7.2 KiB
Vision Model Troubleshooting Checklist
Quick Diagnostics
1. Verify Both GPU Services Running
# Check container status
docker compose ps
# Should show both RUNNING:
# llama-swap (NVIDIA CUDA)
# llama-swap-amd (AMD ROCm)
If llama-swap is not running:
docker compose up -d llama-swap
docker compose logs llama-swap
If llama-swap-amd is not running:
docker compose up -d llama-swap-amd
docker compose logs llama-swap-amd
2. Check NVIDIA Vision Endpoint Health
# Test NVIDIA endpoint directly
curl -v http://llama-swap:8080/health
# Expected: 200 OK
# If timeout (no response for 5+ seconds):
# - NVIDIA GPU might not have enough VRAM
# - Model might be stuck loading
# - Docker network might be misconfigured
3. Check Current GPU State
# See which GPU is set as primary
cat bot/memory/gpu_state.json
# Expected output:
# {"current_gpu": "amd", "reason": "voice_session"}
# or
# {"current_gpu": "nvidia", "reason": "auto_switch"}
4. Verify Model Files Exist
# Check vision model files on disk
ls -lh models/MiniCPM*
# Should show both:
# -rw-r--r-- ... MiniCPM-V-4_5-Q3_K_S.gguf (main model, ~3.3GB)
# -rw-r--r-- ... MiniCPM-V-4_5-mmproj-f16.gguf (projection, ~500MB)
Scenario-Based Troubleshooting
Scenario 1: Vision Works When NVIDIA is Primary, Fails When AMD is Primary
Diagnosis: NVIDIA GPU is getting unloaded when AMD is primary
Root Cause: llama-swap is configured to unload unused models
Solution:
# In llama-swap-config.yaml, reduce TTL for vision model:
vision:
ttl: 3600 # Increase from 900 to keep vision model loaded longer
Or:
# Disable TTL for vision to keep it always loaded:
vision:
ttl: 0 # 0 means never auto-unload
Scenario 2: "Vision service currently unavailable: Endpoint timeout"
Diagnosis: NVIDIA endpoint not responding within 5 seconds
Causes:
- NVIDIA GPU out of memory
- Vision model stuck loading
- Network latency
Solutions:
# Check NVIDIA GPU memory
nvidia-smi
# If memory is full, restart NVIDIA container
docker compose restart llama-swap
# Wait for model to load (check logs)
docker compose logs llama-swap -f
# Should see: "model loaded" message
If persistent: Increase health check timeout in bot/utils/llm.py:
# Change from 5 to 10 seconds
async with session.get(f"{vision_url}/health", timeout=aiohttp.ClientTimeout(total=10)) as response:
Scenario 3: Vision Model Returns Empty Description
Diagnosis: Model loaded but not processing correctly
Causes:
- Model corruption
- Insufficient input validation
- Model inference error
Solutions:
# Test vision model directly
curl -X POST http://llama-swap:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vision",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What is this?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJ..."}}
]
}],
"max_tokens": 100
}'
# If returns empty, check llama-swap logs for errors
docker compose logs llama-swap -n 50
Scenario 4: "Error 503 Service Unavailable"
Diagnosis: llama-swap process crashed or model failed to load
Solutions:
# Check llama-swap container status
docker compose logs llama-swap -n 100
# Look for error messages, stack traces
# Restart the service
docker compose restart llama-swap
# Monitor startup
docker compose logs llama-swap -f
Scenario 5: Slow Vision Analysis When AMD is Primary
Diagnosis: Both GPUs under load, NVIDIA performance degraded
Expected Behavior: This is normal. Both GPUs are working simultaneously.
If Unacceptably Slow:
- Check if text requests are blocking vision requests
- Verify GPU memory allocation
- Consider processing images sequentially instead of parallel
Log Analysis Tips
Enable Detailed Vision Logging
# Watch only vision-related logs
docker compose logs miku-bot -f 2>&1 | grep -i vision
# Watch with timestamps
docker compose logs miku-bot -f 2>&1 | grep -i vision | grep -E "ERROR|WARNING|INFO"
Check GPU Health During Vision Request
In one terminal:
# Monitor NVIDIA GPU while processing
watch -n 1 nvidia-smi
In another:
# Send image to bot that triggers vision
# Then watch GPU usage spike in first terminal
Monitor Both GPUs Simultaneously
# Terminal 1: NVIDIA
watch -n 1 nvidia-smi
# Terminal 2: AMD
watch -n 1 rocm-smi
# Terminal 3: Logs
docker compose logs miku-bot -f 2>&1 | grep -E "ERROR|vision"
Emergency Fixes
If Vision Completely Broken
# Full restart of all GPU services
docker compose down
docker compose up -d llama-swap llama-swap-amd
docker compose restart miku-bot
# Wait for services to start (30-60 seconds)
sleep 30
# Test health
curl http://llama-swap:8080/health
curl http://llama-swap-amd:8080/health
Force NVIDIA GPU Vision
If you want to guarantee vision always works, even if NVIDIA has issues:
# In bot/utils/llm.py, comment out health check in image_handling.py
# (Not recommended, but allows requests to continue)
Disable Dual-GPU Mode Temporarily
If AMD GPU is causing issues:
# In docker-compose.yml, stop llama-swap-amd
# Restart bot
# This reverts to single-GPU mode (everything on NVIDIA)
Prevention Measures
1. Monitor GPU Memory
# Setup automated monitoring
watch -n 5 "nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader"
watch -n 5 "rocm-smi --showmeminfo"
2. Set Appropriate Model TTLs
In llama-swap-config.yaml:
vision:
ttl: 1800 # Keep loaded 30 minutes
llama3.1:
ttl: 1800 # Keep loaded 30 minutes
In llama-swap-rocm-config.yaml:
llama3.1:
ttl: 1800 # AMD text model
darkidol:
ttl: 1800 # AMD evil mode
3. Monitor Container Logs
# Periodic log check
docker compose logs llama-swap | tail -20
docker compose logs llama-swap-amd | tail -20
docker compose logs miku-bot | grep vision | tail -20
4. Regular Health Checks
# Script to check both GPU endpoints
#!/bin/bash
echo "NVIDIA Health:"
curl -s http://llama-swap:8080/health && echo "✓ OK" || echo "✗ FAILED"
echo "AMD Health:"
curl -s http://llama-swap-amd:8080/health && echo "✓ OK" || echo "✗ FAILED"
Performance Optimization
If vision requests are too slow:
- Reduce image quality before sending to model
- Use smaller frames for video analysis
- Batch process multiple images
- Allocate more VRAM to NVIDIA if available
- Reduce concurrent requests to NVIDIA during peak load
Success Indicators
After applying the fix, you should see:
✅ Images analyzed within 5-10 seconds (first load: 20-30 seconds)
✅ No "Vision service unavailable" errors
✅ Log shows Vision analysis completed successfully
✅ Works correctly whether AMD or NVIDIA is primary GPU
✅ No GPU memory errors in nvidia-smi/rocm-smi
Contact Points for Further Issues
- Check NVIDIA llama.cpp/llama-swap logs
- Check AMD ROCm compatibility for your GPU
- Verify Docker networking (if using custom networks)
- Check system VRAM (needs ~10GB+ for both models)