7.8 KiB
Vision Model Debugging Guide
Issue Summary
Vision model not working when AMD is set as the primary GPU for text inference.
Root Cause Analysis
The vision model (MiniCPM-V) should always run on the NVIDIA GPU, even when AMD is the primary GPU for text models. This is because:
-
Separate GPU design: Each GPU has its own llama-swap instance
llama-swap(NVIDIA) on port 8090 → handlesvision,llama3.1,darkidolllama-swap-amd(AMD) on port 8091 → handlesllama3.1,darkidol(text models only)
-
Vision model location: The vision model is ONLY configured on NVIDIA
- Check:
llama-swap-config.yaml(has vision model) - Check:
llama-swap-rocm-config.yaml(does NOT have vision model)
- Check:
Fixes Applied
1. Improved GPU Routing (bot/utils/llm.py)
Function: get_vision_gpu_url()
- Now explicitly returns NVIDIA URL regardless of primary text GPU
- Added debug logging when text GPU is AMD
- Added clear documentation about the routing strategy
New Function: check_vision_endpoint_health()
- Pings the NVIDIA vision endpoint before attempting requests
- Provides detailed error messages if endpoint is unreachable
- Logs health status for troubleshooting
2. Enhanced Vision Analysis (bot/utils/image_handling.py)
Function: analyze_image_with_vision()
- Added health check before processing
- Increased timeout to 60 seconds (from default)
- Logs endpoint URL, model name, and detailed error messages
- Added exception info logging for better debugging
Function: analyze_video_with_vision()
- Added health check before processing
- Increased timeout to 120 seconds (from default)
- Logs media type, frame count, and detailed error messages
- Added exception info logging for better debugging
Testing the Fix
1. Verify Docker Containers
# Check both llama-swap services are running
docker compose ps
# Expected output:
# llama-swap (port 8090)
# llama-swap-amd (port 8091)
2. Test NVIDIA Endpoint Health
# Check if NVIDIA vision endpoint is responsive
curl -f http://llama-swap:8080/health
# Should return 200 OK
3. Test Vision Request to NVIDIA
# Send a simple vision request directly
curl -X POST http://llama-swap:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vision",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}],
"max_tokens": 100
}'
4. Check GPU State File
# Verify which GPU is primary
cat bot/memory/gpu_state.json
# Should show:
# {"current_gpu": "amd", "reason": "..."} when AMD is primary
# {"current_gpu": "nvidia", "reason": "..."} when NVIDIA is primary
5. Monitor Logs During Vision Request
# Watch bot logs during image analysis
docker compose logs -f miku-bot 2>&1 | grep -i vision
# Should see:
# "Sending vision request to http://llama-swap:8080"
# "Vision analysis completed successfully"
# OR detailed error messages if something is wrong
Troubleshooting Steps
Issue: Vision endpoint health check fails
Symptoms: "Vision service currently unavailable: Endpoint timeout"
Solutions:
- Verify NVIDIA container is running:
docker compose ps llama-swap - Check NVIDIA GPU memory:
nvidia-smi(should have free VRAM) - Check if vision model is loaded:
docker compose logs llama-swap - Increase timeout if model is loading slowly
Issue: Vision requests timeout (status 408/504)
Symptoms: Requests hang or return timeout errors
Solutions:
- Check NVIDIA GPU is not overloaded:
nvidia-smi - Check if vision model is already running: Look for MiniCPM processes
- Restart llama-swap if model is stuck:
docker compose restart llama-swap - Check available VRAM: MiniCPM-V needs ~4-6GB
Issue: Vision model returns "No description"
Symptoms: Image analysis returns empty or generic responses
Solutions:
- Check if vision model loaded correctly:
docker compose logs llama-swap - Verify model file exists:
/models/MiniCPM-V-4_5-Q3_K_S.gguf - Check if mmproj loaded:
/models/MiniCPM-V-4_5-mmproj-f16.gguf - Test with direct curl to ensure model works
Issue: AMD GPU affects vision performance
Symptoms: Vision requests are slower when AMD is primary
Solutions:
- This is expected behavior - NVIDIA is still processing vision
- Could indicate NVIDIA GPU memory pressure
- Monitor both GPUs:
rocm-smi(AMD) andnvidia-smi(NVIDIA)
Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│ Miku Bot │
│ │
│ Discord Messages with Images/Videos │
└─────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ Vision Analysis Handler │
│ (image_handling.py) │
│ │
│ 1. Check NVIDIA health │
│ 2. Send to NVIDIA vision │
└──────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ NVIDIA GPU (llama-swap) │
│ Port: 8090 │
│ │
│ Available Models: │
│ • vision (MiniCPM-V) │
│ • llama3.1 │
│ • darkidol │
└──────────────────────────────┘
│
┌───────────┴────────────┐
│ │
▼ (Vision only) ▼ (Text only in dual-GPU mode)
NVIDIA GPU AMD GPU (llama-swap-amd)
Port: 8091
Available Models:
• llama3.1
• darkidol
(NO vision model)
Key Files Changed
-
bot/utils/llm.py
- Enhanced
get_vision_gpu_url()with documentation - Added
check_vision_endpoint_health()function
- Enhanced
-
bot/utils/image_handling.py
analyze_image_with_vision()- added health check and logginganalyze_video_with_vision()- added health check and logging
Expected Behavior After Fix
When NVIDIA is Primary (default)
Image received
→ Check NVIDIA health: OK
→ Send to NVIDIA vision model
→ Analysis complete
✓ Works as before
When AMD is Primary (voice session active)
Image received
→ Check NVIDIA health: OK
→ Send to NVIDIA vision model (even though text uses AMD)
→ Analysis complete
✓ Vision now works correctly!
Next Steps if Issues Persist
- Enable debug logging: Set
AUTONOMOUS_DEBUG=truein docker-compose - Check Docker networking:
docker network inspect miku-discord_default - Verify environment variables:
docker compose exec miku-bot env | grep LLAMA - Check model file integrity:
ls -lah models/MiniCPM* - Review llama-swap logs:
docker compose logs llama-swap -n 100