Files
miku-discord/readmes/VISION_TROUBLESHOOTING.md

7.2 KiB

Vision Model Troubleshooting Checklist

Quick Diagnostics

1. Verify Both GPU Services Running

# Check container status
docker compose ps

# Should show both RUNNING:
# llama-swap      (NVIDIA CUDA)
# llama-swap-amd  (AMD ROCm)

If llama-swap is not running:

docker compose up -d llama-swap
docker compose logs llama-swap

If llama-swap-amd is not running:

docker compose up -d llama-swap-amd
docker compose logs llama-swap-amd

2. Check NVIDIA Vision Endpoint Health

# Test NVIDIA endpoint directly
curl -v http://llama-swap:8080/health

# Expected: 200 OK

# If timeout (no response for 5+ seconds):
# - NVIDIA GPU might not have enough VRAM
# - Model might be stuck loading
# - Docker network might be misconfigured

3. Check Current GPU State

# See which GPU is set as primary
cat bot/memory/gpu_state.json

# Expected output:
# {"current_gpu": "amd", "reason": "voice_session"}
# or
# {"current_gpu": "nvidia", "reason": "auto_switch"}

4. Verify Model Files Exist

# Check vision model files on disk
ls -lh models/MiniCPM*

# Should show both:
# -rw-r--r-- ... MiniCPM-V-4_5-Q3_K_S.gguf (main model, ~3.3GB)
# -rw-r--r-- ... MiniCPM-V-4_5-mmproj-f16.gguf (projection, ~500MB)

Scenario-Based Troubleshooting

Scenario 1: Vision Works When NVIDIA is Primary, Fails When AMD is Primary

Diagnosis: NVIDIA GPU is getting unloaded when AMD is primary

Root Cause: llama-swap is configured to unload unused models

Solution:

# In llama-swap-config.yaml, reduce TTL for vision model:
vision:
  ttl: 3600  # Increase from 900 to keep vision model loaded longer

Or:

# Disable TTL for vision to keep it always loaded:
vision:
  ttl: 0  # 0 means never auto-unload

Scenario 2: "Vision service currently unavailable: Endpoint timeout"

Diagnosis: NVIDIA endpoint not responding within 5 seconds

Causes:

  1. NVIDIA GPU out of memory
  2. Vision model stuck loading
  3. Network latency

Solutions:

# Check NVIDIA GPU memory
nvidia-smi

# If memory is full, restart NVIDIA container
docker compose restart llama-swap

# Wait for model to load (check logs)
docker compose logs llama-swap -f

# Should see: "model loaded" message

If persistent: Increase health check timeout in bot/utils/llm.py:

# Change from 5 to 10 seconds
async with session.get(f"{vision_url}/health", timeout=aiohttp.ClientTimeout(total=10)) as response:

Scenario 3: Vision Model Returns Empty Description

Diagnosis: Model loaded but not processing correctly

Causes:

  1. Model corruption
  2. Insufficient input validation
  3. Model inference error

Solutions:

# Test vision model directly
curl -X POST http://llama-swap:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vision",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is this?"},
        {"type": "image_url", "image_url": {"url": "..."}}
      ]
    }],
    "max_tokens": 100
  }'

# If returns empty, check llama-swap logs for errors
docker compose logs llama-swap -n 50

Scenario 4: "Error 503 Service Unavailable"

Diagnosis: llama-swap process crashed or model failed to load

Solutions:

# Check llama-swap container status
docker compose logs llama-swap -n 100

# Look for error messages, stack traces

# Restart the service
docker compose restart llama-swap

# Monitor startup
docker compose logs llama-swap -f

Scenario 5: Slow Vision Analysis When AMD is Primary

Diagnosis: Both GPUs under load, NVIDIA performance degraded

Expected Behavior: This is normal. Both GPUs are working simultaneously.

If Unacceptably Slow:

  1. Check if text requests are blocking vision requests
  2. Verify GPU memory allocation
  3. Consider processing images sequentially instead of parallel

Log Analysis Tips

Enable Detailed Vision Logging

# Watch only vision-related logs
docker compose logs miku-bot -f 2>&1 | grep -i vision

# Watch with timestamps
docker compose logs miku-bot -f 2>&1 | grep -i vision | grep -E "ERROR|WARNING|INFO"

Check GPU Health During Vision Request

In one terminal:

# Monitor NVIDIA GPU while processing
watch -n 1 nvidia-smi

In another:

# Send image to bot that triggers vision
# Then watch GPU usage spike in first terminal

Monitor Both GPUs Simultaneously

# Terminal 1: NVIDIA
watch -n 1 nvidia-smi

# Terminal 2: AMD
watch -n 1 rocm-smi

# Terminal 3: Logs
docker compose logs miku-bot -f 2>&1 | grep -E "ERROR|vision"

Emergency Fixes

If Vision Completely Broken

# Full restart of all GPU services
docker compose down
docker compose up -d llama-swap llama-swap-amd
docker compose restart miku-bot

# Wait for services to start (30-60 seconds)
sleep 30

# Test health
curl http://llama-swap:8080/health
curl http://llama-swap-amd:8080/health

Force NVIDIA GPU Vision

If you want to guarantee vision always works, even if NVIDIA has issues:

# In bot/utils/llm.py, comment out health check in image_handling.py
# (Not recommended, but allows requests to continue)

Disable Dual-GPU Mode Temporarily

If AMD GPU is causing issues:

# In docker-compose.yml, stop llama-swap-amd
# Restart bot
# This reverts to single-GPU mode (everything on NVIDIA)

Prevention Measures

1. Monitor GPU Memory

# Setup automated monitoring
watch -n 5 "nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader"
watch -n 5 "rocm-smi --showmeminfo"

2. Set Appropriate Model TTLs

In llama-swap-config.yaml:

vision:
  ttl: 1800  # Keep loaded 30 minutes
  
llama3.1:
  ttl: 1800  # Keep loaded 30 minutes

In llama-swap-rocm-config.yaml:

llama3.1:
  ttl: 1800  # AMD text model
  
darkidol:
  ttl: 1800  # AMD evil mode

3. Monitor Container Logs

# Periodic log check
docker compose logs llama-swap | tail -20
docker compose logs llama-swap-amd | tail -20
docker compose logs miku-bot | grep vision | tail -20

4. Regular Health Checks

# Script to check both GPU endpoints
#!/bin/bash
echo "NVIDIA Health:"
curl -s http://llama-swap:8080/health && echo "✓ OK" || echo "✗ FAILED"

echo "AMD Health:"
curl -s http://llama-swap-amd:8080/health && echo "✓ OK" || echo "✗ FAILED"

Performance Optimization

If vision requests are too slow:

  1. Reduce image quality before sending to model
  2. Use smaller frames for video analysis
  3. Batch process multiple images
  4. Allocate more VRAM to NVIDIA if available
  5. Reduce concurrent requests to NVIDIA during peak load

Success Indicators

After applying the fix, you should see:

Images analyzed within 5-10 seconds (first load: 20-30 seconds) No "Vision service unavailable" errors Log shows Vision analysis completed successfully Works correctly whether AMD or NVIDIA is primary GPU No GPU memory errors in nvidia-smi/rocm-smi

Contact Points for Further Issues

  1. Check NVIDIA llama.cpp/llama-swap logs
  2. Check AMD ROCm compatibility for your GPU
  3. Verify Docker networking (if using custom networks)
  4. Check system VRAM (needs ~10GB+ for both models)