moved AI generated readmes to readme folder (may delete)
This commit is contained in:
330
readmes/VISION_TROUBLESHOOTING.md
Normal file
330
readmes/VISION_TROUBLESHOOTING.md
Normal file
@@ -0,0 +1,330 @@
|
||||
# Vision Model Troubleshooting Checklist
|
||||
|
||||
## Quick Diagnostics
|
||||
|
||||
### 1. Verify Both GPU Services Running
|
||||
|
||||
```bash
|
||||
# Check container status
|
||||
docker compose ps
|
||||
|
||||
# Should show both RUNNING:
|
||||
# llama-swap (NVIDIA CUDA)
|
||||
# llama-swap-amd (AMD ROCm)
|
||||
```
|
||||
|
||||
**If llama-swap is not running:**
|
||||
```bash
|
||||
docker compose up -d llama-swap
|
||||
docker compose logs llama-swap
|
||||
```
|
||||
|
||||
**If llama-swap-amd is not running:**
|
||||
```bash
|
||||
docker compose up -d llama-swap-amd
|
||||
docker compose logs llama-swap-amd
|
||||
```
|
||||
|
||||
### 2. Check NVIDIA Vision Endpoint Health
|
||||
|
||||
```bash
|
||||
# Test NVIDIA endpoint directly
|
||||
curl -v http://llama-swap:8080/health
|
||||
|
||||
# Expected: 200 OK
|
||||
|
||||
# If timeout (no response for 5+ seconds):
|
||||
# - NVIDIA GPU might not have enough VRAM
|
||||
# - Model might be stuck loading
|
||||
# - Docker network might be misconfigured
|
||||
```
|
||||
|
||||
### 3. Check Current GPU State
|
||||
|
||||
```bash
|
||||
# See which GPU is set as primary
|
||||
cat bot/memory/gpu_state.json
|
||||
|
||||
# Expected output:
|
||||
# {"current_gpu": "amd", "reason": "voice_session"}
|
||||
# or
|
||||
# {"current_gpu": "nvidia", "reason": "auto_switch"}
|
||||
```
|
||||
|
||||
### 4. Verify Model Files Exist
|
||||
|
||||
```bash
|
||||
# Check vision model files on disk
|
||||
ls -lh models/MiniCPM*
|
||||
|
||||
# Should show both:
|
||||
# -rw-r--r-- ... MiniCPM-V-4_5-Q3_K_S.gguf (main model, ~3.3GB)
|
||||
# -rw-r--r-- ... MiniCPM-V-4_5-mmproj-f16.gguf (projection, ~500MB)
|
||||
```
|
||||
|
||||
## Scenario-Based Troubleshooting
|
||||
|
||||
### Scenario 1: Vision Works When NVIDIA is Primary, Fails When AMD is Primary
|
||||
|
||||
**Diagnosis:** NVIDIA GPU is getting unloaded when AMD is primary
|
||||
|
||||
**Root Cause:** llama-swap is configured to unload unused models
|
||||
|
||||
**Solution:**
|
||||
```yaml
|
||||
# In llama-swap-config.yaml, reduce TTL for vision model:
|
||||
vision:
|
||||
ttl: 3600 # Increase from 900 to keep vision model loaded longer
|
||||
```
|
||||
|
||||
**Or:**
|
||||
```yaml
|
||||
# Disable TTL for vision to keep it always loaded:
|
||||
vision:
|
||||
ttl: 0 # 0 means never auto-unload
|
||||
```
|
||||
|
||||
### Scenario 2: "Vision service currently unavailable: Endpoint timeout"
|
||||
|
||||
**Diagnosis:** NVIDIA endpoint not responding within 5 seconds
|
||||
|
||||
**Causes:**
|
||||
1. NVIDIA GPU out of memory
|
||||
2. Vision model stuck loading
|
||||
3. Network latency
|
||||
|
||||
**Solutions:**
|
||||
|
||||
```bash
|
||||
# Check NVIDIA GPU memory
|
||||
nvidia-smi
|
||||
|
||||
# If memory is full, restart NVIDIA container
|
||||
docker compose restart llama-swap
|
||||
|
||||
# Wait for model to load (check logs)
|
||||
docker compose logs llama-swap -f
|
||||
|
||||
# Should see: "model loaded" message
|
||||
```
|
||||
|
||||
**If persistent:** Increase health check timeout in `bot/utils/llm.py`:
|
||||
```python
|
||||
# Change from 5 to 10 seconds
|
||||
async with session.get(f"{vision_url}/health", timeout=aiohttp.ClientTimeout(total=10)) as response:
|
||||
```
|
||||
|
||||
### Scenario 3: Vision Model Returns Empty Description
|
||||
|
||||
**Diagnosis:** Model loaded but not processing correctly
|
||||
|
||||
**Causes:**
|
||||
1. Model corruption
|
||||
2. Insufficient input validation
|
||||
3. Model inference error
|
||||
|
||||
**Solutions:**
|
||||
|
||||
```bash
|
||||
# Test vision model directly
|
||||
curl -X POST http://llama-swap:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "vision",
|
||||
"messages": [{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "What is this?"},
|
||||
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJ..."}}
|
||||
]
|
||||
}],
|
||||
"max_tokens": 100
|
||||
}'
|
||||
|
||||
# If returns empty, check llama-swap logs for errors
|
||||
docker compose logs llama-swap -n 50
|
||||
```
|
||||
|
||||
### Scenario 4: "Error 503 Service Unavailable"
|
||||
|
||||
**Diagnosis:** llama-swap process crashed or model failed to load
|
||||
|
||||
**Solutions:**
|
||||
|
||||
```bash
|
||||
# Check llama-swap container status
|
||||
docker compose logs llama-swap -n 100
|
||||
|
||||
# Look for error messages, stack traces
|
||||
|
||||
# Restart the service
|
||||
docker compose restart llama-swap
|
||||
|
||||
# Monitor startup
|
||||
docker compose logs llama-swap -f
|
||||
```
|
||||
|
||||
### Scenario 5: Slow Vision Analysis When AMD is Primary
|
||||
|
||||
**Diagnosis:** Both GPUs under load, NVIDIA performance degraded
|
||||
|
||||
**Expected Behavior:** This is normal. Both GPUs are working simultaneously.
|
||||
|
||||
**If Unacceptably Slow:**
|
||||
1. Check if text requests are blocking vision requests
|
||||
2. Verify GPU memory allocation
|
||||
3. Consider processing images sequentially instead of parallel
|
||||
|
||||
## Log Analysis Tips
|
||||
|
||||
### Enable Detailed Vision Logging
|
||||
|
||||
```bash
|
||||
# Watch only vision-related logs
|
||||
docker compose logs miku-bot -f 2>&1 | grep -i vision
|
||||
|
||||
# Watch with timestamps
|
||||
docker compose logs miku-bot -f 2>&1 | grep -i vision | grep -E "ERROR|WARNING|INFO"
|
||||
```
|
||||
|
||||
### Check GPU Health During Vision Request
|
||||
|
||||
In one terminal:
|
||||
```bash
|
||||
# Monitor NVIDIA GPU while processing
|
||||
watch -n 1 nvidia-smi
|
||||
```
|
||||
|
||||
In another:
|
||||
```bash
|
||||
# Send image to bot that triggers vision
|
||||
# Then watch GPU usage spike in first terminal
|
||||
```
|
||||
|
||||
### Monitor Both GPUs Simultaneously
|
||||
|
||||
```bash
|
||||
# Terminal 1: NVIDIA
|
||||
watch -n 1 nvidia-smi
|
||||
|
||||
# Terminal 2: AMD
|
||||
watch -n 1 rocm-smi
|
||||
|
||||
# Terminal 3: Logs
|
||||
docker compose logs miku-bot -f 2>&1 | grep -E "ERROR|vision"
|
||||
```
|
||||
|
||||
## Emergency Fixes
|
||||
|
||||
### If Vision Completely Broken
|
||||
|
||||
```bash
|
||||
# Full restart of all GPU services
|
||||
docker compose down
|
||||
docker compose up -d llama-swap llama-swap-amd
|
||||
docker compose restart miku-bot
|
||||
|
||||
# Wait for services to start (30-60 seconds)
|
||||
sleep 30
|
||||
|
||||
# Test health
|
||||
curl http://llama-swap:8080/health
|
||||
curl http://llama-swap-amd:8080/health
|
||||
```
|
||||
|
||||
### Force NVIDIA GPU Vision
|
||||
|
||||
If you want to guarantee vision always works, even if NVIDIA has issues:
|
||||
|
||||
```python
|
||||
# In bot/utils/llm.py, comment out health check in image_handling.py
|
||||
# (Not recommended, but allows requests to continue)
|
||||
```
|
||||
|
||||
### Disable Dual-GPU Mode Temporarily
|
||||
|
||||
If AMD GPU is causing issues:
|
||||
|
||||
```yaml
|
||||
# In docker-compose.yml, stop llama-swap-amd
|
||||
# Restart bot
|
||||
# This reverts to single-GPU mode (everything on NVIDIA)
|
||||
```
|
||||
|
||||
## Prevention Measures
|
||||
|
||||
### 1. Monitor GPU Memory
|
||||
|
||||
```bash
|
||||
# Setup automated monitoring
|
||||
watch -n 5 "nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader"
|
||||
watch -n 5 "rocm-smi --showmeminfo"
|
||||
```
|
||||
|
||||
### 2. Set Appropriate Model TTLs
|
||||
|
||||
In `llama-swap-config.yaml`:
|
||||
```yaml
|
||||
vision:
|
||||
ttl: 1800 # Keep loaded 30 minutes
|
||||
|
||||
llama3.1:
|
||||
ttl: 1800 # Keep loaded 30 minutes
|
||||
```
|
||||
|
||||
In `llama-swap-rocm-config.yaml`:
|
||||
```yaml
|
||||
llama3.1:
|
||||
ttl: 1800 # AMD text model
|
||||
|
||||
darkidol:
|
||||
ttl: 1800 # AMD evil mode
|
||||
```
|
||||
|
||||
### 3. Monitor Container Logs
|
||||
|
||||
```bash
|
||||
# Periodic log check
|
||||
docker compose logs llama-swap | tail -20
|
||||
docker compose logs llama-swap-amd | tail -20
|
||||
docker compose logs miku-bot | grep vision | tail -20
|
||||
```
|
||||
|
||||
### 4. Regular Health Checks
|
||||
|
||||
```bash
|
||||
# Script to check both GPU endpoints
|
||||
#!/bin/bash
|
||||
echo "NVIDIA Health:"
|
||||
curl -s http://llama-swap:8080/health && echo "✓ OK" || echo "✗ FAILED"
|
||||
|
||||
echo "AMD Health:"
|
||||
curl -s http://llama-swap-amd:8080/health && echo "✓ OK" || echo "✗ FAILED"
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
If vision requests are too slow:
|
||||
|
||||
1. **Reduce image quality** before sending to model
|
||||
2. **Use smaller frames** for video analysis
|
||||
3. **Batch process** multiple images
|
||||
4. **Allocate more VRAM** to NVIDIA if available
|
||||
5. **Reduce concurrent requests** to NVIDIA during peak load
|
||||
|
||||
## Success Indicators
|
||||
|
||||
After applying the fix, you should see:
|
||||
|
||||
✅ Images analyzed within 5-10 seconds (first load: 20-30 seconds)
|
||||
✅ No "Vision service unavailable" errors
|
||||
✅ Log shows `Vision analysis completed successfully`
|
||||
✅ Works correctly whether AMD or NVIDIA is primary GPU
|
||||
✅ No GPU memory errors in nvidia-smi/rocm-smi
|
||||
|
||||
## Contact Points for Further Issues
|
||||
|
||||
1. Check NVIDIA llama.cpp/llama-swap logs
|
||||
2. Check AMD ROCm compatibility for your GPU
|
||||
3. Verify Docker networking (if using custom networks)
|
||||
4. Check system VRAM (needs ~10GB+ for both models)
|
||||
Reference in New Issue
Block a user