readmes/VISION_TROUBLESHOOTING.md

# Vision Model Troubleshooting Checklist

## Quick Diagnostics

### 1. Verify Both GPU Services Running

```bash
# Check container status
docker compose ps

# Should show both RUNNING:
# llama-swap      (NVIDIA CUDA)
# llama-swap-amd  (AMD ROCm)
```

**If llama-swap is not running:**
```bash
docker compose up -d llama-swap
docker compose logs llama-swap
```

**If llama-swap-amd is not running:**
```bash
docker compose up -d llama-swap-amd
docker compose logs llama-swap-amd
```

### 2. Check NVIDIA Vision Endpoint Health

```bash
# Test NVIDIA endpoint directly
curl -v http://llama-swap:8080/health

# Expected: 200 OK

# If timeout (no response for 5+ seconds):
# - NVIDIA GPU might not have enough VRAM
# - Model might be stuck loading
# - Docker network might be misconfigured
```

### 3. Check Current GPU State

```bash
# See which GPU is set as primary
cat bot/memory/gpu_state.json

# Expected output:
# {"current_gpu": "amd", "reason": "voice_session"}
# or
# {"current_gpu": "nvidia", "reason": "auto_switch"}
```

### 4. Verify Model Files Exist

```bash
# Check vision model files on disk
ls -lh models/MiniCPM*

# Should show both:
# -rw-r--r-- ... MiniCPM-V-4_5-Q3_K_S.gguf (main model, ~3.3GB)
# -rw-r--r-- ... MiniCPM-V-4_5-mmproj-f16.gguf (projection, ~500MB)
```

## Scenario-Based Troubleshooting

### Scenario 1: Vision Works When NVIDIA is Primary, Fails When AMD is Primary

**Diagnosis:** NVIDIA GPU is getting unloaded when AMD is primary

**Root Cause:** llama-swap is configured to unload unused models

**Solution:**
```yaml
# In llama-swap-config.yaml, reduce TTL for vision model:
vision:
  ttl: 3600  # Increase from 900 to keep vision model loaded longer
```

**Or:**
```yaml
# Disable TTL for vision to keep it always loaded:
vision:
  ttl: 0  # 0 means never auto-unload
```

### Scenario 2: "Vision service currently unavailable: Endpoint timeout"

**Diagnosis:** NVIDIA endpoint not responding within 5 seconds

**Causes:**
1. NVIDIA GPU out of memory
2. Vision model stuck loading
3. Network latency

**Solutions:**

```bash
# Check NVIDIA GPU memory
nvidia-smi

# If memory is full, restart NVIDIA container
docker compose restart llama-swap

# Wait for model to load (check logs)
docker compose logs llama-swap -f

# Should see: "model loaded" message
```

**If persistent:** Increase health check timeout in `bot/utils/llm.py`:
```python
# Change from 5 to 10 seconds
async with session.get(f"{vision_url}/health", timeout=aiohttp.ClientTimeout(total=10)) as response:
```

### Scenario 3: Vision Model Returns Empty Description

**Diagnosis:** Model loaded but not processing correctly

**Causes:**
1. Model corruption
2. Insufficient input validation
3. Model inference error

**Solutions:**

```bash
# Test vision model directly
curl -X POST http://llama-swap:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vision",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is this?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJ..."}}
      ]
    }],
    "max_tokens": 100
  }'

# If returns empty, check llama-swap logs for errors
docker compose logs llama-swap -n 50
```

### Scenario 4: "Error 503 Service Unavailable"

**Diagnosis:** llama-swap process crashed or model failed to load

**Solutions:**

```bash
# Check llama-swap container status
docker compose logs llama-swap -n 100

# Look for error messages, stack traces

# Restart the service
docker compose restart llama-swap

# Monitor startup
docker compose logs llama-swap -f
```

### Scenario 5: Slow Vision Analysis When AMD is Primary

**Diagnosis:** Both GPUs under load, NVIDIA performance degraded

**Expected Behavior:** This is normal. Both GPUs are working simultaneously.

**If Unacceptably Slow:**
1. Check if text requests are blocking vision requests
2. Verify GPU memory allocation
3. Consider processing images sequentially instead of parallel

## Log Analysis Tips

### Enable Detailed Vision Logging

```bash
# Watch only vision-related logs
docker compose logs miku-bot -f 2>&1 | grep -i vision

# Watch with timestamps
docker compose logs miku-bot -f 2>&1 | grep -i vision | grep -E "ERROR|WARNING|INFO"
```

### Check GPU Health During Vision Request

In one terminal:
```bash
# Monitor NVIDIA GPU while processing
watch -n 1 nvidia-smi
```

In another:
```bash
# Send image to bot that triggers vision
# Then watch GPU usage spike in first terminal
```

### Monitor Both GPUs Simultaneously

```bash
# Terminal 1: NVIDIA
watch -n 1 nvidia-smi

# Terminal 2: AMD
watch -n 1 rocm-smi

# Terminal 3: Logs
docker compose logs miku-bot -f 2>&1 | grep -E "ERROR|vision"
```

## Emergency Fixes

### If Vision Completely Broken

```bash
# Full restart of all GPU services
docker compose down
docker compose up -d llama-swap llama-swap-amd
docker compose restart miku-bot

# Wait for services to start (30-60 seconds)
sleep 30

# Test health
curl http://llama-swap:8080/health
curl http://llama-swap-amd:8080/health
```

### Force NVIDIA GPU Vision

If you want to guarantee vision always works, even if NVIDIA has issues:

```python
# In bot/utils/llm.py, comment out health check in image_handling.py
# (Not recommended, but allows requests to continue)
```

### Disable Dual-GPU Mode Temporarily

If AMD GPU is causing issues:

```yaml
# In docker-compose.yml, stop llama-swap-amd
# Restart bot
# This reverts to single-GPU mode (everything on NVIDIA)
```

## Prevention Measures

### 1. Monitor GPU Memory

```bash
# Setup automated monitoring
watch -n 5 "nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader"
watch -n 5 "rocm-smi --showmeminfo"
```

### 2. Set Appropriate Model TTLs

In `llama-swap-config.yaml`:
```yaml
vision:
  ttl: 1800  # Keep loaded 30 minutes
  
llama3.1:
  ttl: 1800  # Keep loaded 30 minutes
```

In `llama-swap-rocm-config.yaml`:
```yaml
llama3.1:
  ttl: 1800  # AMD text model
  
darkidol:
  ttl: 1800  # AMD evil mode
```

### 3. Monitor Container Logs

```bash
# Periodic log check
docker compose logs llama-swap | tail -20
docker compose logs llama-swap-amd | tail -20
docker compose logs miku-bot | grep vision | tail -20
```

### 4. Regular Health Checks

```bash
# Script to check both GPU endpoints
#!/bin/bash
echo "NVIDIA Health:"
curl -s http://llama-swap:8080/health && echo "✓ OK" || echo "✗ FAILED"

echo "AMD Health:"
curl -s http://llama-swap-amd:8080/health && echo "✓ OK" || echo "✗ FAILED"
```

## Performance Optimization

If vision requests are too slow:

1. **Reduce image quality** before sending to model
2. **Use smaller frames** for video analysis
3. **Batch process** multiple images
4. **Allocate more VRAM** to NVIDIA if available
5. **Reduce concurrent requests** to NVIDIA during peak load

## Success Indicators

After applying the fix, you should see:

✅ Images analyzed within 5-10 seconds (first load: 20-30 seconds)
✅ No "Vision service unavailable" errors
✅ Log shows `Vision analysis completed successfully`
✅ Works correctly whether AMD or NVIDIA is primary GPU
✅ No GPU memory errors in nvidia-smi/rocm-smi

## Contact Points for Further Issues

1. Check NVIDIA llama.cpp/llama-swap logs
2. Check AMD ROCm compatibility for your GPU
3. Verify Docker networking (if using custom networks)
4. Check system VRAM (needs ~10GB+ for both models)
moved AI generated readmes to readme folder (may delete) 2026-01-27 19:57:48 +02:00			`# Vision Model Troubleshooting Checklist`

			`## Quick Diagnostics`

			`### 1. Verify Both GPU Services Running`

			```bash
			`# Check container status`
			`docker compose ps`

			`# Should show both RUNNING:`
			`# llama-swap (NVIDIA CUDA)`
			`# llama-swap-amd (AMD ROCm)`
			```

			`If llama-swap is not running:`
			```bash
			`docker compose up -d llama-swap`
			`docker compose logs llama-swap`
			```

			`If llama-swap-amd is not running:`
			```bash
			`docker compose up -d llama-swap-amd`
			`docker compose logs llama-swap-amd`
			```

			`### 2. Check NVIDIA Vision Endpoint Health`

			```bash
			`# Test NVIDIA endpoint directly`
			`curl -v http://llama-swap:8080/health`

			`# Expected: 200 OK`

			`# If timeout (no response for 5+ seconds):`
			`# - NVIDIA GPU might not have enough VRAM`
			`# - Model might be stuck loading`
			`# - Docker network might be misconfigured`
			```

			`### 3. Check Current GPU State`

			```bash
			`# See which GPU is set as primary`
			`cat bot/memory/gpu_state.json`

			`# Expected output:`
			`# {"current_gpu": "amd", "reason": "voice_session"}`
			`# or`
			`# {"current_gpu": "nvidia", "reason": "auto_switch"}`
			```

			`### 4. Verify Model Files Exist`

			```bash
			`# Check vision model files on disk`
			`ls -lh models/MiniCPM*`

			`# Should show both:`
			`# -rw-r--r-- ... MiniCPM-V-4_5-Q3_K_S.gguf (main model, ~3.3GB)`
			`# -rw-r--r-- ... MiniCPM-V-4_5-mmproj-f16.gguf (projection, ~500MB)`
			```

			`## Scenario-Based Troubleshooting`

			`### Scenario 1: Vision Works When NVIDIA is Primary, Fails When AMD is Primary`

			`Diagnosis: NVIDIA GPU is getting unloaded when AMD is primary`

			`Root Cause: llama-swap is configured to unload unused models`

			`Solution:`
			```yaml
			`# In llama-swap-config.yaml, reduce TTL for vision model:`
			`vision:`
			`ttl: 3600 # Increase from 900 to keep vision model loaded longer`
			```

			`Or:`
			```yaml
			`# Disable TTL for vision to keep it always loaded:`
			`vision:`
			`ttl: 0 # 0 means never auto-unload`
			```

			`### Scenario 2: "Vision service currently unavailable: Endpoint timeout"`

			`Diagnosis: NVIDIA endpoint not responding within 5 seconds`

			`Causes:`
			`1. NVIDIA GPU out of memory`
			`2. Vision model stuck loading`
			`3. Network latency`

			`Solutions:`

			```bash
			`# Check NVIDIA GPU memory`
			`nvidia-smi`

			`# If memory is full, restart NVIDIA container`
			`docker compose restart llama-swap`

			`# Wait for model to load (check logs)`
			`docker compose logs llama-swap -f`

			`# Should see: "model loaded" message`
			```

			If persistent: Increase health check timeout in `bot/utils/llm.py`:
			```python
			`# Change from 5 to 10 seconds`
			`async with session.get(f"{vision_url}/health", timeout=aiohttp.ClientTimeout(total=10)) as response:`
			```

			`### Scenario 3: Vision Model Returns Empty Description`

			`Diagnosis: Model loaded but not processing correctly`

			`Causes:`
			`1. Model corruption`
			`2. Insufficient input validation`
			`3. Model inference error`

			`Solutions:`

			```bash
			`# Test vision model directly`
			`curl -X POST http://llama-swap:8080/v1/chat/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "vision",`
			`"messages": [{`
			`"role": "user",`
			`"content": [`
			`{"type": "text", "text": "What is this?"},`
			`{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJ..."}}`
			`]`
			`}],`
			`"max_tokens": 100`
			`}'`

			`# If returns empty, check llama-swap logs for errors`
			`docker compose logs llama-swap -n 50`
			```

			`### Scenario 4: "Error 503 Service Unavailable"`

			`Diagnosis: llama-swap process crashed or model failed to load`

			`Solutions:`

			```bash
			`# Check llama-swap container status`
			`docker compose logs llama-swap -n 100`

			`# Look for error messages, stack traces`

			`# Restart the service`
			`docker compose restart llama-swap`

			`# Monitor startup`
			`docker compose logs llama-swap -f`
			```

			`### Scenario 5: Slow Vision Analysis When AMD is Primary`

			`Diagnosis: Both GPUs under load, NVIDIA performance degraded`

			`Expected Behavior: This is normal. Both GPUs are working simultaneously.`

			`If Unacceptably Slow:`
			`1. Check if text requests are blocking vision requests`
			`2. Verify GPU memory allocation`
			`3. Consider processing images sequentially instead of parallel`

			`## Log Analysis Tips`

			`### Enable Detailed Vision Logging`

			```bash
			`# Watch only vision-related logs`
			`docker compose logs miku-bot -f 2>&1 \| grep -i vision`

			`# Watch with timestamps`
			`docker compose logs miku-bot -f 2>&1 \| grep -i vision \| grep -E "ERROR\|WARNING\|INFO"`
			```

			`### Check GPU Health During Vision Request`

			`In one terminal:`
			```bash
			`# Monitor NVIDIA GPU while processing`
			`watch -n 1 nvidia-smi`
			```

			`In another:`
			```bash
			`# Send image to bot that triggers vision`
			`# Then watch GPU usage spike in first terminal`
			```

			`### Monitor Both GPUs Simultaneously`

			```bash
			`# Terminal 1: NVIDIA`
			`watch -n 1 nvidia-smi`

			`# Terminal 2: AMD`
			`watch -n 1 rocm-smi`

			`# Terminal 3: Logs`
			`docker compose logs miku-bot -f 2>&1 \| grep -E "ERROR\|vision"`
			```

			`## Emergency Fixes`

			`### If Vision Completely Broken`

			```bash
			`# Full restart of all GPU services`
			`docker compose down`
			`docker compose up -d llama-swap llama-swap-amd`
			`docker compose restart miku-bot`

			`# Wait for services to start (30-60 seconds)`
			`sleep 30`

			`# Test health`
			`curl http://llama-swap:8080/health`
			`curl http://llama-swap-amd:8080/health`
			```

			`### Force NVIDIA GPU Vision`

			`If you want to guarantee vision always works, even if NVIDIA has issues:`

			```python
			`# In bot/utils/llm.py, comment out health check in image_handling.py`
			`# (Not recommended, but allows requests to continue)`
			```

			`### Disable Dual-GPU Mode Temporarily`

			`If AMD GPU is causing issues:`

			```yaml
			`# In docker-compose.yml, stop llama-swap-amd`
			`# Restart bot`
			`# This reverts to single-GPU mode (everything on NVIDIA)`
			```

			`## Prevention Measures`

			`### 1. Monitor GPU Memory`

			```bash
			`# Setup automated monitoring`
			`watch -n 5 "nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader"`
			`watch -n 5 "rocm-smi --showmeminfo"`
			```

			`### 2. Set Appropriate Model TTLs`

			In `llama-swap-config.yaml`:
			```yaml
			`vision:`
			`ttl: 1800 # Keep loaded 30 minutes`

			`llama3.1:`
			`ttl: 1800 # Keep loaded 30 minutes`
			```

			In `llama-swap-rocm-config.yaml`:
			```yaml
			`llama3.1:`
			`ttl: 1800 # AMD text model`

			`darkidol:`
			`ttl: 1800 # AMD evil mode`
			```

			`### 3. Monitor Container Logs`

			```bash
			`# Periodic log check`
			`docker compose logs llama-swap \| tail -20`
			`docker compose logs llama-swap-amd \| tail -20`
			`docker compose logs miku-bot \| grep vision \| tail -20`
			```

			`### 4. Regular Health Checks`

			```bash
			`# Script to check both GPU endpoints`
			`#!/bin/bash`
			`echo "NVIDIA Health:"`
			`curl -s http://llama-swap:8080/health && echo "✓ OK" \|\| echo "✗ FAILED"`

			`echo "AMD Health:"`
			`curl -s http://llama-swap-amd:8080/health && echo "✓ OK" \|\| echo "✗ FAILED"`
			```

			`## Performance Optimization`

			`If vision requests are too slow:`

			`1. Reduce image quality before sending to model`
			`2. Use smaller frames for video analysis`
			`3. Batch process multiple images`
			`4. Allocate more VRAM to NVIDIA if available`
			`5. Reduce concurrent requests to NVIDIA during peak load`

			`## Success Indicators`

			`After applying the fix, you should see:`

			`✅ Images analyzed within 5-10 seconds (first load: 20-30 seconds)`
			`✅ No "Vision service unavailable" errors`
			✅ Log shows `Vision analysis completed successfully`
			`✅ Works correctly whether AMD or NVIDIA is primary GPU`
			`✅ No GPU memory errors in nvidia-smi/rocm-smi`

			`## Contact Points for Further Issues`

			`1. Check NVIDIA llama.cpp/llama-swap logs`
			`2. Check AMD ROCm compatibility for your GPU`
			`3. Verify Docker networking (if using custom networks)`
			`4. Check system VRAM (needs ~10GB+ for both models)`