Files

koko210Serve 1fc3d74a5b Add dual GPU support with web UI selector

Features:
- Built custom ROCm container for AMD RX 6800 GPU
- Added GPU selection toggle in web UI (NVIDIA/AMD)
- Unified model names across both GPUs for seamless switching
- Vision model always uses NVIDIA GPU (optimal performance)
- Text models (llama3.1, darkidol) can use either GPU
- Added /gpu-status and /gpu-select API endpoints
- Implemented GPU state persistence in memory/gpu_state.json

Technical details:
- Multi-stage Dockerfile.llamaswap-rocm with ROCm 6.2.4
- llama.cpp compiled with GGML_HIP=ON for gfx1030 (RX 6800)
- Proper GPU permissions without root (groups 187/989)
- AMD container on port 8091, NVIDIA on port 8090
- Updated bot/utils/llm.py with get_current_gpu_url() and get_vision_gpu_url()
- Modified bot/utils/image_handling.py to always use NVIDIA for vision
- Enhanced web UI with GPU selector button (blue=NVIDIA, red=AMD)

Files modified:
- docker-compose.yml (added llama-swap-amd service)
- bot/globals.py (added LLAMA_AMD_URL)
- bot/api.py (added GPU selection endpoints and helper function)
- bot/utils/llm.py (GPU routing for text models)
- bot/utils/image_handling.py (GPU routing for vision models)
- bot/static/index.html (GPU selector UI)
- llama-swap-rocm-config.yaml (unified model names)

New files:
- Dockerfile.llamaswap-rocm
- bot/memory/gpu_state.json
- bot/utils/gpu_router.py (load balancing utility)
- setup-dual-gpu.sh (setup verification script)
- DUAL_GPU_*.md (documentation files)

2026-01-09 00:03:59 +02:00

4.5 KiB

Raw Blame History

Dual GPU Setup Summary

What We Built

A secondary llama-swap container optimized for your AMD RX 6800 GPU using ROCm.

Architecture

Primary GPU (NVIDIA GTX 1660)     Secondary GPU (AMD RX 6800)
         ↓                                    ↓
   llama-swap (CUDA)                  llama-swap-amd (ROCm)
   Port: 8090                         Port: 8091
         ↓                                    ↓
   NVIDIA models                       AMD models
   - llama3.1                         - llama3.1-amd
   - darkidol                         - darkidol-amd
   - vision (MiniCPM)                 - moondream-amd

Files Created

Dockerfile.llamaswap-rocm - Custom multi-stage build:
- Stage 1: Builds llama.cpp with ROCm from source
- Stage 2: Builds llama-swap from source
- Stage 3: Runtime image with both binaries
llama-swap-rocm-config.yaml - Model configuration for AMD GPU
docker-compose.yml - Updated with llama-swap-amd service
bot/utils/gpu_router.py - Load balancing utility
bot/globals.py - Updated with LLAMA_AMD_URL
setup-dual-gpu.sh - Setup verification script
DUAL_GPU_SETUP.md - Comprehensive documentation
DUAL_GPU_QUICK_REF.md - Quick reference guide

Why Custom Build?

llama.cpp doesn't publish ROCm Docker images (yet)
llama-swap doesn't provide ROCm variants
Building from source ensures latest ROCm compatibility
Full control over compilation flags and optimization

Build Time

The initial build takes 15-30 minutes depending on your system:

llama.cpp compilation: ~10-20 minutes
llama-swap compilation: ~1-2 minutes
Image layering: ~2-5 minutes

Subsequent builds are much faster due to Docker layer caching.

Next Steps

Once the build completes:

# 1. Start both GPU services
docker compose up -d llama-swap llama-swap-amd

# 2. Verify both are running
docker compose ps

# 3. Test NVIDIA GPU
curl http://localhost:8090/health

# 4. Test AMD GPU
curl http://localhost:8091/health

# 5. Monitor logs
docker compose logs -f llama-swap-amd

# 6. Test model loading on AMD
curl -X POST http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1-amd",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

Device Access

The AMD container has access to:

/dev/kfd - AMD GPU kernel driver
/dev/dri - Direct Rendering Infrastructure
Groups: video, render

Environment Variables

RX 6800 specific settings:

HSA_OVERRIDE_GFX_VERSION=10.3.0  # Navi 21 (gfx1030) compatibility
ROCM_PATH=/opt/rocm
HIP_VISIBLE_DEVICES=0            # Use first AMD GPU

Bot Integration

Your bot now has two endpoints available:

import globals

# NVIDIA GPU (primary)
nvidia_url = globals.LLAMA_URL  # http://llama-swap:8080

# AMD GPU (secondary)
amd_url = globals.LLAMA_AMD_URL  # http://llama-swap-amd:8080

Use the gpu_router utility for automatic load balancing:

from bot.utils.gpu_router import get_llama_url_with_load_balancing

# Round-robin between GPUs
url, model = get_llama_url_with_load_balancing(task_type="text")

# Prefer AMD for vision
url, model = get_llama_url_with_load_balancing(
    task_type="vision",
    prefer_amd=True
)

Troubleshooting

If the AMD container fails to start:

Check build logs:

docker compose build --no-cache llama-swap-amd

Verify GPU access:
```
ls -l /dev/kfd /dev/dri
```
Check container logs:
```
docker compose logs llama-swap-amd
```

Test GPU from host:

lspci | grep -i amd
# Should show: Radeon RX 6800

Performance Notes

RX 6800 Specs:

VRAM: 16GB
Architecture: RDNA 2 (Navi 21)
Compute: gfx1030

Recommended Models:

Q4_K_M quantization: 5-6GB per model
Can load 2-3 models simultaneously
Good for: Llama 3.1 8B, DarkIdol 8B, Moondream2

Future Improvements

Automatic failover: Route to AMD if NVIDIA is busy
Health monitoring: Track GPU utilization
Dynamic routing: Use least-busy GPU
VRAM monitoring: Alert before OOM
Model preloading: Keep common models loaded

4.5 KiB Raw Blame History