# Dual GPU Setup - NVIDIA + AMD RX 6800 This document describes the dual-GPU configuration for running two llama-swap instances simultaneously: - **Primary GPU (NVIDIA)**: Runs main models via CUDA - **Secondary GPU (AMD RX 6800)**: Runs additional models via ROCm ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────┐ │ Miku Bot │ │ │ │ LLAMA_URL=http://llama-swap:8080 (NVIDIA) │ │ LLAMA_AMD_URL=http://llama-swap-amd:8080 (AMD RX 6800) │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ ▼ ▼ ┌──────────────────┐ ┌──────────────────┐ │ llama-swap │ │ llama-swap-amd │ │ (CUDA) │ │ (ROCm) │ │ Port: 8090 │ │ Port: 8091 │ └──────────────────┘ └──────────────────┘ │ │ ▼ ▼ ┌──────────────────┐ ┌──────────────────┐ │ NVIDIA GPU │ │ AMD RX 6800 │ │ - llama3.1 │ │ - llama3.1-amd │ │ - darkidol │ │ - darkidol-amd │ │ - vision │ │ - moondream-amd │ └──────────────────┘ └──────────────────┘ ``` ## Files Created 1. **Dockerfile.llamaswap-rocm** - ROCm-enabled Docker image for AMD GPU 2. **llama-swap-rocm-config.yaml** - Model configuration for AMD models 3. **docker-compose.yml** - Updated with `llama-swap-amd` service ## Configuration Details ### llama-swap-amd Service ```yaml llama-swap-amd: build: context: . dockerfile: Dockerfile.llamaswap-rocm container_name: llama-swap-amd ports: - "8091:8080" # External access on port 8091 volumes: - ./models:/models - ./llama-swap-rocm-config.yaml:/app/config.yaml devices: - /dev/kfd:/dev/kfd # AMD GPU kernel driver - /dev/dri:/dev/dri # Direct Rendering Infrastructure group_add: - video - render environment: - HSA_OVERRIDE_GFX_VERSION=10.3.0 # RX 6800 (Navi 21) compatibility ``` ### Available Models on AMD GPU From `llama-swap-rocm-config.yaml`: - **llama3.1-amd** - Llama 3.1 8B text model - **darkidol-amd** - DarkIdol uncensored model - **moondream-amd** - Moondream2 vision model (smaller, AMD-optimized) ### Model Aliases You can access AMD models using these aliases: - `llama3.1-amd`, `text-model-amd`, `amd-text` - `darkidol-amd`, `evil-model-amd`, `uncensored-amd` - `moondream-amd`, `vision-amd`, `moondream` ## Usage ### Building and Starting Services ```bash # Build the AMD ROCm container docker compose build llama-swap-amd # Start both GPU services docker compose up -d llama-swap llama-swap-amd # Check logs docker compose logs -f llama-swap-amd ``` ### Accessing AMD Models from Bot Code In your bot code, you can now use either endpoint: ```python import globals # Use NVIDIA GPU (primary) nvidia_response = requests.post( f"{globals.LLAMA_URL}/v1/chat/completions", json={"model": "llama3.1", ...} ) # Use AMD GPU (secondary) amd_response = requests.post( f"{globals.LLAMA_AMD_URL}/v1/chat/completions", json={"model": "llama3.1-amd", ...} ) ``` ### Load Balancing Strategy You can implement load balancing by: 1. **Round-robin**: Alternate between GPUs for text generation 2. **Task-specific**: - NVIDIA: Primary text + MiniCPM vision (heavy) - AMD: Secondary text + Moondream vision (lighter) 3. **Failover**: Use AMD as backup if NVIDIA is busy Example load balancing function: ```python import random import globals def get_llama_url(prefer_amd=False): """Get llama URL with optional load balancing""" if prefer_amd: return globals.LLAMA_AMD_URL # Random load balancing for text models return random.choice([globals.LLAMA_URL, globals.LLAMA_AMD_URL]) ``` ## Testing ### Test NVIDIA GPU (Port 8090) ```bash curl http://localhost:8090/health curl http://localhost:8090/v1/models ``` ### Test AMD GPU (Port 8091) ```bash curl http://localhost:8091/health curl http://localhost:8091/v1/models ``` ### Test Model Loading (AMD) ```bash curl -X POST http://localhost:8091/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.1-amd", "messages": [{"role": "user", "content": "Hello from AMD GPU!"}], "max_tokens": 50 }' ``` ## Monitoring ### Check GPU Usage **AMD GPU:** ```bash # ROCm monitoring rocm-smi # Or from host watch -n 1 rocm-smi ``` **NVIDIA GPU:** ```bash nvidia-smi watch -n 1 nvidia-smi ``` ### Check Container Resource Usage ```bash docker stats llama-swap llama-swap-amd ``` ## Troubleshooting ### AMD GPU Not Detected 1. Verify ROCm is installed on host: ```bash rocm-smi --version ``` 2. Check device permissions: ```bash ls -l /dev/kfd /dev/dri ``` 3. Verify RX 6800 compatibility: ```bash rocminfo | grep "Name:" ``` ### Model Loading Issues If models fail to load on AMD: 1. Check VRAM availability: ```bash rocm-smi --showmeminfo vram ``` 2. Adjust `-ngl` (GPU layers) in config if needed: ```yaml # Reduce GPU layers for smaller VRAM cmd: /app/llama-server ... -ngl 50 ... # Instead of 99 ``` 3. Check container logs: ```bash docker compose logs llama-swap-amd ``` ### GFX Version Mismatch RX 6800 is Navi 21 (gfx1030). If you see GFX errors: ```bash # Set in docker-compose.yml environment: HSA_OVERRIDE_GFX_VERSION=10.3.0 ``` ### llama-swap Build Issues If the ROCm container fails to build: 1. The Dockerfile attempts to build llama-swap from source 2. Alternative: Use pre-built binary or simpler proxy setup 3. Check build logs: `docker compose build --no-cache llama-swap-amd` ## Performance Considerations ### Memory Usage - **RX 6800**: 16GB VRAM - Q4_K_M/Q4_K_XL models: ~5-6GB each - Can run 2 models simultaneously or 1 with long context ### Model Selection **Best for AMD RX 6800:** - ✅ Q4_K_M/Q4_K_S quantized models (5-6GB) - ✅ Moondream2 vision (smaller, efficient) - ⚠️ MiniCPM-V-4.5 (possible but may be tight on VRAM) ### TTL Configuration Adjust model TTL in `llama-swap-rocm-config.yaml`: - Lower TTL = more aggressive unloading = more VRAM available - Higher TTL = less model swapping = faster response times ## Advanced: Model-Specific Routing Create a helper function to route models automatically: ```python # bot/utils/gpu_router.py import globals MODEL_TO_GPU = { # NVIDIA models "llama3.1": globals.LLAMA_URL, "darkidol": globals.LLAMA_URL, "vision": globals.LLAMA_URL, # AMD models "llama3.1-amd": globals.LLAMA_AMD_URL, "darkidol-amd": globals.LLAMA_AMD_URL, "moondream-amd": globals.LLAMA_AMD_URL, } def get_endpoint_for_model(model_name): """Get the correct llama-swap endpoint for a model""" return MODEL_TO_GPU.get(model_name, globals.LLAMA_URL) def is_amd_model(model_name): """Check if model runs on AMD GPU""" return model_name.endswith("-amd") ``` ## Environment Variables Add these to control GPU selection: ```yaml # In docker-compose.yml environment: - LLAMA_URL=http://llama-swap:8080 - LLAMA_AMD_URL=http://llama-swap-amd:8080 - PREFER_AMD_GPU=false # Set to true to prefer AMD for general tasks - AMD_MODELS_ENABLED=true # Enable/disable AMD models ``` ## Future Enhancements 1. **Automatic load balancing**: Monitor GPU utilization and route requests 2. **Health checks**: Fallback to primary GPU if AMD fails 3. **Model distribution**: Automatically assign models to GPUs based on VRAM 4. **Performance metrics**: Track response times per GPU 5. **Dynamic routing**: Use least-busy GPU for new requests ## References - [ROCm Documentation](https://rocmdocs.amd.com/) - [llama.cpp ROCm Support](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#rocm) - [llama-swap GitHub](https://github.com/mostlygeek/llama-swap) - [AMD GPU Compatibility Matrix](https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html)