Files
miku-discord/DUAL_GPU_SETUP.md
koko210Serve 1fc3d74a5b Add dual GPU support with web UI selector
Features:
- Built custom ROCm container for AMD RX 6800 GPU
- Added GPU selection toggle in web UI (NVIDIA/AMD)
- Unified model names across both GPUs for seamless switching
- Vision model always uses NVIDIA GPU (optimal performance)
- Text models (llama3.1, darkidol) can use either GPU
- Added /gpu-status and /gpu-select API endpoints
- Implemented GPU state persistence in memory/gpu_state.json

Technical details:
- Multi-stage Dockerfile.llamaswap-rocm with ROCm 6.2.4
- llama.cpp compiled with GGML_HIP=ON for gfx1030 (RX 6800)
- Proper GPU permissions without root (groups 187/989)
- AMD container on port 8091, NVIDIA on port 8090
- Updated bot/utils/llm.py with get_current_gpu_url() and get_vision_gpu_url()
- Modified bot/utils/image_handling.py to always use NVIDIA for vision
- Enhanced web UI with GPU selector button (blue=NVIDIA, red=AMD)

Files modified:
- docker-compose.yml (added llama-swap-amd service)
- bot/globals.py (added LLAMA_AMD_URL)
- bot/api.py (added GPU selection endpoints and helper function)
- bot/utils/llm.py (GPU routing for text models)
- bot/utils/image_handling.py (GPU routing for vision models)
- bot/static/index.html (GPU selector UI)
- llama-swap-rocm-config.yaml (unified model names)

New files:
- Dockerfile.llamaswap-rocm
- bot/memory/gpu_state.json
- bot/utils/gpu_router.py (load balancing utility)
- setup-dual-gpu.sh (setup verification script)
- DUAL_GPU_*.md (documentation files)
2026-01-09 00:03:59 +02:00

8.7 KiB

Dual GPU Setup - NVIDIA + AMD RX 6800

This document describes the dual-GPU configuration for running two llama-swap instances simultaneously:

  • Primary GPU (NVIDIA): Runs main models via CUDA
  • Secondary GPU (AMD RX 6800): Runs additional models via ROCm

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                         Miku Bot                            │
│                                                             │
│  LLAMA_URL=http://llama-swap:8080 (NVIDIA)                │
│  LLAMA_AMD_URL=http://llama-swap-amd:8080 (AMD RX 6800)   │
└─────────────────────────────────────────────────────────────┘
                    │                      │
                    │                      │
                    ▼                      ▼
        ┌──────────────────┐    ┌──────────────────┐
        │  llama-swap      │    │  llama-swap-amd  │
        │  (CUDA)          │    │  (ROCm)          │
        │  Port: 8090      │    │  Port: 8091      │
        └──────────────────┘    └──────────────────┘
                    │                      │
                    ▼                      ▼
        ┌──────────────────┐    ┌──────────────────┐
        │  NVIDIA GPU      │    │  AMD RX 6800     │
        │  - llama3.1      │    │  - llama3.1-amd  │
        │  - darkidol      │    │  - darkidol-amd  │
        │  - vision        │    │  - moondream-amd │
        └──────────────────┘    └──────────────────┘

Files Created

  1. Dockerfile.llamaswap-rocm - ROCm-enabled Docker image for AMD GPU
  2. llama-swap-rocm-config.yaml - Model configuration for AMD models
  3. docker-compose.yml - Updated with llama-swap-amd service

Configuration Details

llama-swap-amd Service

llama-swap-amd:
  build:
    context: .
    dockerfile: Dockerfile.llamaswap-rocm
  container_name: llama-swap-amd
  ports:
    - "8091:8080"  # External access on port 8091
  volumes:
    - ./models:/models
    - ./llama-swap-rocm-config.yaml:/app/config.yaml
  devices:
    - /dev/kfd:/dev/kfd    # AMD GPU kernel driver
    - /dev/dri:/dev/dri    # Direct Rendering Infrastructure
  group_add:
    - video
    - render
  environment:
    - HSA_OVERRIDE_GFX_VERSION=10.3.0  # RX 6800 (Navi 21) compatibility

Available Models on AMD GPU

From llama-swap-rocm-config.yaml:

  • llama3.1-amd - Llama 3.1 8B text model
  • darkidol-amd - DarkIdol uncensored model
  • moondream-amd - Moondream2 vision model (smaller, AMD-optimized)

Model Aliases

You can access AMD models using these aliases:

  • llama3.1-amd, text-model-amd, amd-text
  • darkidol-amd, evil-model-amd, uncensored-amd
  • moondream-amd, vision-amd, moondream

Usage

Building and Starting Services

# Build the AMD ROCm container
docker compose build llama-swap-amd

# Start both GPU services
docker compose up -d llama-swap llama-swap-amd

# Check logs
docker compose logs -f llama-swap-amd

Accessing AMD Models from Bot Code

In your bot code, you can now use either endpoint:

import globals

# Use NVIDIA GPU (primary)
nvidia_response = requests.post(
    f"{globals.LLAMA_URL}/v1/chat/completions",
    json={"model": "llama3.1", ...}
)

# Use AMD GPU (secondary)
amd_response = requests.post(
    f"{globals.LLAMA_AMD_URL}/v1/chat/completions", 
    json={"model": "llama3.1-amd", ...}
)

Load Balancing Strategy

You can implement load balancing by:

  1. Round-robin: Alternate between GPUs for text generation
  2. Task-specific:
    • NVIDIA: Primary text + MiniCPM vision (heavy)
    • AMD: Secondary text + Moondream vision (lighter)
  3. Failover: Use AMD as backup if NVIDIA is busy

Example load balancing function:

import random
import globals

def get_llama_url(prefer_amd=False):
    """Get llama URL with optional load balancing"""
    if prefer_amd:
        return globals.LLAMA_AMD_URL
    
    # Random load balancing for text models
    return random.choice([globals.LLAMA_URL, globals.LLAMA_AMD_URL])

Testing

Test NVIDIA GPU (Port 8090)

curl http://localhost:8090/health
curl http://localhost:8090/v1/models

Test AMD GPU (Port 8091)

curl http://localhost:8091/health
curl http://localhost:8091/v1/models

Test Model Loading (AMD)

curl -X POST http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1-amd",
    "messages": [{"role": "user", "content": "Hello from AMD GPU!"}],
    "max_tokens": 50
  }'

Monitoring

Check GPU Usage

AMD GPU:

# ROCm monitoring
rocm-smi

# Or from host
watch -n 1 rocm-smi

NVIDIA GPU:

nvidia-smi
watch -n 1 nvidia-smi

Check Container Resource Usage

docker stats llama-swap llama-swap-amd

Troubleshooting

AMD GPU Not Detected

  1. Verify ROCm is installed on host:

    rocm-smi --version
    
  2. Check device permissions:

    ls -l /dev/kfd /dev/dri
    
  3. Verify RX 6800 compatibility:

    rocminfo | grep "Name:"
    

Model Loading Issues

If models fail to load on AMD:

  1. Check VRAM availability:

    rocm-smi --showmeminfo vram
    
  2. Adjust -ngl (GPU layers) in config if needed:

    # Reduce GPU layers for smaller VRAM
    cmd: /app/llama-server ... -ngl 50 ...  # Instead of 99
    
  3. Check container logs:

    docker compose logs llama-swap-amd
    

GFX Version Mismatch

RX 6800 is Navi 21 (gfx1030). If you see GFX errors:

# Set in docker-compose.yml environment:
HSA_OVERRIDE_GFX_VERSION=10.3.0

llama-swap Build Issues

If the ROCm container fails to build:

  1. The Dockerfile attempts to build llama-swap from source
  2. Alternative: Use pre-built binary or simpler proxy setup
  3. Check build logs: docker compose build --no-cache llama-swap-amd

Performance Considerations

Memory Usage

  • RX 6800: 16GB VRAM
    • Q4_K_M/Q4_K_XL models: ~5-6GB each
    • Can run 2 models simultaneously or 1 with long context

Model Selection

Best for AMD RX 6800:

  • Q4_K_M/Q4_K_S quantized models (5-6GB)
  • Moondream2 vision (smaller, efficient)
  • ⚠️ MiniCPM-V-4.5 (possible but may be tight on VRAM)

TTL Configuration

Adjust model TTL in llama-swap-rocm-config.yaml:

  • Lower TTL = more aggressive unloading = more VRAM available
  • Higher TTL = less model swapping = faster response times

Advanced: Model-Specific Routing

Create a helper function to route models automatically:

# bot/utils/gpu_router.py
import globals

MODEL_TO_GPU = {
    # NVIDIA models
    "llama3.1": globals.LLAMA_URL,
    "darkidol": globals.LLAMA_URL,
    "vision": globals.LLAMA_URL,
    
    # AMD models
    "llama3.1-amd": globals.LLAMA_AMD_URL,
    "darkidol-amd": globals.LLAMA_AMD_URL,
    "moondream-amd": globals.LLAMA_AMD_URL,
}

def get_endpoint_for_model(model_name):
    """Get the correct llama-swap endpoint for a model"""
    return MODEL_TO_GPU.get(model_name, globals.LLAMA_URL)

def is_amd_model(model_name):
    """Check if model runs on AMD GPU"""
    return model_name.endswith("-amd")

Environment Variables

Add these to control GPU selection:

# In docker-compose.yml
environment:
  - LLAMA_URL=http://llama-swap:8080
  - LLAMA_AMD_URL=http://llama-swap-amd:8080
  - PREFER_AMD_GPU=false  # Set to true to prefer AMD for general tasks
  - AMD_MODELS_ENABLED=true  # Enable/disable AMD models

Future Enhancements

  1. Automatic load balancing: Monitor GPU utilization and route requests
  2. Health checks: Fallback to primary GPU if AMD fails
  3. Model distribution: Automatically assign models to GPUs based on VRAM
  4. Performance metrics: Track response times per GPU
  5. Dynamic routing: Use least-busy GPU for new requests

References