Files

koko210Serve 1fc3d74a5b Add dual GPU support with web UI selector

Features:
- Built custom ROCm container for AMD RX 6800 GPU
- Added GPU selection toggle in web UI (NVIDIA/AMD)
- Unified model names across both GPUs for seamless switching
- Vision model always uses NVIDIA GPU (optimal performance)
- Text models (llama3.1, darkidol) can use either GPU
- Added /gpu-status and /gpu-select API endpoints
- Implemented GPU state persistence in memory/gpu_state.json

Technical details:
- Multi-stage Dockerfile.llamaswap-rocm with ROCm 6.2.4
- llama.cpp compiled with GGML_HIP=ON for gfx1030 (RX 6800)
- Proper GPU permissions without root (groups 187/989)
- AMD container on port 8091, NVIDIA on port 8090
- Updated bot/utils/llm.py with get_current_gpu_url() and get_vision_gpu_url()
- Modified bot/utils/image_handling.py to always use NVIDIA for vision
- Enhanced web UI with GPU selector button (blue=NVIDIA, red=AMD)

Files modified:
- docker-compose.yml (added llama-swap-amd service)
- bot/globals.py (added LLAMA_AMD_URL)
- bot/api.py (added GPU selection endpoints and helper function)
- bot/utils/llm.py (GPU routing for text models)
- bot/utils/image_handling.py (GPU routing for vision models)
- bot/static/index.html (GPU selector UI)
- llama-swap-rocm-config.yaml (unified model names)

New files:
- Dockerfile.llamaswap-rocm
- bot/memory/gpu_state.json
- bot/utils/gpu_router.py (load balancing utility)
- setup-dual-gpu.sh (setup verification script)
- DUAL_GPU_*.md (documentation files)

2026-01-09 00:03:59 +02:00

4.2 KiB

Raw Blame History

Dual GPU Quick Reference

Quick Start

# 1. Run setup check
./setup-dual-gpu.sh

# 2. Build AMD container
docker compose build llama-swap-amd

# 3. Start both GPUs
docker compose up -d llama-swap llama-swap-amd

# 4. Verify
curl http://localhost:8090/health  # NVIDIA
curl http://localhost:8091/health  # AMD RX 6800

Endpoints

GPU	Container	Port	Internal URL
NVIDIA	llama-swap	8090	http://llama-swap:8080
AMD RX 6800	llama-swap-amd	8091	http://llama-swap-amd:8080

Models

NVIDIA GPU (Primary)

llama3.1 - Llama 3.1 8B Instruct
darkidol - DarkIdol Uncensored 8B
vision - MiniCPM-V-4.5 (4K context)

AMD RX 6800 (Secondary)

llama3.1-amd - Llama 3.1 8B Instruct
darkidol-amd - DarkIdol Uncensored 8B
moondream-amd - Moondream2 Vision (2K context)

Commands

Start/Stop

# Start both
docker compose up -d llama-swap llama-swap-amd

# Start only AMD
docker compose up -d llama-swap-amd

# Stop AMD
docker compose stop llama-swap-amd

# Restart AMD with logs
docker compose restart llama-swap-amd && docker compose logs -f llama-swap-amd

Monitoring

# Container status
docker compose ps

# Logs
docker compose logs -f llama-swap-amd

# GPU usage
watch -n 1 nvidia-smi  # NVIDIA
watch -n 1 rocm-smi    # AMD

# Resource usage
docker stats llama-swap llama-swap-amd

Testing

# List available models
curl http://localhost:8091/v1/models | jq

# Test text generation (AMD)
curl -X POST http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1-amd",
    "messages": [{"role": "user", "content": "Say hello!"}],
    "max_tokens": 20
  }' | jq

# Test vision model (AMD)
curl -X POST http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moondream-amd",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }],
    "max_tokens": 100
  }' | jq

Bot Integration

Using GPU Router

from bot.utils.gpu_router import get_llama_url_with_load_balancing, get_endpoint_for_model

# Load balanced text generation
url, model = get_llama_url_with_load_balancing(task_type="text")

# Specific model
url = get_endpoint_for_model("darkidol-amd")

# Vision on AMD
url, model = get_llama_url_with_load_balancing(task_type="vision", prefer_amd=True)

Direct Access

import globals

# AMD GPU
amd_url = globals.LLAMA_AMD_URL  # http://llama-swap-amd:8080

# NVIDIA GPU  
nvidia_url = globals.LLAMA_URL   # http://llama-swap:8080

Troubleshooting

AMD Container Won't Start

# Check ROCm
rocm-smi

# Check permissions
ls -l /dev/kfd /dev/dri

# Check logs
docker compose logs llama-swap-amd

# Rebuild
docker compose build --no-cache llama-swap-amd

Model Won't Load

# Check VRAM
rocm-smi --showmeminfo vram

# Lower GPU layers in llama-swap-rocm-config.yaml
# Change: -ngl 99
# To:     -ngl 50

GFX Version Error

# RX 6800 is gfx1030
# Ensure in docker-compose.yml:
HSA_OVERRIDE_GFX_VERSION=10.3.0

Environment Variables

Add to docker-compose.yml under miku-bot service:

environment:
  - PREFER_AMD_GPU=true          # Prefer AMD for load balancing
  - AMD_MODELS_ENABLED=true      # Enable AMD models
  - LLAMA_AMD_URL=http://llama-swap-amd:8080

Files

Dockerfile.llamaswap-rocm - ROCm container
llama-swap-rocm-config.yaml - AMD model config
bot/utils/gpu_router.py - Load balancing utility
DUAL_GPU_SETUP.md - Full documentation
setup-dual-gpu.sh - Setup verification script

Performance Tips

Model Selection: Use Q4_K quantization for best size/quality balance
VRAM: RX 6800 has 16GB - can run 2-3 Q4 models
TTL: Adjust in config files (1800s = 30min default)
Context: Lower context size (-c 8192) to save VRAM
GPU Layers: -ngl 99 uses full GPU, lower if needed

Support

ROCm Docs: https://rocmdocs.amd.com/
llama.cpp: https://github.com/ggml-org/llama.cpp
llama-swap: https://github.com/mostlygeek/llama-swap

4.2 KiB Raw Blame History

Dual GPU Quick Reference

Quick Start

Endpoints

Models

NVIDIA GPU (Primary)

AMD RX 6800 (Secondary)

Commands

Start/Stop

Monitoring

Testing

Bot Integration

Using GPU Router

Direct Access

Troubleshooting

AMD Container Won't Start

Model Won't Load

GFX Version Error

Environment Variables

Files

Performance Tips

Support

4.2 KiB

Raw Blame History