Files

koko210Serve c58b941587 moved AI generated readmes to readme folder (may delete)

2026-01-27 19:57:48 +02:00

4.2 KiB

Raw Blame History

Dual GPU Quick Reference

Quick Start

# 1. Run setup check
./setup-dual-gpu.sh

# 2. Build AMD container
docker compose build llama-swap-amd

# 3. Start both GPUs
docker compose up -d llama-swap llama-swap-amd

# 4. Verify
curl http://localhost:8090/health  # NVIDIA
curl http://localhost:8091/health  # AMD RX 6800

Endpoints

GPU	Container	Port	Internal URL
NVIDIA	llama-swap	8090	http://llama-swap:8080
AMD RX 6800	llama-swap-amd	8091	http://llama-swap-amd:8080

Models

NVIDIA GPU (Primary)

llama3.1 - Llama 3.1 8B Instruct
darkidol - DarkIdol Uncensored 8B
vision - MiniCPM-V-4.5 (4K context)

AMD RX 6800 (Secondary)

llama3.1-amd - Llama 3.1 8B Instruct
darkidol-amd - DarkIdol Uncensored 8B
moondream-amd - Moondream2 Vision (2K context)

Commands

Start/Stop

# Start both
docker compose up -d llama-swap llama-swap-amd

# Start only AMD
docker compose up -d llama-swap-amd

# Stop AMD
docker compose stop llama-swap-amd

# Restart AMD with logs
docker compose restart llama-swap-amd && docker compose logs -f llama-swap-amd

Monitoring

# Container status
docker compose ps

# Logs
docker compose logs -f llama-swap-amd

# GPU usage
watch -n 1 nvidia-smi  # NVIDIA
watch -n 1 rocm-smi    # AMD

# Resource usage
docker stats llama-swap llama-swap-amd

Testing

# List available models
curl http://localhost:8091/v1/models | jq

# Test text generation (AMD)
curl -X POST http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1-amd",
    "messages": [{"role": "user", "content": "Say hello!"}],
    "max_tokens": 20
  }' | jq

# Test vision model (AMD)
curl -X POST http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moondream-amd",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }],
    "max_tokens": 100
  }' | jq

Bot Integration

Using GPU Router

from bot.utils.gpu_router import get_llama_url_with_load_balancing, get_endpoint_for_model

# Load balanced text generation
url, model = get_llama_url_with_load_balancing(task_type="text")

# Specific model
url = get_endpoint_for_model("darkidol-amd")

# Vision on AMD
url, model = get_llama_url_with_load_balancing(task_type="vision", prefer_amd=True)

Direct Access

import globals

# AMD GPU
amd_url = globals.LLAMA_AMD_URL  # http://llama-swap-amd:8080

# NVIDIA GPU  
nvidia_url = globals.LLAMA_URL   # http://llama-swap:8080

Troubleshooting

AMD Container Won't Start

# Check ROCm
rocm-smi

# Check permissions
ls -l /dev/kfd /dev/dri

# Check logs
docker compose logs llama-swap-amd

# Rebuild
docker compose build --no-cache llama-swap-amd

Model Won't Load

# Check VRAM
rocm-smi --showmeminfo vram

# Lower GPU layers in llama-swap-rocm-config.yaml
# Change: -ngl 99
# To:     -ngl 50

GFX Version Error

# RX 6800 is gfx1030
# Ensure in docker-compose.yml:
HSA_OVERRIDE_GFX_VERSION=10.3.0

Environment Variables

Add to docker-compose.yml under miku-bot service:

environment:
  - PREFER_AMD_GPU=true          # Prefer AMD for load balancing
  - AMD_MODELS_ENABLED=true      # Enable AMD models
  - LLAMA_AMD_URL=http://llama-swap-amd:8080

Files

Dockerfile.llamaswap-rocm - ROCm container
llama-swap-rocm-config.yaml - AMD model config
bot/utils/gpu_router.py - Load balancing utility
DUAL_GPU_SETUP.md - Full documentation
setup-dual-gpu.sh - Setup verification script

Performance Tips

Model Selection: Use Q4_K quantization for best size/quality balance
VRAM: RX 6800 has 16GB - can run 2-3 Q4 models
TTL: Adjust in config files (1800s = 30min default)
Context: Lower context size (-c 8192) to save VRAM
GPU Layers: -ngl 99 uses full GPU, lower if needed

Support

ROCm Docs: https://rocmdocs.amd.com/
llama.cpp: https://github.com/ggml-org/llama.cpp
llama-swap: https://github.com/mostlygeek/llama-swap

4.2 KiB Raw Blame History

Dual GPU Quick Reference

Quick Start

Endpoints

Models

NVIDIA GPU (Primary)

AMD RX 6800 (Secondary)

Commands

Start/Stop

Monitoring

Testing

Bot Integration

Using GPU Router

Direct Access

Troubleshooting

AMD Container Won't Start

Model Won't Load

GFX Version Error

Environment Variables

Files

Performance Tips

Support

4.2 KiB

Raw Blame History