moved AI generated readmes to readme folder (may delete)

2026-01-27 19:57:48 +02:00
parent 0f1c30f757
commit c58b941587
34 changed files with 8709 additions and 770 deletions
--- a/readmes/DUAL_GPU_BUILD_SUMMARY.md
+++ b/readmes/DUAL_GPU_BUILD_SUMMARY.md
@@ -0,0 +1,184 @@
+# Dual GPU Setup Summary
+
+## What We Built
+
+A secondary llama-swap container optimized for your **AMD RX 6800** GPU using ROCm.
+
+### Architecture
+
+```
+Primary GPU (NVIDIA GTX 1660)     Secondary GPU (AMD RX 6800)
+         ↓                                    ↓
+   llama-swap (CUDA)                  llama-swap-amd (ROCm)
+   Port: 8090                         Port: 8091
+         ↓                                    ↓
+   NVIDIA models                       AMD models
+   - llama3.1                         - llama3.1-amd
+   - darkidol                         - darkidol-amd
+   - vision (MiniCPM)                 - moondream-amd
+```
+
+## Files Created
+
+1. **Dockerfile.llamaswap-rocm** - Custom multi-stage build:
+   - Stage 1: Builds llama.cpp with ROCm from source
+   - Stage 2: Builds llama-swap from source
+   - Stage 3: Runtime image with both binaries
+
+2. **llama-swap-rocm-config.yaml** - Model configuration for AMD GPU
+
+3. **docker-compose.yml** - Updated with `llama-swap-amd` service
+
+4. **bot/utils/gpu_router.py** - Load balancing utility
+
+5. **bot/globals.py** - Updated with `LLAMA_AMD_URL`
+
+6. **setup-dual-gpu.sh** - Setup verification script
+
+7. **DUAL_GPU_SETUP.md** - Comprehensive documentation
+
+8. **DUAL_GPU_QUICK_REF.md** - Quick reference guide
+
+## Why Custom Build?
+
+- llama.cpp doesn't publish ROCm Docker images (yet)
+- llama-swap doesn't provide ROCm variants
+- Building from source ensures latest ROCm compatibility
+- Full control over compilation flags and optimization
+
+## Build Time
+
+The initial build takes 15-30 minutes depending on your system:
+- llama.cpp compilation: ~10-20 minutes
+- llama-swap compilation: ~1-2 minutes
+- Image layering: ~2-5 minutes
+
+Subsequent builds are much faster due to Docker layer caching.
+
+## Next Steps
+
+Once the build completes:
+
+```bash
+# 1. Start both GPU services
+docker compose up -d llama-swap llama-swap-amd
+
+# 2. Verify both are running
+docker compose ps
+
+# 3. Test NVIDIA GPU
+curl http://localhost:8090/health
+
+# 4. Test AMD GPU
+curl http://localhost:8091/health
+
+# 5. Monitor logs
+docker compose logs -f llama-swap-amd
+
+# 6. Test model loading on AMD
+curl -X POST http://localhost:8091/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama3.1-amd",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 50
+  }'
+```
+
+## Device Access
+
+The AMD container has access to:
+- `/dev/kfd` - AMD GPU kernel driver
+- `/dev/dri` - Direct Rendering Infrastructure
+- Groups: `video`, `render`
+
+## Environment Variables
+
+RX 6800 specific settings:
+```yaml
+HSA_OVERRIDE_GFX_VERSION=10.3.0  # Navi 21 (gfx1030) compatibility
+ROCM_PATH=/opt/rocm
+HIP_VISIBLE_DEVICES=0            # Use first AMD GPU
+```
+
+## Bot Integration
+
+Your bot now has two endpoints available:
+
+```python
+import globals
+
+# NVIDIA GPU (primary)
+nvidia_url = globals.LLAMA_URL  # http://llama-swap:8080
+
+# AMD GPU (secondary)
+amd_url = globals.LLAMA_AMD_URL  # http://llama-swap-amd:8080
+```
+
+Use the `gpu_router` utility for automatic load balancing:
+
+```python
+from bot.utils.gpu_router import get_llama_url_with_load_balancing
+
+# Round-robin between GPUs
+url, model = get_llama_url_with_load_balancing(task_type="text")
+
+# Prefer AMD for vision
+url, model = get_llama_url_with_load_balancing(
+    task_type="vision",
+    prefer_amd=True
+)
+```
+
+## Troubleshooting
+
+If the AMD container fails to start:
+
+1. **Check build logs:**
+   ```bash
+   docker compose build --no-cache llama-swap-amd
+   ```
+
+2. **Verify GPU access:**
+   ```bash
+   ls -l /dev/kfd /dev/dri
+   ```
+
+3. **Check container logs:**
+   ```bash
+   docker compose logs llama-swap-amd
+   ```
+
+4. **Test GPU from host:**
+   ```bash
+   lspci | grep -i amd
+   # Should show: Radeon RX 6800
+   ```
+
+## Performance Notes
+
+**RX 6800 Specs:**
+- VRAM: 16GB
+- Architecture: RDNA 2 (Navi 21)
+- Compute: gfx1030
+
+**Recommended Models:**
+- Q4_K_M quantization: 5-6GB per model
+- Can load 2-3 models simultaneously
+- Good for: Llama 3.1 8B, DarkIdol 8B, Moondream2
+
+## Future Improvements
+
+1. **Automatic failover:** Route to AMD if NVIDIA is busy
+2. **Health monitoring:** Track GPU utilization
+3. **Dynamic routing:** Use least-busy GPU
+4. **VRAM monitoring:** Alert before OOM
+5. **Model preloading:** Keep common models loaded
+
+## Resources
+
+- [ROCm Documentation](https://rocmdocs.amd.com/)
+- [llama.cpp ROCm Build](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#rocm)
+- [llama-swap GitHub](https://github.com/mostlygeek/llama-swap)
+- [Full Setup Guide](./DUAL_GPU_SETUP.md)
+- [Quick Reference](./DUAL_GPU_QUICK_REF.md)