Organize documentation: Move all .md files to readmes/ directory

2025-12-07 17:21:59 +02:00
parent 8c74ad5260
commit 88d4256755
28 changed files with 0 additions and 0 deletions
--- a/readmes/LLAMA_CPP_SETUP.md
+++ b/readmes/LLAMA_CPP_SETUP.md
@@ -0,0 +1,199 @@
+# Llama.cpp Migration - Model Setup Guide
+
+## Overview
+This bot now uses **llama.cpp** with **llama-swap** instead of Ollama. This provides:
+- ✅ Automatic model unloading after inactivity (saves VRAM)
+- ✅ Seamless model switching between text and vision models
+- ✅ OpenAI-compatible API
+- ✅ Better resource management
+
+## Required Models
+
+You need to download two GGUF model files and place them in the `/models` directory:
+
+### 1. Text Generation Model: Llama 3.1 8B
+
+**Recommended:** Meta-Llama-3.1-8B-Instruct (Q4_K_M quantization)
+
+**Download from HuggingFace:**
+```bash
+# Using huggingface-cli (recommended)
+huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
+  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
+  --local-dir ./models \
+  --local-dir-use-symlinks False
+
+# Or download manually from:
+# https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
+```
+
+**Rename the file to:**
+```bash
+mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf
+```
+
+**File size:** ~4.9 GB
+**VRAM usage:** ~5-6 GB
+
+### 2. Vision Model: Moondream 2
+
+**Moondream 2** is a small but capable vision-language model.
+
+**Download model and projector:**
+```bash
+# Download the main model
+wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf
+# Rename for clarity
+mv models/moondream-0_5b-int8.gguf models/moondream.gguf
+
+# Download the multimodal projector (required for vision)
+wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf
+# Rename for clarity
+mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf
+```
+
+**Alternative download locations:**
+- Main: https://huggingface.co/vikhyatk/moondream2
+- GGUF versions: https://huggingface.co/vikhyatk/moondream2/tree/main
+
+**File sizes:**
+- moondream.gguf: ~500 MB
+- moondream-mmproj.gguf: ~1.2 GB
+**VRAM usage:** ~2-3 GB
+
+## Directory Structure
+
+After downloading, your `models/` directory should look like this:
+
+```
+models/
+├── .gitkeep
+├── llama3.1.gguf                 (~4.9 GB) - Text generation
+├── moondream.gguf                (~500 MB) - Vision model
+└── moondream-mmproj.gguf         (~1.2 GB) - Vision projector
+```
+
+## Alternative Models
+
+If you want to use different models:
+
+### Alternative Text Models:
+- **Llama 3.2 3B** (smaller, faster): `Llama-3.2-3B-Instruct-Q4_K_M.gguf`
+- **Qwen 2.5 7B** (alternative): `Qwen2.5-7B-Instruct-Q4_K_M.gguf`
+- **Mistral 7B**: `Mistral-7B-Instruct-v0.3-Q4_K_M.gguf`
+
+### Alternative Vision Models:
+- **LLaVA 1.5 7B**: Larger, more capable vision model
+- **BakLLaVA**: Another vision-language option
+
+**Important:** If you use different models, update `llama-swap-config.yaml`:
+```yaml
+models:
+  your-model-name:
+    cmd: llama-server --port ${PORT} --model /models/your-model.gguf -ngl 99 -c 4096 --host 0.0.0.0
+    ttl: 30m
+```
+
+And update environment variables in `docker-compose.yml`:
+```yaml
+environment:
+  - TEXT_MODEL=your-model-name
+  - VISION_MODEL=your-vision-model
+```
+
+## Verification
+
+After placing models in the directory, verify:
+
+```bash
+ls -lh models/
+# Should show:
+# llama3.1.gguf          (~4.9 GB)
+# moondream.gguf         (~500 MB)
+# moondream-mmproj.gguf  (~1.2 GB)
+```
+
+## Starting the Bot
+
+Once models are in place:
+
+```bash
+docker-compose up -d
+```
+
+Check the logs to ensure models load correctly:
+```bash
+docker-compose logs -f llama-swap
+```
+
+You should see:
+```
+✅ Model llama3.1 loaded successfully
+✅ Model moondream ready for vision tasks
+```
+
+## Monitoring
+
+Access the llama-swap web UI at:
+```
+http://localhost:8080/ui
+```
+
+This shows:
+- Currently loaded models
+- Model swap history
+- Request logs
+- Auto-unload timers
+
+## Troubleshooting
+
+### Model not found error
+- Ensure files are in the correct `/models` directory
+- Check filenames match exactly what's in `llama-swap-config.yaml`
+- Verify file permissions (should be readable by Docker)
+
+### CUDA/GPU errors
+- Ensure NVIDIA runtime is available: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`
+- Update NVIDIA drivers if needed
+- Check GPU memory: Models need ~6-8 GB VRAM total (but only one loaded at a time)
+
+### Model loads but generates gibberish
+- Wrong quantization or corrupted download
+- Re-download the model file
+- Try a different quantization (Q4_K_M recommended)
+
+## Resource Usage
+
+With TTL-based unloading:
+- **Idle:** ~0 GB VRAM (models unloaded)
+- **Text generation active:** ~5-6 GB VRAM (llama3.1 loaded)
+- **Vision analysis active:** ~2-3 GB VRAM (moondream loaded)
+- **Switching:** Brief spike as models swap (~1-2 seconds)
+
+The TTL settings in `llama-swap-config.yaml` control auto-unload:
+- Text model: 30 minutes of inactivity
+- Vision model: 15 minutes of inactivity (used less frequently)
+
+---
+
+## Quick Start Summary
+
+```bash
+# 1. Download models
+huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --local-dir ./models
+wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf
+wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf
+
+# 2. Rename files
+mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf
+mv models/moondream-0_5b-int8.gguf models/moondream.gguf
+mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf
+
+# 3. Start the bot
+docker-compose up -d
+
+# 4. Monitor
+docker-compose logs -f
+```
+
+That's it! 🎉