# Llama.cpp Migration - Model Setup Guide ## Overview This bot now uses **llama.cpp** with **llama-swap** instead of Ollama. This provides: - ✅ Automatic model unloading after inactivity (saves VRAM) - ✅ Seamless model switching between text and vision models - ✅ OpenAI-compatible API - ✅ Better resource management ## Required Models You need to download two GGUF model files and place them in the `/models` directory: ### 1. Text Generation Model: Llama 3.1 8B **Recommended:** Meta-Llama-3.1-8B-Instruct (Q4_K_M quantization) **Download from HuggingFace:** ```bash # Using huggingface-cli (recommended) huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \ Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \ --local-dir ./models \ --local-dir-use-symlinks False # Or download manually from: # https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf ``` **Rename the file to:** ```bash mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf ``` **File size:** ~4.9 GB **VRAM usage:** ~5-6 GB ### 2. Vision Model: Moondream 2 **Moondream 2** is a small but capable vision-language model. **Download model and projector:** ```bash # Download the main model wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf # Rename for clarity mv models/moondream-0_5b-int8.gguf models/moondream.gguf # Download the multimodal projector (required for vision) wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf # Rename for clarity mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf ``` **Alternative download locations:** - Main: https://huggingface.co/vikhyatk/moondream2 - GGUF versions: https://huggingface.co/vikhyatk/moondream2/tree/main **File sizes:** - moondream.gguf: ~500 MB - moondream-mmproj.gguf: ~1.2 GB **VRAM usage:** ~2-3 GB ## Directory Structure After downloading, your `models/` directory should look like this: ``` models/ ├── .gitkeep ├── llama3.1.gguf (~4.9 GB) - Text generation ├── moondream.gguf (~500 MB) - Vision model └── moondream-mmproj.gguf (~1.2 GB) - Vision projector ``` ## Alternative Models If you want to use different models: ### Alternative Text Models: - **Llama 3.2 3B** (smaller, faster): `Llama-3.2-3B-Instruct-Q4_K_M.gguf` - **Qwen 2.5 7B** (alternative): `Qwen2.5-7B-Instruct-Q4_K_M.gguf` - **Mistral 7B**: `Mistral-7B-Instruct-v0.3-Q4_K_M.gguf` ### Alternative Vision Models: - **LLaVA 1.5 7B**: Larger, more capable vision model - **BakLLaVA**: Another vision-language option **Important:** If you use different models, update `llama-swap-config.yaml`: ```yaml models: your-model-name: cmd: llama-server --port ${PORT} --model /models/your-model.gguf -ngl 99 -c 4096 --host 0.0.0.0 ttl: 30m ``` And update environment variables in `docker-compose.yml`: ```yaml environment: - TEXT_MODEL=your-model-name - VISION_MODEL=your-vision-model ``` ## Verification After placing models in the directory, verify: ```bash ls -lh models/ # Should show: # llama3.1.gguf (~4.9 GB) # moondream.gguf (~500 MB) # moondream-mmproj.gguf (~1.2 GB) ``` ## Starting the Bot Once models are in place: ```bash docker-compose up -d ``` Check the logs to ensure models load correctly: ```bash docker-compose logs -f llama-swap ``` You should see: ``` ✅ Model llama3.1 loaded successfully ✅ Model moondream ready for vision tasks ``` ## Monitoring Access the llama-swap web UI at: ``` http://localhost:8080/ui ``` This shows: - Currently loaded models - Model swap history - Request logs - Auto-unload timers ## Troubleshooting ### Model not found error - Ensure files are in the correct `/models` directory - Check filenames match exactly what's in `llama-swap-config.yaml` - Verify file permissions (should be readable by Docker) ### CUDA/GPU errors - Ensure NVIDIA runtime is available: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi` - Update NVIDIA drivers if needed - Check GPU memory: Models need ~6-8 GB VRAM total (but only one loaded at a time) ### Model loads but generates gibberish - Wrong quantization or corrupted download - Re-download the model file - Try a different quantization (Q4_K_M recommended) ## Resource Usage With TTL-based unloading: - **Idle:** ~0 GB VRAM (models unloaded) - **Text generation active:** ~5-6 GB VRAM (llama3.1 loaded) - **Vision analysis active:** ~2-3 GB VRAM (moondream loaded) - **Switching:** Brief spike as models swap (~1-2 seconds) The TTL settings in `llama-swap-config.yaml` control auto-unload: - Text model: 30 minutes of inactivity - Vision model: 15 minutes of inactivity (used less frequently) --- ## Quick Start Summary ```bash # 1. Download models huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --local-dir ./models wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf # 2. Rename files mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf mv models/moondream-0_5b-int8.gguf models/moondream.gguf mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf # 3. Start the bot docker-compose up -d # 4. Monitor docker-compose logs -f ``` That's it! 🎉