5.4 KiB
Llama.cpp Migration - Model Setup Guide
Overview
This bot now uses llama.cpp with llama-swap instead of Ollama. This provides:
- ✅ Automatic model unloading after inactivity (saves VRAM)
- ✅ Seamless model switching between text and vision models
- ✅ OpenAI-compatible API
- ✅ Better resource management
Required Models
You need to download two GGUF model files and place them in the /models directory:
1. Text Generation Model: Llama 3.1 8B
Recommended: Meta-Llama-3.1-8B-Instruct (Q4_K_M quantization)
Download from HuggingFace:
# Using huggingface-cli (recommended)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--local-dir ./models \
--local-dir-use-symlinks False
# Or download manually from:
# https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
Rename the file to:
mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf
File size: ~4.9 GB VRAM usage: ~5-6 GB
2. Vision Model: Moondream 2
Moondream 2 is a small but capable vision-language model.
Download model and projector:
# Download the main model
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf
# Rename for clarity
mv models/moondream-0_5b-int8.gguf models/moondream.gguf
# Download the multimodal projector (required for vision)
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf
# Rename for clarity
mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf
Alternative download locations:
- Main: https://huggingface.co/vikhyatk/moondream2
- GGUF versions: https://huggingface.co/vikhyatk/moondream2/tree/main
File sizes:
- moondream.gguf: ~500 MB
- moondream-mmproj.gguf: ~1.2 GB VRAM usage: ~2-3 GB
Directory Structure
After downloading, your models/ directory should look like this:
models/
├── .gitkeep
├── llama3.1.gguf (~4.9 GB) - Text generation
├── moondream.gguf (~500 MB) - Vision model
└── moondream-mmproj.gguf (~1.2 GB) - Vision projector
Alternative Models
If you want to use different models:
Alternative Text Models:
- Llama 3.2 3B (smaller, faster):
Llama-3.2-3B-Instruct-Q4_K_M.gguf - Qwen 2.5 7B (alternative):
Qwen2.5-7B-Instruct-Q4_K_M.gguf - Mistral 7B:
Mistral-7B-Instruct-v0.3-Q4_K_M.gguf
Alternative Vision Models:
- LLaVA 1.5 7B: Larger, more capable vision model
- BakLLaVA: Another vision-language option
Important: If you use different models, update llama-swap-config.yaml:
models:
your-model-name:
cmd: llama-server --port ${PORT} --model /models/your-model.gguf -ngl 99 -c 4096 --host 0.0.0.0
ttl: 30m
And update environment variables in docker-compose.yml:
environment:
- TEXT_MODEL=your-model-name
- VISION_MODEL=your-vision-model
Verification
After placing models in the directory, verify:
ls -lh models/
# Should show:
# llama3.1.gguf (~4.9 GB)
# moondream.gguf (~500 MB)
# moondream-mmproj.gguf (~1.2 GB)
Starting the Bot
Once models are in place:
docker-compose up -d
Check the logs to ensure models load correctly:
docker-compose logs -f llama-swap
You should see:
✅ Model llama3.1 loaded successfully
✅ Model moondream ready for vision tasks
Monitoring
Access the llama-swap web UI at:
http://localhost:8080/ui
This shows:
- Currently loaded models
- Model swap history
- Request logs
- Auto-unload timers
Troubleshooting
Model not found error
- Ensure files are in the correct
/modelsdirectory - Check filenames match exactly what's in
llama-swap-config.yaml - Verify file permissions (should be readable by Docker)
CUDA/GPU errors
- Ensure NVIDIA runtime is available:
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi - Update NVIDIA drivers if needed
- Check GPU memory: Models need ~6-8 GB VRAM total (but only one loaded at a time)
Model loads but generates gibberish
- Wrong quantization or corrupted download
- Re-download the model file
- Try a different quantization (Q4_K_M recommended)
Resource Usage
With TTL-based unloading:
- Idle: ~0 GB VRAM (models unloaded)
- Text generation active: ~5-6 GB VRAM (llama3.1 loaded)
- Vision analysis active: ~2-3 GB VRAM (moondream loaded)
- Switching: Brief spike as models swap (~1-2 seconds)
The TTL settings in llama-swap-config.yaml control auto-unload:
- Text model: 30 minutes of inactivity
- Vision model: 15 minutes of inactivity (used less frequently)
Quick Start Summary
# 1. Download models
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --local-dir ./models
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf
# 2. Rename files
mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf
mv models/moondream-0_5b-int8.gguf models/moondream.gguf
mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf
# 3. Start the bot
docker-compose up -d
# 4. Monitor
docker-compose logs -f
That's it! 🎉