Organize documentation: Move all .md files to readmes/ directory
This commit is contained in:
199
readmes/LLAMA_CPP_SETUP.md
Normal file
199
readmes/LLAMA_CPP_SETUP.md
Normal file
@@ -0,0 +1,199 @@
|
||||
# Llama.cpp Migration - Model Setup Guide
|
||||
|
||||
## Overview
|
||||
This bot now uses **llama.cpp** with **llama-swap** instead of Ollama. This provides:
|
||||
- ✅ Automatic model unloading after inactivity (saves VRAM)
|
||||
- ✅ Seamless model switching between text and vision models
|
||||
- ✅ OpenAI-compatible API
|
||||
- ✅ Better resource management
|
||||
|
||||
## Required Models
|
||||
|
||||
You need to download two GGUF model files and place them in the `/models` directory:
|
||||
|
||||
### 1. Text Generation Model: Llama 3.1 8B
|
||||
|
||||
**Recommended:** Meta-Llama-3.1-8B-Instruct (Q4_K_M quantization)
|
||||
|
||||
**Download from HuggingFace:**
|
||||
```bash
|
||||
# Using huggingface-cli (recommended)
|
||||
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
|
||||
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
|
||||
--local-dir ./models \
|
||||
--local-dir-use-symlinks False
|
||||
|
||||
# Or download manually from:
|
||||
# https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
|
||||
```
|
||||
|
||||
**Rename the file to:**
|
||||
```bash
|
||||
mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf
|
||||
```
|
||||
|
||||
**File size:** ~4.9 GB
|
||||
**VRAM usage:** ~5-6 GB
|
||||
|
||||
### 2. Vision Model: Moondream 2
|
||||
|
||||
**Moondream 2** is a small but capable vision-language model.
|
||||
|
||||
**Download model and projector:**
|
||||
```bash
|
||||
# Download the main model
|
||||
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf
|
||||
# Rename for clarity
|
||||
mv models/moondream-0_5b-int8.gguf models/moondream.gguf
|
||||
|
||||
# Download the multimodal projector (required for vision)
|
||||
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf
|
||||
# Rename for clarity
|
||||
mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf
|
||||
```
|
||||
|
||||
**Alternative download locations:**
|
||||
- Main: https://huggingface.co/vikhyatk/moondream2
|
||||
- GGUF versions: https://huggingface.co/vikhyatk/moondream2/tree/main
|
||||
|
||||
**File sizes:**
|
||||
- moondream.gguf: ~500 MB
|
||||
- moondream-mmproj.gguf: ~1.2 GB
|
||||
**VRAM usage:** ~2-3 GB
|
||||
|
||||
## Directory Structure
|
||||
|
||||
After downloading, your `models/` directory should look like this:
|
||||
|
||||
```
|
||||
models/
|
||||
├── .gitkeep
|
||||
├── llama3.1.gguf (~4.9 GB) - Text generation
|
||||
├── moondream.gguf (~500 MB) - Vision model
|
||||
└── moondream-mmproj.gguf (~1.2 GB) - Vision projector
|
||||
```
|
||||
|
||||
## Alternative Models
|
||||
|
||||
If you want to use different models:
|
||||
|
||||
### Alternative Text Models:
|
||||
- **Llama 3.2 3B** (smaller, faster): `Llama-3.2-3B-Instruct-Q4_K_M.gguf`
|
||||
- **Qwen 2.5 7B** (alternative): `Qwen2.5-7B-Instruct-Q4_K_M.gguf`
|
||||
- **Mistral 7B**: `Mistral-7B-Instruct-v0.3-Q4_K_M.gguf`
|
||||
|
||||
### Alternative Vision Models:
|
||||
- **LLaVA 1.5 7B**: Larger, more capable vision model
|
||||
- **BakLLaVA**: Another vision-language option
|
||||
|
||||
**Important:** If you use different models, update `llama-swap-config.yaml`:
|
||||
```yaml
|
||||
models:
|
||||
your-model-name:
|
||||
cmd: llama-server --port ${PORT} --model /models/your-model.gguf -ngl 99 -c 4096 --host 0.0.0.0
|
||||
ttl: 30m
|
||||
```
|
||||
|
||||
And update environment variables in `docker-compose.yml`:
|
||||
```yaml
|
||||
environment:
|
||||
- TEXT_MODEL=your-model-name
|
||||
- VISION_MODEL=your-vision-model
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
After placing models in the directory, verify:
|
||||
|
||||
```bash
|
||||
ls -lh models/
|
||||
# Should show:
|
||||
# llama3.1.gguf (~4.9 GB)
|
||||
# moondream.gguf (~500 MB)
|
||||
# moondream-mmproj.gguf (~1.2 GB)
|
||||
```
|
||||
|
||||
## Starting the Bot
|
||||
|
||||
Once models are in place:
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
Check the logs to ensure models load correctly:
|
||||
```bash
|
||||
docker-compose logs -f llama-swap
|
||||
```
|
||||
|
||||
You should see:
|
||||
```
|
||||
✅ Model llama3.1 loaded successfully
|
||||
✅ Model moondream ready for vision tasks
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
Access the llama-swap web UI at:
|
||||
```
|
||||
http://localhost:8080/ui
|
||||
```
|
||||
|
||||
This shows:
|
||||
- Currently loaded models
|
||||
- Model swap history
|
||||
- Request logs
|
||||
- Auto-unload timers
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Model not found error
|
||||
- Ensure files are in the correct `/models` directory
|
||||
- Check filenames match exactly what's in `llama-swap-config.yaml`
|
||||
- Verify file permissions (should be readable by Docker)
|
||||
|
||||
### CUDA/GPU errors
|
||||
- Ensure NVIDIA runtime is available: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`
|
||||
- Update NVIDIA drivers if needed
|
||||
- Check GPU memory: Models need ~6-8 GB VRAM total (but only one loaded at a time)
|
||||
|
||||
### Model loads but generates gibberish
|
||||
- Wrong quantization or corrupted download
|
||||
- Re-download the model file
|
||||
- Try a different quantization (Q4_K_M recommended)
|
||||
|
||||
## Resource Usage
|
||||
|
||||
With TTL-based unloading:
|
||||
- **Idle:** ~0 GB VRAM (models unloaded)
|
||||
- **Text generation active:** ~5-6 GB VRAM (llama3.1 loaded)
|
||||
- **Vision analysis active:** ~2-3 GB VRAM (moondream loaded)
|
||||
- **Switching:** Brief spike as models swap (~1-2 seconds)
|
||||
|
||||
The TTL settings in `llama-swap-config.yaml` control auto-unload:
|
||||
- Text model: 30 minutes of inactivity
|
||||
- Vision model: 15 minutes of inactivity (used less frequently)
|
||||
|
||||
---
|
||||
|
||||
## Quick Start Summary
|
||||
|
||||
```bash
|
||||
# 1. Download models
|
||||
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --local-dir ./models
|
||||
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf
|
||||
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf
|
||||
|
||||
# 2. Rename files
|
||||
mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf
|
||||
mv models/moondream-0_5b-int8.gguf models/moondream.gguf
|
||||
mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf
|
||||
|
||||
# 3. Start the bot
|
||||
docker-compose up -d
|
||||
|
||||
# 4. Monitor
|
||||
docker-compose logs -f
|
||||
```
|
||||
|
||||
That's it! 🎉
|
||||
Reference in New Issue
Block a user