readmes/LLAMA_CPP_SETUP.md

# Llama.cpp Migration - Model Setup Guide

## Overview
This bot now uses **llama.cpp** with **llama-swap** instead of Ollama. This provides:
- ✅ Automatic model unloading after inactivity (saves VRAM)
- ✅ Seamless model switching between text and vision models
- ✅ OpenAI-compatible API
- ✅ Better resource management

## Required Models

You need to download two GGUF model files and place them in the `/models` directory:

### 1. Text Generation Model: Llama 3.1 8B

**Recommended:** Meta-Llama-3.1-8B-Instruct (Q4_K_M quantization)

**Download from HuggingFace:**
```bash
# Using huggingface-cli (recommended)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --local-dir ./models \
  --local-dir-use-symlinks False

# Or download manually from:
# https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
```

**Rename the file to:**
```bash
mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf
```

**File size:** ~4.9 GB
**VRAM usage:** ~5-6 GB

### 2. Vision Model: Moondream 2

**Moondream 2** is a small but capable vision-language model.

**Download model and projector:**
```bash
# Download the main model
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf
# Rename for clarity
mv models/moondream-0_5b-int8.gguf models/moondream.gguf

# Download the multimodal projector (required for vision)
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf
# Rename for clarity
mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf
```

**Alternative download locations:**
- Main: https://huggingface.co/vikhyatk/moondream2
- GGUF versions: https://huggingface.co/vikhyatk/moondream2/tree/main

**File sizes:**
- moondream.gguf: ~500 MB
- moondream-mmproj.gguf: ~1.2 GB
**VRAM usage:** ~2-3 GB

## Directory Structure

After downloading, your `models/` directory should look like this:

```
models/
├── .gitkeep
├── llama3.1.gguf                 (~4.9 GB) - Text generation
├── moondream.gguf                (~500 MB) - Vision model
└── moondream-mmproj.gguf         (~1.2 GB) - Vision projector
```

## Alternative Models

If you want to use different models:

### Alternative Text Models:
- **Llama 3.2 3B** (smaller, faster): `Llama-3.2-3B-Instruct-Q4_K_M.gguf`
- **Qwen 2.5 7B** (alternative): `Qwen2.5-7B-Instruct-Q4_K_M.gguf`
- **Mistral 7B**: `Mistral-7B-Instruct-v0.3-Q4_K_M.gguf`

### Alternative Vision Models:
- **LLaVA 1.5 7B**: Larger, more capable vision model
- **BakLLaVA**: Another vision-language option

**Important:** If you use different models, update `llama-swap-config.yaml`:
```yaml
models:
  your-model-name:
    cmd: llama-server --port ${PORT} --model /models/your-model.gguf -ngl 99 -c 4096 --host 0.0.0.0
    ttl: 30m
```

And update environment variables in `docker-compose.yml`:
```yaml
environment:
  - TEXT_MODEL=your-model-name
  - VISION_MODEL=your-vision-model
```

## Verification

After placing models in the directory, verify:

```bash
ls -lh models/
# Should show:
# llama3.1.gguf          (~4.9 GB)
# moondream.gguf         (~500 MB)
# moondream-mmproj.gguf  (~1.2 GB)
```

## Starting the Bot

Once models are in place:

```bash
docker-compose up -d
```

Check the logs to ensure models load correctly:
```bash
docker-compose logs -f llama-swap
```

You should see:
```
✅ Model llama3.1 loaded successfully
✅ Model moondream ready for vision tasks
```

## Monitoring

Access the llama-swap web UI at:
```
http://localhost:8080/ui
```

This shows:
- Currently loaded models
- Model swap history
- Request logs
- Auto-unload timers

## Troubleshooting

### Model not found error
- Ensure files are in the correct `/models` directory
- Check filenames match exactly what's in `llama-swap-config.yaml`
- Verify file permissions (should be readable by Docker)

### CUDA/GPU errors
- Ensure NVIDIA runtime is available: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`
- Update NVIDIA drivers if needed
- Check GPU memory: Models need ~6-8 GB VRAM total (but only one loaded at a time)

### Model loads but generates gibberish
- Wrong quantization or corrupted download
- Re-download the model file
- Try a different quantization (Q4_K_M recommended)

## Resource Usage

With TTL-based unloading:
- **Idle:** ~0 GB VRAM (models unloaded)
- **Text generation active:** ~5-6 GB VRAM (llama3.1 loaded)
- **Vision analysis active:** ~2-3 GB VRAM (moondream loaded)
- **Switching:** Brief spike as models swap (~1-2 seconds)

The TTL settings in `llama-swap-config.yaml` control auto-unload:
- Text model: 30 minutes of inactivity
- Vision model: 15 minutes of inactivity (used less frequently)

---

## Quick Start Summary

```bash
# 1. Download models
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --local-dir ./models
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf

# 2. Rename files
mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf
mv models/moondream-0_5b-int8.gguf models/moondream.gguf
mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf

# 3. Start the bot
docker-compose up -d

# 4. Monitor
docker-compose logs -f
```

That's it! 🎉
Initial commit: Miku Discord Bot 2025-12-07 17:15:09 +02:00			`# Llama.cpp Migration - Model Setup Guide`

			`## Overview`
			`This bot now uses llama.cpp with llama-swap instead of Ollama. This provides:`
			`- ✅ Automatic model unloading after inactivity (saves VRAM)`
			`- ✅ Seamless model switching between text and vision models`
			`- ✅ OpenAI-compatible API`
			`- ✅ Better resource management`

			`## Required Models`

			You need to download two GGUF model files and place them in the `/models` directory:

			`### 1. Text Generation Model: Llama 3.1 8B`

			`Recommended: Meta-Llama-3.1-8B-Instruct (Q4_K_M quantization)`

			`Download from HuggingFace:`
			```bash
			`# Using huggingface-cli (recommended)`
			`huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \`
			`Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \`
			`--local-dir ./models \`
			`--local-dir-use-symlinks False`

			`# Or download manually from:`
			`# https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf`
			```

			`Rename the file to:`
			```bash
			`mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf`
			```

			`File size: ~4.9 GB`
			`VRAM usage: ~5-6 GB`

			`### 2. Vision Model: Moondream 2`

			`Moondream 2 is a small but capable vision-language model.`

			`Download model and projector:`
			```bash
			`# Download the main model`
			`wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf`
			`# Rename for clarity`
			`mv models/moondream-0_5b-int8.gguf models/moondream.gguf`

			`# Download the multimodal projector (required for vision)`
			`wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf`
			`# Rename for clarity`
			`mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf`
			```

			`Alternative download locations:`
			`- Main: https://huggingface.co/vikhyatk/moondream2`
			`- GGUF versions: https://huggingface.co/vikhyatk/moondream2/tree/main`

			`File sizes:`
			`- moondream.gguf: ~500 MB`
			`- moondream-mmproj.gguf: ~1.2 GB`
			`VRAM usage: ~2-3 GB`

			`## Directory Structure`

			After downloading, your `models/` directory should look like this:

			```
			`models/`
			`├── .gitkeep`
			`├── llama3.1.gguf (~4.9 GB) - Text generation`
			`├── moondream.gguf (~500 MB) - Vision model`
			`└── moondream-mmproj.gguf (~1.2 GB) - Vision projector`
			```

			`## Alternative Models`

			`If you want to use different models:`

			`### Alternative Text Models:`
			- Llama 3.2 3B (smaller, faster): `Llama-3.2-3B-Instruct-Q4_K_M.gguf`
			- Qwen 2.5 7B (alternative): `Qwen2.5-7B-Instruct-Q4_K_M.gguf`
			- Mistral 7B: `Mistral-7B-Instruct-v0.3-Q4_K_M.gguf`

			`### Alternative Vision Models:`
			`- LLaVA 1.5 7B: Larger, more capable vision model`
			`- BakLLaVA: Another vision-language option`

			Important: If you use different models, update `llama-swap-config.yaml`:
			```yaml
			`models:`
			`your-model-name:`
			`cmd: llama-server --port ${PORT} --model /models/your-model.gguf -ngl 99 -c 4096 --host 0.0.0.0`
			`ttl: 30m`
			```

			And update environment variables in `docker-compose.yml`:
			```yaml
			`environment:`
			`- TEXT_MODEL=your-model-name`
			`- VISION_MODEL=your-vision-model`
			```

			`## Verification`

			`After placing models in the directory, verify:`

			```bash
			`ls -lh models/`
			`# Should show:`
			`# llama3.1.gguf (~4.9 GB)`
			`# moondream.gguf (~500 MB)`
			`# moondream-mmproj.gguf (~1.2 GB)`
			```

			`## Starting the Bot`

			`Once models are in place:`

			```bash
			`docker-compose up -d`
			```

			`Check the logs to ensure models load correctly:`
			```bash
			`docker-compose logs -f llama-swap`
			```

			`You should see:`
			```
			`✅ Model llama3.1 loaded successfully`
			`✅ Model moondream ready for vision tasks`
			```

			`## Monitoring`

			`Access the llama-swap web UI at:`
			```
			`http://localhost:8080/ui`
			```

			`This shows:`
			`- Currently loaded models`
			`- Model swap history`
			`- Request logs`
			`- Auto-unload timers`

			`## Troubleshooting`

			`### Model not found error`
			- Ensure files are in the correct `/models` directory
			- Check filenames match exactly what's in `llama-swap-config.yaml`
			`- Verify file permissions (should be readable by Docker)`

			`### CUDA/GPU errors`
			- Ensure NVIDIA runtime is available: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`
			`- Update NVIDIA drivers if needed`
			`- Check GPU memory: Models need ~6-8 GB VRAM total (but only one loaded at a time)`

			`### Model loads but generates gibberish`
			`- Wrong quantization or corrupted download`
			`- Re-download the model file`
			`- Try a different quantization (Q4_K_M recommended)`

			`## Resource Usage`

			`With TTL-based unloading:`
			`- Idle: ~0 GB VRAM (models unloaded)`
			`- Text generation active: ~5-6 GB VRAM (llama3.1 loaded)`
			`- Vision analysis active: ~2-3 GB VRAM (moondream loaded)`
			`- Switching: Brief spike as models swap (~1-2 seconds)`

			The TTL settings in `llama-swap-config.yaml` control auto-unload:
			`- Text model: 30 minutes of inactivity`
			`- Vision model: 15 minutes of inactivity (used less frequently)`

			`---`

			`## Quick Start Summary`

			```bash
			`# 1. Download models`
			`huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --local-dir ./models`
			`wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf`
			`wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf`

			`# 2. Rename files`
			`mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf`
			`mv models/moondream-0_5b-int8.gguf models/moondream.gguf`
			`mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf`

			`# 3. Start the bot`
			`docker-compose up -d`

			`# 4. Monitor`
			`docker-compose logs -f`
			```

			`That's it! 🎉`