Files
miku-discord/readmes/LLAMA_CPP_SETUP.md

5.4 KiB

Llama.cpp Migration - Model Setup Guide

Overview

This bot now uses llama.cpp with llama-swap instead of Ollama. This provides:

  • Automatic model unloading after inactivity (saves VRAM)
  • Seamless model switching between text and vision models
  • OpenAI-compatible API
  • Better resource management

Required Models

You need to download two GGUF model files and place them in the /models directory:

1. Text Generation Model: Llama 3.1 8B

Recommended: Meta-Llama-3.1-8B-Instruct (Q4_K_M quantization)

Download from HuggingFace:

# Using huggingface-cli (recommended)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --local-dir ./models \
  --local-dir-use-symlinks False

# Or download manually from:
# https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

Rename the file to:

mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf

File size: ~4.9 GB VRAM usage: ~5-6 GB

2. Vision Model: Moondream 2

Moondream 2 is a small but capable vision-language model.

Download model and projector:

# Download the main model
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf
# Rename for clarity
mv models/moondream-0_5b-int8.gguf models/moondream.gguf

# Download the multimodal projector (required for vision)
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf
# Rename for clarity
mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf

Alternative download locations:

File sizes:

  • moondream.gguf: ~500 MB
  • moondream-mmproj.gguf: ~1.2 GB VRAM usage: ~2-3 GB

Directory Structure

After downloading, your models/ directory should look like this:

models/
├── .gitkeep
├── llama3.1.gguf                 (~4.9 GB) - Text generation
├── moondream.gguf                (~500 MB) - Vision model
└── moondream-mmproj.gguf         (~1.2 GB) - Vision projector

Alternative Models

If you want to use different models:

Alternative Text Models:

  • Llama 3.2 3B (smaller, faster): Llama-3.2-3B-Instruct-Q4_K_M.gguf
  • Qwen 2.5 7B (alternative): Qwen2.5-7B-Instruct-Q4_K_M.gguf
  • Mistral 7B: Mistral-7B-Instruct-v0.3-Q4_K_M.gguf

Alternative Vision Models:

  • LLaVA 1.5 7B: Larger, more capable vision model
  • BakLLaVA: Another vision-language option

Important: If you use different models, update llama-swap-config.yaml:

models:
  your-model-name:
    cmd: llama-server --port ${PORT} --model /models/your-model.gguf -ngl 99 -c 4096 --host 0.0.0.0
    ttl: 30m

And update environment variables in docker-compose.yml:

environment:
  - TEXT_MODEL=your-model-name
  - VISION_MODEL=your-vision-model

Verification

After placing models in the directory, verify:

ls -lh models/
# Should show:
# llama3.1.gguf          (~4.9 GB)
# moondream.gguf         (~500 MB)
# moondream-mmproj.gguf  (~1.2 GB)

Starting the Bot

Once models are in place:

docker-compose up -d

Check the logs to ensure models load correctly:

docker-compose logs -f llama-swap

You should see:

✅ Model llama3.1 loaded successfully
✅ Model moondream ready for vision tasks

Monitoring

Access the llama-swap web UI at:

http://localhost:8080/ui

This shows:

  • Currently loaded models
  • Model swap history
  • Request logs
  • Auto-unload timers

Troubleshooting

Model not found error

  • Ensure files are in the correct /models directory
  • Check filenames match exactly what's in llama-swap-config.yaml
  • Verify file permissions (should be readable by Docker)

CUDA/GPU errors

  • Ensure NVIDIA runtime is available: docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
  • Update NVIDIA drivers if needed
  • Check GPU memory: Models need ~6-8 GB VRAM total (but only one loaded at a time)

Model loads but generates gibberish

  • Wrong quantization or corrupted download
  • Re-download the model file
  • Try a different quantization (Q4_K_M recommended)

Resource Usage

With TTL-based unloading:

  • Idle: ~0 GB VRAM (models unloaded)
  • Text generation active: ~5-6 GB VRAM (llama3.1 loaded)
  • Vision analysis active: ~2-3 GB VRAM (moondream loaded)
  • Switching: Brief spike as models swap (~1-2 seconds)

The TTL settings in llama-swap-config.yaml control auto-unload:

  • Text model: 30 minutes of inactivity
  • Vision model: 15 minutes of inactivity (used less frequently)

Quick Start Summary

# 1. Download models
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --local-dir ./models
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf

# 2. Rename files
mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf
mv models/moondream-0_5b-int8.gguf models/moondream.gguf
mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf

# 3. Start the bot
docker-compose up -d

# 4. Monitor
docker-compose logs -f

That's it! 🎉