Koko210/miku-discord

Fork 0

Files

koko210Serve 88d4256755 Organize documentation: Move all .md files to readmes/ directory

2025-12-07 17:21:59 +02:00

5.2 KiB

Raw Blame History

Migration Complete: Ollama → Llama.cpp + llama-swap

✅ Migration Summary

Your Miku Discord bot has been successfully migrated from Ollama to llama.cpp with llama-swap!

What Changed

Architecture

Before: Ollama server with manual model switching
After: llama-swap proxy + llama-server (llama.cpp) with automatic model management

Benefits Gained

✅ Auto-unload models after inactivity (saves VRAM!) ✅ Seamless model switching - no more manual switch_model() calls ✅ OpenAI-compatible API - more standard and portable ✅ Better resource management - TTL-based unloading ✅ Web UI for monitoring at http://localhost:8080/ui

Files Modified

Configuration

✅ docker-compose.yml - Replaced ollama service with llama-swap
✅ llama-swap-config.yaml - Created (new configuration file)
✅ models/ - Created directory for GGUF files

Bot Code

✅ bot/globals.py - Updated environment variables (OLLAMA_URL → LLAMA_URL)
✅ bot/utils/llm.py - Converted to OpenAI API format
✅ bot/utils/image_handling.py - Updated vision API calls
✅ bot/utils/core.py - Removed switch_model() function
✅ bot/utils/scheduled.py - Removed switch_model() calls

Documentation

✅ LLAMA_CPP_SETUP.md - Created comprehensive setup guide

What You Need to Do

1. Download Models (~6.5 GB total)

See LLAMA_CPP_SETUP.md for detailed instructions. Quick version:

# Text model (Llama 3.1 8B)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --local-dir ./models

# Vision model (Moondream)
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf

# Rename files
mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf
mv models/moondream-0_5b-int8.gguf models/moondream.gguf
mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf

2. Verify File Structure

ls -lh models/
# Should show:
# llama3.1.gguf          (~4.9 GB)
# moondream.gguf         (~500 MB)
# moondream-mmproj.gguf  (~1.2 GB)

3. Remove Old Ollama Data (Optional)

If you're completely done with Ollama:

# Stop containers
docker-compose down

# Remove old Ollama volume
docker volume rm ollama-discord_ollama_data

# Remove old Dockerfile (no longer used)
rm Dockerfile.ollama
rm entrypoint.sh

4. Start the Bot

docker-compose up -d

5. Monitor Startup

# Watch llama-swap logs
docker-compose logs -f llama-swap

# Watch bot logs
docker-compose logs -f bot

6. Access Web UI

Visit http://localhost:8080/ui to monitor:

Currently loaded models
Auto-unload timers
Request history
Model swap events

API Changes (For Reference)

Before (Ollama):

# Manual model switching
await switch_model("moondream")

# Ollama API
payload = {
    "model": "llama3.1",
    "prompt": "Hello",
    "system": "You are Miku"
}
response = await session.post(f"{OLLAMA_URL}/api/generate", ...)

After (llama.cpp):

# No manual switching needed!

# OpenAI-compatible API
payload = {
    "model": "llama3.1",  # llama-swap auto-switches
    "messages": [
        {"role": "system", "content": "You are Miku"},
        {"role": "user", "content": "Hello"}
    ]
}
response = await session.post(f"{LLAMA_URL}/v1/chat/completions", ...)

Backward Compatibility

All existing code still works! Aliases were added:

query_ollama() → now calls query_llama()
analyze_image_with_qwen() → now calls analyze_image_with_vision()

So you don't need to update every file immediately.

Resource Usage

With Auto-Unload (TTL):

Idle: 0 GB VRAM (models unloaded automatically)
Text generation: ~5-6 GB VRAM
Vision analysis: ~2-3 GB VRAM
Model switching: 1-2 seconds

TTL Settings (in llama-swap-config.yaml):

Text model: 30 minutes idle → auto-unload
Vision model: 15 minutes idle → auto-unload

Troubleshooting

"Model not found" error

Check that model files are in ./models/ and named correctly:

llama3.1.gguf
moondream.gguf
moondream-mmproj.gguf

CUDA/GPU errors

Ensure NVIDIA runtime works:

docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

Bot won't connect to llama-swap

Check health:

curl http://localhost:8080/health
# Should return: {"status": "ok"}

Models load slowly

This is normal on first load! llama.cpp loads models from scratch. Subsequent loads reuse cache and are much faster.

Next Steps

✅ Download models (see LLAMA_CPP_SETUP.md)
✅ Start services: docker-compose up -d
✅ Test in Discord
✅ Monitor web UI at http://localhost:8080/ui
✅ Adjust TTL settings in llama-swap-config.yaml if needed

Need Help?

Setup Guide: See LLAMA_CPP_SETUP.md
llama-swap Docs: https://github.com/mostlygeek/llama-swap
llama.cpp Server Docs: https://github.com/ggml-org/llama.cpp/tree/master/tools/server

Migration completed successfully! 🎉

The bot will now automatically manage VRAM usage by unloading models when idle, and seamlessly switch between text and vision models as needed.

5.2 KiB Raw Blame History