Organize documentation: Move all .md files to readmes/ directory

2025-12-07 17:21:59 +02:00
parent 8c74ad5260
commit 88d4256755
28 changed files with 0 additions and 0 deletions
--- a/readmes/MIGRATION_COMPLETE.md
+++ b/readmes/MIGRATION_COMPLETE.md
@@ -0,0 +1,203 @@
+# Migration Complete: Ollama → Llama.cpp + llama-swap
+
+## ✅ Migration Summary
+
+Your Miku Discord bot has been successfully migrated from Ollama to llama.cpp with llama-swap!
+
+## What Changed
+
+### Architecture
+- **Before:** Ollama server with manual model switching
+- **After:** llama-swap proxy + llama-server (llama.cpp) with automatic model management
+
+### Benefits Gained
+✅ **Auto-unload models** after inactivity (saves VRAM!)
+✅ **Seamless model switching** - no more manual `switch_model()` calls
+✅ **OpenAI-compatible API** - more standard and portable
+✅ **Better resource management** - TTL-based unloading
+✅ **Web UI** for monitoring at http://localhost:8080/ui
+
+## Files Modified
+
+### Configuration
+- ✅ `docker-compose.yml` - Replaced ollama service with llama-swap
+- ✅ `llama-swap-config.yaml` - Created (new configuration file)
+- ✅ `models/` - Created directory for GGUF files
+
+### Bot Code
+- ✅ `bot/globals.py` - Updated environment variables (OLLAMA_URL → LLAMA_URL)
+- ✅ `bot/utils/llm.py` - Converted to OpenAI API format
+- ✅ `bot/utils/image_handling.py` - Updated vision API calls
+- ✅ `bot/utils/core.py` - Removed `switch_model()` function
+- ✅ `bot/utils/scheduled.py` - Removed `switch_model()` calls
+
+### Documentation
+- ✅ `LLAMA_CPP_SETUP.md` - Created comprehensive setup guide
+
+## What You Need to Do
+
+### 1. Download Models (~6.5 GB total)
+
+See `LLAMA_CPP_SETUP.md` for detailed instructions. Quick version:
+
+```bash
+# Text model (Llama 3.1 8B)
+huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
+  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
+  --local-dir ./models
+
+# Vision model (Moondream)
+wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf
+wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf
+
+# Rename files
+mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf
+mv models/moondream-0_5b-int8.gguf models/moondream.gguf
+mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf
+```
+
+### 2. Verify File Structure
+
+```bash
+ls -lh models/
+# Should show:
+# llama3.1.gguf          (~4.9 GB)
+# moondream.gguf         (~500 MB)
+# moondream-mmproj.gguf  (~1.2 GB)
+```
+
+### 3. Remove Old Ollama Data (Optional)
+
+If you're completely done with Ollama:
+
+```bash
+# Stop containers
+docker-compose down
+
+# Remove old Ollama volume
+docker volume rm ollama-discord_ollama_data
+
+# Remove old Dockerfile (no longer used)
+rm Dockerfile.ollama
+rm entrypoint.sh
+```
+
+### 4. Start the Bot
+
+```bash
+docker-compose up -d
+```
+
+### 5. Monitor Startup
+
+```bash
+# Watch llama-swap logs
+docker-compose logs -f llama-swap
+
+# Watch bot logs
+docker-compose logs -f bot
+```
+
+### 6. Access Web UI
+
+Visit http://localhost:8080/ui to monitor:
+- Currently loaded models
+- Auto-unload timers
+- Request history
+- Model swap events
+
+## API Changes (For Reference)
+
+### Before (Ollama):
+```python
+# Manual model switching
+await switch_model("moondream")
+
+# Ollama API
+payload = {
+    "model": "llama3.1",
+    "prompt": "Hello",
+    "system": "You are Miku"
+}
+response = await session.post(f"{OLLAMA_URL}/api/generate", ...)
+```
+
+### After (llama.cpp):
+```python
+# No manual switching needed!
+
+# OpenAI-compatible API
+payload = {
+    "model": "llama3.1",  # llama-swap auto-switches
+    "messages": [
+        {"role": "system", "content": "You are Miku"},
+        {"role": "user", "content": "Hello"}
+    ]
+}
+response = await session.post(f"{LLAMA_URL}/v1/chat/completions", ...)
+```
+
+## Backward Compatibility
+
+All existing code still works! Aliases were added:
+- `query_ollama()` → now calls `query_llama()`
+- `analyze_image_with_qwen()` → now calls `analyze_image_with_vision()`
+
+So you don't need to update every file immediately.
+
+## Resource Usage
+
+### With Auto-Unload (TTL):
+- **Idle:** 0 GB VRAM (models unloaded automatically)
+- **Text generation:** ~5-6 GB VRAM
+- **Vision analysis:** ~2-3 GB VRAM
+- **Model switching:** 1-2 seconds
+
+### TTL Settings (in llama-swap-config.yaml):
+- Text model: 30 minutes idle → auto-unload
+- Vision model: 15 minutes idle → auto-unload
+
+## Troubleshooting
+
+### "Model not found" error
+Check that model files are in `./models/` and named correctly:
+- `llama3.1.gguf`
+- `moondream.gguf`
+- `moondream-mmproj.gguf`
+
+### CUDA/GPU errors
+Ensure NVIDIA runtime works:
+```bash
+docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
+```
+
+### Bot won't connect to llama-swap
+Check health:
+```bash
+curl http://localhost:8080/health
+# Should return: {"status": "ok"}
+```
+
+### Models load slowly
+This is normal on first load! llama.cpp loads models from scratch.
+Subsequent loads reuse cache and are much faster.
+
+## Next Steps
+
+1. ✅ Download models (see LLAMA_CPP_SETUP.md)
+2. ✅ Start services: `docker-compose up -d`
+3. ✅ Test in Discord
+4. ✅ Monitor web UI at http://localhost:8080/ui
+5. ✅ Adjust TTL settings in `llama-swap-config.yaml` if needed
+
+## Need Help?
+
+- **Setup Guide:** See `LLAMA_CPP_SETUP.md`
+- **llama-swap Docs:** https://github.com/mostlygeek/llama-swap
+- **llama.cpp Server Docs:** https://github.com/ggml-org/llama.cpp/tree/master/tools/server
+
+---
+
+**Migration completed successfully! 🎉**
+
+The bot will now automatically manage VRAM usage by unloading models when idle, and seamlessly switch between text and vision models as needed.