Organize documentation: Move all .md files to readmes/ directory
This commit is contained in:
203
readmes/MIGRATION_COMPLETE.md
Normal file
203
readmes/MIGRATION_COMPLETE.md
Normal file
@@ -0,0 +1,203 @@
|
||||
# Migration Complete: Ollama → Llama.cpp + llama-swap
|
||||
|
||||
## ✅ Migration Summary
|
||||
|
||||
Your Miku Discord bot has been successfully migrated from Ollama to llama.cpp with llama-swap!
|
||||
|
||||
## What Changed
|
||||
|
||||
### Architecture
|
||||
- **Before:** Ollama server with manual model switching
|
||||
- **After:** llama-swap proxy + llama-server (llama.cpp) with automatic model management
|
||||
|
||||
### Benefits Gained
|
||||
✅ **Auto-unload models** after inactivity (saves VRAM!)
|
||||
✅ **Seamless model switching** - no more manual `switch_model()` calls
|
||||
✅ **OpenAI-compatible API** - more standard and portable
|
||||
✅ **Better resource management** - TTL-based unloading
|
||||
✅ **Web UI** for monitoring at http://localhost:8080/ui
|
||||
|
||||
## Files Modified
|
||||
|
||||
### Configuration
|
||||
- ✅ `docker-compose.yml` - Replaced ollama service with llama-swap
|
||||
- ✅ `llama-swap-config.yaml` - Created (new configuration file)
|
||||
- ✅ `models/` - Created directory for GGUF files
|
||||
|
||||
### Bot Code
|
||||
- ✅ `bot/globals.py` - Updated environment variables (OLLAMA_URL → LLAMA_URL)
|
||||
- ✅ `bot/utils/llm.py` - Converted to OpenAI API format
|
||||
- ✅ `bot/utils/image_handling.py` - Updated vision API calls
|
||||
- ✅ `bot/utils/core.py` - Removed `switch_model()` function
|
||||
- ✅ `bot/utils/scheduled.py` - Removed `switch_model()` calls
|
||||
|
||||
### Documentation
|
||||
- ✅ `LLAMA_CPP_SETUP.md` - Created comprehensive setup guide
|
||||
|
||||
## What You Need to Do
|
||||
|
||||
### 1. Download Models (~6.5 GB total)
|
||||
|
||||
See `LLAMA_CPP_SETUP.md` for detailed instructions. Quick version:
|
||||
|
||||
```bash
|
||||
# Text model (Llama 3.1 8B)
|
||||
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
|
||||
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
|
||||
--local-dir ./models
|
||||
|
||||
# Vision model (Moondream)
|
||||
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-0_5b-int8.gguf
|
||||
wget -P models/ https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream-mmproj-f16.gguf
|
||||
|
||||
# Rename files
|
||||
mv models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf models/llama3.1.gguf
|
||||
mv models/moondream-0_5b-int8.gguf models/moondream.gguf
|
||||
mv models/moondream-mmproj-f16.gguf models/moondream-mmproj.gguf
|
||||
```
|
||||
|
||||
### 2. Verify File Structure
|
||||
|
||||
```bash
|
||||
ls -lh models/
|
||||
# Should show:
|
||||
# llama3.1.gguf (~4.9 GB)
|
||||
# moondream.gguf (~500 MB)
|
||||
# moondream-mmproj.gguf (~1.2 GB)
|
||||
```
|
||||
|
||||
### 3. Remove Old Ollama Data (Optional)
|
||||
|
||||
If you're completely done with Ollama:
|
||||
|
||||
```bash
|
||||
# Stop containers
|
||||
docker-compose down
|
||||
|
||||
# Remove old Ollama volume
|
||||
docker volume rm ollama-discord_ollama_data
|
||||
|
||||
# Remove old Dockerfile (no longer used)
|
||||
rm Dockerfile.ollama
|
||||
rm entrypoint.sh
|
||||
```
|
||||
|
||||
### 4. Start the Bot
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### 5. Monitor Startup
|
||||
|
||||
```bash
|
||||
# Watch llama-swap logs
|
||||
docker-compose logs -f llama-swap
|
||||
|
||||
# Watch bot logs
|
||||
docker-compose logs -f bot
|
||||
```
|
||||
|
||||
### 6. Access Web UI
|
||||
|
||||
Visit http://localhost:8080/ui to monitor:
|
||||
- Currently loaded models
|
||||
- Auto-unload timers
|
||||
- Request history
|
||||
- Model swap events
|
||||
|
||||
## API Changes (For Reference)
|
||||
|
||||
### Before (Ollama):
|
||||
```python
|
||||
# Manual model switching
|
||||
await switch_model("moondream")
|
||||
|
||||
# Ollama API
|
||||
payload = {
|
||||
"model": "llama3.1",
|
||||
"prompt": "Hello",
|
||||
"system": "You are Miku"
|
||||
}
|
||||
response = await session.post(f"{OLLAMA_URL}/api/generate", ...)
|
||||
```
|
||||
|
||||
### After (llama.cpp):
|
||||
```python
|
||||
# No manual switching needed!
|
||||
|
||||
# OpenAI-compatible API
|
||||
payload = {
|
||||
"model": "llama3.1", # llama-swap auto-switches
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are Miku"},
|
||||
{"role": "user", "content": "Hello"}
|
||||
]
|
||||
}
|
||||
response = await session.post(f"{LLAMA_URL}/v1/chat/completions", ...)
|
||||
```
|
||||
|
||||
## Backward Compatibility
|
||||
|
||||
All existing code still works! Aliases were added:
|
||||
- `query_ollama()` → now calls `query_llama()`
|
||||
- `analyze_image_with_qwen()` → now calls `analyze_image_with_vision()`
|
||||
|
||||
So you don't need to update every file immediately.
|
||||
|
||||
## Resource Usage
|
||||
|
||||
### With Auto-Unload (TTL):
|
||||
- **Idle:** 0 GB VRAM (models unloaded automatically)
|
||||
- **Text generation:** ~5-6 GB VRAM
|
||||
- **Vision analysis:** ~2-3 GB VRAM
|
||||
- **Model switching:** 1-2 seconds
|
||||
|
||||
### TTL Settings (in llama-swap-config.yaml):
|
||||
- Text model: 30 minutes idle → auto-unload
|
||||
- Vision model: 15 minutes idle → auto-unload
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "Model not found" error
|
||||
Check that model files are in `./models/` and named correctly:
|
||||
- `llama3.1.gguf`
|
||||
- `moondream.gguf`
|
||||
- `moondream-mmproj.gguf`
|
||||
|
||||
### CUDA/GPU errors
|
||||
Ensure NVIDIA runtime works:
|
||||
```bash
|
||||
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
|
||||
```
|
||||
|
||||
### Bot won't connect to llama-swap
|
||||
Check health:
|
||||
```bash
|
||||
curl http://localhost:8080/health
|
||||
# Should return: {"status": "ok"}
|
||||
```
|
||||
|
||||
### Models load slowly
|
||||
This is normal on first load! llama.cpp loads models from scratch.
|
||||
Subsequent loads reuse cache and are much faster.
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. ✅ Download models (see LLAMA_CPP_SETUP.md)
|
||||
2. ✅ Start services: `docker-compose up -d`
|
||||
3. ✅ Test in Discord
|
||||
4. ✅ Monitor web UI at http://localhost:8080/ui
|
||||
5. ✅ Adjust TTL settings in `llama-swap-config.yaml` if needed
|
||||
|
||||
## Need Help?
|
||||
|
||||
- **Setup Guide:** See `LLAMA_CPP_SETUP.md`
|
||||
- **llama-swap Docs:** https://github.com/mostlygeek/llama-swap
|
||||
- **llama.cpp Server Docs:** https://github.com/ggml-org/llama.cpp/tree/master/tools/server
|
||||
|
||||
---
|
||||
|
||||
**Migration completed successfully! 🎉**
|
||||
|
||||
The bot will now automatically manage VRAM usage by unloading models when idle, and seamlessly switch between text and vision models as needed.
|
||||
Reference in New Issue
Block a user