# Quick Reference: Ollama → Llama.cpp Migration ## Environment Variables | Old (Ollama) | New (llama.cpp) | Purpose | |--------------|-----------------|---------| | `OLLAMA_URL` | `LLAMA_URL` | Server endpoint | | `OLLAMA_MODEL` | `TEXT_MODEL` | Text generation model name | | N/A | `VISION_MODEL` | Vision model name | ## API Endpoints | Purpose | Old (Ollama) | New (llama.cpp) | |---------|--------------|-----------------| | Text generation | `/api/generate` | `/v1/chat/completions` | | Vision | `/api/generate` | `/v1/chat/completions` | | Health check | `GET /` | `GET /health` | | Model management | Manual `switch_model()` | Automatic via llama-swap | ## Function Changes | Old Function | New Function | Status | |--------------|--------------|--------| | `query_ollama()` | `query_llama()` | Aliased for compatibility | | `analyze_image_with_qwen()` | `analyze_image_with_vision()` | Aliased for compatibility | | `switch_model()` | **Removed** | llama-swap handles automatically | ## Request Format ### Text Generation **Before (Ollama):** ```python payload = { "model": "llama3.1", "prompt": "Hello world", "system": "You are Miku", "stream": False } await session.post(f"{OLLAMA_URL}/api/generate", json=payload) ``` **After (OpenAI):** ```python payload = { "model": "llama3.1", "messages": [ {"role": "system", "content": "You are Miku"}, {"role": "user", "content": "Hello world"} ], "stream": False } await session.post(f"{LLAMA_URL}/v1/chat/completions", json=payload) ``` ### Vision Analysis **Before (Ollama):** ```python await switch_model("moondream") # Manual switch! payload = { "model": "moondream", "prompt": "Describe this image", "images": [base64_img], "stream": False } await session.post(f"{OLLAMA_URL}/api/generate", json=payload) ``` **After (OpenAI):** ```python # No manual switch needed! payload = { "model": "moondream", # llama-swap auto-switches "messages": [{ "role": "user", "content": [ {"type": "text", "text": "Describe this image"}, {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_img}"}} ] }], "stream": False } await session.post(f"{LLAMA_URL}/v1/chat/completions", json=payload) ``` ## Response Format **Before (Ollama):** ```json { "response": "Hello! I'm Miku!", "model": "llama3.1" } ``` **After (OpenAI):** ```json { "choices": [{ "message": { "role": "assistant", "content": "Hello! I'm Miku!" } }], "model": "llama3.1" } ``` ## Docker Services **Before:** ```yaml services: ollama: image: ollama/ollama ports: ["11434:11434"] volumes: ["ollama_data:/root/.ollama"] bot: environment: - OLLAMA_URL=http://ollama:11434 - OLLAMA_MODEL=llama3.1 ``` **After:** ```yaml services: llama-swap: image: ghcr.io/mostlygeek/llama-swap:cuda ports: ["8080:8080"] volumes: - ./models:/models - ./llama-swap-config.yaml:/app/config.yaml bot: environment: - LLAMA_URL=http://llama-swap:8080 - TEXT_MODEL=llama3.1 - VISION_MODEL=moondream ``` ## Model Management | Feature | Ollama | llama.cpp + llama-swap | |---------|--------|------------------------| | Model loading | Manual `ollama pull` | Download GGUF files to `/models` | | Model switching | Manual `switch_model()` call | Automatic based on request | | Model unloading | Manual or never | Automatic after TTL (30m text, 15m vision) | | VRAM management | Always loaded | Load on demand, unload when idle | | Storage format | Ollama format | GGUF files | | Location | Docker volume | Host directory `./models/` | ## Configuration Files | File | Purpose | Format | |------|---------|--------| | `docker-compose.yml` | Service orchestration | YAML | | `llama-swap-config.yaml` | Model configs, TTL settings | YAML | | `models/llama3.1.gguf` | Text model weights | Binary GGUF | | `models/moondream.gguf` | Vision model weights | Binary GGUF | | `models/moondream-mmproj.gguf` | Vision projector | Binary GGUF | ## Monitoring | Tool | URL | Purpose | |------|-----|---------| | llama-swap Web UI | http://localhost:8080/ui | Monitor models, logs, timers | | Health endpoint | http://localhost:8080/health | Check if server is ready | | Running models | http://localhost:8080/running | List currently loaded models | | Metrics | http://localhost:8080/metrics | Prometheus-compatible metrics | ## Common Commands ```bash # Check what's running curl http://localhost:8080/running # Check health curl http://localhost:8080/health # Manually unload all models curl -X POST http://localhost:8080/models/unload # View logs docker-compose logs -f llama-swap # Restart services docker-compose restart # Check model files ls -lh models/ ``` ## Quick Troubleshooting | Issue | Solution | |-------|----------| | "Model not found" | Verify files in `./models/` match config | | CUDA errors | Check: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi` | | Slow responses | First load is slow; subsequent loads use cache | | High VRAM usage | Models will auto-unload after TTL expires | | Bot can't connect | Check: `curl http://localhost:8080/health` | --- **Remember:** The migration maintains backward compatibility. Old function names are aliased, so existing code continues to work!