5.3 KiB
5.3 KiB
Quick Reference: Ollama → Llama.cpp Migration
Environment Variables
| Old (Ollama) | New (llama.cpp) | Purpose |
|---|---|---|
OLLAMA_URL |
LLAMA_URL |
Server endpoint |
OLLAMA_MODEL |
TEXT_MODEL |
Text generation model name |
| N/A | VISION_MODEL |
Vision model name |
API Endpoints
| Purpose | Old (Ollama) | New (llama.cpp) |
|---|---|---|
| Text generation | /api/generate |
/v1/chat/completions |
| Vision | /api/generate |
/v1/chat/completions |
| Health check | GET / |
GET /health |
| Model management | Manual switch_model() |
Automatic via llama-swap |
Function Changes
| Old Function | New Function | Status |
|---|---|---|
query_ollama() |
query_llama() |
Aliased for compatibility |
analyze_image_with_qwen() |
analyze_image_with_vision() |
Aliased for compatibility |
switch_model() |
Removed | llama-swap handles automatically |
Request Format
Text Generation
Before (Ollama):
payload = {
"model": "llama3.1",
"prompt": "Hello world",
"system": "You are Miku",
"stream": False
}
await session.post(f"{OLLAMA_URL}/api/generate", json=payload)
After (OpenAI):
payload = {
"model": "llama3.1",
"messages": [
{"role": "system", "content": "You are Miku"},
{"role": "user", "content": "Hello world"}
],
"stream": False
}
await session.post(f"{LLAMA_URL}/v1/chat/completions", json=payload)
Vision Analysis
Before (Ollama):
await switch_model("moondream") # Manual switch!
payload = {
"model": "moondream",
"prompt": "Describe this image",
"images": [base64_img],
"stream": False
}
await session.post(f"{OLLAMA_URL}/api/generate", json=payload)
After (OpenAI):
# No manual switch needed!
payload = {
"model": "moondream", # llama-swap auto-switches
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image"},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_img}"}}
]
}],
"stream": False
}
await session.post(f"{LLAMA_URL}/v1/chat/completions", json=payload)
Response Format
Before (Ollama):
{
"response": "Hello! I'm Miku!",
"model": "llama3.1"
}
After (OpenAI):
{
"choices": [{
"message": {
"role": "assistant",
"content": "Hello! I'm Miku!"
}
}],
"model": "llama3.1"
}
Docker Services
Before:
services:
ollama:
image: ollama/ollama
ports: ["11434:11434"]
volumes: ["ollama_data:/root/.ollama"]
bot:
environment:
- OLLAMA_URL=http://ollama:11434
- OLLAMA_MODEL=llama3.1
After:
services:
llama-swap:
image: ghcr.io/mostlygeek/llama-swap:cuda
ports: ["8080:8080"]
volumes:
- ./models:/models
- ./llama-swap-config.yaml:/app/config.yaml
bot:
environment:
- LLAMA_URL=http://llama-swap:8080
- TEXT_MODEL=llama3.1
- VISION_MODEL=moondream
Model Management
| Feature | Ollama | llama.cpp + llama-swap |
|---|---|---|
| Model loading | Manual ollama pull |
Download GGUF files to /models |
| Model switching | Manual switch_model() call |
Automatic based on request |
| Model unloading | Manual or never | Automatic after TTL (30m text, 15m vision) |
| VRAM management | Always loaded | Load on demand, unload when idle |
| Storage format | Ollama format | GGUF files |
| Location | Docker volume | Host directory ./models/ |
Configuration Files
| File | Purpose | Format |
|---|---|---|
docker-compose.yml |
Service orchestration | YAML |
llama-swap-config.yaml |
Model configs, TTL settings | YAML |
models/llama3.1.gguf |
Text model weights | Binary GGUF |
models/moondream.gguf |
Vision model weights | Binary GGUF |
models/moondream-mmproj.gguf |
Vision projector | Binary GGUF |
Monitoring
| Tool | URL | Purpose |
|---|---|---|
| llama-swap Web UI | http://localhost:8080/ui | Monitor models, logs, timers |
| Health endpoint | http://localhost:8080/health | Check if server is ready |
| Running models | http://localhost:8080/running | List currently loaded models |
| Metrics | http://localhost:8080/metrics | Prometheus-compatible metrics |
Common Commands
# Check what's running
curl http://localhost:8080/running
# Check health
curl http://localhost:8080/health
# Manually unload all models
curl -X POST http://localhost:8080/models/unload
# View logs
docker-compose logs -f llama-swap
# Restart services
docker-compose restart
# Check model files
ls -lh models/
Quick Troubleshooting
| Issue | Solution |
|---|---|
| "Model not found" | Verify files in ./models/ match config |
| CUDA errors | Check: docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi |
| Slow responses | First load is slow; subsequent loads use cache |
| High VRAM usage | Models will auto-unload after TTL expires |
| Bot can't connect | Check: curl http://localhost:8080/health |
Remember: The migration maintains backward compatibility. Old function names are aliased, so existing code continues to work!