208 lines
5.3 KiB
Markdown
208 lines
5.3 KiB
Markdown
|
|
# Quick Reference: Ollama → Llama.cpp Migration
|
||
|
|
|
||
|
|
## Environment Variables
|
||
|
|
|
||
|
|
| Old (Ollama) | New (llama.cpp) | Purpose |
|
||
|
|
|--------------|-----------------|---------|
|
||
|
|
| `OLLAMA_URL` | `LLAMA_URL` | Server endpoint |
|
||
|
|
| `OLLAMA_MODEL` | `TEXT_MODEL` | Text generation model name |
|
||
|
|
| N/A | `VISION_MODEL` | Vision model name |
|
||
|
|
|
||
|
|
## API Endpoints
|
||
|
|
|
||
|
|
| Purpose | Old (Ollama) | New (llama.cpp) |
|
||
|
|
|---------|--------------|-----------------|
|
||
|
|
| Text generation | `/api/generate` | `/v1/chat/completions` |
|
||
|
|
| Vision | `/api/generate` | `/v1/chat/completions` |
|
||
|
|
| Health check | `GET /` | `GET /health` |
|
||
|
|
| Model management | Manual `switch_model()` | Automatic via llama-swap |
|
||
|
|
|
||
|
|
## Function Changes
|
||
|
|
|
||
|
|
| Old Function | New Function | Status |
|
||
|
|
|--------------|--------------|--------|
|
||
|
|
| `query_ollama()` | `query_llama()` | Aliased for compatibility |
|
||
|
|
| `analyze_image_with_qwen()` | `analyze_image_with_vision()` | Aliased for compatibility |
|
||
|
|
| `switch_model()` | **Removed** | llama-swap handles automatically |
|
||
|
|
|
||
|
|
## Request Format
|
||
|
|
|
||
|
|
### Text Generation
|
||
|
|
|
||
|
|
**Before (Ollama):**
|
||
|
|
```python
|
||
|
|
payload = {
|
||
|
|
"model": "llama3.1",
|
||
|
|
"prompt": "Hello world",
|
||
|
|
"system": "You are Miku",
|
||
|
|
"stream": False
|
||
|
|
}
|
||
|
|
await session.post(f"{OLLAMA_URL}/api/generate", json=payload)
|
||
|
|
```
|
||
|
|
|
||
|
|
**After (OpenAI):**
|
||
|
|
```python
|
||
|
|
payload = {
|
||
|
|
"model": "llama3.1",
|
||
|
|
"messages": [
|
||
|
|
{"role": "system", "content": "You are Miku"},
|
||
|
|
{"role": "user", "content": "Hello world"}
|
||
|
|
],
|
||
|
|
"stream": False
|
||
|
|
}
|
||
|
|
await session.post(f"{LLAMA_URL}/v1/chat/completions", json=payload)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Vision Analysis
|
||
|
|
|
||
|
|
**Before (Ollama):**
|
||
|
|
```python
|
||
|
|
await switch_model("moondream") # Manual switch!
|
||
|
|
payload = {
|
||
|
|
"model": "moondream",
|
||
|
|
"prompt": "Describe this image",
|
||
|
|
"images": [base64_img],
|
||
|
|
"stream": False
|
||
|
|
}
|
||
|
|
await session.post(f"{OLLAMA_URL}/api/generate", json=payload)
|
||
|
|
```
|
||
|
|
|
||
|
|
**After (OpenAI):**
|
||
|
|
```python
|
||
|
|
# No manual switch needed!
|
||
|
|
payload = {
|
||
|
|
"model": "moondream", # llama-swap auto-switches
|
||
|
|
"messages": [{
|
||
|
|
"role": "user",
|
||
|
|
"content": [
|
||
|
|
{"type": "text", "text": "Describe this image"},
|
||
|
|
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_img}"}}
|
||
|
|
]
|
||
|
|
}],
|
||
|
|
"stream": False
|
||
|
|
}
|
||
|
|
await session.post(f"{LLAMA_URL}/v1/chat/completions", json=payload)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Response Format
|
||
|
|
|
||
|
|
**Before (Ollama):**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"response": "Hello! I'm Miku!",
|
||
|
|
"model": "llama3.1"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**After (OpenAI):**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"choices": [{
|
||
|
|
"message": {
|
||
|
|
"role": "assistant",
|
||
|
|
"content": "Hello! I'm Miku!"
|
||
|
|
}
|
||
|
|
}],
|
||
|
|
"model": "llama3.1"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Docker Services
|
||
|
|
|
||
|
|
**Before:**
|
||
|
|
```yaml
|
||
|
|
services:
|
||
|
|
ollama:
|
||
|
|
image: ollama/ollama
|
||
|
|
ports: ["11434:11434"]
|
||
|
|
volumes: ["ollama_data:/root/.ollama"]
|
||
|
|
|
||
|
|
bot:
|
||
|
|
environment:
|
||
|
|
- OLLAMA_URL=http://ollama:11434
|
||
|
|
- OLLAMA_MODEL=llama3.1
|
||
|
|
```
|
||
|
|
|
||
|
|
**After:**
|
||
|
|
```yaml
|
||
|
|
services:
|
||
|
|
llama-swap:
|
||
|
|
image: ghcr.io/mostlygeek/llama-swap:cuda
|
||
|
|
ports: ["8080:8080"]
|
||
|
|
volumes:
|
||
|
|
- ./models:/models
|
||
|
|
- ./llama-swap-config.yaml:/app/config.yaml
|
||
|
|
|
||
|
|
bot:
|
||
|
|
environment:
|
||
|
|
- LLAMA_URL=http://llama-swap:8080
|
||
|
|
- TEXT_MODEL=llama3.1
|
||
|
|
- VISION_MODEL=moondream
|
||
|
|
```
|
||
|
|
|
||
|
|
## Model Management
|
||
|
|
|
||
|
|
| Feature | Ollama | llama.cpp + llama-swap |
|
||
|
|
|---------|--------|------------------------|
|
||
|
|
| Model loading | Manual `ollama pull` | Download GGUF files to `/models` |
|
||
|
|
| Model switching | Manual `switch_model()` call | Automatic based on request |
|
||
|
|
| Model unloading | Manual or never | Automatic after TTL (30m text, 15m vision) |
|
||
|
|
| VRAM management | Always loaded | Load on demand, unload when idle |
|
||
|
|
| Storage format | Ollama format | GGUF files |
|
||
|
|
| Location | Docker volume | Host directory `./models/` |
|
||
|
|
|
||
|
|
## Configuration Files
|
||
|
|
|
||
|
|
| File | Purpose | Format |
|
||
|
|
|------|---------|--------|
|
||
|
|
| `docker-compose.yml` | Service orchestration | YAML |
|
||
|
|
| `llama-swap-config.yaml` | Model configs, TTL settings | YAML |
|
||
|
|
| `models/llama3.1.gguf` | Text model weights | Binary GGUF |
|
||
|
|
| `models/moondream.gguf` | Vision model weights | Binary GGUF |
|
||
|
|
| `models/moondream-mmproj.gguf` | Vision projector | Binary GGUF |
|
||
|
|
|
||
|
|
## Monitoring
|
||
|
|
|
||
|
|
| Tool | URL | Purpose |
|
||
|
|
|------|-----|---------|
|
||
|
|
| llama-swap Web UI | http://localhost:8080/ui | Monitor models, logs, timers |
|
||
|
|
| Health endpoint | http://localhost:8080/health | Check if server is ready |
|
||
|
|
| Running models | http://localhost:8080/running | List currently loaded models |
|
||
|
|
| Metrics | http://localhost:8080/metrics | Prometheus-compatible metrics |
|
||
|
|
|
||
|
|
## Common Commands
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check what's running
|
||
|
|
curl http://localhost:8080/running
|
||
|
|
|
||
|
|
# Check health
|
||
|
|
curl http://localhost:8080/health
|
||
|
|
|
||
|
|
# Manually unload all models
|
||
|
|
curl -X POST http://localhost:8080/models/unload
|
||
|
|
|
||
|
|
# View logs
|
||
|
|
docker-compose logs -f llama-swap
|
||
|
|
|
||
|
|
# Restart services
|
||
|
|
docker-compose restart
|
||
|
|
|
||
|
|
# Check model files
|
||
|
|
ls -lh models/
|
||
|
|
```
|
||
|
|
|
||
|
|
## Quick Troubleshooting
|
||
|
|
|
||
|
|
| Issue | Solution |
|
||
|
|
|-------|----------|
|
||
|
|
| "Model not found" | Verify files in `./models/` match config |
|
||
|
|
| CUDA errors | Check: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi` |
|
||
|
|
| Slow responses | First load is slow; subsequent loads use cache |
|
||
|
|
| High VRAM usage | Models will auto-unload after TTL expires |
|
||
|
|
| Bot can't connect | Check: `curl http://localhost:8080/health` |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Remember:** The migration maintains backward compatibility. Old function names are aliased, so existing code continues to work!
|