QUICK_REFERENCE.md

# Quick Reference: Ollama → Llama.cpp Migration

## Environment Variables

| Old (Ollama) | New (llama.cpp) | Purpose |
|--------------|-----------------|---------|
| `OLLAMA_URL` | `LLAMA_URL` | Server endpoint |
| `OLLAMA_MODEL` | `TEXT_MODEL` | Text generation model name |
| N/A | `VISION_MODEL` | Vision model name |

## API Endpoints

| Purpose | Old (Ollama) | New (llama.cpp) |
|---------|--------------|-----------------|
| Text generation | `/api/generate` | `/v1/chat/completions` |
| Vision | `/api/generate` | `/v1/chat/completions` |
| Health check | `GET /` | `GET /health` |
| Model management | Manual `switch_model()` | Automatic via llama-swap |

## Function Changes

| Old Function | New Function | Status |
|--------------|--------------|--------|
| `query_ollama()` | `query_llama()` | Aliased for compatibility |
| `analyze_image_with_qwen()` | `analyze_image_with_vision()` | Aliased for compatibility |
| `switch_model()` | **Removed** | llama-swap handles automatically |

## Request Format

### Text Generation

**Before (Ollama):**
```python
payload = {
    "model": "llama3.1",
    "prompt": "Hello world",
    "system": "You are Miku",
    "stream": False
}
await session.post(f"{OLLAMA_URL}/api/generate", json=payload)
```

**After (OpenAI):**
```python
payload = {
    "model": "llama3.1",
    "messages": [
        {"role": "system", "content": "You are Miku"},
        {"role": "user", "content": "Hello world"}
    ],
    "stream": False
}
await session.post(f"{LLAMA_URL}/v1/chat/completions", json=payload)
```

### Vision Analysis

**Before (Ollama):**
```python
await switch_model("moondream")  # Manual switch!
payload = {
    "model": "moondream",
    "prompt": "Describe this image",
    "images": [base64_img],
    "stream": False
}
await session.post(f"{OLLAMA_URL}/api/generate", json=payload)
```

**After (OpenAI):**
```python
# No manual switch needed!
payload = {
    "model": "moondream",  # llama-swap auto-switches
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_img}"}}
        ]
    }],
    "stream": False
}
await session.post(f"{LLAMA_URL}/v1/chat/completions", json=payload)
```

## Response Format

**Before (Ollama):**
```json
{
  "response": "Hello! I'm Miku!",
  "model": "llama3.1"
}
```

**After (OpenAI):**
```json
{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello! I'm Miku!"
    }
  }],
  "model": "llama3.1"
}
```

## Docker Services

**Before:**
```yaml
services:
  ollama:
    image: ollama/ollama
    ports: ["11434:11434"]
    volumes: ["ollama_data:/root/.ollama"]
    
  bot:
    environment:
      - OLLAMA_URL=http://ollama:11434
      - OLLAMA_MODEL=llama3.1
```

**After:**
```yaml
services:
  llama-swap:
    image: ghcr.io/mostlygeek/llama-swap:cuda
    ports: ["8080:8080"]
    volumes:
      - ./models:/models
      - ./llama-swap-config.yaml:/app/config.yaml
    
  bot:
    environment:
      - LLAMA_URL=http://llama-swap:8080
      - TEXT_MODEL=llama3.1
      - VISION_MODEL=moondream
```

## Model Management

| Feature | Ollama | llama.cpp + llama-swap |
|---------|--------|------------------------|
| Model loading | Manual `ollama pull` | Download GGUF files to `/models` |
| Model switching | Manual `switch_model()` call | Automatic based on request |
| Model unloading | Manual or never | Automatic after TTL (30m text, 15m vision) |
| VRAM management | Always loaded | Load on demand, unload when idle |
| Storage format | Ollama format | GGUF files |
| Location | Docker volume | Host directory `./models/` |

## Configuration Files

| File | Purpose | Format |
|------|---------|--------|
| `docker-compose.yml` | Service orchestration | YAML |
| `llama-swap-config.yaml` | Model configs, TTL settings | YAML |
| `models/llama3.1.gguf` | Text model weights | Binary GGUF |
| `models/moondream.gguf` | Vision model weights | Binary GGUF |
| `models/moondream-mmproj.gguf` | Vision projector | Binary GGUF |

## Monitoring

| Tool | URL | Purpose |
|------|-----|---------|
| llama-swap Web UI | http://localhost:8080/ui | Monitor models, logs, timers |
| Health endpoint | http://localhost:8080/health | Check if server is ready |
| Running models | http://localhost:8080/running | List currently loaded models |
| Metrics | http://localhost:8080/metrics | Prometheus-compatible metrics |

## Common Commands

```bash
# Check what's running
curl http://localhost:8080/running

# Check health
curl http://localhost:8080/health

# Manually unload all models
curl -X POST http://localhost:8080/models/unload

# View logs
docker-compose logs -f llama-swap

# Restart services
docker-compose restart

# Check model files
ls -lh models/
```

## Quick Troubleshooting

| Issue | Solution |
|-------|----------|
| "Model not found" | Verify files in `./models/` match config |
| CUDA errors | Check: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi` |
| Slow responses | First load is slow; subsequent loads use cache |
| High VRAM usage | Models will auto-unload after TTL expires |
| Bot can't connect | Check: `curl http://localhost:8080/health` |

---

**Remember:** The migration maintains backward compatibility. Old function names are aliased, so existing code continues to work!
Initial commit: Miku Discord Bot 2025-12-07 17:15:09 +02:00			`# Quick Reference: Ollama → Llama.cpp Migration`

			`## Environment Variables`

			`\| Old (Ollama) \| New (llama.cpp) \| Purpose \|`
			`\|--------------\|-----------------\|---------\|`
			\| `OLLAMA_URL` \| `LLAMA_URL` \| Server endpoint \|
			\| `OLLAMA_MODEL` \| `TEXT_MODEL` \| Text generation model name \|
			\| N/A \| `VISION_MODEL` \| Vision model name \|

			`## API Endpoints`

			`\| Purpose \| Old (Ollama) \| New (llama.cpp) \|`
			`\|---------\|--------------\|-----------------\|`
			\| Text generation \| `/api/generate` \| `/v1/chat/completions` \|
			\| Vision \| `/api/generate` \| `/v1/chat/completions` \|
			\| Health check \| `GET /` \| `GET /health` \|
			\| Model management \| Manual `switch_model()` \| Automatic via llama-swap \|

			`## Function Changes`

			`\| Old Function \| New Function \| Status \|`
			`\|--------------\|--------------\|--------\|`
			\| `query_ollama()` \| `query_llama()` \| Aliased for compatibility \|
			\| `analyze_image_with_qwen()` \| `analyze_image_with_vision()` \| Aliased for compatibility \|
			\| `switch_model()` \| Removed \| llama-swap handles automatically \|

			`## Request Format`

			`### Text Generation`

			`Before (Ollama):`
			```python
			`payload = {`
			`"model": "llama3.1",`
			`"prompt": "Hello world",`
			`"system": "You are Miku",`
			`"stream": False`
			`}`
			`await session.post(f"{OLLAMA_URL}/api/generate", json=payload)`
			```

			`After (OpenAI):`
			```python
			`payload = {`
			`"model": "llama3.1",`
			`"messages": [`
			`{"role": "system", "content": "You are Miku"},`
			`{"role": "user", "content": "Hello world"}`
			`],`
			`"stream": False`
			`}`
			`await session.post(f"{LLAMA_URL}/v1/chat/completions", json=payload)`
			```

			`### Vision Analysis`

			`Before (Ollama):`
			```python
			`await switch_model("moondream") # Manual switch!`
			`payload = {`
			`"model": "moondream",`
			`"prompt": "Describe this image",`
			`"images": [base64_img],`
			`"stream": False`
			`}`
			`await session.post(f"{OLLAMA_URL}/api/generate", json=payload)`
			```

			`After (OpenAI):`
			```python
			`# No manual switch needed!`
			`payload = {`
			`"model": "moondream", # llama-swap auto-switches`
			`"messages": [{`
			`"role": "user",`
			`"content": [`
			`{"type": "text", "text": "Describe this image"},`
			`{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_img}"}}`
			`]`
			`}],`
			`"stream": False`
			`}`
			`await session.post(f"{LLAMA_URL}/v1/chat/completions", json=payload)`
			```

			`## Response Format`

			`Before (Ollama):`
			```json
			`{`
			`"response": "Hello! I'm Miku!",`
			`"model": "llama3.1"`
			`}`
			```

			`After (OpenAI):`
			```json
			`{`
			`"choices": [{`
			`"message": {`
			`"role": "assistant",`
			`"content": "Hello! I'm Miku!"`
			`}`
			`}],`
			`"model": "llama3.1"`
			`}`
			```

			`## Docker Services`

			`Before:`
			```yaml
			`services:`
			`ollama:`
			`image: ollama/ollama`
			`ports: ["11434:11434"]`
			`volumes: ["ollama_data:/root/.ollama"]`

			`bot:`
			`environment:`
			`- OLLAMA_URL=http://ollama:11434`
			`- OLLAMA_MODEL=llama3.1`
			```

			`After:`
			```yaml
			`services:`
			`llama-swap:`
			`image: ghcr.io/mostlygeek/llama-swap:cuda`
			`ports: ["8080:8080"]`
			`volumes:`
			`- ./models:/models`
			`- ./llama-swap-config.yaml:/app/config.yaml`

			`bot:`
			`environment:`
			`- LLAMA_URL=http://llama-swap:8080`
			`- TEXT_MODEL=llama3.1`
			`- VISION_MODEL=moondream`
			```

			`## Model Management`

			`\| Feature \| Ollama \| llama.cpp + llama-swap \|`
			`\|---------\|--------\|------------------------\|`
			\| Model loading \| Manual `ollama pull` \| Download GGUF files to `/models` \|
			\| Model switching \| Manual `switch_model()` call \| Automatic based on request \|
			`\| Model unloading \| Manual or never \| Automatic after TTL (30m text, 15m vision) \|`
			`\| VRAM management \| Always loaded \| Load on demand, unload when idle \|`
			`\| Storage format \| Ollama format \| GGUF files \|`
			\| Location \| Docker volume \| Host directory `./models/` \|

			`## Configuration Files`

			`\| File \| Purpose \| Format \|`
			`\|------\|---------\|--------\|`
			\| `docker-compose.yml` \| Service orchestration \| YAML \|
			\| `llama-swap-config.yaml` \| Model configs, TTL settings \| YAML \|
			\| `models/llama3.1.gguf` \| Text model weights \| Binary GGUF \|
			\| `models/moondream.gguf` \| Vision model weights \| Binary GGUF \|
			\| `models/moondream-mmproj.gguf` \| Vision projector \| Binary GGUF \|

			`## Monitoring`

			`\| Tool \| URL \| Purpose \|`
			`\|------\|-----\|---------\|`
			`\| llama-swap Web UI \| http://localhost:8080/ui \| Monitor models, logs, timers \|`
			`\| Health endpoint \| http://localhost:8080/health \| Check if server is ready \|`
			`\| Running models \| http://localhost:8080/running \| List currently loaded models \|`
			`\| Metrics \| http://localhost:8080/metrics \| Prometheus-compatible metrics \|`

			`## Common Commands`

			```bash
			`# Check what's running`
			`curl http://localhost:8080/running`

			`# Check health`
			`curl http://localhost:8080/health`

			`# Manually unload all models`
			`curl -X POST http://localhost:8080/models/unload`

			`# View logs`
			`docker-compose logs -f llama-swap`

			`# Restart services`
			`docker-compose restart`

			`# Check model files`
			`ls -lh models/`
			```

			`## Quick Troubleshooting`

			`\| Issue \| Solution \|`
			`\|-------\|----------\|`
			\| "Model not found" \| Verify files in `./models/` match config \|
			\| CUDA errors \| Check: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi` \|
			`\| Slow responses \| First load is slow; subsequent loads use cache \|`
			`\| High VRAM usage \| Models will auto-unload after TTL expires \|`
			\| Bot can't connect \| Check: `curl http://localhost:8080/health` \|

			`---`

			`Remember: The migration maintains backward compatibility. Old function names are aliased, so existing code continues to work!`