Files
miku-discord/QUICK_REFERENCE.md

5.3 KiB

Quick Reference: Ollama → Llama.cpp Migration

Environment Variables

Old (Ollama) New (llama.cpp) Purpose
OLLAMA_URL LLAMA_URL Server endpoint
OLLAMA_MODEL TEXT_MODEL Text generation model name
N/A VISION_MODEL Vision model name

API Endpoints

Purpose Old (Ollama) New (llama.cpp)
Text generation /api/generate /v1/chat/completions
Vision /api/generate /v1/chat/completions
Health check GET / GET /health
Model management Manual switch_model() Automatic via llama-swap

Function Changes

Old Function New Function Status
query_ollama() query_llama() Aliased for compatibility
analyze_image_with_qwen() analyze_image_with_vision() Aliased for compatibility
switch_model() Removed llama-swap handles automatically

Request Format

Text Generation

Before (Ollama):

payload = {
    "model": "llama3.1",
    "prompt": "Hello world",
    "system": "You are Miku",
    "stream": False
}
await session.post(f"{OLLAMA_URL}/api/generate", json=payload)

After (OpenAI):

payload = {
    "model": "llama3.1",
    "messages": [
        {"role": "system", "content": "You are Miku"},
        {"role": "user", "content": "Hello world"}
    ],
    "stream": False
}
await session.post(f"{LLAMA_URL}/v1/chat/completions", json=payload)

Vision Analysis

Before (Ollama):

await switch_model("moondream")  # Manual switch!
payload = {
    "model": "moondream",
    "prompt": "Describe this image",
    "images": [base64_img],
    "stream": False
}
await session.post(f"{OLLAMA_URL}/api/generate", json=payload)

After (OpenAI):

# No manual switch needed!
payload = {
    "model": "moondream",  # llama-swap auto-switches
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_img}"}}
        ]
    }],
    "stream": False
}
await session.post(f"{LLAMA_URL}/v1/chat/completions", json=payload)

Response Format

Before (Ollama):

{
  "response": "Hello! I'm Miku!",
  "model": "llama3.1"
}

After (OpenAI):

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello! I'm Miku!"
    }
  }],
  "model": "llama3.1"
}

Docker Services

Before:

services:
  ollama:
    image: ollama/ollama
    ports: ["11434:11434"]
    volumes: ["ollama_data:/root/.ollama"]
    
  bot:
    environment:
      - OLLAMA_URL=http://ollama:11434
      - OLLAMA_MODEL=llama3.1

After:

services:
  llama-swap:
    image: ghcr.io/mostlygeek/llama-swap:cuda
    ports: ["8080:8080"]
    volumes:
      - ./models:/models
      - ./llama-swap-config.yaml:/app/config.yaml
    
  bot:
    environment:
      - LLAMA_URL=http://llama-swap:8080
      - TEXT_MODEL=llama3.1
      - VISION_MODEL=moondream

Model Management

Feature Ollama llama.cpp + llama-swap
Model loading Manual ollama pull Download GGUF files to /models
Model switching Manual switch_model() call Automatic based on request
Model unloading Manual or never Automatic after TTL (30m text, 15m vision)
VRAM management Always loaded Load on demand, unload when idle
Storage format Ollama format GGUF files
Location Docker volume Host directory ./models/

Configuration Files

File Purpose Format
docker-compose.yml Service orchestration YAML
llama-swap-config.yaml Model configs, TTL settings YAML
models/llama3.1.gguf Text model weights Binary GGUF
models/moondream.gguf Vision model weights Binary GGUF
models/moondream-mmproj.gguf Vision projector Binary GGUF

Monitoring

Tool URL Purpose
llama-swap Web UI http://localhost:8080/ui Monitor models, logs, timers
Health endpoint http://localhost:8080/health Check if server is ready
Running models http://localhost:8080/running List currently loaded models
Metrics http://localhost:8080/metrics Prometheus-compatible metrics

Common Commands

# Check what's running
curl http://localhost:8080/running

# Check health
curl http://localhost:8080/health

# Manually unload all models
curl -X POST http://localhost:8080/models/unload

# View logs
docker-compose logs -f llama-swap

# Restart services
docker-compose restart

# Check model files
ls -lh models/

Quick Troubleshooting

Issue Solution
"Model not found" Verify files in ./models/ match config
CUDA errors Check: docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
Slow responses First load is slow; subsequent loads use cache
High VRAM usage Models will auto-unload after TTL expires
Bot can't connect Check: curl http://localhost:8080/health

Remember: The migration maintains backward compatibility. Old function names are aliased, so existing code continues to work!