Files

koko210Serve 88d4256755 Organize documentation: Move all .md files to readmes/ directory

2025-12-07 17:21:59 +02:00

5.3 KiB

Raw Blame History

Quick Reference: Ollama → Llama.cpp Migration

Environment Variables

Old (Ollama)	New (llama.cpp)	Purpose
`OLLAMA_URL`	`LLAMA_URL`	Server endpoint
`OLLAMA_MODEL`	`TEXT_MODEL`	Text generation model name
N/A	`VISION_MODEL`	Vision model name

API Endpoints

Purpose	Old (Ollama)	New (llama.cpp)
Text generation	`/api/generate`	`/v1/chat/completions`
Vision	`/api/generate`	`/v1/chat/completions`
Health check	`GET /`	`GET /health`
Model management	Manual `switch_model()`	Automatic via llama-swap

Function Changes

Old Function	New Function	Status
`query_ollama()`	`query_llama()`	Aliased for compatibility
`analyze_image_with_qwen()`	`analyze_image_with_vision()`	Aliased for compatibility
`switch_model()`	Removed	llama-swap handles automatically

Request Format

Text Generation

Before (Ollama):

payload = {
    "model": "llama3.1",
    "prompt": "Hello world",
    "system": "You are Miku",
    "stream": False
}
await session.post(f"{OLLAMA_URL}/api/generate", json=payload)

After (OpenAI):

payload = {
    "model": "llama3.1",
    "messages": [
        {"role": "system", "content": "You are Miku"},
        {"role": "user", "content": "Hello world"}
    ],
    "stream": False
}
await session.post(f"{LLAMA_URL}/v1/chat/completions", json=payload)

Vision Analysis

Before (Ollama):

await switch_model("moondream")  # Manual switch!
payload = {
    "model": "moondream",
    "prompt": "Describe this image",
    "images": [base64_img],
    "stream": False
}
await session.post(f"{OLLAMA_URL}/api/generate", json=payload)

After (OpenAI):

# No manual switch needed!
payload = {
    "model": "moondream",  # llama-swap auto-switches
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_img}"}}
        ]
    }],
    "stream": False
}
await session.post(f"{LLAMA_URL}/v1/chat/completions", json=payload)

Response Format

Before (Ollama):

{
  "response": "Hello! I'm Miku!",
  "model": "llama3.1"
}

After (OpenAI):

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello! I'm Miku!"
    }
  }],
  "model": "llama3.1"
}

Docker Services

Before:

services:
  ollama:
    image: ollama/ollama
    ports: ["11434:11434"]
    volumes: ["ollama_data:/root/.ollama"]
    
  bot:
    environment:
      - OLLAMA_URL=http://ollama:11434
      - OLLAMA_MODEL=llama3.1

After:

services:
  llama-swap:
    image: ghcr.io/mostlygeek/llama-swap:cuda
    ports: ["8080:8080"]
    volumes:
      - ./models:/models
      - ./llama-swap-config.yaml:/app/config.yaml
    
  bot:
    environment:
      - LLAMA_URL=http://llama-swap:8080
      - TEXT_MODEL=llama3.1
      - VISION_MODEL=moondream

Model Management

Feature	Ollama	llama.cpp + llama-swap
Model loading	Manual `ollama pull`	Download GGUF files to `/models`
Model switching	Manual `switch_model()` call	Automatic based on request
Model unloading	Manual or never	Automatic after TTL (30m text, 15m vision)
VRAM management	Always loaded	Load on demand, unload when idle
Storage format	Ollama format	GGUF files
Location	Docker volume	Host directory `./models/`

Configuration Files

File	Purpose	Format
`docker-compose.yml`	Service orchestration	YAML
`llama-swap-config.yaml`	Model configs, TTL settings	YAML
`models/llama3.1.gguf`	Text model weights	Binary GGUF
`models/moondream.gguf`	Vision model weights	Binary GGUF
`models/moondream-mmproj.gguf`	Vision projector	Binary GGUF

Monitoring

Tool	URL	Purpose
llama-swap Web UI	http://localhost:8080/ui	Monitor models, logs, timers
Health endpoint	http://localhost:8080/health	Check if server is ready
Running models	http://localhost:8080/running	List currently loaded models
Metrics	http://localhost:8080/metrics	Prometheus-compatible metrics

Common Commands

# Check what's running
curl http://localhost:8080/running

# Check health
curl http://localhost:8080/health

# Manually unload all models
curl -X POST http://localhost:8080/models/unload

# View logs
docker-compose logs -f llama-swap

# Restart services
docker-compose restart

# Check model files
ls -lh models/

Quick Troubleshooting

Issue	Solution
"Model not found"	Verify files in `./models/` match config
CUDA errors	Check: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`
Slow responses	First load is slow; subsequent loads use cache
High VRAM usage	Models will auto-unload after TTL expires
Bot can't connect	Check: `curl http://localhost:8080/health`

Remember: The migration maintains backward compatibility. Old function names are aliased, so existing code continues to work!

5.3 KiB Raw Blame History

Quick Reference: Ollama → Llama.cpp Migration

Environment Variables

API Endpoints

Function Changes

Request Format

Text Generation

Vision Analysis

Response Format

Docker Services

Model Management

Configuration Files

Monitoring

Common Commands

Quick Troubleshooting

5.3 KiB

Raw Blame History