llama-swap-config.yaml

# llama-swap configuration for Miku Discord Bot
# This manages automatic model switching and unloading

models:
  # Main text generation model (Llama 3.1 8B)
  llama3.1:
    cmd: /app/llama-server --port ${PORT} --model /models/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf -ngl 99 -nkvo -c 16384 --host 0.0.0.0 --no-warmup
    ttl: 1800  # Unload after 30 minutes of inactivity (1800 seconds)
    swap: true  # CRITICAL: Unload other models when loading this one
    aliases:
      - llama3.1
      - text-model
  
  # Evil/Uncensored text generation model (DarkIdol-Llama 3.1 8B)
  darkidol:
    cmd: /app/llama-server --port ${PORT} --model /models/DarkIdol-Llama-3.1-8B-Instruct-1.3-Uncensored_Q4_K_M.gguf -ngl 99 -nkvo -c 16384 --host 0.0.0.0 --no-warmup
    ttl: 1800  # Unload after 30 minutes of inactivity
    swap: true  # CRITICAL: Unload other models when loading this one
    aliases:
      - darkidol
      - evil-model
      - uncensored
  
  # Japanese language model (Llama 3.1 Swallow - Japanese optimized)
  swallow:
    cmd: /app/llama-server --port ${PORT} --model /models/Llama-3.1-Swallow-8B-Instruct-v0.5-Q4_K_M.gguf -ngl 99 -nkvo -c 16384 --host 0.0.0.0 --no-warmup
    ttl: 1800  # Unload after 30 minutes of inactivity
    swap: true  # CRITICAL: Unload other models when loading this one
    aliases:
      - swallow
      - japanese
      - japanese-model
    
  # Vision/Multimodal model (MiniCPM-V-4.5 - supports images, video, and GIFs)
  vision:
    cmd: /app/llama-server --port ${PORT} --model /models/MiniCPM-V-4_5-Q3_K_S.gguf --mmproj /models/MiniCPM-V-4_5-mmproj-f16.gguf -ngl 99 -c 4096 --host 0.0.0.0 --no-warmup
    ttl: 900  # Vision model used less frequently, shorter TTL (15 minutes = 900 seconds)
    swap: true  # CRITICAL: Unload text models before loading vision
    aliases:
      - vision
      - vision-model
      - minicpm

# Server configuration
# llama-swap will listen on this address
# Inside Docker, we bind to 0.0.0.0 to allow bot container to connect
Initial commit: Miku Discord Bot 2025-12-07 17:15:09 +02:00			`# llama-swap configuration for Miku Discord Bot`
			`# This manages automatic model switching and unloading`

			`models:`
			`# Main text generation model (Llama 3.1 8B)`
			`llama3.1:`
Disable model warmup to improve switching speed - Added --no-warmup flag to both llama3.1 and vision models - Reduces model switch time by 2-5 seconds per swap - No impact on response quality, only minor first-token latency - Better for frequent model switching use case and tight VRAM budget 2025-12-10 10:09:37 +02:00			`cmd: /app/llama-server --port ${PORT} --model /models/Llama-3.1-8B-Instruct-UD-Q4_K_XL.gguf -ngl 99 -nkvo -c 16384 --host 0.0.0.0 --no-warmup`
Initial commit: Miku Discord Bot 2025-12-07 17:15:09 +02:00			`ttl: 1800 # Unload after 30 minutes of inactivity (1800 seconds)`
Disabled KV cache offloading on llama-server and enabled Flash Attention. Performance gains in the tens. 2026-01-27 19:11:49 +02:00			`swap: true # CRITICAL: Unload other models when loading this one`
Initial commit: Miku Discord Bot 2025-12-07 17:15:09 +02:00			`aliases:`
			`- llama3.1`
			`- text-model`
Implement Evil Miku mode with persistence, fix API event loop issues, and improve formatting - Added Evil Miku mode with 4 evil moods (aggressive, cunning, sarcastic, evil_neutral) - Created evil mode content files (evil_miku_lore.txt, evil_miku_prompt.txt, evil_miku_lyrics.txt) - Implemented persistent evil mode state across restarts (saves to memory/evil_mode_state.json) - Fixed API endpoints to use client.loop.create_task() to prevent timeout errors - Added evil mode toggle in web UI with red theme styling - Modified mood rotation to handle evil mode - Configured DarkIdol uncensored model for evil mode text generation - Reduced system prompt redundancy by removing duplicate content - Added markdown escape for single asterisks (actions) while preserving bold formatting - Evil mode now persists username, pfp, and nicknames across restarts without re-applying changes 2026-01-02 17:11:58 +02:00
			`# Evil/Uncensored text generation model (DarkIdol-Llama 3.1 8B)`
			`darkidol:`
			`cmd: /app/llama-server --port ${PORT} --model /models/DarkIdol-Llama-3.1-8B-Instruct-1.3-Uncensored_Q4_K_M.gguf -ngl 99 -nkvo -c 16384 --host 0.0.0.0 --no-warmup`
			`ttl: 1800 # Unload after 30 minutes of inactivity`
Disabled KV cache offloading on llama-server and enabled Flash Attention. Performance gains in the tens. 2026-01-27 19:11:49 +02:00			`swap: true # CRITICAL: Unload other models when loading this one`
Implement Evil Miku mode with persistence, fix API event loop issues, and improve formatting - Added Evil Miku mode with 4 evil moods (aggressive, cunning, sarcastic, evil_neutral) - Created evil mode content files (evil_miku_lore.txt, evil_miku_prompt.txt, evil_miku_lyrics.txt) - Implemented persistent evil mode state across restarts (saves to memory/evil_mode_state.json) - Fixed API endpoints to use client.loop.create_task() to prevent timeout errors - Added evil mode toggle in web UI with red theme styling - Modified mood rotation to handle evil mode - Configured DarkIdol uncensored model for evil mode text generation - Reduced system prompt redundancy by removing duplicate content - Added markdown escape for single asterisks (actions) while preserving bold formatting - Evil mode now persists username, pfp, and nicknames across restarts without re-applying changes 2026-01-02 17:11:58 +02:00			`aliases:`
			`- darkidol`
			`- evil-model`
			`- uncensored`
Implemented new Japanese only text mode with WebUI toggle, utilizing a llama3.1 swallow dataset model. Next up is Japanese TTS. 2026-01-23 15:02:36 +02:00
			`# Japanese language model (Llama 3.1 Swallow - Japanese optimized)`
			`swallow:`
			`cmd: /app/llama-server --port ${PORT} --model /models/Llama-3.1-Swallow-8B-Instruct-v0.5-Q4_K_M.gguf -ngl 99 -nkvo -c 16384 --host 0.0.0.0 --no-warmup`
			`ttl: 1800 # Unload after 30 minutes of inactivity`
Disabled KV cache offloading on llama-server and enabled Flash Attention. Performance gains in the tens. 2026-01-27 19:11:49 +02:00			`swap: true # CRITICAL: Unload other models when loading this one`
Implemented new Japanese only text mode with WebUI toggle, utilizing a llama3.1 swallow dataset model. Next up is Japanese TTS. 2026-01-23 15:02:36 +02:00			`aliases:`
			`- swallow`
			`- japanese`
			`- japanese-model`
Initial commit: Miku Discord Bot 2025-12-07 17:15:09 +02:00
			`# Vision/Multimodal model (MiniCPM-V-4.5 - supports images, video, and GIFs)`
			`vision:`
Disable model warmup to improve switching speed - Added --no-warmup flag to both llama3.1 and vision models - Reduces model switch time by 2-5 seconds per swap - No impact on response quality, only minor first-token latency - Better for frequent model switching use case and tight VRAM budget 2025-12-10 10:09:37 +02:00			`cmd: /app/llama-server --port ${PORT} --model /models/MiniCPM-V-4_5-Q3_K_S.gguf --mmproj /models/MiniCPM-V-4_5-mmproj-f16.gguf -ngl 99 -c 4096 --host 0.0.0.0 --no-warmup`
Initial commit: Miku Discord Bot 2025-12-07 17:15:09 +02:00			`ttl: 900 # Vision model used less frequently, shorter TTL (15 minutes = 900 seconds)`
Disabled KV cache offloading on llama-server and enabled Flash Attention. Performance gains in the tens. 2026-01-27 19:11:49 +02:00			`swap: true # CRITICAL: Unload text models before loading vision`
Initial commit: Miku Discord Bot 2025-12-07 17:15:09 +02:00			`aliases:`
			`- vision`
			`- vision-model`
			`- minicpm`

			`# Server configuration`
			`# llama-swap will listen on this address`
			`# Inside Docker, we bind to 0.0.0.0 to allow bot container to connect`