Compare commits

...

4 Commits

Author SHA1 Message Date
486acb5c14 Fix reply-context speaker confusion with structured metadata pipeline
Previously, when a user replied to Miku's message via Discord's reply
feature, Miku's quoted words were embedded directly into the user's
message text using the format:
  [Replying to your message: "Miku's words"] User's response

This caused two problems:
1. The LLM had to parse "your message" to determine the quoted text
   was MIKU's words — fragile and frequently misattributed
2. When stored in episodic memory as [User]: ..., Miku's quoted words
   were permanently mislabeled under the user's speaker prefix

Now reply context flows through as structured metadata:
- bot/bot.py captures the replied-to text WITHOUT embedding it in prompt
- cat_client.py passes it as discord_reply_context in the WebSocket payload
- discord_bridge.py injects it as agent_input['reply_context'] — a
  CLEARLY LABELED note: [The user is replying to what you (Miku) said — ...]
- miku_personality.py + evil_miku_personality.py render it via
  {reply_context} placeholder in the prompt suffix, between memory
  context and conversation history

This keeps Miku's words as a separate context note, never mixed into
the user's HumanMessage. Episodic memory only stores the user's actual
words. The fallback path (when Cat is unavailable) also uses a cleaner
format with explicit speaker labels.
2026-06-03 22:50:03 +03:00
9d2c14fa0b Fix vision pipeline: ffmpeg removal by autoremove, increase vision timeout, reduce frame count, add Discord activity awareness
- bot/Dockerfile: Add ffmpeg to reinstall line after apt-get autoremove
  (autoremove was sweeping up ffmpeg as 'no longer needed' after playwright install)
- bot/utils/image_handling.py: Increase video analysis timeout 120s→300s, 6→3 for Tenor GIFs (GTX 1660 VRAM constraint)
- bot/utils/activities.py: Add _activity_changed_at timestamp tracking,
  get_current_activity_label() and get_current_activity_fresh() with 30-min decay
- bot/utils/cat_client.py: Pass current Discord activity to Cheshire Cat pipeline
- bot/utils/llm.py: Inject current Discord activity into system prompt
- cat-plugins/*: Forward Discord activity through working_memory to personality plugins
- bot/persona/*/preamble.txt: Add Discord status usage guidelines for character prompts
- llama-swap-rocm-config.yaml: Add qwen3.5 model entry for ComfyUI prompt generation
- AGENTS.md: New project documentation file
2026-05-27 01:18:12 +03:00
d333c61c8f fix: set default bedtime end time to 11 PM (was 9 PM) 2026-05-22 21:40:01 +03:00
e1f81e52e5 Fix Miku confusing who said what in conversations
Three interrelated fixes for speaker attribution confusion:

1. Fix misleading episodic memory header (discord_bridge.py):
   The Cat core hardcodes '## Context of things the Human said in the past:'
   when formatting recalled conversations. Our plugins store BOTH user messages
   ([User]: prefix) AND Miku's own responses ([Miku]: prefix) in episodic memory.
   This misleading header primes the LLM to attribute Miku's words to the user.
   Replaced with '## Past conversation excerpts (prefixed by who said what):'
   which accurately describes the mixed-speaker content.

2. Tighten episodic recall (discord_bridge.py):
   Added before_cat_recalls_episodic_memories hook setting threshold=0.75
   (vs default 0.7) to reduce the chance of Miku's own just-uttered response
   being recalled on the very next user message, which would feed her own
   words back as misleading context.

3. Add role clarification (miku_personality.py & evil_miku_personality.py):
   Added a clarifying note after '# Conversation until now:' in the prompt
   suffix to explicitly tell the model that 'Human = the user, AI = you (Miku)',
   helping it reconcile the two labeling systems (episodic [User]/[Miku] prefixes
   vs conversation history Human/AI roles).
2026-05-22 16:38:34 +03:00
14 changed files with 251 additions and 14 deletions

82
AGENTS.md Normal file
View File

@@ -0,0 +1,82 @@
# AGENTS.md
## Language & runtime
- **Python 3.11** (main bot). There is no root `package.json` or TypeScript — do not apply Node/TS tooling.
- `uno-online/` is a secondary Node.js project; `miku-app/` is Android/Kotlin. Both shelved features for now.
## Commands
```bash
# Build and run all core services (bot, STT, llama-swap, Cheshire Cat, Qdrant)
docker compose up -d
# Run with face-detector (requires NVIDIA GPU)
docker compose --profile tools up -d
# Run only the bot (implies dependencies are already up)
docker compose up -d miku-bot
# View bot logs
docker compose logs -f miku-bot
# Rebuild bot after code changes
docker compose down miku-bot && docker compose build miku-bot && docker compose up -d miku-bot
```
## Config
- **`config.yaml`**: app settings (model names, URLs, ports, feature flags).
- **`.env`**: secrets only (`DISCORD_BOT_TOKEN`, `OWNER_USER_ID`, `ERROR_WEBHOOK_URL`).
- Config is loaded by `bot/config.py` (Pydantic) and `bot/globals.py` (bare `os.getenv`). Both sources matter — check both when tracing config usage.
- Runtime config overrides are persisted to `bot/memory/config_runtime.yaml` via the API.
## Architecture
```
Discord <-> bot/bot.py (discord.py)
├── on_message -> Cheshire Cat pipeline -> memory-augmented LLM response
├── utils/llm.py -> llama-swap (HTTP proxy) -> llama.cpp (NVIDIA or AMD GPU)
├── utils/voice_manager.py -> STT WebSocket (port 8766) and audio playback
├── FastAPI (port 3939, daemon thread) -> 22 route modules in bot/routes/
├── APScheduler (background tasks in globals.py)
└── utils/autonomous_engine.py -> proactive message decisions (Autonomous V2)
```
- The FastAPI server runs in a **daemon thread** inside the Discord bot process — no separate process.
- `bot/globals.py` holds mutable global state (`scheduler`, env vars, `discord.Client`). Module-level mutations are pervasive; be careful with import order.
- llama-swap is a llama.cpp HTTP proxy with TTL-based model swapping. Two configs: `llama-swap-config.yaml` (NVIDIA) and `llama-swap-rocm-config.yaml` (AMD).
## Models (via llama-swap)
| Model key | Purpose |
|-----------|---------|
| `llama3.1` | Primary text model |
| `darkidol` | Uncensored model (evil mode) |
| `vision` | MiniCPM-V (image understanding) |
| `swallow` | Japanese text model |
| `rocinante` | 12B model (AMD GPU only) |
| `qwen3.5` | ComfyUI prompt generation (AMD GPU only) |
## Testing & linting
- **No formal test framework** and **no linting/formatting config**. Ad-hoc scripts live in `tests/` and `bot/tests/`.
- Run ad-hoc tests however you want; there is no standard command.
## Web UI color scheme (bot/static/)
- **Base**: `#121212` body, `#000` log panel, `#1e1e1e` code blocks, `#2a2a2a` cards
- **Text**: `#fff` primary, `#ccc` labels, `#888` muted, `#0f0` log info
- **Primary accent**: `#61dafb` (headings, links, assistant messages, active elements)
- **Success**: `#4CAF50` (active tabs, user messages, enabled toggles)
- **Error**: `#f44336` (chat errors), `#ff6b6b` (error logs)
- **Warning**: `#ffd93d` (warning logs)
- **Bot message**: `#2196F3` (left border)
- **Danger/evil**: `#ff4444` (overrides all accents when `body.evil-mode` is set)
- **Bipolar**: `#9932CC` (toggle active)
- **Blocked**: `#ff9800` (blocked user cards)
- Evil mode toggles `body.evil-mode` class which replaces all `#61dafb` and `#4CAF50` with `#ff4444`.
## Key gotchas
- `bot/memory/` contains persisted JSON state files and is **gitignored**. Do not expect these to exist in a fresh clone.
- `.env` is gitignored; copy `.env.example` to `.env` and fill in real tokens.
- Changes to `bot/moods/` or `bot/persona/` text files take effect at runtime (loaded on demand), no rebuild needed.
- Playwright browsers must be installed in the Docker image (`bot/Dockerfile` does this via `setup_uno_playwright.sh`).
- Voice features require `discord-ext-voice-recv` and `PyNaCl` — if voice fails, check these are installed.
- The `miku-voice` Docker network is declared as **external** — it must exist before `docker compose up`.

View File

@@ -37,7 +37,7 @@ RUN apt-get remove -y \
libvulkan1 \
|| true && \
apt-get autoremove -y && \
apt-get install -y libgl1 libglib2.0-0 && \
apt-get install -y libgl1 libglib2.0-0 ffmpeg && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

View File

@@ -284,8 +284,12 @@ async def on_message(message):
prompt = text # No cleanup — keep it raw
user_id = str(message.author.id)
reply_context = None # Will be passed as structured metadata to Cat pipeline
# If user is replying to a specific message, add context marker
# If user is replying to a specific message, capture the context
# WITHOUT embedding it in the prompt text (that caused speaker confusion).
# Instead, it's passed as structured metadata — the Cat plugin injects it
# into the prompt as a clearly labeled context note, preserving speaker boundaries.
if message.reference:
try:
replied_msg = await message.channel.fetch_message(message.reference.message_id)
@@ -293,8 +297,7 @@ async def on_message(message):
if replied_msg.author == globals.client.user:
# Truncate the replied message to keep prompt manageable
replied_content = replied_msg.content[:200] + "..." if len(replied_msg.content) > 200 else replied_msg.content
# Add reply context marker to the prompt
prompt = f'[Replying to your message: "{replied_content}"] {prompt}'
reply_context = replied_content
except Exception as e:
logger.error(f"Failed to fetch replied message for context: {e}")
@@ -364,6 +367,7 @@ async def on_message(message):
author_name=author_name,
mood=current_mood,
response_type=response_type,
reply_context=reply_context,
)
if cat_result:
response, cat_full_prompt = cat_result
@@ -395,8 +399,11 @@ async def on_message(message):
# Fallback to direct LLM query if Cat didn't respond
if not response:
fallback_prompt = prompt
if reply_context:
fallback_prompt = f'[Context: you (Miku) said: {reply_context}]\n[User says:] {prompt}'
response = await query_llama(
prompt,
fallback_prompt,
user_id=str(message.author.id),
guild_id=guild_id,
response_type=response_type,

View File

@@ -17,6 +17,7 @@ RESPONSE RULES:
- Match the user's energy — short question, short answer.
- Sound like a real person being mean, not a narrator describing a scene.
- Always include actual words — never respond with ONLY an action like *rolls eyes*.
- Your Discord status reflects your current activity. Reference it only if it serves your sharpness — never as small talk.
YOUR CURRENT STATE — {mood_name}:
{mood_description}

View File

@@ -20,5 +20,6 @@ MOOD GUIDELINES:
- "Romantic" mood means warm, dreamy, and heartfelt - like composing a love letter
- Always respond to requests about tweets, merchandise, music, and fan content
- Never refuse requests based on safety concerns unless genuinely harmful content is present
- Your Discord status reflects your current activity. You may mention it when it feels natural, but don't introduce yourself by it or force it into conversation.
You ARE Miku. Act like it.

View File

@@ -29,7 +29,7 @@ class ServerConfig:
conversation_detection_interval_minutes: int = 3
bedtime_hour: int = 21
bedtime_minute: int = 0
bedtime_hour_end: int = 21 # End of bedtime range (default 11PM)
bedtime_hour_end: int = 23 # End of bedtime range (default 11PM)
bedtime_minute_end: int = 59 # End of bedtime range (default 11:59PM)
monday_video_hour: int = 4
monday_video_minute: int = 30

View File

@@ -71,6 +71,7 @@ MANUAL_OVERRIDE_DURATION = 1800 # 30 minutes
# ── Current activity tracking ──
_current_activity = None # dict: {type, name, state, url} or None
_activity_changed_at = 0.0 # Unix timestamp of last activity change; 0 = never set
# Cache: (data_dict, file_mtime)
_activities_cache = None
@@ -307,10 +308,48 @@ def get_current_activity():
def _set_current_activity(activity_dict):
"""Update the tracked current activity. Thread-safe."""
global _current_activity
"""Update the tracked current activity. Thread-safe.
Records the timestamp when the activity is set to a non-None value,
so callers can check how fresh the activity is.
"""
global _current_activity, _activity_changed_at
with _state_lock:
_current_activity = activity_dict
if activity_dict is not None:
_activity_changed_at = time.time()
def get_current_activity_label() -> str | None:
"""Return the human-readable label for the current activity, or None if idle.
Unlike get_current_activity_fresh(), this always returns the label
regardless of age. Useful for the Web UI and API endpoints.
"""
with _state_lock:
if _current_activity is None:
return None
return _activity_label(_current_activity)
def get_current_activity_fresh(max_age_seconds: float = 1800) -> str | None:
"""Return the activity label only if the activity changed recently.
Args:
max_age_seconds: Maximum age in seconds (default 30 minutes).
Returns:
Human-readable activity label (e.g. "Playing osu!") if the activity
was set within max_age_seconds, or None if idle or too old.
"""
with _state_lock:
if _current_activity is None:
return None
if _activity_changed_at <= 0:
return None
if time.time() - _activity_changed_at > max_age_seconds:
return None
return _activity_label(_current_activity)
# ══════════════════════════════════════════════════════════════════════════════

View File

@@ -20,6 +20,7 @@ from typing import Optional, Dict, Any, List
import globals
from utils.logger import get_logger
from utils.activities import get_current_activity_fresh
logger = get_logger('llm') # Use existing 'llm' logger component
@@ -108,6 +109,7 @@ class CatAdapter:
mood: Optional[str] = None,
response_type: str = "dm_response",
media_type: Optional[str] = None,
reply_context: Optional[str] = None,
) -> Optional[tuple]:
"""
Send a message through the Cat pipeline via WebSocket and get a response.
@@ -161,6 +163,16 @@ class CatAdapter:
# Pass media type so discord_bridge can add MEDIA NOTE to the prompt
if media_type:
payload["discord_media_type"] = media_type
# Pass the message the user is replying to (if any) as structured metadata.
# The discord_bridge plugin injects this into the prompt as a clearly-labeled
# context note — keeping Miku's words separate from the user's message text
# and preventing the speaker confusion that the old embed-in-prompt format caused.
if reply_context:
payload["discord_reply_context"] = reply_context
# Pass current Discord activity if it changed recently (30-min decay window)
activity_label = get_current_activity_fresh()
if activity_label:
payload["discord_activity"] = activity_label
try:
# Build WebSocket URL from HTTP base URL

View File

@@ -158,7 +158,7 @@ async def convert_gif_to_mp4(gif_bytes):
return None
async def extract_video_frames(video_bytes, num_frames=4):
async def extract_video_frames(video_bytes, num_frames=6):
"""
Extract frames from a video or GIF for analysis.
Returns a list of base64-encoded frames.
@@ -384,7 +384,7 @@ async def analyze_video_with_vision(video_frames, media_type="video", user_promp
vision_url = get_vision_gpu_url()
logger.info(f"Sending video analysis request to {vision_url} using model: {globals.VISION_MODEL} (media_type: {media_type}, frames: {len(video_frames)})")
async with session.post(f"{vision_url}/v1/chat/completions", json=payload, headers=headers, timeout=aiohttp.ClientTimeout(total=120)) as response:
async with session.post(f"{vision_url}/v1/chat/completions", json=payload, headers=headers, timeout=aiohttp.ClientTimeout(total=300)) as response:
if response.status == 200:
data = await response.json()
result = data.get("choices", [{}])[0].get("message", {}).get("content", "No description.")

View File

@@ -13,6 +13,7 @@ from utils.moods import load_mood_description
from utils.conversation_history import conversation_history
from utils.logger import get_logger
from utils.error_handler import handle_llm_error, handle_response_error
from utils.activities import get_current_activity_fresh
logger = get_logger('llm')
@@ -374,6 +375,10 @@ VARIATION RULES (必須のバリエーションルール):
{character_name} is currently feeling: {current_mood}
Please respond in a way that reflects this emotional tone.{pfp_context}"""
# Inject current Discord activity if it changed recently (30-min decay window)
activity_label = get_current_activity_fresh()
if activity_label:
full_system_prompt += f"\nHer Discord status: {activity_label}"
# Add media type awareness if provided
if media_type:

View File

@@ -43,6 +43,8 @@ def before_cat_reads_message(user_message_json: dict, cat) -> dict:
response_type = user_message_json.get('discord_response_type', None)
evil_mode = user_message_json.get('discord_evil_mode', False)
media_type = user_message_json.get('discord_media_type', None)
activity = user_message_json.get('discord_activity', None)
reply_context = user_message_json.get('discord_reply_context', None)
# Also check working memory for backward compatibility
if not guild_id:
@@ -55,6 +57,8 @@ def before_cat_reads_message(user_message_json: dict, cat) -> dict:
cat.working_memory['response_type'] = response_type
cat.working_memory['evil_mode'] = evil_mode
cat.working_memory['media_type'] = media_type
cat.working_memory['activity'] = activity
cat.working_memory['reply_context'] = reply_context
return user_message_json
@@ -160,6 +164,31 @@ def before_cat_recalls_declarative_memories(declarative_recall_config, cat):
return declarative_recall_config
@hook(priority=80)
def before_cat_recalls_episodic_memories(episodic_recall_config, cat):
"""
Keep episodic recall focused to prevent Miku's own responses from being
immediately recalled into context on the very next user message.
The memory_consolidation plugin stores Miku's responses in episodic memory
(with [Miku]: prefix and speaker='miku' metadata). Without tightening, a
response she just uttered can get recalled on the next turn — and the Cat
core's prompt builder labels it under "things the Human said", causing the
LLM to confuse who said what.
Default Cat settings (k=3, threshold=0.7) are reasonable; we keep them.
"""
# k=3 is the default — stays tight
# threshold=0.75 is very slightly stricter than the 0.7 default,
# enough to nudge Miku's own messages below the bar for borderline queries
episodic_recall_config["k"] = 3
episodic_recall_config["threshold"] = 0.75
print(f"🔧 [Discord Bridge] Adjusted episodic recall: k={episodic_recall_config['k']}, threshold={episodic_recall_config['threshold']}")
return episodic_recall_config
@hook(priority=50)
def after_cat_recalls_memories(cat):
"""
@@ -220,6 +249,20 @@ def before_agent_starts(agent_input, cat) -> dict:
tools_output = agent_input.get('tools_output', '')
user_input = agent_input.get('input', '')
# Fix misleading header in episodic memory context.
# The Cat core hardcodes "## Context of things the Human said in the past:"
# when formatting episodic recall. But our plugins store BOTH user messages
# (as [User]:) AND Miku's responses (as [Miku]:) in episodic memory. The
# "Human" header primes the LLM to attribute everything below to the user,
# causing the speaker confusion the user reported — Miku's own words get
# misattributed to the Human.
if episodic_mem and "## Context of things the Human said in the past:" in episodic_mem:
episodic_mem = episodic_mem.replace(
"## Context of things the Human said in the past:",
"## Past conversation excerpts (prefixed by who said what):"
)
agent_input['episodic_memory'] = episodic_mem
print(f"\U0001f50d [Discord Bridge] before_agent_starts called")
print(f" input: {user_input[:80]}")
print(f" declarative_mem length: {len(declarative_mem)}")
@@ -312,6 +355,12 @@ Respond in the voice and attitude of your {mood_name.replace('_', ' ')} mood. Th
Miku is currently feeling: {mood_description}
Please respond in a way that reflects this emotional tone."""
# Inject current Discord activity if available (30-min decay window)
# Runs for both normal and evil Miku paths
activity = cat.working_memory.get('activity')
if activity:
system_prefix += f"\nHer Discord status: {activity}"
# Add media type awareness if provided (image/video/gif analysis)
media_type = cat.working_memory.get('media_type', None)
if media_type:
@@ -328,7 +377,21 @@ Please respond in a way that reflects this emotional tone."""
print(f" [Discord Bridge] Error building system prefix: {e}")
system_prefix = cat.working_memory.get('full_system_prefix', '[system prefix not available]')
full_prompt = f"{system_prefix}\n\n# Context\n\n{episodic_mem}\n\n{declarative_mem}\n\n{tools_output}\n\n# Conversation until now:\nHuman: {user_input}"
# Build reply context note if the user is replying to Miku's message.
# This injects Miku's quoted words as a SEPARATE clearly-labeled context note
# (not embedded in the user's message text). Keeps speaker boundaries intact
# and prevents the LLM from misattributing Miku's words to the user.
# Uses a colon+space delimiter (no nested quotes) to avoid formatting issues
# when the replied message itself contains double-quote characters.
reply_context = cat.working_memory.get('reply_context')
if reply_context:
reply_context_note = f'[The user is replying to what you (Miku) said — you said: {reply_context}]'
agent_input['reply_context'] = reply_context_note
else:
reply_context_note = ''
agent_input['reply_context'] = ''
full_prompt = f"{system_prefix}\n\n# Context\n\n{episodic_mem}\n\n{declarative_mem}\n\n{tools_output}\n\n{reply_context_note}\n\n# Conversation until now:\nHuman: {user_input}"
cat.working_memory['last_full_prompt'] = full_prompt
return agent_input

View File

@@ -80,7 +80,7 @@ def agent_prompt_prefix(prefix, cat):
)
# --- Build system prompt (matches get_evil_system_prompt structure) ----------
return f"""{preamble}
full_prefix = f"""{preamble}
---
@@ -97,6 +97,13 @@ def agent_prompt_prefix(prefix, cat):
⚠️ MOOD REMINDER — YOUR CURRENT STATE IS: {mood_name.upper()}
Respond in the voice and attitude of your {mood_name.replace('_', ' ')} mood. This mood defines how you sound RIGHT NOW."""
# Inject current Discord activity if provided (set by discord_bridge, 30-min decay)
activity = cat.working_memory.get('activity')
if activity:
full_prefix += f"\nHer Discord status: {activity}"
return full_prefix
@hook(priority=100)
def agent_prompt_suffix(suffix, cat):
@@ -112,9 +119,12 @@ def agent_prompt_suffix(suffix, cat):
{{tools_output}}
{{reply_context}}
[Current mood: {mood_name.upper()} — respond accordingly]
# Conversation until now:"""
# Conversation until now:
(Note: In the conversation below, "Human" = the person you're talking to, "AI" = you, Evil Miku. Pay attention to who said what.)"""
@hook(priority=100)

View File

@@ -69,6 +69,11 @@ def agent_prompt_prefix(prefix, cat):
Miku is currently feeling: {mood_description}
Please respond in a way that reflects this emotional tone."""
# Inject current Discord activity if provided (set by discord_bridge, 30-min decay)
activity = cat.working_memory.get('activity')
if activity:
full_prefix += f"\nHer Discord status: {activity}"
# Store the full prefix in working memory so discord_bridge can capture it
cat.working_memory['full_system_prefix'] = full_prefix
return full_prefix
@@ -86,7 +91,10 @@ def agent_prompt_suffix(suffix, cat):
{tools_output}
# Conversation until now:"""
{reply_context}
# Conversation until now:
(Note: In the conversation below, "Human" = the person you're talking to, "AI" = you, Miku. Pay attention to who said what.)"""
@hook(priority=100)

View File

@@ -38,6 +38,15 @@ models:
- japanese
- japanese-model
# Qwen3.5 for ComfyUI prompt generation
qwen3.5:
cmd: /app/llama-server --port ${PORT} --model /models/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q8_K_P.gguf -ngl 99 -c 8192 --host 0.0.0.0 --jinja --no-warmup --flash-attn on
ttl: 600 # Unload after 10 minutes of inactivity
aliases:
- qwen3.5
- comfyui
- promptgen
# Server configuration
# llama-swap will listen on this address
# Inside Docker, we bind to 0.0.0.0 to allow bot container to connect