50 KiB
Miku Voice Channel Chat Feature - Implementation Plan
Executive Summary
This document outlines a comprehensive plan to implement real-time voice channel functionality for Miku, enabling her to:
- Join Discord voice channels and speak using her TTS pipeline
- Stream text tokens from llama-swap's LLM directly to TTS for maximum real-time responsiveness
- Accept text-based prompts from a designated text channel as a temporary input method
- Manage system resources strictly to maintain performance on constrained hardware
⚠️ Critical Resource Management Requirements
Due to tight hardware constraints, multiple bot features must be disabled during voice sessions:
| Feature | Action During Voice | Reason |
|---|---|---|
| Vision Model | ❌ Blocked | Frees GTX 1660 for TTS only |
| Image Generation | ❌ Blocked | Prevents ComfyUI GPU usage |
| Bipolar Mode | ❌ Disabled | Prevents dual-personality interactions |
| Profile Pictures | 🔒 Locked | Avoids Discord API overhead |
| Autonomous Engine | ⏸️ Paused | Reduces inference load |
| Scheduled Events | ⏸️ Paused | Prevents background tasks |
| Figurine Notifier | ⏸️ Paused | Prevents background tasks |
| Text Channels | 📦 Queued | Messages processed after session |
GPU Allocation During Voice:
- GTX 1660: Soprano TTS only (no LLM, no vision)
- AMD RX 6800: RVC API + llama-swap-amd text model (~10-12GB)
1. System Architecture Overview
1.1 Current TTS Pipeline (soprano_to_rvc)
Components:
-
Soprano TTS Server (GTX 1660 + CUDA)
- Runs in
miku-soprano-ttscontainer - Listens on ZMQ port 5555 (internal network)
- Converts text → 32kHz audio
- Runs in
-
RVC API Server (AMD RX 6800 + ROCm)
- Runs in
miku-rvc-apicontainer - Exposes WebSocket endpoint:
ws://localhost:8765/ws/stream - Exposes HTTP endpoint:
http://localhost:8765 - Converts Soprano output → Miku voice (48kHz)
- Runs in
WebSocket Protocol (/ws/stream):
Client → Server: {"token": "Hello", "pitch_shift": 0}
Client → Server: {"token": " world"}
Client → Server: {"token": "!", "flush": false}
Server → Client: [binary PCM float32 audio @ 48kHz]
1.2 Current LLM Infrastructure
Text Models:
-
llama-swap (GTX 1660) - Port 8090
- Models: llama3.1, darkidol, vision
- Supports streaming via
/completionendpoint withstream=true
-
llama-swap-amd (AMD RX 6800) - Port 8091
- Models: llama3.1, darkidol (no vision)
- Supports streaming via
/completionendpoint withstream=true
1.3 Current Discord Bot Architecture
Main Components:
bot/bot.py- Main Discord client event handlerbot/globals.py- Global state and configurationbot/utils/llm.py- LLM query interfacebot/server_manager.py- Multi-server configuration management
Key Features:
- Multi-server support with per-server mood/config
- DM support with separate mood system
- Evil mode (alternate personality with uncensored model)
- Bipolar mode (both personalities can interact)
- Vision model integration for images/videos
2. Voice Channel Feature Requirements
2.1 Core Functionality
-
Voice Channel Connection
- Miku can join a voice channel via command (e.g.,
!miku join) - Miku can leave via command (e.g.,
!miku leave) - Only one voice session active at a time (resource constraint)
- Miku can join a voice channel via command (e.g.,
-
Real-Time Text-to-Speech
- Stream LLM tokens directly to TTS WebSocket
- Send audio chunks to Discord voice as they're generated
- Minimize latency between token generation and audio playback
-
Text-Based Input (Temporary)
- Designated text channel for prompting Miku (e.g.,
#miku-voice-prompt) - Messages in this channel trigger voice responses
- Only active when Miku is in voice channel
- Designated text channel for prompting Miku (e.g.,
-
Resource Management
- GPU Switching: Use llama-swap-amd (RX 6800) exclusively for text generation
- Vision Model Blocking: Prevent vision model from loading during voice session
- Text Channel Pausing: Pause/queue regular text channel inference
- Cleanup: Properly release resources when voice session ends
2.2 User Experience Goals
- Low Latency: First audio chunk should play within 1-2 seconds of prompt
- Natural Speech: Sentence boundaries should be respected for natural pauses
- Reliable: Graceful error handling and recovery
- Non-Intrusive: Shouldn't break existing bot functionality
3. Resource Management Strategy
3.1 Hardware Constraints
Available Resources:
- GTX 1660 (6GB VRAM): Currently runs llama-swap + Soprano TTS
- AMD RX 6800 (16GB VRAM): Currently runs llama-swap-amd + RVC API
During Voice Session:
- GTX 1660: Dedicated to Soprano TTS only (no LLM)
- AMD RX 6800: Split between RVC API + llama-swap-amd text model
- RVC uses ~4-6GB VRAM
- Text model uses ~5-6GB VRAM
- Total: ~10-12GB (within 16GB limit)
Features That Must Be Disabled During Voice Session:
Due to resource constraints and to ensure voice chat performance, the following features must be completely disabled while Miku is in a voice channel:
- Vision Model Loading - Prevents GTX 1660 from loading vision models (keeps TTS running)
- Image Generation (ComfyUI) - Blocks draw commands with custom message
- Bipolar Mode Interactions - Prevents Miku/Evil Miku arguments and dialogues
- Profile Picture Switching - Locks avatar changes during session
- Autonomous Engine - Pauses autonomous message generation
- Scheduled Events - Pauses all scheduled jobs (e.g., Monday videos)
- Figurine Notifier - Pauses figurine availability notifications
- Text Channel Inference - Queues regular text messages for later processing
User-Facing Messages for Blocked Features:
- Image generation: "🎤 I can't draw right now, I'm talking in voice chat! Ask me again after I leave the voice channel."
- Vision requests: "🎤 I can't look at images or videos right now, I'm talking in voice chat! Send it again after I leave."
- Bipolar triggers: (Silent - no argument starts)
- Profile changes: (Silent - no avatar updates)
3.2 Resource Locking Mechanism
Implementation via VoiceSessionManager singleton:
class VoiceSessionManager:
def __init__(self):
self.active_session = None # VoiceSession instance or None
self.session_lock = asyncio.Lock()
async def start_session(self, guild_id, voice_channel, text_channel):
"""Start voice session with resource locks"""
async with self.session_lock:
if self.active_session:
raise Exception("Voice session already active")
# 1. Switch to AMD GPU for text inference
await self._switch_to_amd_gpu()
# 2. Block vision model loading
await self._block_vision_model()
# 3. Pause text channel inference (queue messages)
await self._pause_text_channels()
# 4. Disable bipolar mode interactions (Miku/Evil Miku arguments)
await self._disable_bipolar_mode()
# 5. Disable profile picture switching
await self._disable_profile_picture_switching()
# 6. Disable image generation (ComfyUI)
await self._disable_image_generation()
# 7. Pause autonomous engine
await self._pause_autonomous_engine()
# 8. Pause scheduled events
await self._pause_scheduled_events()
# 9. Pause figurine notifier
await self._pause_figurine_notifier()
# 10. Create voice session
self.active_session = VoiceSession(guild_id, voice_channel, text_channel)
await self.active_session.connect()
async def end_session(self):
"""End voice session and release resources"""
async with self.session_lock:
if not self.active_session:
return
# 1. Disconnect from voice
await self.active_session.disconnect()
# 2. Resume text channel inference
await self._resume_text_channels()
# 3. Unblock vision model
await self._unblock_vision_model()
# 4. Re-enable bipolar mode interactions
await self._enable_bipolar_mode()
# 5. Re-enable profile picture switching
await self._enable_profile_picture_switching()
# 6. Re-enable image generation
await self._enable_image_generation()
# 7. Resume autonomous engine
await self._resume_autonomous_engine()
# 8. Resume scheduled events
await self._resume_scheduled_events()
# 9. Resume figurine notifier
await self._resume_figurine_notifier()
# 10. Restore original GPU (optional)
# Keep AMD for now to avoid extra switching
self.active_session = None
3.3 Detailed Resource Lock Methods
Each resource lock/unlock requires specific implementation:
3.3.1 Text Channel Pausing
async def _pause_text_channels(self):
"""Queue text messages instead of processing during voice session"""
globals.VOICE_SESSION_ACTIVE = True
globals.TEXT_MESSAGE_QUEUE = []
logger.info("Text channels paused (messages will be queued)")
async def _resume_text_channels(self):
"""Process queued messages after voice session"""
globals.VOICE_SESSION_ACTIVE = False
queued_count = len(globals.TEXT_MESSAGE_QUEUE)
logger.info(f"Resuming text channels, processing {queued_count} queued messages")
# Process queue in background task
asyncio.create_task(self._process_message_queue())
3.3.2 Bipolar Mode Disabling
async def _disable_bipolar_mode(self):
"""Prevent Miku/Evil Miku arguments during voice session"""
from utils.bipolar_mode import pause_bipolar_interactions
pause_bipolar_interactions()
logger.info("Bipolar mode interactions disabled")
async def _enable_bipolar_mode(self):
"""Re-enable Miku/Evil Miku arguments after voice session"""
from utils.bipolar_mode import resume_bipolar_interactions
resume_bipolar_interactions()
logger.info("Bipolar mode interactions re-enabled")
3.3.3 Profile Picture Switching
async def _disable_profile_picture_switching(self):
"""Lock profile picture during voice session"""
from utils.profile_picture_manager import profile_picture_manager
profile_picture_manager.lock_switching()
logger.info("Profile picture switching disabled")
async def _enable_profile_picture_switching(self):
"""Unlock profile picture after voice session"""
from utils.profile_picture_manager import profile_picture_manager
profile_picture_manager.unlock_switching()
logger.info("Profile picture switching re-enabled")
3.3.4 Image Generation Blocking
async def _disable_image_generation(self):
"""Block ComfyUI image generation during voice session"""
globals.IMAGE_GENERATION_BLOCKED = True
globals.IMAGE_GENERATION_BLOCK_MESSAGE = (
"🎤 I can't draw right now, I'm talking in voice chat! "
"Ask me again after I leave the voice channel."
)
logger.info("Image generation disabled")
async def _enable_image_generation(self):
"""Re-enable image generation after voice session"""
globals.IMAGE_GENERATION_BLOCKED = False
globals.IMAGE_GENERATION_BLOCK_MESSAGE = None
logger.info("Image generation re-enabled")
3.3.5 Autonomous Engine Pausing
async def _pause_autonomous_engine(self):
"""Pause autonomous message generation during voice session"""
from utils.autonomous import pause_autonomous_system
pause_autonomous_system()
logger.info("Autonomous engine paused")
async def _resume_autonomous_engine(self):
"""Resume autonomous message generation after voice session"""
from utils.autonomous import resume_autonomous_system
resume_autonomous_system()
logger.info("Autonomous engine resumed")
3.3.6 Scheduled Events Pausing
async def _pause_scheduled_events(self):
"""Pause all scheduled jobs during voice session"""
globals.scheduler.pause()
logger.info("Scheduled events paused")
async def _resume_scheduled_events(self):
"""Resume scheduled jobs after voice session"""
globals.scheduler.resume()
logger.info("Scheduled events resumed")
3.3.7 Figurine Notifier Pausing
async def _pause_figurine_notifier(self):
"""Pause figurine notifications during voice session"""
# Assuming figurine notifier is a scheduled job
try:
globals.scheduler.pause_job('figurine_notifier')
logger.info("Figurine notifier paused")
except Exception as e:
logger.warning(f"Could not pause figurine notifier: {e}")
async def _resume_figurine_notifier(self):
"""Resume figurine notifications after voice session"""
try:
globals.scheduler.resume_job('figurine_notifier')
logger.info("Figurine notifier resumed")
except Exception as e:
logger.warning(f"Could not resume figurine notifier: {e}")
3.3.8 Vision Model Blocking
async def _block_vision_model(self):
"""Prevent vision model from loading during voice session"""
globals.VISION_MODEL_BLOCKED = True
logger.info("Vision model blocked")
async def _unblock_vision_model(self):
"""Allow vision model to load after voice session"""
globals.VISION_MODEL_BLOCKED = False
logger.info("Vision model unblocked")
3.3.9 GPU Switching
async def _switch_to_amd_gpu(self):
"""Switch text inference to AMD GPU"""
gpu_state_file = os.path.join("memory", "gpu_state.json")
with open(gpu_state_file, "w") as f:
json.dump({"current_gpu": "amd"}, f)
logger.info("Switched to AMD GPU for text inference")
3.4 Feature-Specific Response Handlers
Image Generation Request Handler:
# In bot message handler, before processing image generation
if globals.IMAGE_GENERATION_BLOCKED:
await message.channel.send(globals.IMAGE_GENERATION_BLOCK_MESSAGE)
await message.add_reaction('🎤')
return
Vision Model Request Handler:
# In image/video handling code
if globals.VISION_MODEL_BLOCKED:
await message.channel.send(
"🎤 I can't look at images right now, I'm in voice chat! "
"Send it again after I leave."
)
return
Bipolar Argument Trigger Handler:
# In bipolar_mode.py trigger detection
from utils.voice_manager import voice_manager
if voice_manager.active_session:
logger.info("Bipolar argument blocked (voice session active)")
return # Skip argument trigger
3.5 Required Module Modifications
The following existing modules need to be updated to check voice session state:
3.5.1 bot/utils/bipolar_mode.py
Add checks before:
- Argument triggers based on score thresholds
- Persona dialogue initiations
- Any webhook-based Miku/Evil Miku interactions
# At the top of functions that trigger bipolar interactions
from utils.voice_manager import voice_manager
if voice_manager.active_session:
logger.debug("Bipolar interaction blocked (voice session active)")
return
3.5.2 bot/utils/profile_picture_manager.py
Add locking mechanism:
class ProfilePictureManager:
def __init__(self):
self.switching_locked = False
def lock_switching(self):
"""Lock profile picture changes during voice session"""
self.switching_locked = True
def unlock_switching(self):
"""Unlock profile picture changes after voice session"""
self.switching_locked = False
async def update_profile_picture(self, ...):
if self.switching_locked:
logger.info("Profile picture change blocked (voice session active)")
return
# ... normal update logic ...
3.5.3 bot/utils/autonomous.py
Add pause/resume functions:
_autonomous_paused = False
def pause_autonomous_system():
"""Pause autonomous message generation"""
global _autonomous_paused
_autonomous_paused = True
def resume_autonomous_system():
"""Resume autonomous message generation"""
global _autonomous_paused
_autonomous_paused = False
# In autonomous trigger functions:
def should_send_autonomous_message():
if _autonomous_paused:
return False
# ... normal logic ...
3.5.4 bot/bot.py (main message handler)
Add checks for image generation and vision:
@globals.client.event
async def on_message(message):
# ... existing checks ...
# Check if image generation is blocked
if "draw" in message.content.lower() and globals.IMAGE_GENERATION_BLOCKED:
await message.channel.send(globals.IMAGE_GENERATION_BLOCK_MESSAGE)
await message.add_reaction('🎤')
return
# Check if vision model is blocked (images/videos)
if (message.attachments or message.embeds) and globals.VISION_MODEL_BLOCKED:
await message.channel.send(
"🎤 I can't look at images or videos right now, I'm talking in voice chat! "
"Send it again after I leave."
)
return
# ... rest of message handling ...
3.5.5 bot/commands/actions.py (ComfyUI integration)
Add check before image generation:
async def handle_image_generation(message, prompt):
if globals.IMAGE_GENERATION_BLOCKED:
await message.channel.send(globals.IMAGE_GENERATION_BLOCK_MESSAGE)
await message.add_reaction('🎤')
return
# ... normal image generation logic ...
3.6 Text Channel Pausing Strategy
Options:
Option A: Queue Messages (Recommended)
- Store incoming messages in a queue during voice session
- Process queue after voice session ends
- Pros: No messages lost, users get responses eventually
- Cons: Responses delayed until voice session ends
Option B: Ignore Messages
- Simply don't respond to text channels during voice session
- Send status message: "🎤 Miku is currently in voice chat..."
- Pros: Simple, clear behavior
- Cons: Users might think bot is broken
Recommendation: Option A with status indicator
- Queue messages with timestamps
- Set bot status to "🎤 In Voice Chat"
- Process queue in order after session ends
4. Technical Implementation Details
4.1 Discord Voice Integration
Required Package:
pip install PyNaCl # Required for voice support
Voice Connection Flow:
import discord
# Connect to voice channel
voice_channel = client.get_channel(voice_channel_id)
voice_client = await voice_channel.connect()
# Create audio source from stream
audio_source = VoiceStreamSource(websocket_url)
# Play audio
voice_client.play(audio_source)
# Disconnect
await voice_client.disconnect()
Key Classes:
discord.VoiceClient- Handles voice connectiondiscord.AudioSource- Abstract base for audio streamingdiscord.PCMAudio- Raw PCM audio source (16-bit, 48kHz, stereo)
4.2 Custom Audio Source for TTS Stream
Implementation:
class MikuVoiceSource(discord.AudioSource):
"""
Streams audio from RVC WebSocket to Discord voice channel.
"""
def __init__(self, websocket_url="ws://localhost:8765/ws/stream"):
self.websocket_url = websocket_url
self.ws = None
self.audio_queue = asyncio.Queue(maxsize=100)
self.running = False
self.frame_size = 3840 # 20ms @ 48kHz stereo (960 samples * 2 channels * 2 bytes)
async def connect(self):
"""Connect to TTS WebSocket"""
self.ws = await websockets.connect(self.websocket_url)
self.running = True
asyncio.create_task(self._receive_audio())
async def _receive_audio(self):
"""Receive audio from WebSocket and queue for playback"""
while self.running:
try:
audio_bytes = await self.ws.recv()
# Convert float32 mono to int16 stereo
audio_data = self._process_audio(audio_bytes)
await self.audio_queue.put(audio_data)
except Exception as e:
logger.error(f"Audio receive error: {e}")
break
def _process_audio(self, audio_bytes):
"""
Convert float32 mono @ 48kHz to int16 stereo @ 48kHz for Discord.
Discord expects: 16-bit PCM, 48kHz, stereo
"""
# Decode float32
audio = np.frombuffer(audio_bytes, dtype=np.float32)
# Convert to int16
audio_int16 = (audio * 32767).clip(-32768, 32767).astype(np.int16)
# Convert mono to stereo (duplicate channel)
audio_stereo = np.repeat(audio_int16, 2)
return audio_stereo.tobytes()
def read(self):
"""
Called by Discord.py to get next audio frame (20ms).
Must be synchronous and return exactly 3840 bytes.
"""
try:
# Get from queue (non-blocking)
audio_chunk = self.audio_queue.get_nowait()
# Ensure exactly frame_size bytes
if len(audio_chunk) < self.frame_size:
# Pad with silence
audio_chunk += b'\x00' * (self.frame_size - len(audio_chunk))
elif len(audio_chunk) > self.frame_size:
# Trim excess
audio_chunk = audio_chunk[:self.frame_size]
return audio_chunk
except:
# No audio available, return silence
return b'\x00' * self.frame_size
def cleanup(self):
"""Clean up resources"""
self.running = False
if self.ws:
asyncio.create_task(self.ws.close())
4.3 LLM Streaming Integration
Llama.cpp Streaming Endpoint:
POST http://llama-swap-amd:8080/v1/models/llama3.1/completions
Content-Type: application/json
{
"prompt": "<full prompt>",
"stream": true,
"temperature": 0.8,
"max_tokens": 500
}
Response (SSE - Server-Sent Events):
data: {"choices":[{"text":"Hello","finish_reason":null}]}
data: {"choices":[{"text":" world","finish_reason":null}]}
data: {"choices":[{"text":"!","finish_reason":"stop"}]}
data: [DONE]
Streaming Handler:
async def stream_llm_to_tts(prompt, websocket):
"""
Stream LLM tokens directly to TTS WebSocket.
"""
url = f"{globals.LLAMA_AMD_URL}/v1/models/{globals.TEXT_MODEL}/completions"
payload = {
"prompt": prompt,
"stream": True,
"temperature": 0.8,
"max_tokens": 500,
"stop": ["\n\n", "User:", "Assistant:"]
}
async with aiohttp.ClientSession() as session:
async with session.post(url, json=payload) as resp:
async for line in resp.content:
line = line.decode('utf-8').strip()
if not line.startswith('data: '):
continue
data = line[6:] # Remove 'data: ' prefix
if data == '[DONE]':
# Flush remaining buffer
await websocket.send(json.dumps({"flush": True}))
break
try:
chunk = json.loads(data)
token = chunk['choices'][0]['text']
# Send token to TTS
await websocket.send(json.dumps({
"token": token,
"pitch_shift": 0
}))
except Exception as e:
logger.error(f"Token parse error: {e}")
4.4 Voice Session Implementation
Main Session Class:
class VoiceSession:
"""
Manages a single voice chat session.
"""
def __init__(self, guild_id, voice_channel, text_channel):
self.guild_id = guild_id
self.voice_channel = voice_channel
self.text_channel = text_channel
self.voice_client = None
self.audio_source = None
self.tts_websocket = None
self.active = False
async def connect(self):
"""Connect to voice channel and TTS pipeline"""
# 1. Connect to Discord voice
self.voice_client = await self.voice_channel.connect()
# 2. Connect to TTS WebSocket
self.tts_websocket = await websockets.connect("ws://localhost:8765/ws/stream")
# 3. Create audio source
self.audio_source = MikuVoiceSource()
await self.audio_source.connect()
# 4. Start playing audio stream
self.voice_client.play(self.audio_source)
self.active = True
logger.info(f"Voice session started in {self.voice_channel.name}")
async def speak(self, prompt):
"""
Generate speech for given prompt.
Streams LLM tokens → TTS → Discord voice.
"""
if not self.active:
raise Exception("Voice session not active")
# Build full LLM prompt with context
full_prompt = await self._build_llm_prompt(prompt)
# Stream tokens to TTS
await stream_llm_to_tts(full_prompt, self.tts_websocket)
async def _build_llm_prompt(self, user_prompt):
"""Build full prompt with context (similar to query_llama)"""
# Get mood and context
from utils.llm import get_context_for_response_type
from server_manager import server_manager
server_config = server_manager.get_server_config(self.guild_id)
current_mood = server_config.current_mood_description
miku_context = get_context_for_response_type("server_response")
# Build messages array
messages = [
{"role": "system", "content": f"{miku_context}\n\nMiku is currently feeling: {current_mood}"},
{"role": "user", "content": user_prompt}
]
# Convert to llama.cpp prompt format (depends on model)
# For Llama 3.1:
prompt = "<|begin_of_text|>"
for msg in messages:
if msg["role"] == "system":
prompt += f"<|start_header_id|>system<|end_header_id|>\n{msg['content']}<|eot_id|>"
elif msg["role"] == "user":
prompt += f"<|start_header_id|>user<|end_header_id|>\n{msg['content']}<|eot_id|>"
prompt += "<|start_header_id|>assistant<|end_header_id|>\n"
return prompt
async def disconnect(self):
"""Disconnect from voice and cleanup"""
self.active = False
# Stop audio playback
if self.voice_client and self.voice_client.is_playing():
self.voice_client.stop()
# Disconnect from voice
if self.voice_client:
await self.voice_client.disconnect()
# Close TTS WebSocket
if self.tts_websocket:
await self.tts_websocket.close()
# Cleanup audio source
if self.audio_source:
self.audio_source.cleanup()
logger.info("Voice session ended")
5. Command Implementation
5.1 Voice Commands
New commands to add:
-
!miku join [#voice-channel]- Join specified voice channel (or user's current channel)
- Set text channel as prompt input channel
- Lock resources and start session
-
!miku leave- Leave current voice channel
- Release resources
- Resume normal operation
-
!miku voice-status- Show current voice session status
- Show active prompt channel
- Show resource allocation
5.2 Command Router Integration
Add to bot/command_router.py:
from commands.voice import handle_voice_command
# In route_command():
if cmd in ['join', 'leave', 'voice-status', 'speak']:
return await handle_voice_command(message, cmd, args)
New file bot/commands/voice.py:
from utils.voice_manager import voice_manager
async def handle_voice_command(message, cmd, args):
"""Handle voice-related commands"""
if cmd == 'join':
# Get voice channel
if args and args[0].startswith('<#'):
# Channel mentioned
channel_id = int(args[0][2:-1])
voice_channel = message.guild.get_channel(channel_id)
else:
# Use user's current voice channel
if message.author.voice:
voice_channel = message.author.voice.channel
else:
await message.channel.send("❌ You must be in a voice channel!")
return
try:
await voice_manager.start_session(
message.guild.id,
voice_channel,
message.channel # Use current text channel for prompts
)
await message.channel.send(f"🎤 Joined {voice_channel.name}! Send messages here to make me speak.")
except Exception as e:
await message.channel.send(f"❌ Failed to join voice: {e}")
elif cmd == 'leave':
if not voice_manager.active_session:
await message.channel.send("❌ I'm not in a voice channel!")
return
await voice_manager.end_session()
await message.channel.send("👋 Left voice channel!")
elif cmd == 'voice-status':
if voice_manager.active_session:
session = voice_manager.active_session
await message.channel.send(
f"🎤 **Voice Session Active**\n"
f"Voice Channel: {session.voice_channel.name}\n"
f"Prompt Channel: {session.text_channel.mention}\n"
f"GPU: AMD RX 6800 (text only)\n"
f"Text Channels: Paused (queued)"
)
else:
await message.channel.send("No active voice session")
5.3 Text Channel Prompt Handler
Modify bot/bot.py on_message handler:
@globals.client.event
async def on_message(message):
if message.author == globals.client.user:
return
# Check if this is voice prompt channel
if voice_manager.active_session:
session = voice_manager.active_session
if message.channel.id == session.text_channel.id:
# This is a voice prompt
await session.speak(message.content)
await message.add_reaction('🎤') # Acknowledge
return
# ... rest of normal message handling ...
6. Implementation Phases
Phase 1: Foundation (3-4 hours)
Goal: Set up basic voice connection and resource management
Tasks:
- Install PyNaCl dependency
- Add global state variables to
globals.py - Create
bot/utils/voice_manager.py - Implement
VoiceSessionManagersingleton - Implement all resource locking methods:
- GPU switching
- Vision model blocking
- Text channel pausing
- Bipolar mode disabling
- Profile picture lock
- Image generation blocking
- Autonomous engine pause
- Scheduled events pause
- Figurine notifier pause
- Add feature-specific response handlers (image gen, vision model)
- Test voice connection without TTS
Deliverables:
- Voice channel join/leave working
- All resource locks functional
- Text channels properly paused during session
- All features properly disabled/re-enabled around sessions
- Hardcoded responses for blocked features
Phase 2: Audio Streaming (3-4 hours)
Goal: Implement TTS audio streaming to Discord
Tasks:
- Create
MikuVoiceSourceclass - Implement WebSocket → Discord audio bridge
- Handle audio format conversion (float32 mono → int16 stereo)
- Implement frame buffering and timing
- Test with static text (no LLM streaming yet)
Deliverables:
- Audio plays in Discord voice channel
- TTS pipeline outputs correctly formatted audio
- No audio glitches or timing issues
Phase 3: LLM Streaming Integration (2-3 hours)
Goal: Connect LLM token stream to TTS pipeline
Tasks:
- Implement
stream_llm_to_tts()function - Handle SSE parsing from llama.cpp
- Build proper prompt with context/mood
- Test token-by-token streaming
- Handle edge cases (connection drops, errors)
Deliverables:
- LLM tokens stream to TTS in real-time
- Audio starts playing quickly (1-2s latency)
- Natural sentence boundaries respected
Phase 4: Commands & UX (1-2 hours)
Goal: Polish user interface and commands
Tasks:
- Create
bot/commands/voice.py - Add commands to router
- Implement status messages
- Add error handling and user feedback
- Test edge cases (multiple join attempts, etc.)
Deliverables:
- All voice commands working
- Clear user feedback
- Graceful error handling
Phase 5: Testing & Refinement (2-3 hours)
Goal: Ensure stability and performance
Tasks:
- Load testing (long sessions, many prompts)
- Resource leak detection
- Audio quality verification
- Latency optimization
- Documentation and README updates
Deliverables:
- Stable voice sessions (no crashes)
- Optimal latency (target: <2s first audio)
- Updated documentation
Total Estimated Time: 12-18 hours
7. Error Handling & Edge Cases
7.1 Common Error Scenarios
1. TTS Pipeline Unavailable
- Symptom: Can't connect to WebSocket
- Response: Return error, don't start voice session
- Message: "❌ TTS pipeline not available. Check soprano/rvc containers."
2. Voice Channel Full
- Symptom: Can't join voice channel (user limit)
- Response: Return error
- Message: "❌ Voice channel is full!"
3. Already in Voice Session
- Symptom: User tries to join while session active
- Response: Reject command
- Message: "❌ Already in a voice session! Use
!miku leavefirst."
4. LLM Timeout
- Symptom: LLM doesn't respond within timeout
- Response: Send silence, log error
- Message: "(in voice) Miku seems confused..."
5. Audio Buffer Underrun
- Symptom: TTS slower than playback rate
- Response: Pad with silence, don't crash
- Log: Warning about buffer underrun
6. Blocked Feature Attempted During Voice
- Symptom: User tries to generate image, send image, trigger bipolar mode
- Response: Send appropriate blocked feature message
- Examples:
- "🎤 I can't draw right now, I'm talking in voice chat!"
- "🎤 I can't look at images right now, I'm talking in voice chat!"
- Log: Feature block triggered
7. Resource Cleanup Failure
- Symptom: Feature doesn't resume after voice session
- Response: Log error, attempt manual cleanup
- Fallback: Restart bot if critical features stuck
7.2 Graceful Degradation
Priority Order:
- Keep bot online (don't crash)
- Maintain voice connection if possible
- Inform user of issues
- Fallback to text if voice fails
Fallback Strategy:
async def speak_with_fallback(session, prompt):
"""Speak in voice, fallback to text if error"""
try:
await session.speak(prompt)
except Exception as e:
logger.error(f"Voice speak failed: {e}")
# Fallback: send text response
response = await query_llama(prompt, ...)
await session.text_channel.send(f"⚠️ (Voice failed, text mode): {response}")
8. Performance Optimization
8.1 Latency Reduction Strategies
Target: <2 seconds from prompt to first audio
Optimization Points:
-
Pre-warm TTS connection
- Keep WebSocket connected during session
- Reduce handshake overhead
-
Reduce LLM prompt length
- Limit conversation history to 4 messages
- Truncate long context
-
Parallel processing
- Start TTS as soon as first token arrives
- Don't wait for full sentence
-
Buffer tuning
- Keep audio buffer small (5-10 chunks max)
- Balance latency vs. smoothness
8.2 Resource Monitoring
Metrics to Track:
- VRAM usage (AMD GPU during session)
- CPU usage (RVC/Soprano processing)
- Audio buffer fill level
- LLM token rate (tokens/second)
- End-to-end latency (prompt → audio)
Implementation:
class VoiceMetrics:
def __init__(self):
self.prompt_times = []
self.first_audio_times = []
self.total_tokens = 0
def log_prompt(self):
self.prompt_times.append(time.time())
def log_first_audio(self):
if self.prompt_times:
latency = time.time() - self.prompt_times[-1]
self.first_audio_times.append(latency)
logger.info(f"First audio latency: {latency:.2f}s")
9. Testing Plan
9.1 Unit Tests
Components to Test:
MikuVoiceSource.read()- Audio framingstream_llm_to_tts()- Token streamingVoiceSessionManager- Resource locking- Audio format conversion
9.2 Integration Tests
Test Scenarios:
- Full voice session lifecycle (join → speak → leave)
- Resource cleanup after session
- Text channel pause/resume
- Multiple prompts in quick succession
- Long prompts (500+ characters)
- Error recovery (TTS crash, LLM timeout)
9.3 Feature Blocking Tests
Test each blocked feature during voice session:
-
Vision Model Blocking
- ✅ Send image to Miku while in voice
- ✅ Verify blocked message appears
- ✅ Confirm no vision model loaded (check logs)
- ✅ After leaving voice, send image again
- ✅ Verify vision model works normally
-
Image Generation Blocking
- ✅ Try "draw [prompt]" while in voice
- ✅ Verify custom blocked message appears
- ✅ Confirm ComfyUI not called
- ✅ After leaving voice, try draw again
- ✅ Verify image generation works normally
-
Bipolar Mode Blocking
- ✅ Trigger bipolar argument score while in voice
- ✅ Verify no argument starts
- ✅ Check logs for block message
- ✅ After leaving voice, verify bipolar mode resumes
-
Profile Picture Blocking
- ✅ Trigger profile picture change while in voice
- ✅ Verify avatar doesn't change
- ✅ After leaving voice, verify pfp switching works
-
Autonomous Engine Blocking
- ✅ Wait for autonomous message trigger while in voice
- ✅ Verify no autonomous messages sent
- ✅ After leaving voice, verify autonomous resumes
-
Scheduled Events Blocking
- ✅ Join voice near scheduled event time
- ✅ Verify event doesn't fire during session
- ✅ After leaving voice, verify scheduler active
-
Text Channel Queuing
- ✅ Send regular message while in voice
- ✅ Verify no response during session
- ✅ Verify message queued (check logs)
- ✅ After leaving voice, verify queued messages processed
9.4 Manual Testing Checklist
- Join voice channel via command
- Bot appears in voice channel
- Send prompt in text channel
- Audio plays in voice channel within 2s
- Audio quality is clear (no glitches)
- Multiple prompts work in sequence
- Leave command works
- Resources released after leave
- Text channels resume normal operation
- Bot can rejoin after leaving
10. Future Enhancements (Post-MVP)
10.1 Speech-to-Text (STT) Integration
Goal: Allow users to speak to Miku instead of typing
Approach:
- Use Whisper model for STT
- Run on AMD GPU during voice sessions
- Stream audio → text → LLM → TTS → audio
10.2 Multi-User Voice Conversations
Goal: Multiple users can take turns speaking
Approach:
- Voice activity detection (VAD)
- Queue speaker turns
- Name prefixes in prompts ("User1: ...", "User2: ...")
10.3 Background Music/Sound Effects
Goal: Play background music while Miku speaks
Approach:
- Mix audio streams (voice + music)
- Volume ducking (lower music during speech)
- FFmpeg audio processing
10.4 Voice Commands
Goal: Control bot via voice ("Miku, leave voice chat")
Approach:
- Simple keyword detection in STT output
- Command routing from voice input
10.5 Emotion-Aware Speech
Goal: Vary TTS pitch/speed based on mood
Approach:
- Map mood → pitch_shift parameter
- Dynamic pitch based on emotion detection
11. Configuration & Deployment
11.1 Environment Variables
Add to docker-compose.yml:
miku-bot:
environment:
- VOICE_ENABLED=true
- TTS_WEBSOCKET_URL=ws://miku-rvc-api:8765/ws/stream
- VOICE_GPU=amd # Force AMD GPU during voice sessions
11.2 Network Configuration
Ensure containers can communicate:
miku-bot→miku-soprano-tts(ZMQ 5555)miku-bot→miku-rvc-api(HTTP/WS 8765)miku-bot→llama-swap-amd(HTTP 8080)
Add to docker-compose networks if needed:
networks:
miku-network:
name: miku-network
driver: bridge
11.3 Dependencies
Add to bot/requirements.txt:
PyNaCl>=1.5.0 # Voice support
websockets>=12.0 # TTS WebSocket client
11.4 Global State Variables
Add to bot/globals.py:
# Voice Chat Session State
VOICE_SESSION_ACTIVE = False
TEXT_MESSAGE_QUEUE = [] # Queue for messages received during voice session
# Feature Blocking Flags (set during voice session)
VISION_MODEL_BLOCKED = False
IMAGE_GENERATION_BLOCKED = False
IMAGE_GENERATION_BLOCK_MESSAGE = None
12. Risk Assessment
12.1 Technical Risks
High Risk:
- Audio glitches/stuttering - Mitigation: Extensive buffer testing
- Resource exhaustion - Mitigation: Strict resource locking
- TTS pipeline crashes - Mitigation: Health checks, auto-restart
Medium Risk:
- High latency - Mitigation: Optimization, parallel processing
- Connection drops - Mitigation: Retry logic, graceful degradation
Low Risk:
- Command conflicts - Mitigation: Clear command names
- User confusion - Mitigation: Status messages, documentation
12.2 Resource Risks
Concern: AMD GPU overload (RVC + LLM simultaneously)
Mitigation:
- Monitor VRAM usage during testing
- Reduce RVC batch size if needed
- Consider limiting LLM context length
- Add VRAM threshold checks
Concern: Text channel message queue overflow
Mitigation:
- Limit queue size (e.g., 100 messages)
- Discard oldest messages if limit reached
- Send warning to users
13. Documentation Requirements
13.1 User Documentation
Create VOICE_CHAT_USER_GUIDE.md:
- How to invite Miku to voice channel
- How to send prompts
- Troubleshooting common issues
- Feature limitations
13.2 Developer Documentation
Create VOICE_CHAT_DEVELOPER_GUIDE.md:
- Architecture overview
- Code organization
- Adding new voice features
- Debugging tips
13.3 API Documentation
Document in API_REFERENCE.md:
- Voice command endpoints
- VoiceSessionManager API
- MikuVoiceSource interface
14. Success Criteria
14.1 Functional Requirements ✓
- Miku can join voice channel
- Miku can speak using TTS pipeline
- LLM tokens stream in real-time
- Text prompts trigger voice responses
- Resource management prevents conflicts
- Graceful session cleanup
14.2 Resource Management Requirements
- GPU switches to AMD during session
- Vision model blocked during session
- Text channels paused (messages queued)
- Bipolar mode interactions disabled
- Profile picture switching locked
- Image generation blocked with custom message
- Autonomous engine paused
- Scheduled events paused
- Figurine notifier paused
- All features resume after session ends
14.3 Performance Requirements
- First audio within 2 seconds of prompt
- Audio quality: Clear, no glitches
- VRAM usage: <14GB on AMD GPU
- Sessions stable for 30+ minutes
14.4 Usability Requirements
- Commands intuitive and documented
- Error messages clear and actionable
- Status indicators show session state
- Fallback to text on voice failure
- Helpful blocked feature messages
15. Next Steps
Immediate Actions:
- Review this plan with team/stakeholders
- Set up development branch (
feature/voice-chat) - Install dependencies (PyNaCl, test WebSocket connectivity)
- Create skeleton files (voice_manager.py, voice.py commands)
- Start Phase 1 implementation
Before Starting Implementation:
- Verify soprano/rvc containers are healthy
- Test WebSocket endpoint manually (
websocket_client_example.py) - Verify AMD GPU has sufficient VRAM (check with
rocm-smi) - Back up current bot state (in case rollback needed)
Development Workflow:
- Create feature branch
- Implement phase-by-phase
- Test each phase before moving to next
- Document changes in commit messages
- Merge to main when MVP complete
Appendix A: File Structure
miku-discord/
├── bot/
│ ├── commands/
│ │ ├── actions.py
│ │ └── voice.py # NEW: Voice commands
│ ├── utils/
│ │ ├── llm.py # MODIFY: Add streaming support
│ │ ├── voice_manager.py # NEW: Session management
│ │ └── voice_stream.py # NEW: Audio streaming classes
│ ├── bot.py # MODIFY: Add voice prompt handling
│ └── command_router.py # MODIFY: Add voice command routing
├── soprano_to_rvc/
│ ├── soprano_rvc_api.py # EXISTING: WebSocket endpoint
│ └── docker-compose.yml # EXISTING: TTS containers
├── docker-compose.yml # MODIFY: Add environment vars
└── readmes/
├── VOICE_CHAT_IMPLEMENTATION_PLAN.md # THIS FILE
├── VOICE_CHAT_USER_GUIDE.md # NEW: User docs
└── VOICE_CHAT_DEVELOPER_GUIDE.md # NEW: Dev docs
Appendix B: Key Code Snippets
B.1 Streaming LLM Tokens
async def stream_llm_tokens(prompt):
"""Stream tokens from llama-swap-amd"""
url = f"{globals.LLAMA_AMD_URL}/v1/models/{globals.TEXT_MODEL}/completions"
async with aiohttp.ClientSession() as session:
async with session.post(url, json={"prompt": prompt, "stream": True}) as resp:
async for line in resp.content:
if line.startswith(b'data: '):
data = json.loads(line[6:])
if 'choices' in data:
yield data['choices'][0]['text']
B.2 Discord Voice Connection
voice_client = await voice_channel.connect()
audio_source = MikuVoiceSource()
await audio_source.connect()
voice_client.play(audio_source)
B.3 Resource Lock Pattern
async with voice_manager.session_lock:
await switch_to_amd_gpu()
await block_vision_model()
await pause_text_channels()
# ... start session ...
Appendix C: Troubleshooting Guide
Issue: "TTS pipeline not available"
Cause: RVC container not running or WebSocket unreachable Fix:
cd soprano_to_rvc
docker-compose up -d
docker-compose logs rvc
Issue: Audio stuttering/glitching
Cause: Buffer underrun (TTS too slow)
Fix: Increase audio buffer size in MikuVoiceSource
Issue: High latency (>5s first audio)
Cause: LLM slow to generate tokens Fix: Reduce prompt length, check GPU utilization
Issue: Voice session hangs on disconnect
Cause: Resource cleanup timeout Fix: Add timeout to disconnect operations
Issue: Features not resuming after voice session
Cause: Resource unlock methods not called or failed Fix:
# Check logs for cleanup errors
docker logs miku-bot | grep -i "voice\|resume\|enable"
# Manual fix: Restart bot to reset all states
docker restart miku-bot
Issue: Image generation still works during voice session
Cause: Block check not implemented in image gen handler
Fix: Add if globals.IMAGE_GENERATION_BLOCKED check in commands/actions.py
Issue: Bipolar argument triggered during voice
Cause: Block check missing in bipolar_mode.py
Fix: Add if voice_manager.active_session check before argument triggers
Appendix D: Quick Reference - Resource Blocks
Developer Quick Reference: What to Check Before Each Feature
| Feature/Module | Check This Before Running | Global Flag |
|---|---|---|
| Vision model loading | globals.VISION_MODEL_BLOCKED |
VISION_MODEL_BLOCKED |
| Image generation | globals.IMAGE_GENERATION_BLOCKED |
IMAGE_GENERATION_BLOCKED |
| Bipolar triggers | voice_manager.active_session |
N/A (check object) |
| Profile picture | profile_picture_manager.switching_locked |
N/A (check object) |
| Autonomous msgs | _autonomous_paused (in autonomous.py) |
N/A (module-level) |
| Scheduled events | N/A (scheduler.pause() called) | N/A (APScheduler) |
| Text channel response | globals.VOICE_SESSION_ACTIVE |
VOICE_SESSION_ACTIVE |
Locations to Add Checks:
bot/bot.py- Main message handler (vision, image gen, text response)bot/utils/bipolar_mode.py- Argument trigger functionsbot/utils/profile_picture_manager.py- Update functionsbot/utils/autonomous.py- Message generation functionsbot/commands/actions.py- Image generation handler
Conclusion
This implementation plan provides a comprehensive roadmap for adding voice channel functionality to Miku. The phased approach ensures incremental progress with testing at each stage. The resource management strategy carefully balances competing demands on limited hardware.
Key Success Factors:
- Strict resource locking prevents conflicts (8 features disabled during voice)
- Token streaming minimizes latency (<2s target)
- Graceful error handling ensures stability
- Clear user feedback improves experience (blocked feature messages)
- Comprehensive testing covers all edge cases
Critical Implementation Points:
- ⚠️ Must disable 8 features during voice: vision, image gen, bipolar, pfp, autonomous, scheduled events, figurine notifier, text channels
- ⚠️ GPU switching mandatory: AMD RX 6800 for text, GTX 1660 for TTS only
- ⚠️ User messaging important: Clear feedback when features blocked
- ⚠️ Cleanup critical: All features must resume properly after session
Estimated Timeline: 12-18 hours for MVP, additional 5-10 hours for polish and testing.
Files to Create:
bot/utils/voice_manager.py(main session management)bot/utils/voice_stream.py(audio streaming classes)bot/commands/voice.py(voice commands)
Files to Modify:
bot/globals.py(add voice state flags)bot/bot.py(add voice prompt handler, blocking checks)bot/command_router.py(add voice command routing)bot/utils/bipolar_mode.py(add session checks)bot/utils/profile_picture_manager.py(add locking)bot/utils/autonomous.py(add pause/resume)bot/commands/actions.py(add image gen blocking)bot/requirements.txt(add PyNaCl, websockets)
Ready to proceed? Review this plan, make any necessary adjustments, and begin Phase 1 implementation.