Files

koko210Serve 911f11ee9f Untested Phase 1 (Foundation & Resource management) of voice chat integration

2026-01-16 13:01:08 +02:00

50 KiB

Raw Blame History

Miku Voice Channel Chat Feature - Implementation Plan

Executive Summary

This document outlines a comprehensive plan to implement real-time voice channel functionality for Miku, enabling her to:

Join Discord voice channels and speak using her TTS pipeline
Stream text tokens from llama-swap's LLM directly to TTS for maximum real-time responsiveness
Accept text-based prompts from a designated text channel as a temporary input method
Manage system resources strictly to maintain performance on constrained hardware

⚠️ Critical Resource Management Requirements

Due to tight hardware constraints, multiple bot features must be disabled during voice sessions:

Feature	Action During Voice	Reason
Vision Model	❌ Blocked	Frees GTX 1660 for TTS only
Image Generation	❌ Blocked	Prevents ComfyUI GPU usage
Bipolar Mode	❌ Disabled	Prevents dual-personality interactions
Profile Pictures	🔒 Locked	Avoids Discord API overhead
Autonomous Engine	⏸️ Paused	Reduces inference load
Scheduled Events	⏸️ Paused	Prevents background tasks
Figurine Notifier	⏸️ Paused	Prevents background tasks
Text Channels	📦 Queued	Messages processed after session

GPU Allocation During Voice:

GTX 1660: Soprano TTS only (no LLM, no vision)
AMD RX 6800: RVC API + llama-swap-amd text model (~10-12GB)

1. System Architecture Overview

1.1 Current TTS Pipeline (soprano_to_rvc)

Components:

Soprano TTS Server (GTX 1660 + CUDA)
- Runs in miku-soprano-tts container
- Listens on ZMQ port 5555 (internal network)
- Converts text → 32kHz audio
RVC API Server (AMD RX 6800 + ROCm)
- Runs in miku-rvc-api container
- Exposes WebSocket endpoint: ws://localhost:8765/ws/stream
- Exposes HTTP endpoint: http://localhost:8765
- Converts Soprano output → Miku voice (48kHz)

WebSocket Protocol (/ws/stream):

Client → Server: {"token": "Hello", "pitch_shift": 0}
Client → Server: {"token": " world"}
Client → Server: {"token": "!", "flush": false}
Server → Client: [binary PCM float32 audio @ 48kHz]

1.2 Current LLM Infrastructure

Text Models:

llama-swap (GTX 1660) - Port 8090
- Models: llama3.1, darkidol, vision
- Supports streaming via /completion endpoint with stream=true
llama-swap-amd (AMD RX 6800) - Port 8091
- Models: llama3.1, darkidol (no vision)
- Supports streaming via /completion endpoint with stream=true

1.3 Current Discord Bot Architecture

Main Components:

bot/bot.py - Main Discord client event handler
bot/globals.py - Global state and configuration
bot/utils/llm.py - LLM query interface
bot/server_manager.py - Multi-server configuration management

Key Features:

Multi-server support with per-server mood/config
DM support with separate mood system
Evil mode (alternate personality with uncensored model)
Bipolar mode (both personalities can interact)
Vision model integration for images/videos

2. Voice Channel Feature Requirements

2.1 Core Functionality

Voice Channel Connection
- Miku can join a voice channel via command (e.g., !miku join)
- Miku can leave via command (e.g., !miku leave)
- Only one voice session active at a time (resource constraint)
Real-Time Text-to-Speech
- Stream LLM tokens directly to TTS WebSocket
- Send audio chunks to Discord voice as they're generated
- Minimize latency between token generation and audio playback
Text-Based Input (Temporary)
- Designated text channel for prompting Miku (e.g., #miku-voice-prompt)
- Messages in this channel trigger voice responses
- Only active when Miku is in voice channel
Resource Management
- GPU Switching: Use llama-swap-amd (RX 6800) exclusively for text generation
- Vision Model Blocking: Prevent vision model from loading during voice session
- Text Channel Pausing: Pause/queue regular text channel inference
- Cleanup: Properly release resources when voice session ends

2.2 User Experience Goals

Low Latency: First audio chunk should play within 1-2 seconds of prompt
Natural Speech: Sentence boundaries should be respected for natural pauses
Reliable: Graceful error handling and recovery
Non-Intrusive: Shouldn't break existing bot functionality

3. Resource Management Strategy

3.1 Hardware Constraints

Available Resources:

GTX 1660 (6GB VRAM): Currently runs llama-swap + Soprano TTS
AMD RX 6800 (16GB VRAM): Currently runs llama-swap-amd + RVC API

During Voice Session:

GTX 1660: Dedicated to Soprano TTS only (no LLM)
AMD RX 6800: Split between RVC API + llama-swap-amd text model
- RVC uses ~4-6GB VRAM
- Text model uses ~5-6GB VRAM
- Total: ~10-12GB (within 16GB limit)

Features That Must Be Disabled During Voice Session:

Due to resource constraints and to ensure voice chat performance, the following features must be completely disabled while Miku is in a voice channel:

Vision Model Loading - Prevents GTX 1660 from loading vision models (keeps TTS running)
Image Generation (ComfyUI) - Blocks draw commands with custom message
Bipolar Mode Interactions - Prevents Miku/Evil Miku arguments and dialogues
Profile Picture Switching - Locks avatar changes during session
Autonomous Engine - Pauses autonomous message generation
Scheduled Events - Pauses all scheduled jobs (e.g., Monday videos)
Figurine Notifier - Pauses figurine availability notifications
Text Channel Inference - Queues regular text messages for later processing

User-Facing Messages for Blocked Features:

Image generation: "🎤 I can't draw right now, I'm talking in voice chat! Ask me again after I leave the voice channel."
Vision requests: "🎤 I can't look at images or videos right now, I'm talking in voice chat! Send it again after I leave."
Bipolar triggers: (Silent - no argument starts)
Profile changes: (Silent - no avatar updates)

3.2 Resource Locking Mechanism

Implementation via VoiceSessionManager singleton:

class VoiceSessionManager:
    def __init__(self):
        self.active_session = None  # VoiceSession instance or None
        self.session_lock = asyncio.Lock()
        
    async def start_session(self, guild_id, voice_channel, text_channel):
        """Start voice session with resource locks"""
        async with self.session_lock:
            if self.active_session:
                raise Exception("Voice session already active")
            
            # 1. Switch to AMD GPU for text inference
            await self._switch_to_amd_gpu()
            
            # 2. Block vision model loading
            await self._block_vision_model()
            
            # 3. Pause text channel inference (queue messages)
            await self._pause_text_channels()
            
            # 4. Disable bipolar mode interactions (Miku/Evil Miku arguments)
            await self._disable_bipolar_mode()
            
            # 5. Disable profile picture switching
            await self._disable_profile_picture_switching()
            
            # 6. Disable image generation (ComfyUI)
            await self._disable_image_generation()
            
            # 7. Pause autonomous engine
            await self._pause_autonomous_engine()
            
            # 8. Pause scheduled events
            await self._pause_scheduled_events()
            
            # 9. Pause figurine notifier
            await self._pause_figurine_notifier()
            
            # 10. Create voice session
            self.active_session = VoiceSession(guild_id, voice_channel, text_channel)
            await self.active_session.connect()
            
    async def end_session(self):
        """End voice session and release resources"""
        async with self.session_lock:
            if not self.active_session:
                return
            
            # 1. Disconnect from voice
            await self.active_session.disconnect()
            
            # 2. Resume text channel inference
            await self._resume_text_channels()
            
            # 3. Unblock vision model
            await self._unblock_vision_model()
            
            # 4. Re-enable bipolar mode interactions
            await self._enable_bipolar_mode()
            
            # 5. Re-enable profile picture switching
            await self._enable_profile_picture_switching()
            
            # 6. Re-enable image generation
            await self._enable_image_generation()
            
            # 7. Resume autonomous engine
            await self._resume_autonomous_engine()
            
            # 8. Resume scheduled events
            await self._resume_scheduled_events()
            
            # 9. Resume figurine notifier
            await self._resume_figurine_notifier()
            
            # 10. Restore original GPU (optional)
            # Keep AMD for now to avoid extra switching
            
            self.active_session = None

3.3 Detailed Resource Lock Methods

Each resource lock/unlock requires specific implementation:

3.3.1 Text Channel Pausing

async def _pause_text_channels(self):
    """Queue text messages instead of processing during voice session"""
    globals.VOICE_SESSION_ACTIVE = True
    globals.TEXT_MESSAGE_QUEUE = []
    logger.info("Text channels paused (messages will be queued)")

async def _resume_text_channels(self):
    """Process queued messages after voice session"""
    globals.VOICE_SESSION_ACTIVE = False
    queued_count = len(globals.TEXT_MESSAGE_QUEUE)
    logger.info(f"Resuming text channels, processing {queued_count} queued messages")
    # Process queue in background task
    asyncio.create_task(self._process_message_queue())

3.3.2 Bipolar Mode Disabling

async def _disable_bipolar_mode(self):
    """Prevent Miku/Evil Miku arguments during voice session"""
    from utils.bipolar_mode import pause_bipolar_interactions
    pause_bipolar_interactions()
    logger.info("Bipolar mode interactions disabled")

async def _enable_bipolar_mode(self):
    """Re-enable Miku/Evil Miku arguments after voice session"""
    from utils.bipolar_mode import resume_bipolar_interactions
    resume_bipolar_interactions()
    logger.info("Bipolar mode interactions re-enabled")

3.3.3 Profile Picture Switching

async def _disable_profile_picture_switching(self):
    """Lock profile picture during voice session"""
    from utils.profile_picture_manager import profile_picture_manager
    profile_picture_manager.lock_switching()
    logger.info("Profile picture switching disabled")

async def _enable_profile_picture_switching(self):
    """Unlock profile picture after voice session"""
    from utils.profile_picture_manager import profile_picture_manager
    profile_picture_manager.unlock_switching()
    logger.info("Profile picture switching re-enabled")

3.3.4 Image Generation Blocking

async def _disable_image_generation(self):
    """Block ComfyUI image generation during voice session"""
    globals.IMAGE_GENERATION_BLOCKED = True
    globals.IMAGE_GENERATION_BLOCK_MESSAGE = (
        "🎤 I can't draw right now, I'm talking in voice chat! "
        "Ask me again after I leave the voice channel."
    )
    logger.info("Image generation disabled")

async def _enable_image_generation(self):
    """Re-enable image generation after voice session"""
    globals.IMAGE_GENERATION_BLOCKED = False
    globals.IMAGE_GENERATION_BLOCK_MESSAGE = None
    logger.info("Image generation re-enabled")

3.3.5 Autonomous Engine Pausing

async def _pause_autonomous_engine(self):
    """Pause autonomous message generation during voice session"""
    from utils.autonomous import pause_autonomous_system
    pause_autonomous_system()
    logger.info("Autonomous engine paused")

async def _resume_autonomous_engine(self):
    """Resume autonomous message generation after voice session"""
    from utils.autonomous import resume_autonomous_system
    resume_autonomous_system()
    logger.info("Autonomous engine resumed")

3.3.6 Scheduled Events Pausing

async def _pause_scheduled_events(self):
    """Pause all scheduled jobs during voice session"""
    globals.scheduler.pause()
    logger.info("Scheduled events paused")

async def _resume_scheduled_events(self):
    """Resume scheduled jobs after voice session"""
    globals.scheduler.resume()
    logger.info("Scheduled events resumed")

3.3.7 Figurine Notifier Pausing

async def _pause_figurine_notifier(self):
    """Pause figurine notifications during voice session"""
    # Assuming figurine notifier is a scheduled job
    try:
        globals.scheduler.pause_job('figurine_notifier')
        logger.info("Figurine notifier paused")
    except Exception as e:
        logger.warning(f"Could not pause figurine notifier: {e}")

async def _resume_figurine_notifier(self):
    """Resume figurine notifications after voice session"""
    try:
        globals.scheduler.resume_job('figurine_notifier')
        logger.info("Figurine notifier resumed")
    except Exception as e:
        logger.warning(f"Could not resume figurine notifier: {e}")

3.3.8 Vision Model Blocking

async def _block_vision_model(self):
    """Prevent vision model from loading during voice session"""
    globals.VISION_MODEL_BLOCKED = True
    logger.info("Vision model blocked")

async def _unblock_vision_model(self):
    """Allow vision model to load after voice session"""
    globals.VISION_MODEL_BLOCKED = False
    logger.info("Vision model unblocked")

3.3.9 GPU Switching

async def _switch_to_amd_gpu(self):
    """Switch text inference to AMD GPU"""
    gpu_state_file = os.path.join("memory", "gpu_state.json")
    with open(gpu_state_file, "w") as f:
        json.dump({"current_gpu": "amd"}, f)
    logger.info("Switched to AMD GPU for text inference")

3.4 Feature-Specific Response Handlers

Image Generation Request Handler:

# In bot message handler, before processing image generation
if globals.IMAGE_GENERATION_BLOCKED:
    await message.channel.send(globals.IMAGE_GENERATION_BLOCK_MESSAGE)
    await message.add_reaction('🎤')
    return

Vision Model Request Handler:

# In image/video handling code
if globals.VISION_MODEL_BLOCKED:
    await message.channel.send(
        "🎤 I can't look at images right now, I'm in voice chat! "
        "Send it again after I leave."
    )
    return

Bipolar Argument Trigger Handler:

# In bipolar_mode.py trigger detection
from utils.voice_manager import voice_manager

if voice_manager.active_session:
    logger.info("Bipolar argument blocked (voice session active)")
    return  # Skip argument trigger

3.5 Required Module Modifications

The following existing modules need to be updated to check voice session state:

3.5.1 bot/utils/bipolar_mode.py

Add checks before:

Argument triggers based on score thresholds
Persona dialogue initiations
Any webhook-based Miku/Evil Miku interactions

# At the top of functions that trigger bipolar interactions
from utils.voice_manager import voice_manager

if voice_manager.active_session:
    logger.debug("Bipolar interaction blocked (voice session active)")
    return

3.5.2 bot/utils/profile_picture_manager.py

Add locking mechanism:

class ProfilePictureManager:
    def __init__(self):
        self.switching_locked = False
        
    def lock_switching(self):
        """Lock profile picture changes during voice session"""
        self.switching_locked = True
        
    def unlock_switching(self):
        """Unlock profile picture changes after voice session"""
        self.switching_locked = False
        
    async def update_profile_picture(self, ...):
        if self.switching_locked:
            logger.info("Profile picture change blocked (voice session active)")
            return
        # ... normal update logic ...

3.5.3 bot/utils/autonomous.py

Add pause/resume functions:

_autonomous_paused = False

def pause_autonomous_system():
    """Pause autonomous message generation"""
    global _autonomous_paused
    _autonomous_paused = True
    
def resume_autonomous_system():
    """Resume autonomous message generation"""
    global _autonomous_paused
    _autonomous_paused = False
    
# In autonomous trigger functions:
def should_send_autonomous_message():
    if _autonomous_paused:
        return False
    # ... normal logic ...

3.5.4 bot/bot.py (main message handler)

Add checks for image generation and vision:

@globals.client.event
async def on_message(message):
    # ... existing checks ...
    
    # Check if image generation is blocked
    if "draw" in message.content.lower() and globals.IMAGE_GENERATION_BLOCKED:
        await message.channel.send(globals.IMAGE_GENERATION_BLOCK_MESSAGE)
        await message.add_reaction('🎤')
        return
    
    # Check if vision model is blocked (images/videos)
    if (message.attachments or message.embeds) and globals.VISION_MODEL_BLOCKED:
        await message.channel.send(
            "🎤 I can't look at images or videos right now, I'm talking in voice chat! "
            "Send it again after I leave."
        )
        return
    
    # ... rest of message handling ...

3.5.5 bot/commands/actions.py (ComfyUI integration)

Add check before image generation:

async def handle_image_generation(message, prompt):
    if globals.IMAGE_GENERATION_BLOCKED:
        await message.channel.send(globals.IMAGE_GENERATION_BLOCK_MESSAGE)
        await message.add_reaction('🎤')
        return
    
    # ... normal image generation logic ...

3.6 Text Channel Pausing Strategy

Options:

Option A: Queue Messages (Recommended)

Store incoming messages in a queue during voice session
Process queue after voice session ends
Pros: No messages lost, users get responses eventually
Cons: Responses delayed until voice session ends

Option B: Ignore Messages

Simply don't respond to text channels during voice session
Send status message: "🎤 Miku is currently in voice chat..."
Pros: Simple, clear behavior
Cons: Users might think bot is broken

Recommendation: Option A with status indicator

Queue messages with timestamps
Set bot status to "🎤 In Voice Chat"
Process queue in order after session ends

4. Technical Implementation Details

4.1 Discord Voice Integration

Required Package:

pip install PyNaCl  # Required for voice support

Voice Connection Flow:

import discord

# Connect to voice channel
voice_channel = client.get_channel(voice_channel_id)
voice_client = await voice_channel.connect()

# Create audio source from stream
audio_source = VoiceStreamSource(websocket_url)

# Play audio
voice_client.play(audio_source)

# Disconnect
await voice_client.disconnect()

Key Classes:

discord.VoiceClient - Handles voice connection
discord.AudioSource - Abstract base for audio streaming
discord.PCMAudio - Raw PCM audio source (16-bit, 48kHz, stereo)

4.2 Custom Audio Source for TTS Stream

Implementation:

class MikuVoiceSource(discord.AudioSource):
    """
    Streams audio from RVC WebSocket to Discord voice channel.
    """
    def __init__(self, websocket_url="ws://localhost:8765/ws/stream"):
        self.websocket_url = websocket_url
        self.ws = None
        self.audio_queue = asyncio.Queue(maxsize=100)
        self.running = False
        self.frame_size = 3840  # 20ms @ 48kHz stereo (960 samples * 2 channels * 2 bytes)
        
    async def connect(self):
        """Connect to TTS WebSocket"""
        self.ws = await websockets.connect(self.websocket_url)
        self.running = True
        asyncio.create_task(self._receive_audio())
        
    async def _receive_audio(self):
        """Receive audio from WebSocket and queue for playback"""
        while self.running:
            try:
                audio_bytes = await self.ws.recv()
                # Convert float32 mono to int16 stereo
                audio_data = self._process_audio(audio_bytes)
                await self.audio_queue.put(audio_data)
            except Exception as e:
                logger.error(f"Audio receive error: {e}")
                break
                
    def _process_audio(self, audio_bytes):
        """
        Convert float32 mono @ 48kHz to int16 stereo @ 48kHz for Discord.
        Discord expects: 16-bit PCM, 48kHz, stereo
        """
        # Decode float32
        audio = np.frombuffer(audio_bytes, dtype=np.float32)
        
        # Convert to int16
        audio_int16 = (audio * 32767).clip(-32768, 32767).astype(np.int16)
        
        # Convert mono to stereo (duplicate channel)
        audio_stereo = np.repeat(audio_int16, 2)
        
        return audio_stereo.tobytes()
        
    def read(self):
        """
        Called by Discord.py to get next audio frame (20ms).
        Must be synchronous and return exactly 3840 bytes.
        """
        try:
            # Get from queue (non-blocking)
            audio_chunk = self.audio_queue.get_nowait()
            
            # Ensure exactly frame_size bytes
            if len(audio_chunk) < self.frame_size:
                # Pad with silence
                audio_chunk += b'\x00' * (self.frame_size - len(audio_chunk))
            elif len(audio_chunk) > self.frame_size:
                # Trim excess
                audio_chunk = audio_chunk[:self.frame_size]
                
            return audio_chunk
        except:
            # No audio available, return silence
            return b'\x00' * self.frame_size
            
    def cleanup(self):
        """Clean up resources"""
        self.running = False
        if self.ws:
            asyncio.create_task(self.ws.close())

4.3 LLM Streaming Integration

Llama.cpp Streaming Endpoint:

POST http://llama-swap-amd:8080/v1/models/llama3.1/completions
Content-Type: application/json

{
  "prompt": "<full prompt>",
  "stream": true,
  "temperature": 0.8,
  "max_tokens": 500
}

Response (SSE - Server-Sent Events):

data: {"choices":[{"text":"Hello","finish_reason":null}]}

data: {"choices":[{"text":" world","finish_reason":null}]}

data: {"choices":[{"text":"!","finish_reason":"stop"}]}

data: [DONE]

Streaming Handler:

async def stream_llm_to_tts(prompt, websocket):
    """
    Stream LLM tokens directly to TTS WebSocket.
    """
    url = f"{globals.LLAMA_AMD_URL}/v1/models/{globals.TEXT_MODEL}/completions"
    
    payload = {
        "prompt": prompt,
        "stream": True,
        "temperature": 0.8,
        "max_tokens": 500,
        "stop": ["\n\n", "User:", "Assistant:"]
    }
    
    async with aiohttp.ClientSession() as session:
        async with session.post(url, json=payload) as resp:
            async for line in resp.content:
                line = line.decode('utf-8').strip()
                
                if not line.startswith('data: '):
                    continue
                    
                data = line[6:]  # Remove 'data: ' prefix
                
                if data == '[DONE]':
                    # Flush remaining buffer
                    await websocket.send(json.dumps({"flush": True}))
                    break
                
                try:
                    chunk = json.loads(data)
                    token = chunk['choices'][0]['text']
                    
                    # Send token to TTS
                    await websocket.send(json.dumps({
                        "token": token,
                        "pitch_shift": 0
                    }))
                    
                except Exception as e:
                    logger.error(f"Token parse error: {e}")

4.4 Voice Session Implementation

Main Session Class:

class VoiceSession:
    """
    Manages a single voice chat session.
    """
    def __init__(self, guild_id, voice_channel, text_channel):
        self.guild_id = guild_id
        self.voice_channel = voice_channel
        self.text_channel = text_channel
        self.voice_client = None
        self.audio_source = None
        self.tts_websocket = None
        self.active = False
        
    async def connect(self):
        """Connect to voice channel and TTS pipeline"""
        # 1. Connect to Discord voice
        self.voice_client = await self.voice_channel.connect()
        
        # 2. Connect to TTS WebSocket
        self.tts_websocket = await websockets.connect("ws://localhost:8765/ws/stream")
        
        # 3. Create audio source
        self.audio_source = MikuVoiceSource()
        await self.audio_source.connect()
        
        # 4. Start playing audio stream
        self.voice_client.play(self.audio_source)
        
        self.active = True
        logger.info(f"Voice session started in {self.voice_channel.name}")
        
    async def speak(self, prompt):
        """
        Generate speech for given prompt.
        Streams LLM tokens → TTS → Discord voice.
        """
        if not self.active:
            raise Exception("Voice session not active")
        
        # Build full LLM prompt with context
        full_prompt = await self._build_llm_prompt(prompt)
        
        # Stream tokens to TTS
        await stream_llm_to_tts(full_prompt, self.tts_websocket)
        
    async def _build_llm_prompt(self, user_prompt):
        """Build full prompt with context (similar to query_llama)"""
        # Get mood and context
        from utils.llm import get_context_for_response_type
        from server_manager import server_manager
        
        server_config = server_manager.get_server_config(self.guild_id)
        current_mood = server_config.current_mood_description
        
        miku_context = get_context_for_response_type("server_response")
        
        # Build messages array
        messages = [
            {"role": "system", "content": f"{miku_context}\n\nMiku is currently feeling: {current_mood}"},
            {"role": "user", "content": user_prompt}
        ]
        
        # Convert to llama.cpp prompt format (depends on model)
        # For Llama 3.1:
        prompt = "<|begin_of_text|>"
        for msg in messages:
            if msg["role"] == "system":
                prompt += f"<|start_header_id|>system<|end_header_id|>\n{msg['content']}<|eot_id|>"
            elif msg["role"] == "user":
                prompt += f"<|start_header_id|>user<|end_header_id|>\n{msg['content']}<|eot_id|>"
        prompt += "<|start_header_id|>assistant<|end_header_id|>\n"
        
        return prompt
        
    async def disconnect(self):
        """Disconnect from voice and cleanup"""
        self.active = False
        
        # Stop audio playback
        if self.voice_client and self.voice_client.is_playing():
            self.voice_client.stop()
        
        # Disconnect from voice
        if self.voice_client:
            await self.voice_client.disconnect()
        
        # Close TTS WebSocket
        if self.tts_websocket:
            await self.tts_websocket.close()
        
        # Cleanup audio source
        if self.audio_source:
            self.audio_source.cleanup()
        
        logger.info("Voice session ended")

5. Command Implementation

5.1 Voice Commands

New commands to add:

!miku join [#voice-channel]
- Join specified voice channel (or user's current channel)
- Set text channel as prompt input channel
- Lock resources and start session
!miku leave
- Leave current voice channel
- Release resources
- Resume normal operation
!miku voice-status
- Show current voice session status
- Show active prompt channel
- Show resource allocation

5.2 Command Router Integration

Add to bot/command_router.py:

from commands.voice import handle_voice_command

# In route_command():
if cmd in ['join', 'leave', 'voice-status', 'speak']:
    return await handle_voice_command(message, cmd, args)

New file bot/commands/voice.py:

from utils.voice_manager import voice_manager

async def handle_voice_command(message, cmd, args):
    """Handle voice-related commands"""
    
    if cmd == 'join':
        # Get voice channel
        if args and args[0].startswith('<#'):
            # Channel mentioned
            channel_id = int(args[0][2:-1])
            voice_channel = message.guild.get_channel(channel_id)
        else:
            # Use user's current voice channel
            if message.author.voice:
                voice_channel = message.author.voice.channel
            else:
                await message.channel.send("❌ You must be in a voice channel!")
                return
        
        try:
            await voice_manager.start_session(
                message.guild.id,
                voice_channel,
                message.channel  # Use current text channel for prompts
            )
            await message.channel.send(f"🎤 Joined {voice_channel.name}! Send messages here to make me speak.")
        except Exception as e:
            await message.channel.send(f"❌ Failed to join voice: {e}")
    
    elif cmd == 'leave':
        if not voice_manager.active_session:
            await message.channel.send("❌ I'm not in a voice channel!")
            return
        
        await voice_manager.end_session()
        await message.channel.send("👋 Left voice channel!")
    
    elif cmd == 'voice-status':
        if voice_manager.active_session:
            session = voice_manager.active_session
            await message.channel.send(
                f"🎤 **Voice Session Active**\n"
                f"Voice Channel: {session.voice_channel.name}\n"
                f"Prompt Channel: {session.text_channel.mention}\n"
                f"GPU: AMD RX 6800 (text only)\n"
                f"Text Channels: Paused (queued)"
            )
        else:
            await message.channel.send("No active voice session")

5.3 Text Channel Prompt Handler

Modify bot/bot.py on_message handler:

@globals.client.event
async def on_message(message):
    if message.author == globals.client.user:
        return
    
    # Check if this is voice prompt channel
    if voice_manager.active_session:
        session = voice_manager.active_session
        if message.channel.id == session.text_channel.id:
            # This is a voice prompt
            await session.speak(message.content)
            await message.add_reaction('🎤')  # Acknowledge
            return
    
    # ... rest of normal message handling ...

6. Implementation Phases

Phase 1: Foundation (3-4 hours)

Goal: Set up basic voice connection and resource management

Tasks:

Install PyNaCl dependency
Add global state variables to globals.py
Create bot/utils/voice_manager.py
Implement VoiceSessionManager singleton
Implement all resource locking methods:
- GPU switching
- Vision model blocking
- Text channel pausing
- Bipolar mode disabling
- Profile picture lock
- Image generation blocking
- Autonomous engine pause
- Scheduled events pause
- Figurine notifier pause
Add feature-specific response handlers (image gen, vision model)
Test voice connection without TTS

Deliverables:

Voice channel join/leave working
All resource locks functional
Text channels properly paused during session
All features properly disabled/re-enabled around sessions
Hardcoded responses for blocked features

Phase 2: Audio Streaming (3-4 hours)

Goal: Implement TTS audio streaming to Discord

Tasks:

Create MikuVoiceSource class
Implement WebSocket → Discord audio bridge
Handle audio format conversion (float32 mono → int16 stereo)
Implement frame buffering and timing
Test with static text (no LLM streaming yet)

Deliverables:

Audio plays in Discord voice channel
TTS pipeline outputs correctly formatted audio
No audio glitches or timing issues

Phase 3: LLM Streaming Integration (2-3 hours)

Goal: Connect LLM token stream to TTS pipeline

Tasks:

Implement stream_llm_to_tts() function
Handle SSE parsing from llama.cpp
Build proper prompt with context/mood
Test token-by-token streaming
Handle edge cases (connection drops, errors)

Deliverables:

LLM tokens stream to TTS in real-time
Audio starts playing quickly (1-2s latency)
Natural sentence boundaries respected

Phase 4: Commands & UX (1-2 hours)

Goal: Polish user interface and commands

Tasks:

Create bot/commands/voice.py
Add commands to router
Implement status messages
Add error handling and user feedback
Test edge cases (multiple join attempts, etc.)

Deliverables:

All voice commands working
Clear user feedback
Graceful error handling

Phase 5: Testing & Refinement (2-3 hours)

Goal: Ensure stability and performance

Tasks:

Load testing (long sessions, many prompts)
Resource leak detection
Audio quality verification
Latency optimization
Documentation and README updates

Deliverables:

Stable voice sessions (no crashes)
Optimal latency (target: <2s first audio)
Updated documentation

Total Estimated Time: 12-18 hours

7. Error Handling & Edge Cases

7.1 Common Error Scenarios

1. TTS Pipeline Unavailable

Symptom: Can't connect to WebSocket
Response: Return error, don't start voice session
Message: "❌ TTS pipeline not available. Check soprano/rvc containers."

2. Voice Channel Full

Symptom: Can't join voice channel (user limit)
Response: Return error
Message: "❌ Voice channel is full!"

3. Already in Voice Session

Symptom: User tries to join while session active
Response: Reject command
Message: "❌ Already in a voice session! Use !miku leave first."

4. LLM Timeout

Symptom: LLM doesn't respond within timeout
Response: Send silence, log error
Message: "(in voice) Miku seems confused..."

5. Audio Buffer Underrun

Symptom: TTS slower than playback rate
Response: Pad with silence, don't crash
Log: Warning about buffer underrun

6. Blocked Feature Attempted During Voice

Symptom: User tries to generate image, send image, trigger bipolar mode
Response: Send appropriate blocked feature message
Examples:
- "🎤 I can't draw right now, I'm talking in voice chat!"
- "🎤 I can't look at images right now, I'm talking in voice chat!"
Log: Feature block triggered

7. Resource Cleanup Failure

Symptom: Feature doesn't resume after voice session
Response: Log error, attempt manual cleanup
Fallback: Restart bot if critical features stuck

7.2 Graceful Degradation

Priority Order:

Keep bot online (don't crash)
Maintain voice connection if possible
Inform user of issues
Fallback to text if voice fails

Fallback Strategy:

async def speak_with_fallback(session, prompt):
    """Speak in voice, fallback to text if error"""
    try:
        await session.speak(prompt)
    except Exception as e:
        logger.error(f"Voice speak failed: {e}")
        # Fallback: send text response
        response = await query_llama(prompt, ...)
        await session.text_channel.send(f"⚠️ (Voice failed, text mode): {response}")

8. Performance Optimization

8.1 Latency Reduction Strategies

Target: <2 seconds from prompt to first audio

Optimization Points:

Pre-warm TTS connection
- Keep WebSocket connected during session
- Reduce handshake overhead
Reduce LLM prompt length
- Limit conversation history to 4 messages
- Truncate long context
Parallel processing
- Start TTS as soon as first token arrives
- Don't wait for full sentence
Buffer tuning
- Keep audio buffer small (5-10 chunks max)
- Balance latency vs. smoothness

8.2 Resource Monitoring

Metrics to Track:

VRAM usage (AMD GPU during session)
CPU usage (RVC/Soprano processing)
Audio buffer fill level
LLM token rate (tokens/second)
End-to-end latency (prompt → audio)

Implementation:

class VoiceMetrics:
    def __init__(self):
        self.prompt_times = []
        self.first_audio_times = []
        self.total_tokens = 0
        
    def log_prompt(self):
        self.prompt_times.append(time.time())
        
    def log_first_audio(self):
        if self.prompt_times:
            latency = time.time() - self.prompt_times[-1]
            self.first_audio_times.append(latency)
            logger.info(f"First audio latency: {latency:.2f}s")

9. Testing Plan

9.1 Unit Tests

Components to Test:

MikuVoiceSource.read() - Audio framing
stream_llm_to_tts() - Token streaming
VoiceSessionManager - Resource locking
Audio format conversion

9.2 Integration Tests

Test Scenarios:

Full voice session lifecycle (join → speak → leave)
Resource cleanup after session
Text channel pause/resume
Multiple prompts in quick succession
Long prompts (500+ characters)
Error recovery (TTS crash, LLM timeout)

9.3 Feature Blocking Tests

Test each blocked feature during voice session:

Vision Model Blocking
- ✅ Send image to Miku while in voice
- ✅ Verify blocked message appears
- ✅ Confirm no vision model loaded (check logs)
- ✅ After leaving voice, send image again
- ✅ Verify vision model works normally
Image Generation Blocking
- ✅ Try "draw [prompt]" while in voice
- ✅ Verify custom blocked message appears
- ✅ Confirm ComfyUI not called
- ✅ After leaving voice, try draw again
- ✅ Verify image generation works normally
Bipolar Mode Blocking
- ✅ Trigger bipolar argument score while in voice
- ✅ Verify no argument starts
- ✅ Check logs for block message
- ✅ After leaving voice, verify bipolar mode resumes
Profile Picture Blocking
- ✅ Trigger profile picture change while in voice
- ✅ Verify avatar doesn't change
- ✅ After leaving voice, verify pfp switching works
Autonomous Engine Blocking
- ✅ Wait for autonomous message trigger while in voice
- ✅ Verify no autonomous messages sent
- ✅ After leaving voice, verify autonomous resumes
Scheduled Events Blocking
- ✅ Join voice near scheduled event time
- ✅ Verify event doesn't fire during session
- ✅ After leaving voice, verify scheduler active
Text Channel Queuing
- ✅ Send regular message while in voice
- ✅ Verify no response during session
- ✅ Verify message queued (check logs)
- ✅ After leaving voice, verify queued messages processed

9.4 Manual Testing Checklist

Join voice channel via command
Bot appears in voice channel
Send prompt in text channel
Audio plays in voice channel within 2s
Audio quality is clear (no glitches)
Multiple prompts work in sequence
Leave command works
Resources released after leave
Text channels resume normal operation
Bot can rejoin after leaving

10. Future Enhancements (Post-MVP)

10.1 Speech-to-Text (STT) Integration

Goal: Allow users to speak to Miku instead of typing

Approach:

Use Whisper model for STT
Run on AMD GPU during voice sessions
Stream audio → text → LLM → TTS → audio

10.2 Multi-User Voice Conversations

Goal: Multiple users can take turns speaking

Approach:

Voice activity detection (VAD)
Queue speaker turns
Name prefixes in prompts ("User1: ...", "User2: ...")

10.3 Background Music/Sound Effects

Goal: Play background music while Miku speaks

Approach:

Mix audio streams (voice + music)
Volume ducking (lower music during speech)
FFmpeg audio processing

10.4 Voice Commands

Goal: Control bot via voice ("Miku, leave voice chat")

Approach:

Simple keyword detection in STT output
Command routing from voice input

10.5 Emotion-Aware Speech

Goal: Vary TTS pitch/speed based on mood

Approach:

Map mood → pitch_shift parameter
Dynamic pitch based on emotion detection

11. Configuration & Deployment

11.1 Environment Variables

Add to docker-compose.yml:

miku-bot:
  environment:
    - VOICE_ENABLED=true
    - TTS_WEBSOCKET_URL=ws://miku-rvc-api:8765/ws/stream
    - VOICE_GPU=amd  # Force AMD GPU during voice sessions

11.2 Network Configuration

Ensure containers can communicate:

miku-bot → miku-soprano-tts (ZMQ 5555)
miku-bot → miku-rvc-api (HTTP/WS 8765)
miku-bot → llama-swap-amd (HTTP 8080)

Add to docker-compose networks if needed:

networks:
  miku-network:
    name: miku-network
    driver: bridge

11.3 Dependencies

Add to bot/requirements.txt:

PyNaCl>=1.5.0  # Voice support
websockets>=12.0  # TTS WebSocket client

11.4 Global State Variables

Add to bot/globals.py:

# Voice Chat Session State
VOICE_SESSION_ACTIVE = False
TEXT_MESSAGE_QUEUE = []  # Queue for messages received during voice session

# Feature Blocking Flags (set during voice session)
VISION_MODEL_BLOCKED = False
IMAGE_GENERATION_BLOCKED = False
IMAGE_GENERATION_BLOCK_MESSAGE = None

12. Risk Assessment

12.1 Technical Risks

High Risk:

Audio glitches/stuttering - Mitigation: Extensive buffer testing
Resource exhaustion - Mitigation: Strict resource locking
TTS pipeline crashes - Mitigation: Health checks, auto-restart

Medium Risk:

High latency - Mitigation: Optimization, parallel processing
Connection drops - Mitigation: Retry logic, graceful degradation

Low Risk:

Command conflicts - Mitigation: Clear command names
User confusion - Mitigation: Status messages, documentation

12.2 Resource Risks

Concern: AMD GPU overload (RVC + LLM simultaneously)

Mitigation:

Monitor VRAM usage during testing
Reduce RVC batch size if needed
Consider limiting LLM context length
Add VRAM threshold checks

Concern: Text channel message queue overflow

Mitigation:

Limit queue size (e.g., 100 messages)
Discard oldest messages if limit reached
Send warning to users

13. Documentation Requirements

13.1 User Documentation

Create VOICE_CHAT_USER_GUIDE.md:

How to invite Miku to voice channel
How to send prompts
Troubleshooting common issues
Feature limitations

13.2 Developer Documentation

Create VOICE_CHAT_DEVELOPER_GUIDE.md:

Architecture overview
Code organization
Adding new voice features
Debugging tips

13.3 API Documentation

Document in API_REFERENCE.md:

Voice command endpoints
VoiceSessionManager API
MikuVoiceSource interface

14. Success Criteria

14.1 Functional Requirements ✓

Miku can join voice channel
Miku can speak using TTS pipeline
LLM tokens stream in real-time
Text prompts trigger voice responses
Resource management prevents conflicts
Graceful session cleanup

14.2 Resource Management Requirements

GPU switches to AMD during session
Vision model blocked during session
Text channels paused (messages queued)
Bipolar mode interactions disabled
Profile picture switching locked
Image generation blocked with custom message
Autonomous engine paused
Scheduled events paused
Figurine notifier paused
All features resume after session ends

14.3 Performance Requirements

First audio within 2 seconds of prompt
Audio quality: Clear, no glitches
VRAM usage: <14GB on AMD GPU
Sessions stable for 30+ minutes

14.4 Usability Requirements

Commands intuitive and documented
Error messages clear and actionable
Status indicators show session state
Fallback to text on voice failure
Helpful blocked feature messages

15. Next Steps

Immediate Actions:

Review this plan with team/stakeholders
Set up development branch (feature/voice-chat)
Install dependencies (PyNaCl, test WebSocket connectivity)
Create skeleton files (voice_manager.py, voice.py commands)
Start Phase 1 implementation

Before Starting Implementation:

Verify soprano/rvc containers are healthy
Test WebSocket endpoint manually (websocket_client_example.py)
Verify AMD GPU has sufficient VRAM (check with rocm-smi)
Back up current bot state (in case rollback needed)

Development Workflow:

Create feature branch
Implement phase-by-phase
Test each phase before moving to next
Document changes in commit messages
Merge to main when MVP complete

Appendix A: File Structure

miku-discord/
├── bot/
│   ├── commands/
│   │   ├── actions.py
│   │   └── voice.py          # NEW: Voice commands
│   ├── utils/
│   │   ├── llm.py            # MODIFY: Add streaming support
│   │   ├── voice_manager.py  # NEW: Session management
│   │   └── voice_stream.py   # NEW: Audio streaming classes
│   ├── bot.py                # MODIFY: Add voice prompt handling
│   └── command_router.py     # MODIFY: Add voice command routing
├── soprano_to_rvc/
│   ├── soprano_rvc_api.py    # EXISTING: WebSocket endpoint
│   └── docker-compose.yml    # EXISTING: TTS containers
├── docker-compose.yml        # MODIFY: Add environment vars
└── readmes/
    ├── VOICE_CHAT_IMPLEMENTATION_PLAN.md  # THIS FILE
    ├── VOICE_CHAT_USER_GUIDE.md           # NEW: User docs
    └── VOICE_CHAT_DEVELOPER_GUIDE.md      # NEW: Dev docs

Appendix B: Key Code Snippets

B.1 Streaming LLM Tokens

async def stream_llm_tokens(prompt):
    """Stream tokens from llama-swap-amd"""
    url = f"{globals.LLAMA_AMD_URL}/v1/models/{globals.TEXT_MODEL}/completions"
    async with aiohttp.ClientSession() as session:
        async with session.post(url, json={"prompt": prompt, "stream": True}) as resp:
            async for line in resp.content:
                if line.startswith(b'data: '):
                    data = json.loads(line[6:])
                    if 'choices' in data:
                        yield data['choices'][0]['text']

B.2 Discord Voice Connection

voice_client = await voice_channel.connect()
audio_source = MikuVoiceSource()
await audio_source.connect()
voice_client.play(audio_source)

B.3 Resource Lock Pattern

async with voice_manager.session_lock:
    await switch_to_amd_gpu()
    await block_vision_model()
    await pause_text_channels()
    # ... start session ...

Appendix C: Troubleshooting Guide

Issue: "TTS pipeline not available"

Cause: RVC container not running or WebSocket unreachable Fix:

cd soprano_to_rvc
docker-compose up -d
docker-compose logs rvc

Issue: Audio stuttering/glitching

Cause: Buffer underrun (TTS too slow) Fix: Increase audio buffer size in MikuVoiceSource

Issue: High latency (>5s first audio)

Cause: LLM slow to generate tokens Fix: Reduce prompt length, check GPU utilization

Issue: Voice session hangs on disconnect

Cause: Resource cleanup timeout Fix: Add timeout to disconnect operations

Issue: Features not resuming after voice session

Cause: Resource unlock methods not called or failed Fix:

# Check logs for cleanup errors
docker logs miku-bot | grep -i "voice\|resume\|enable"

# Manual fix: Restart bot to reset all states
docker restart miku-bot

Issue: Image generation still works during voice session

Cause: Block check not implemented in image gen handler Fix: Add if globals.IMAGE_GENERATION_BLOCKED check in commands/actions.py

Issue: Bipolar argument triggered during voice

Cause: Block check missing in bipolar_mode.py Fix: Add if voice_manager.active_session check before argument triggers

Appendix D: Quick Reference - Resource Blocks

Developer Quick Reference: What to Check Before Each Feature

Feature/Module	Check This Before Running	Global Flag
Vision model loading	`globals.VISION_MODEL_BLOCKED`	`VISION_MODEL_BLOCKED`
Image generation	`globals.IMAGE_GENERATION_BLOCKED`	`IMAGE_GENERATION_BLOCKED`
Bipolar triggers	`voice_manager.active_session`	N/A (check object)
Profile picture	`profile_picture_manager.switching_locked`	N/A (check object)
Autonomous msgs	`_autonomous_paused` (in autonomous.py)	N/A (module-level)
Scheduled events	N/A (scheduler.pause() called)	N/A (APScheduler)
Text channel response	`globals.VOICE_SESSION_ACTIVE`	`VOICE_SESSION_ACTIVE`

Locations to Add Checks:

bot/bot.py - Main message handler (vision, image gen, text response)
bot/utils/bipolar_mode.py - Argument trigger functions
bot/utils/profile_picture_manager.py - Update functions
bot/utils/autonomous.py - Message generation functions
bot/commands/actions.py - Image generation handler

Conclusion

This implementation plan provides a comprehensive roadmap for adding voice channel functionality to Miku. The phased approach ensures incremental progress with testing at each stage. The resource management strategy carefully balances competing demands on limited hardware.

Key Success Factors:

Strict resource locking prevents conflicts (8 features disabled during voice)
Token streaming minimizes latency (<2s target)
Graceful error handling ensures stability
Clear user feedback improves experience (blocked feature messages)
Comprehensive testing covers all edge cases

Critical Implementation Points:

⚠️ Must disable 8 features during voice: vision, image gen, bipolar, pfp, autonomous, scheduled events, figurine notifier, text channels
⚠️ GPU switching mandatory: AMD RX 6800 for text, GTX 1660 for TTS only
⚠️ User messaging important: Clear feedback when features blocked
⚠️ Cleanup critical: All features must resume properly after session

Estimated Timeline: 12-18 hours for MVP, additional 5-10 hours for polish and testing.

Files to Create:

bot/utils/voice_manager.py (main session management)
bot/utils/voice_stream.py (audio streaming classes)
bot/commands/voice.py (voice commands)

Files to Modify:

bot/globals.py (add voice state flags)
bot/bot.py (add voice prompt handler, blocking checks)
bot/command_router.py (add voice command routing)
bot/utils/bipolar_mode.py (add session checks)
bot/utils/profile_picture_manager.py (add locking)
bot/utils/autonomous.py (add pause/resume)
bot/commands/actions.py (add image gen blocking)
bot/requirements.txt (add PyNaCl, websockets)

Ready to proceed? Review this plan, make any necessary adjustments, and begin Phase 1 implementation.

50 KiB Raw Blame History

Miku Voice Channel Chat Feature - Implementation Plan

Executive Summary

⚠️ Critical Resource Management Requirements

1. System Architecture Overview

1.1 Current TTS Pipeline (soprano_to_rvc)

1.2 Current LLM Infrastructure

1.3 Current Discord Bot Architecture

2. Voice Channel Feature Requirements

2.1 Core Functionality

2.2 User Experience Goals

3. Resource Management Strategy

3.1 Hardware Constraints

3.2 Resource Locking Mechanism

3.3 Detailed Resource Lock Methods

3.3.1 Text Channel Pausing

3.3.2 Bipolar Mode Disabling

3.3.3 Profile Picture Switching

3.3.4 Image Generation Blocking

3.3.5 Autonomous Engine Pausing

3.3.6 Scheduled Events Pausing

3.3.7 Figurine Notifier Pausing

3.3.8 Vision Model Blocking

3.3.9 GPU Switching

3.4 Feature-Specific Response Handlers

3.5 Required Module Modifications

3.5.1 bot/utils/bipolar_mode.py

3.5.2 bot/utils/profile_picture_manager.py

3.5.3 bot/utils/autonomous.py

3.5.4 bot/bot.py (main message handler)

3.5.5 bot/commands/actions.py (ComfyUI integration)

3.6 Text Channel Pausing Strategy

4. Technical Implementation Details

4.1 Discord Voice Integration

4.2 Custom Audio Source for TTS Stream

4.3 LLM Streaming Integration

4.4 Voice Session Implementation

5. Command Implementation

5.1 Voice Commands

5.2 Command Router Integration

5.3 Text Channel Prompt Handler

6. Implementation Phases

Phase 1: Foundation (3-4 hours)

Phase 2: Audio Streaming (3-4 hours)

Phase 3: LLM Streaming Integration (2-3 hours)

Phase 4: Commands & UX (1-2 hours)

Phase 5: Testing & Refinement (2-3 hours)

7. Error Handling & Edge Cases

7.1 Common Error Scenarios

7.2 Graceful Degradation

8. Performance Optimization

8.1 Latency Reduction Strategies

8.2 Resource Monitoring

9. Testing Plan

9.1 Unit Tests

9.2 Integration Tests

9.3 Feature Blocking Tests

9.4 Manual Testing Checklist

10. Future Enhancements (Post-MVP)

10.1 Speech-to-Text (STT) Integration

10.2 Multi-User Voice Conversations

10.3 Background Music/Sound Effects

10.4 Voice Commands

10.5 Emotion-Aware Speech

11. Configuration & Deployment

11.1 Environment Variables

11.2 Network Configuration

11.3 Dependencies

11.4 Global State Variables

12. Risk Assessment

12.1 Technical Risks

12.2 Resource Risks

13. Documentation Requirements

13.1 User Documentation

13.2 Developer Documentation

13.3 API Documentation

14. Success Criteria

14.1 Functional Requirements ✓

14.2 Resource Management Requirements

14.3 Performance Requirements

50 KiB

Raw Blame History