Files
miku-discord/VOICE_TO_VOICE_REFERENCE.md

17 KiB

Voice-to-Voice Quick Reference

Complete Pipeline Status

All phases complete and deployed!

Phase Completion Status

Phase 1: Voice Connection (COMPLETE)

  • Discord voice channel connection
  • Audio playback via discord.py
  • Resource management and cleanup

Phase 2: Audio Streaming (COMPLETE)

  • Soprano TTS server (GTX 1660)
  • RVC voice conversion
  • Real-time streaming via WebSocket
  • Token-by-token synthesis

Phase 3: Text-to-Voice (COMPLETE)

  • LLaMA text generation (AMD RX 6800)
  • Streaming token pipeline
  • TTS integration with !miku say
  • Natural conversation flow

Phase 4A: STT Container (COMPLETE)

  • Silero VAD on CPU
  • Faster-Whisper on GTX 1660
  • WebSocket server at port 8001
  • Per-user session management
  • Chunk buffering for VAD

Phase 4B: Bot STT Integration (COMPLETE - READY FOR TESTING)

  • Discord audio capture
  • Opus decode + resampling
  • STT client WebSocket integration
  • Voice commands: !miku listen, !miku stop-listening
  • LLM voice response generation
  • Interruption detection and cancellation
  • /interrupt endpoint in RVC API

Quick Start Commands

Setup

!miku join              # Join your voice channel
!miku listen            # Start listening to your voice

Usage

  • Speak into your microphone
  • Miku will transcribe your speech
  • Miku will respond with voice
  • Interrupt her by speaking while she's talking

Teardown

!miku stop-listening    # Stop listening to your voice
!miku leave             # Leave voice channel

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                         USER INPUT                              │
└─────────────────────────────────────────────────────────────────┘
                              │
                              │ Discord Voice (Opus 48kHz)
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    miku-bot Container                           │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ VoiceReceiver (discord.sinks.Sink)                        │ │
│  │  - Opus decode → PCM                                      │ │
│  │  - Stereo → Mono                                          │ │
│  │  - Resample 48kHz → 16kHz                                 │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
│                    │ PCM int16, 16kHz, 20ms chunks              │
│  ┌─────────────────▼─────────────────────────────────────────┐ │
│  │ STTClient (WebSocket)                                     │ │
│  │  - Sends audio to miku-stt                                │ │
│  │  - Receives VAD events, transcripts                       │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
└────────────────────┼───────────────────────────────────────────┘
                     │ ws://miku-stt:8001/ws/stt/{user_id}
                     ▼
┌─────────────────────────────────────────────────────────────────┐
│                    miku-stt Container                           │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ VADProcessor (Silero VAD 5.1.2)         [CPU]            │ │
│  │  - Chunk buffering (512 samples min)                      │ │
│  │  - Speech detection (threshold=0.5)                       │ │
│  │  - Events: speech_start, speaking, speech_end             │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
│                    │ Audio segments                             │
│  ┌─────────────────▼─────────────────────────────────────────┐ │
│  │ WhisperTranscriber (Faster-Whisper 1.2.1) [GTX 1660]    │ │
│  │  - Model: small (1.3GB VRAM)                              │ │
│  │  - Transcribes speech segments                            │ │
│  │  - Returns: partial & final transcripts                   │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
└────────────────────┼───────────────────────────────────────────┘
                     │ JSON events via WebSocket
                     ▼
┌─────────────────────────────────────────────────────────────────┐
│                    miku-bot Container                           │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ voice_manager.py Callbacks                                │ │
│  │  - on_vad_event()         → Log VAD states                │ │
│  │  - on_partial_transcript() → Show typing indicator        │ │
│  │  - on_final_transcript()   → Generate LLM response        │ │
│  │  - on_interruption()       → Cancel TTS playback          │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
│                    │ Final transcript text                      │
│  ┌─────────────────▼─────────────────────────────────────────┐ │
│  │ _generate_voice_response()                                │ │
│  │  - Build LLM prompt with conversation history             │ │
│  │  - Stream LLM response                                    │ │
│  │  - Send tokens to TTS                                     │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
└────────────────────┼───────────────────────────────────────────┘
                     │ HTTP streaming to LLaMA server
                     ▼
┌─────────────────────────────────────────────────────────────────┐
│              llama-cpp-server (AMD RX 6800)                     │
│  - Streaming text generation                                   │
│  - 20-30 tokens/sec                                            │
│  - Returns: {"delta": {"content": "token"}}                    │
└─────────────────┬───────────────────────────────────────────────┘
                  │ Token stream
                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                    miku-bot Container                           │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ audio_source.send_token()                                 │ │
│  │  - Buffers tokens                                         │ │
│  │  - Sends to RVC WebSocket                                 │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
└────────────────────┼───────────────────────────────────────────┘
                     │ ws://miku-rvc-api:8765/ws/stream
                     ▼
┌─────────────────────────────────────────────────────────────────┐
│                 miku-rvc-api Container                          │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ Soprano TTS Server (miku-soprano-tts)    [GTX 1660]      │ │
│  │  - Text → Audio synthesis                                 │ │
│  │  - 32kHz output                                           │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
│                    │ Raw audio via ZMQ                          │
│  ┌─────────────────▼─────────────────────────────────────────┐ │
│  │ RVC Voice Conversion                     [GTX 1660]      │ │
│  │  - Voice cloning & pitch shifting                         │ │
│  │  - 48kHz output                                           │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
└────────────────────┼───────────────────────────────────────────┘
                     │ PCM float32, 48kHz
                     ▼
┌─────────────────────────────────────────────────────────────────┐
│                    miku-bot Container                           │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ discord.VoiceClient                                       │ │
│  │  - Plays audio in voice channel                           │ │
│  │  - Can be interrupted by user speech                      │ │
│  └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                       USER OUTPUT                               │
│                   (Miku's voice response)                       │
└─────────────────────────────────────────────────────────────────┘

Interruption Flow

User speaks during Miku's TTS
         │
         ▼
VAD detects speech (probability > 0.7)
         │
         ▼
STT sends interruption event
         │
         ▼
on_user_interruption() callback
         │
         ▼
_cancel_tts() → voice_client.stop()
         │
         ▼
POST http://miku-rvc-api:8765/interrupt
         │
         ▼
Flush ZMQ socket + clear RVC buffers
         │
         ▼
Miku stops speaking, ready for new input

Hardware Utilization

Listen Phase (User Speaking)

  • CPU: Silero VAD processing
  • GTX 1660: Faster-Whisper transcription (1.3GB VRAM)
  • AMD RX 6800: Idle

Think Phase (LLM Generation)

  • CPU: Idle
  • GTX 1660: Idle
  • AMD RX 6800: LLaMA inference (20-30 tokens/sec)

Speak Phase (Miku Responding)

  • CPU: Silero VAD monitoring for interruption
  • GTX 1660: Soprano TTS + RVC synthesis
  • AMD RX 6800: Idle

Performance Metrics

Expected Latencies

Stage Latency
Discord audio capture ~20ms
Opus decode + resample <10ms
VAD processing <50ms
Whisper transcription 200-500ms
LLM token generation 33-50ms/tok
TTS synthesis Real-time
Total (speech → response) 1-2s

VRAM Usage

GPU Component VRAM
AMD RX 6800 LLaMA 8B Q4 ~5.5GB
GTX 1660 Whisper small 1.3GB
GTX 1660 Soprano + RVC ~3GB

Key Files

Bot Container

  • bot/utils/stt_client.py - WebSocket client for STT
  • bot/utils/voice_receiver.py - Discord audio sink
  • bot/utils/voice_manager.py - Voice session with STT integration
  • bot/commands/voice.py - Voice commands including listen/stop-listening

STT Container

  • stt/vad_processor.py - Silero VAD with chunk buffering
  • stt/whisper_transcriber.py - Faster-Whisper transcription
  • stt/stt_server.py - FastAPI WebSocket server

RVC Container

  • soprano_to_rvc/soprano_rvc_api.py - TTS + RVC pipeline with /interrupt endpoint

Configuration Files

docker-compose.yml

  • Network: miku-network (all containers)
  • Ports:
    • miku-bot: 8081 (API)
    • miku-rvc-api: 8765 (TTS)
    • miku-stt: 8001 (STT)
    • llama-cpp-server: 8080 (LLM)

VAD Settings (stt/vad_processor.py)

threshold = 0.5          # Speech detection sensitivity
min_speech = 250         # Minimum speech duration (ms)
min_silence = 500        # Silence before speech_end (ms)
interruption_threshold = 0.7  # Probability for interruption

Whisper Settings (stt/whisper_transcriber.py)

model = "small"          # 1.3GB VRAM
device = "cuda"
compute_type = "float16"
beam_size = 5
patience = 1.0

Testing Commands

# Check all container health
curl http://localhost:8001/health  # STT
curl http://localhost:8765/health  # RVC
curl http://localhost:8080/health  # LLM

# Monitor logs
docker logs -f miku-bot | grep -E "(listen|transcript|interrupt)"
docker logs -f miku-stt
docker logs -f miku-rvc-api | grep interrupt

# Test interrupt endpoint
curl -X POST http://localhost:8765/interrupt

# Check GPU usage
nvidia-smi

Troubleshooting

Issue Solution
No audio from Discord Check bot has Connect and Speak permissions
VAD not detecting Speak louder, check microphone, lower threshold
Empty transcripts Speak for at least 1-2 seconds, check Whisper model
Interruption not working Verify miku_speaking=true, check VAD probability
High latency Profile each stage, check GPU utilization

Next Features (Phase 4C+)

  • KV cache precomputation from partial transcripts
  • Multi-user simultaneous conversation
  • Latency optimization (<1s total)
  • Voice activity history and analytics
  • Emotion detection from speech patterns
  • Context-aware interruption handling

Ready to test! Use !miku join!miku listen → speak to Miku 🎤