# Voice-to-Voice Quick Reference ## Complete Pipeline Status ✅ All phases complete and deployed! ## Phase Completion Status ### ✅ Phase 1: Voice Connection (COMPLETE) - Discord voice channel connection - Audio playback via discord.py - Resource management and cleanup ### ✅ Phase 2: Audio Streaming (COMPLETE) - Soprano TTS server (GTX 1660) - RVC voice conversion - Real-time streaming via WebSocket - Token-by-token synthesis ### ✅ Phase 3: Text-to-Voice (COMPLETE) - LLaMA text generation (AMD RX 6800) - Streaming token pipeline - TTS integration with `!miku say` - Natural conversation flow ### ✅ Phase 4A: STT Container (COMPLETE) - Silero VAD on CPU - Faster-Whisper on GTX 1660 - WebSocket server at port 8001 - Per-user session management - Chunk buffering for VAD ### ✅ Phase 4B: Bot STT Integration (COMPLETE - READY FOR TESTING) - Discord audio capture - Opus decode + resampling - STT client WebSocket integration - Voice commands: `!miku listen`, `!miku stop-listening` - LLM voice response generation - Interruption detection and cancellation - `/interrupt` endpoint in RVC API ## Quick Start Commands ### Setup ```bash !miku join # Join your voice channel !miku listen # Start listening to your voice ``` ### Usage - **Speak** into your microphone - Miku will **transcribe** your speech - Miku will **respond** with voice - **Interrupt** her by speaking while she's talking ### Teardown ```bash !miku stop-listening # Stop listening to your voice !miku leave # Leave voice channel ``` ## Architecture Diagram ``` ┌─────────────────────────────────────────────────────────────────┐ │ USER INPUT │ └─────────────────────────────────────────────────────────────────┘ │ │ Discord Voice (Opus 48kHz) ▼ ┌─────────────────────────────────────────────────────────────────┐ │ miku-bot Container │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ VoiceReceiver (discord.sinks.Sink) │ │ │ │ - Opus decode → PCM │ │ │ │ - Stereo → Mono │ │ │ │ - Resample 48kHz → 16kHz │ │ │ └─────────────────┬─────────────────────────────────────────┘ │ │ │ PCM int16, 16kHz, 20ms chunks │ │ ┌─────────────────▼─────────────────────────────────────────┐ │ │ │ STTClient (WebSocket) │ │ │ │ - Sends audio to miku-stt │ │ │ │ - Receives VAD events, transcripts │ │ │ └─────────────────┬─────────────────────────────────────────┘ │ └────────────────────┼───────────────────────────────────────────┘ │ ws://miku-stt:8001/ws/stt/{user_id} ▼ ┌─────────────────────────────────────────────────────────────────┐ │ miku-stt Container │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ VADProcessor (Silero VAD 5.1.2) [CPU] │ │ │ │ - Chunk buffering (512 samples min) │ │ │ │ - Speech detection (threshold=0.5) │ │ │ │ - Events: speech_start, speaking, speech_end │ │ │ └─────────────────┬─────────────────────────────────────────┘ │ │ │ Audio segments │ │ ┌─────────────────▼─────────────────────────────────────────┐ │ │ │ WhisperTranscriber (Faster-Whisper 1.2.1) [GTX 1660] │ │ │ │ - Model: small (1.3GB VRAM) │ │ │ │ - Transcribes speech segments │ │ │ │ - Returns: partial & final transcripts │ │ │ └─────────────────┬─────────────────────────────────────────┘ │ └────────────────────┼───────────────────────────────────────────┘ │ JSON events via WebSocket ▼ ┌─────────────────────────────────────────────────────────────────┐ │ miku-bot Container │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ voice_manager.py Callbacks │ │ │ │ - on_vad_event() → Log VAD states │ │ │ │ - on_partial_transcript() → Show typing indicator │ │ │ │ - on_final_transcript() → Generate LLM response │ │ │ │ - on_interruption() → Cancel TTS playback │ │ │ └─────────────────┬─────────────────────────────────────────┘ │ │ │ Final transcript text │ │ ┌─────────────────▼─────────────────────────────────────────┐ │ │ │ _generate_voice_response() │ │ │ │ - Build LLM prompt with conversation history │ │ │ │ - Stream LLM response │ │ │ │ - Send tokens to TTS │ │ │ └─────────────────┬─────────────────────────────────────────┘ │ └────────────────────┼───────────────────────────────────────────┘ │ HTTP streaming to LLaMA server ▼ ┌─────────────────────────────────────────────────────────────────┐ │ llama-cpp-server (AMD RX 6800) │ │ - Streaming text generation │ │ - 20-30 tokens/sec │ │ - Returns: {"delta": {"content": "token"}} │ └─────────────────┬───────────────────────────────────────────────┘ │ Token stream ▼ ┌─────────────────────────────────────────────────────────────────┐ │ miku-bot Container │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ audio_source.send_token() │ │ │ │ - Buffers tokens │ │ │ │ - Sends to RVC WebSocket │ │ │ └─────────────────┬─────────────────────────────────────────┘ │ └────────────────────┼───────────────────────────────────────────┘ │ ws://miku-rvc-api:8765/ws/stream ▼ ┌─────────────────────────────────────────────────────────────────┐ │ miku-rvc-api Container │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ Soprano TTS Server (miku-soprano-tts) [GTX 1660] │ │ │ │ - Text → Audio synthesis │ │ │ │ - 32kHz output │ │ │ └─────────────────┬─────────────────────────────────────────┘ │ │ │ Raw audio via ZMQ │ │ ┌─────────────────▼─────────────────────────────────────────┐ │ │ │ RVC Voice Conversion [GTX 1660] │ │ │ │ - Voice cloning & pitch shifting │ │ │ │ - 48kHz output │ │ │ └─────────────────┬─────────────────────────────────────────┘ │ └────────────────────┼───────────────────────────────────────────┘ │ PCM float32, 48kHz ▼ ┌─────────────────────────────────────────────────────────────────┐ │ miku-bot Container │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ discord.VoiceClient │ │ │ │ - Plays audio in voice channel │ │ │ │ - Can be interrupted by user speech │ │ │ └───────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ USER OUTPUT │ │ (Miku's voice response) │ └─────────────────────────────────────────────────────────────────┘ ``` ## Interruption Flow ``` User speaks during Miku's TTS │ ▼ VAD detects speech (probability > 0.7) │ ▼ STT sends interruption event │ ▼ on_user_interruption() callback │ ▼ _cancel_tts() → voice_client.stop() │ ▼ POST http://miku-rvc-api:8765/interrupt │ ▼ Flush ZMQ socket + clear RVC buffers │ ▼ Miku stops speaking, ready for new input ``` ## Hardware Utilization ### Listen Phase (User Speaking) - **CPU**: Silero VAD processing - **GTX 1660**: Faster-Whisper transcription (1.3GB VRAM) - **AMD RX 6800**: Idle ### Think Phase (LLM Generation) - **CPU**: Idle - **GTX 1660**: Idle - **AMD RX 6800**: LLaMA inference (20-30 tokens/sec) ### Speak Phase (Miku Responding) - **CPU**: Silero VAD monitoring for interruption - **GTX 1660**: Soprano TTS + RVC synthesis - **AMD RX 6800**: Idle ## Performance Metrics ### Expected Latencies | Stage | Latency | |--------------------------|--------------| | Discord audio capture | ~20ms | | Opus decode + resample | <10ms | | VAD processing | <50ms | | Whisper transcription | 200-500ms | | LLM token generation | 33-50ms/tok | | TTS synthesis | Real-time | | **Total (speech → response)** | **1-2s** | ### VRAM Usage | GPU | Component | VRAM | |-------------|----------------|-----------| | AMD RX 6800 | LLaMA 8B Q4 | ~5.5GB | | GTX 1660 | Whisper small | 1.3GB | | GTX 1660 | Soprano + RVC | ~3GB | ## Key Files ### Bot Container - `bot/utils/stt_client.py` - WebSocket client for STT - `bot/utils/voice_receiver.py` - Discord audio sink - `bot/utils/voice_manager.py` - Voice session with STT integration - `bot/commands/voice.py` - Voice commands including listen/stop-listening ### STT Container - `stt/vad_processor.py` - Silero VAD with chunk buffering - `stt/whisper_transcriber.py` - Faster-Whisper transcription - `stt/stt_server.py` - FastAPI WebSocket server ### RVC Container - `soprano_to_rvc/soprano_rvc_api.py` - TTS + RVC pipeline with /interrupt endpoint ## Configuration Files ### docker-compose.yml - Network: `miku-network` (all containers) - Ports: - miku-bot: 8081 (API) - miku-rvc-api: 8765 (TTS) - miku-stt: 8001 (STT) - llama-cpp-server: 8080 (LLM) ### VAD Settings (stt/vad_processor.py) ```python threshold = 0.5 # Speech detection sensitivity min_speech = 250 # Minimum speech duration (ms) min_silence = 500 # Silence before speech_end (ms) interruption_threshold = 0.7 # Probability for interruption ``` ### Whisper Settings (stt/whisper_transcriber.py) ```python model = "small" # 1.3GB VRAM device = "cuda" compute_type = "float16" beam_size = 5 patience = 1.0 ``` ## Testing Commands ```bash # Check all container health curl http://localhost:8001/health # STT curl http://localhost:8765/health # RVC curl http://localhost:8080/health # LLM # Monitor logs docker logs -f miku-bot | grep -E "(listen|transcript|interrupt)" docker logs -f miku-stt docker logs -f miku-rvc-api | grep interrupt # Test interrupt endpoint curl -X POST http://localhost:8765/interrupt # Check GPU usage nvidia-smi ``` ## Troubleshooting | Issue | Solution | |-------|----------| | No audio from Discord | Check bot has Connect and Speak permissions | | VAD not detecting | Speak louder, check microphone, lower threshold | | Empty transcripts | Speak for at least 1-2 seconds, check Whisper model | | Interruption not working | Verify `miku_speaking=true`, check VAD probability | | High latency | Profile each stage, check GPU utilization | ## Next Features (Phase 4C+) - [ ] KV cache precomputation from partial transcripts - [ ] Multi-user simultaneous conversation - [ ] Latency optimization (<1s total) - [ ] Voice activity history and analytics - [ ] Emotion detection from speech patterns - [ ] Context-aware interruption handling --- **Ready to test!** Use `!miku join` → `!miku listen` → speak to Miku 🎤