17 KiB
17 KiB
Voice-to-Voice Quick Reference
Complete Pipeline Status ✅
All phases complete and deployed!
Phase Completion Status
✅ Phase 1: Voice Connection (COMPLETE)
- Discord voice channel connection
- Audio playback via discord.py
- Resource management and cleanup
✅ Phase 2: Audio Streaming (COMPLETE)
- Soprano TTS server (GTX 1660)
- RVC voice conversion
- Real-time streaming via WebSocket
- Token-by-token synthesis
✅ Phase 3: Text-to-Voice (COMPLETE)
- LLaMA text generation (AMD RX 6800)
- Streaming token pipeline
- TTS integration with
!miku say - Natural conversation flow
✅ Phase 4A: STT Container (COMPLETE)
- Silero VAD on CPU
- Faster-Whisper on GTX 1660
- WebSocket server at port 8001
- Per-user session management
- Chunk buffering for VAD
✅ Phase 4B: Bot STT Integration (COMPLETE - READY FOR TESTING)
- Discord audio capture
- Opus decode + resampling
- STT client WebSocket integration
- Voice commands:
!miku listen,!miku stop-listening - LLM voice response generation
- Interruption detection and cancellation
/interruptendpoint in RVC API
Quick Start Commands
Setup
!miku join # Join your voice channel
!miku listen # Start listening to your voice
Usage
- Speak into your microphone
- Miku will transcribe your speech
- Miku will respond with voice
- Interrupt her by speaking while she's talking
Teardown
!miku stop-listening # Stop listening to your voice
!miku leave # Leave voice channel
Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐
│ USER INPUT │
└─────────────────────────────────────────────────────────────────┘
│
│ Discord Voice (Opus 48kHz)
▼
┌─────────────────────────────────────────────────────────────────┐
│ miku-bot Container │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ VoiceReceiver (discord.sinks.Sink) │ │
│ │ - Opus decode → PCM │ │
│ │ - Stereo → Mono │ │
│ │ - Resample 48kHz → 16kHz │ │
│ └─────────────────┬─────────────────────────────────────────┘ │
│ │ PCM int16, 16kHz, 20ms chunks │
│ ┌─────────────────▼─────────────────────────────────────────┐ │
│ │ STTClient (WebSocket) │ │
│ │ - Sends audio to miku-stt │ │
│ │ - Receives VAD events, transcripts │ │
│ └─────────────────┬─────────────────────────────────────────┘ │
└────────────────────┼───────────────────────────────────────────┘
│ ws://miku-stt:8001/ws/stt/{user_id}
▼
┌─────────────────────────────────────────────────────────────────┐
│ miku-stt Container │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ VADProcessor (Silero VAD 5.1.2) [CPU] │ │
│ │ - Chunk buffering (512 samples min) │ │
│ │ - Speech detection (threshold=0.5) │ │
│ │ - Events: speech_start, speaking, speech_end │ │
│ └─────────────────┬─────────────────────────────────────────┘ │
│ │ Audio segments │
│ ┌─────────────────▼─────────────────────────────────────────┐ │
│ │ WhisperTranscriber (Faster-Whisper 1.2.1) [GTX 1660] │ │
│ │ - Model: small (1.3GB VRAM) │ │
│ │ - Transcribes speech segments │ │
│ │ - Returns: partial & final transcripts │ │
│ └─────────────────┬─────────────────────────────────────────┘ │
└────────────────────┼───────────────────────────────────────────┘
│ JSON events via WebSocket
▼
┌─────────────────────────────────────────────────────────────────┐
│ miku-bot Container │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ voice_manager.py Callbacks │ │
│ │ - on_vad_event() → Log VAD states │ │
│ │ - on_partial_transcript() → Show typing indicator │ │
│ │ - on_final_transcript() → Generate LLM response │ │
│ │ - on_interruption() → Cancel TTS playback │ │
│ └─────────────────┬─────────────────────────────────────────┘ │
│ │ Final transcript text │
│ ┌─────────────────▼─────────────────────────────────────────┐ │
│ │ _generate_voice_response() │ │
│ │ - Build LLM prompt with conversation history │ │
│ │ - Stream LLM response │ │
│ │ - Send tokens to TTS │ │
│ └─────────────────┬─────────────────────────────────────────┘ │
└────────────────────┼───────────────────────────────────────────┘
│ HTTP streaming to LLaMA server
▼
┌─────────────────────────────────────────────────────────────────┐
│ llama-cpp-server (AMD RX 6800) │
│ - Streaming text generation │
│ - 20-30 tokens/sec │
│ - Returns: {"delta": {"content": "token"}} │
└─────────────────┬───────────────────────────────────────────────┘
│ Token stream
▼
┌─────────────────────────────────────────────────────────────────┐
│ miku-bot Container │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ audio_source.send_token() │ │
│ │ - Buffers tokens │ │
│ │ - Sends to RVC WebSocket │ │
│ └─────────────────┬─────────────────────────────────────────┘ │
└────────────────────┼───────────────────────────────────────────┘
│ ws://miku-rvc-api:8765/ws/stream
▼
┌─────────────────────────────────────────────────────────────────┐
│ miku-rvc-api Container │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Soprano TTS Server (miku-soprano-tts) [GTX 1660] │ │
│ │ - Text → Audio synthesis │ │
│ │ - 32kHz output │ │
│ └─────────────────┬─────────────────────────────────────────┘ │
│ │ Raw audio via ZMQ │
│ ┌─────────────────▼─────────────────────────────────────────┐ │
│ │ RVC Voice Conversion [GTX 1660] │ │
│ │ - Voice cloning & pitch shifting │ │
│ │ - 48kHz output │ │
│ └─────────────────┬─────────────────────────────────────────┘ │
└────────────────────┼───────────────────────────────────────────┘
│ PCM float32, 48kHz
▼
┌─────────────────────────────────────────────────────────────────┐
│ miku-bot Container │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ discord.VoiceClient │ │
│ │ - Plays audio in voice channel │ │
│ │ - Can be interrupted by user speech │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ USER OUTPUT │
│ (Miku's voice response) │
└─────────────────────────────────────────────────────────────────┘
Interruption Flow
User speaks during Miku's TTS
│
▼
VAD detects speech (probability > 0.7)
│
▼
STT sends interruption event
│
▼
on_user_interruption() callback
│
▼
_cancel_tts() → voice_client.stop()
│
▼
POST http://miku-rvc-api:8765/interrupt
│
▼
Flush ZMQ socket + clear RVC buffers
│
▼
Miku stops speaking, ready for new input
Hardware Utilization
Listen Phase (User Speaking)
- CPU: Silero VAD processing
- GTX 1660: Faster-Whisper transcription (1.3GB VRAM)
- AMD RX 6800: Idle
Think Phase (LLM Generation)
- CPU: Idle
- GTX 1660: Idle
- AMD RX 6800: LLaMA inference (20-30 tokens/sec)
Speak Phase (Miku Responding)
- CPU: Silero VAD monitoring for interruption
- GTX 1660: Soprano TTS + RVC synthesis
- AMD RX 6800: Idle
Performance Metrics
Expected Latencies
| Stage | Latency |
|---|---|
| Discord audio capture | ~20ms |
| Opus decode + resample | <10ms |
| VAD processing | <50ms |
| Whisper transcription | 200-500ms |
| LLM token generation | 33-50ms/tok |
| TTS synthesis | Real-time |
| Total (speech → response) | 1-2s |
VRAM Usage
| GPU | Component | VRAM |
|---|---|---|
| AMD RX 6800 | LLaMA 8B Q4 | ~5.5GB |
| GTX 1660 | Whisper small | 1.3GB |
| GTX 1660 | Soprano + RVC | ~3GB |
Key Files
Bot Container
bot/utils/stt_client.py- WebSocket client for STTbot/utils/voice_receiver.py- Discord audio sinkbot/utils/voice_manager.py- Voice session with STT integrationbot/commands/voice.py- Voice commands including listen/stop-listening
STT Container
stt/vad_processor.py- Silero VAD with chunk bufferingstt/whisper_transcriber.py- Faster-Whisper transcriptionstt/stt_server.py- FastAPI WebSocket server
RVC Container
soprano_to_rvc/soprano_rvc_api.py- TTS + RVC pipeline with /interrupt endpoint
Configuration Files
docker-compose.yml
- Network:
miku-network(all containers) - Ports:
- miku-bot: 8081 (API)
- miku-rvc-api: 8765 (TTS)
- miku-stt: 8001 (STT)
- llama-cpp-server: 8080 (LLM)
VAD Settings (stt/vad_processor.py)
threshold = 0.5 # Speech detection sensitivity
min_speech = 250 # Minimum speech duration (ms)
min_silence = 500 # Silence before speech_end (ms)
interruption_threshold = 0.7 # Probability for interruption
Whisper Settings (stt/whisper_transcriber.py)
model = "small" # 1.3GB VRAM
device = "cuda"
compute_type = "float16"
beam_size = 5
patience = 1.0
Testing Commands
# Check all container health
curl http://localhost:8001/health # STT
curl http://localhost:8765/health # RVC
curl http://localhost:8080/health # LLM
# Monitor logs
docker logs -f miku-bot | grep -E "(listen|transcript|interrupt)"
docker logs -f miku-stt
docker logs -f miku-rvc-api | grep interrupt
# Test interrupt endpoint
curl -X POST http://localhost:8765/interrupt
# Check GPU usage
nvidia-smi
Troubleshooting
| Issue | Solution |
|---|---|
| No audio from Discord | Check bot has Connect and Speak permissions |
| VAD not detecting | Speak louder, check microphone, lower threshold |
| Empty transcripts | Speak for at least 1-2 seconds, check Whisper model |
| Interruption not working | Verify miku_speaking=true, check VAD probability |
| High latency | Profile each stage, check GPU utilization |
Next Features (Phase 4C+)
- KV cache precomputation from partial transcripts
- Multi-user simultaneous conversation
- Latency optimization (<1s total)
- Voice activity history and analytics
- Emotion detection from speech patterns
- Context-aware interruption handling
Ready to test! Use !miku join → !miku listen → speak to Miku 🎤