8.1 KiB
8.1 KiB
STT Voice Testing Guide
Phase 4B: Bot-Side STT Integration - COMPLETE ✅
All code has been deployed to containers. Ready for testing!
Architecture Overview
Discord Voice (User) → Opus 48kHz stereo
↓
VoiceReceiver.write()
↓
Opus decode → Stereo-to-mono → Resample to 16kHz
↓
STTClient.send_audio() → WebSocket
↓
miku-stt:8001 (Silero VAD + Faster-Whisper)
↓
JSON events (vad, partial, final, interruption)
↓
VoiceReceiver callbacks → voice_manager
↓
on_final_transcript() → _generate_voice_response()
↓
LLM streaming → TTS tokens → Audio playback
New Voice Commands
1. Start Listening
!miku listen
- Starts listening to your voice in the current voice channel
- You must be in the same channel as Miku
- Miku will transcribe your speech and respond with voice
!miku listen @username
- Start listening to a specific user's voice
- Useful for moderators or testing with multiple users
2. Stop Listening
!miku stop-listening
- Stop listening to your voice
- Miku will no longer transcribe or respond to your speech
!miku stop-listening @username
- Stop listening to a specific user
Testing Procedure
Test 1: Basic STT Connection
- Join a voice channel
!miku join- Miku joins your channel!miku listen- Start listening to your voice- Check bot logs for "Started listening to user"
- Check STT logs:
docker logs miku-stt --tail 50- Should show: "WebSocket connection from user {user_id}"
- Should show: "Session started for user {user_id}"
Test 2: VAD Detection
- After
!miku listen, speak into your microphone - Say something like: "Hello Miku, can you hear me?"
- Check STT logs for VAD events:
[DEBUG] VAD: speech_start probability=0.85 [DEBUG] VAD: speaking probability=0.92 [DEBUG] VAD: speech_end probability=0.15 - Bot logs should show: "VAD event for user {id}: speech_start/speaking/speech_end"
Test 3: Transcription
- Speak clearly into microphone: "Hey Miku, tell me a joke"
- Watch bot logs for:
- "Partial transcript from user {id}: Hey Miku..."
- "Final transcript from user {id}: Hey Miku, tell me a joke"
- Miku should respond with LLM-generated speech
- Check channel for: "🎤 Miku: [her response]"
Test 4: Interruption Detection
!miku listen!miku say Tell me a very long story about your favorite song- While Miku is speaking, start talking yourself
- Speak loudly enough to trigger VAD (probability > 0.7)
- Expected behavior:
- Miku's audio should stop immediately
- Bot logs: "User {id} interrupted Miku (probability={prob})"
- STT logs: "Interruption detected during TTS playback"
- RVC logs: "Interrupted: Flushed {N} ZMQ chunks"
Test 5: Multi-User (if available)
- Have two users join voice channel
!miku listen @user1- Listen to first user!miku listen @user2- Listen to second user- Both users speak separately
- Verify Miku responds to each user individually
- Check STT logs for multiple active sessions
Logs to Monitor
Bot Logs
docker logs -f miku-bot | grep -E "(listen|STT|transcript|interrupt)"
Expected output:
[INFO] Started listening to user 123456789 (username)
[DEBUG] VAD event for user 123456789: speech_start
[DEBUG] Partial transcript from user 123456789: Hello Miku...
[INFO] Final transcript from user 123456789: Hello Miku, how are you?
[INFO] User 123456789 interrupted Miku (probability=0.82)
STT Logs
docker logs -f miku-stt
Expected output:
[INFO] WebSocket connection from user_123456789
[INFO] Session started for user 123456789
[DEBUG] Received 320 audio samples from user_123456789
[DEBUG] VAD speech_start: probability=0.87
[INFO] Transcribing audio segment (duration=2.5s)
[INFO] Final transcript: "Hello Miku, how are you?"
RVC Logs (for interruption)
docker logs -f miku-rvc-api | grep -i interrupt
Expected output:
[INFO] Interrupted: Flushed 15 ZMQ chunks, cleared 48000 RVC buffer samples
Component Status
✅ Completed
- STT container running (miku-stt:8001)
- Silero VAD on CPU with chunk buffering
- Faster-Whisper on GTX 1660 (1.3GB VRAM)
- STTClient WebSocket client
- VoiceReceiver Discord audio sink
- VoiceSession STT integration
- listen/stop-listening commands
- /interrupt endpoint in RVC API
- LLM response generation from transcripts
- Interruption detection and cancellation
⏳ Pending Testing
- Basic STT connection test
- VAD speech detection test
- End-to-end transcription test
- LLM voice response test
- Interruption cancellation test
- Multi-user testing (if available)
🔧 Configuration Tuning (after testing)
- VAD sensitivity (currently threshold=0.5)
- VAD timing (min_speech=250ms, min_silence=500ms)
- Interruption threshold (currently 0.7)
- Whisper beam size and patience
- LLM streaming chunk size
API Endpoints
STT Container (port 8001)
- WebSocket:
ws://localhost:8001/ws/stt/{user_id} - Health:
http://localhost:8001/health
RVC Container (port 8765)
- WebSocket:
ws://localhost:8765/ws/stream - Interrupt:
http://localhost:8765/interrupt(POST) - Health:
http://localhost:8765/health
Troubleshooting
No audio received from Discord
- Check bot logs for "write() called with data"
- Verify user is in same voice channel as Miku
- Check Discord permissions (View Channel, Connect, Speak)
VAD not detecting speech
- Check chunk buffer accumulation in STT logs
- Verify audio format: PCM int16, 16kHz mono
- Try speaking louder or more clearly
- Check VAD threshold (may need adjustment)
Transcription empty or gibberish
- Verify Whisper model loaded (check STT startup logs)
- Check GPU VRAM usage:
nvidia-smi - Ensure audio segments are at least 1-2 seconds long
- Try speaking more clearly with less background noise
Interruption not working
- Verify Miku is actually speaking (check miku_speaking flag)
- Check VAD probability in logs (must be > 0.7)
- Verify /interrupt endpoint returns success
- Check RVC logs for flushed chunks
Multiple users causing issues
- Check STT logs for per-user session management
- Verify each user has separate STTClient instance
- Check for resource contention on GTX 1660
Next Steps After Testing
Phase 4C: LLM KV Cache Precomputation
- Use partial transcripts to start LLM generation early
- Precompute KV cache for common phrases
- Reduce latency between speech end and response start
Phase 4D: Multi-User Refinement
- Queue management for multiple simultaneous speakers
- Priority system for interruptions
- Resource allocation for multiple Whisper requests
Phase 4E: Latency Optimization
- Profile each stage of the pipeline
- Optimize audio chunk sizes
- Reduce WebSocket message overhead
- Tune Whisper beam search parameters
- Implement VAD lookahead for quicker detection
Hardware Utilization
Current Allocation
- AMD RX 6800: LLaMA text models (idle during listen/speak)
- GTX 1660:
- Listen phase: Faster-Whisper (1.3GB VRAM)
- Speak phase: Soprano TTS + RVC (time-multiplexed)
- CPU: Silero VAD, audio preprocessing
Expected Performance
- VAD latency: <50ms (CPU processing)
- Transcription latency: 200-500ms (Whisper inference)
- LLM streaming: 20-30 tokens/sec (RX 6800)
- TTS synthesis: Real-time (GTX 1660)
- Total latency (speech → response): 1-2 seconds
Testing Checklist
Before marking Phase 4B as complete:
- Test basic STT connection with
!miku listen - Verify VAD detects speech start/end correctly
- Confirm transcripts are accurate and complete
- Test LLM voice response generation works
- Verify interruption cancels TTS playback
- Check multi-user handling (if possible)
- Verify resource cleanup on
!miku stop-listening - Test edge cases (silence, background noise, overlapping speech)
- Profile latencies at each stage
- Document any configuration tuning needed
Status: Code deployed, ready for user testing! 🎤🤖