Files

koko210Serve d1e6b21508 Phase 4 STT pipeline implemented — Silero VAD + faster-whisper — still not working well at all

2026-01-17 03:14:40 +02:00

8.1 KiB

Raw Blame History

STT Voice Testing Guide

Phase 4B: Bot-Side STT Integration - COMPLETE ✅

All code has been deployed to containers. Ready for testing!

Architecture Overview

Discord Voice (User) → Opus 48kHz stereo
                ↓
        VoiceReceiver.write()
                ↓
        Opus decode → Stereo-to-mono → Resample to 16kHz
                ↓
        STTClient.send_audio() → WebSocket
                ↓
        miku-stt:8001 (Silero VAD + Faster-Whisper)
                ↓
        JSON events (vad, partial, final, interruption)
                ↓
        VoiceReceiver callbacks → voice_manager
                ↓
        on_final_transcript() → _generate_voice_response()
                ↓
        LLM streaming → TTS tokens → Audio playback

New Voice Commands

1. Start Listening

!miku listen

Starts listening to your voice in the current voice channel
You must be in the same channel as Miku
Miku will transcribe your speech and respond with voice

!miku listen @username

Start listening to a specific user's voice
Useful for moderators or testing with multiple users

2. Stop Listening

!miku stop-listening

Stop listening to your voice
Miku will no longer transcribe or respond to your speech

!miku stop-listening @username

Stop listening to a specific user

Testing Procedure

Test 1: Basic STT Connection

Join a voice channel
!miku join - Miku joins your channel
!miku listen - Start listening to your voice
Check bot logs for "Started listening to user"
Check STT logs: docker logs miku-stt --tail 50
- Should show: "WebSocket connection from user {user_id}"
- Should show: "Session started for user {user_id}"

Test 2: VAD Detection

After !miku listen, speak into your microphone
Say something like: "Hello Miku, can you hear me?"

Check STT logs for VAD events:

[DEBUG] VAD: speech_start probability=0.85
[DEBUG] VAD: speaking probability=0.92
[DEBUG] VAD: speech_end probability=0.15

Bot logs should show: "VAD event for user {id}: speech_start/speaking/speech_end"

Test 3: Transcription

Speak clearly into microphone: "Hey Miku, tell me a joke"
Watch bot logs for:
- "Partial transcript from user {id}: Hey Miku..."
- "Final transcript from user {id}: Hey Miku, tell me a joke"
Miku should respond with LLM-generated speech
Check channel for: "🎤 Miku: [her response]"

Test 4: Interruption Detection

!miku listen
!miku say Tell me a very long story about your favorite song
While Miku is speaking, start talking yourself
Speak loudly enough to trigger VAD (probability > 0.7)
Expected behavior:
- Miku's audio should stop immediately
- Bot logs: "User {id} interrupted Miku (probability={prob})"
- STT logs: "Interruption detected during TTS playback"
- RVC logs: "Interrupted: Flushed {N} ZMQ chunks"

Test 5: Multi-User (if available)

Have two users join voice channel
!miku listen @user1 - Listen to first user
!miku listen @user2 - Listen to second user
Both users speak separately
Verify Miku responds to each user individually
Check STT logs for multiple active sessions

Logs to Monitor

Bot Logs

docker logs -f miku-bot | grep -E "(listen|STT|transcript|interrupt)"

Expected output:

[INFO] Started listening to user 123456789 (username)
[DEBUG] VAD event for user 123456789: speech_start
[DEBUG] Partial transcript from user 123456789: Hello Miku...
[INFO] Final transcript from user 123456789: Hello Miku, how are you?
[INFO] User 123456789 interrupted Miku (probability=0.82)

STT Logs

docker logs -f miku-stt

Expected output:

[INFO] WebSocket connection from user_123456789
[INFO] Session started for user 123456789
[DEBUG] Received 320 audio samples from user_123456789
[DEBUG] VAD speech_start: probability=0.87
[INFO] Transcribing audio segment (duration=2.5s)
[INFO] Final transcript: "Hello Miku, how are you?"

RVC Logs (for interruption)

docker logs -f miku-rvc-api | grep -i interrupt

Expected output:

[INFO] Interrupted: Flushed 15 ZMQ chunks, cleared 48000 RVC buffer samples

Component Status

✅ Completed

STT container running (miku-stt:8001)
Silero VAD on CPU with chunk buffering
Faster-Whisper on GTX 1660 (1.3GB VRAM)
STTClient WebSocket client
VoiceReceiver Discord audio sink
VoiceSession STT integration
listen/stop-listening commands
/interrupt endpoint in RVC API
LLM response generation from transcripts
Interruption detection and cancellation

⏳ Pending Testing

Basic STT connection test
VAD speech detection test
End-to-end transcription test
LLM voice response test
Interruption cancellation test
Multi-user testing (if available)

🔧 Configuration Tuning (after testing)

VAD sensitivity (currently threshold=0.5)
VAD timing (min_speech=250ms, min_silence=500ms)
Interruption threshold (currently 0.7)
Whisper beam size and patience
LLM streaming chunk size

API Endpoints

STT Container (port 8001)

WebSocket: ws://localhost:8001/ws/stt/{user_id}
Health: http://localhost:8001/health

RVC Container (port 8765)

WebSocket: ws://localhost:8765/ws/stream
Interrupt: http://localhost:8765/interrupt (POST)
Health: http://localhost:8765/health

Troubleshooting

No audio received from Discord

Check bot logs for "write() called with data"
Verify user is in same voice channel as Miku
Check Discord permissions (View Channel, Connect, Speak)

VAD not detecting speech

Check chunk buffer accumulation in STT logs
Verify audio format: PCM int16, 16kHz mono
Try speaking louder or more clearly
Check VAD threshold (may need adjustment)

Transcription empty or gibberish

Verify Whisper model loaded (check STT startup logs)
Check GPU VRAM usage: nvidia-smi
Ensure audio segments are at least 1-2 seconds long
Try speaking more clearly with less background noise

Interruption not working

Verify Miku is actually speaking (check miku_speaking flag)
Check VAD probability in logs (must be > 0.7)
Verify /interrupt endpoint returns success
Check RVC logs for flushed chunks

Multiple users causing issues

Check STT logs for per-user session management
Verify each user has separate STTClient instance
Check for resource contention on GTX 1660

Next Steps After Testing

Phase 4C: LLM KV Cache Precomputation

Use partial transcripts to start LLM generation early
Precompute KV cache for common phrases
Reduce latency between speech end and response start

Phase 4D: Multi-User Refinement

Queue management for multiple simultaneous speakers
Priority system for interruptions
Resource allocation for multiple Whisper requests

Phase 4E: Latency Optimization

Profile each stage of the pipeline
Optimize audio chunk sizes
Reduce WebSocket message overhead
Tune Whisper beam search parameters
Implement VAD lookahead for quicker detection

Hardware Utilization

Current Allocation

AMD RX 6800: LLaMA text models (idle during listen/speak)
GTX 1660:
- Listen phase: Faster-Whisper (1.3GB VRAM)
- Speak phase: Soprano TTS + RVC (time-multiplexed)
CPU: Silero VAD, audio preprocessing

Expected Performance

VAD latency: <50ms (CPU processing)
Transcription latency: 200-500ms (Whisper inference)
LLM streaming: 20-30 tokens/sec (RX 6800)
TTS synthesis: Real-time (GTX 1660)
Total latency (speech → response): 1-2 seconds

Testing Checklist

Before marking Phase 4B as complete:

Test basic STT connection with !miku listen
Verify VAD detects speech start/end correctly
Confirm transcripts are accurate and complete
Test LLM voice response generation works
Verify interruption cancels TTS playback
Check multi-user handling (if possible)
Verify resource cleanup on !miku stop-listening
Test edge cases (silence, background noise, overlapping speech)
Profile latencies at each stage
Document any configuration tuning needed

Status: Code deployed, ready for user testing! 🎤🤖

8.1 KiB Raw Blame History

STT Voice Testing Guide

Phase 4B: Bot-Side STT Integration - COMPLETE ✅

Architecture Overview

New Voice Commands

1. Start Listening

2. Stop Listening

Testing Procedure

Test 1: Basic STT Connection

Test 2: VAD Detection

Test 3: Transcription

Test 4: Interruption Detection

Test 5: Multi-User (if available)

Logs to Monitor

Bot Logs

STT Logs

RVC Logs (for interruption)

Component Status

✅ Completed

⏳ Pending Testing

🔧 Configuration Tuning (after testing)

API Endpoints

STT Container (port 8001)

RVC Container (port 8765)

Troubleshooting

No audio received from Discord

VAD not detecting speech

Transcription empty or gibberish

Interruption not working

Multiple users causing issues

Next Steps After Testing

Phase 4C: LLM KV Cache Precomputation

Phase 4D: Multi-User Refinement

Phase 4E: Latency Optimization

Hardware Utilization

Current Allocation

Expected Performance

Testing Checklist

8.1 KiB

Raw Blame History