# Silence Detection Implementation ## What Was Added Implemented automatic silence detection to trigger final transcriptions in the new ONNX-based STT system. ### Problem The new ONNX server requires manually sending a `{"type": "final"}` command to get the complete transcription. Without this, partial transcripts would appear but never be finalized and sent to LlamaCPP. ### Solution Added silence tracking in `voice_receiver.py`: 1. **Track audio timestamps**: Record when the last audio chunk was sent 2. **Detect silence**: Start a timer after each audio chunk 3. **Send final command**: If no new audio arrives within 1.5 seconds, send `{"type": "final"}` 4. **Cancel on new audio**: Reset the timer if more audio arrives --- ## Implementation Details ### New Attributes ```python self.last_audio_time: Dict[int, float] = {} # Track last audio per user self.silence_tasks: Dict[int, asyncio.Task] = {} # Silence detection tasks self.silence_timeout = 1.5 # Seconds of silence before "final" ``` ### New Method ```python async def _detect_silence(self, user_id: int): """ Wait for silence timeout and send 'final' command to STT. Called after each audio chunk. """ await asyncio.sleep(self.silence_timeout) stt_client = self.stt_clients.get(user_id) if stt_client and stt_client.is_connected(): await stt_client.send_final() ``` ### Integration - Called after sending each audio chunk - Cancels previous silence task if new audio arrives - Automatically cleaned up when stopping listening --- ## Testing ### Test 1: Basic Transcription 1. Join voice channel 2. Run `!miku listen` 3. **Speak a sentence** and wait 1.5 seconds 4. **Expected**: Final transcript appears and is sent to LlamaCPP ### Test 2: Continuous Speech 1. Start listening 2. **Speak multiple sentences** with pauses < 1.5s between them 3. **Expected**: Partial transcripts update, final sent after last sentence ### Test 3: Multiple Users 1. Have 2+ users in voice channel 2. Each runs `!miku listen` 3. Both speak (taking turns or simultaneously) 4. **Expected**: Each user's speech is transcribed independently --- ## Configuration ### Silence Timeout Default: `1.5` seconds **To adjust**, edit `voice_receiver.py`: ```python self.silence_timeout = 1.5 # Change this value ``` **Recommendations**: - **Too short (< 1.0s)**: May cut off during natural pauses in speech - **Too long (> 3.0s)**: User waits too long for response - **Sweet spot**: 1.5-2.0s works well for conversational speech --- ## Monitoring ### Check Logs for Silence Detection ```bash docker logs miku-bot 2>&1 | grep "Silence detected" ``` **Expected output**: ``` [DEBUG] Silence detected for user 209381657369772032, requesting final transcript ``` ### Check Final Transcripts ```bash docker logs miku-bot 2>&1 | grep "FINAL TRANSCRIPT" ``` ### Check STT Processing ```bash docker logs miku-stt 2>&1 | grep "Final transcription" ``` --- ## Debugging ### Issue: No Final Transcript **Symptoms**: Partial transcripts appear but never finalize **Debug steps**: 1. Check if silence detection is triggering: ```bash docker logs miku-bot 2>&1 | grep "Silence detected" ``` 2. Check if final command is being sent: ```bash docker logs miku-stt 2>&1 | grep "type.*final" ``` 3. Increase log level in stt_client.py: ```python logger.setLevel(logging.DEBUG) ``` ### Issue: Cuts Off Mid-Sentence **Symptoms**: Final transcript triggers during natural pauses **Solution**: Increase silence timeout: ```python self.silence_timeout = 2.0 # or 2.5 ``` ### Issue: Too Slow to Respond **Symptoms**: Long wait after user stops speaking **Solution**: Decrease silence timeout: ```python self.silence_timeout = 1.0 # or 1.2 ``` --- ## Architecture ``` Discord Voice → voice_receiver.py ↓ [Audio Chunk Received] ↓ ┌─────────────────────┐ │ send_audio() │ │ to STT server │ └─────────────────────┘ ↓ ┌─────────────────────┐ │ Start silence │ │ detection timer │ │ (1.5s countdown) │ └─────────────────────┘ ↓ ┌──────┴──────┐ │ │ More audio No more audio arrives for 1.5s │ │ ↓ ↓ Cancel timer ┌──────────────┐ Start new │ send_final() │ │ to STT │ └──────────────┘ ↓ ┌─────────────────┐ │ Final transcript│ │ → LlamaCPP │ └─────────────────┘ ``` --- ## Files Modified 1. **bot/utils/voice_receiver.py** - Added `last_audio_time` tracking - Added `silence_tasks` management - Added `_detect_silence()` method - Integrated silence detection in `_send_audio_chunk()` - Added cleanup in `stop_listening()` 2. **bot/utils/stt_client.py** (previously) - Added `send_final()` method - Added `send_reset()` method - Updated protocol handler --- ## Next Steps 1. **Test thoroughly** with different speech patterns 2. **Tune silence timeout** based on user feedback 3. **Consider VAD integration** for more accurate speech end detection 4. **Add metrics** to track transcription latency --- **Status**: ✅ **READY FOR TESTING** The system now: - ✅ Connects to ONNX STT server (port 8766) - ✅ Uses CUDA GPU acceleration (cuDNN 9) - ✅ Receives partial transcripts - ✅ Automatically detects silence - ✅ Sends final command after 1.5s silence - ✅ Forwards final transcript to LlamaCPP **Test it now with `!miku listen`!**