Files
miku-discord/readmes/SILENCE_DETECTION.md

223 lines
6.1 KiB
Markdown
Raw Normal View History

# Silence Detection Implementation
## What Was Added
Implemented automatic silence detection to trigger final transcriptions in the new ONNX-based STT system.
### Problem
The new ONNX server requires manually sending a `{"type": "final"}` command to get the complete transcription. Without this, partial transcripts would appear but never be finalized and sent to LlamaCPP.
### Solution
Added silence tracking in `voice_receiver.py`:
1. **Track audio timestamps**: Record when the last audio chunk was sent
2. **Detect silence**: Start a timer after each audio chunk
3. **Send final command**: If no new audio arrives within 1.5 seconds, send `{"type": "final"}`
4. **Cancel on new audio**: Reset the timer if more audio arrives
---
## Implementation Details
### New Attributes
```python
self.last_audio_time: Dict[int, float] = {} # Track last audio per user
self.silence_tasks: Dict[int, asyncio.Task] = {} # Silence detection tasks
self.silence_timeout = 1.5 # Seconds of silence before "final"
```
### New Method
```python
async def _detect_silence(self, user_id: int):
"""
Wait for silence timeout and send 'final' command to STT.
Called after each audio chunk.
"""
await asyncio.sleep(self.silence_timeout)
stt_client = self.stt_clients.get(user_id)
if stt_client and stt_client.is_connected():
await stt_client.send_final()
```
### Integration
- Called after sending each audio chunk
- Cancels previous silence task if new audio arrives
- Automatically cleaned up when stopping listening
---
## Testing
### Test 1: Basic Transcription
1. Join voice channel
2. Run `!miku listen`
3. **Speak a sentence** and wait 1.5 seconds
4. **Expected**: Final transcript appears and is sent to LlamaCPP
### Test 2: Continuous Speech
1. Start listening
2. **Speak multiple sentences** with pauses < 1.5s between them
3. **Expected**: Partial transcripts update, final sent after last sentence
### Test 3: Multiple Users
1. Have 2+ users in voice channel
2. Each runs `!miku listen`
3. Both speak (taking turns or simultaneously)
4. **Expected**: Each user's speech is transcribed independently
---
## Configuration
### Silence Timeout
Default: `1.5` seconds
**To adjust**, edit `voice_receiver.py`:
```python
self.silence_timeout = 1.5 # Change this value
```
**Recommendations**:
- **Too short (< 1.0s)**: May cut off during natural pauses in speech
- **Too long (> 3.0s)**: User waits too long for response
- **Sweet spot**: 1.5-2.0s works well for conversational speech
---
## Monitoring
### Check Logs for Silence Detection
```bash
docker logs miku-bot 2>&1 | grep "Silence detected"
```
**Expected output**:
```
[DEBUG] Silence detected for user 209381657369772032, requesting final transcript
```
### Check Final Transcripts
```bash
docker logs miku-bot 2>&1 | grep "FINAL TRANSCRIPT"
```
### Check STT Processing
```bash
docker logs miku-stt 2>&1 | grep "Final transcription"
```
---
## Debugging
### Issue: No Final Transcript
**Symptoms**: Partial transcripts appear but never finalize
**Debug steps**:
1. Check if silence detection is triggering:
```bash
docker logs miku-bot 2>&1 | grep "Silence detected"
```
2. Check if final command is being sent:
```bash
docker logs miku-stt 2>&1 | grep "type.*final"
```
3. Increase log level in stt_client.py:
```python
logger.setLevel(logging.DEBUG)
```
### Issue: Cuts Off Mid-Sentence
**Symptoms**: Final transcript triggers during natural pauses
**Solution**: Increase silence timeout:
```python
self.silence_timeout = 2.0 # or 2.5
```
### Issue: Too Slow to Respond
**Symptoms**: Long wait after user stops speaking
**Solution**: Decrease silence timeout:
```python
self.silence_timeout = 1.0 # or 1.2
```
---
## Architecture
```
Discord Voice → voice_receiver.py
[Audio Chunk Received]
┌─────────────────────┐
│ send_audio() │
│ to STT server │
└─────────────────────┘
┌─────────────────────┐
│ Start silence │
│ detection timer │
│ (1.5s countdown) │
└─────────────────────┘
┌──────┴──────┐
│ │
More audio No more audio
arrives for 1.5s
│ │
↓ ↓
Cancel timer ┌──────────────┐
Start new │ send_final() │
│ to STT │
└──────────────┘
┌─────────────────┐
│ Final transcript│
│ → LlamaCPP │
└─────────────────┘
```
---
## Files Modified
1. **bot/utils/voice_receiver.py**
- Added `last_audio_time` tracking
- Added `silence_tasks` management
- Added `_detect_silence()` method
- Integrated silence detection in `_send_audio_chunk()`
- Added cleanup in `stop_listening()`
2. **bot/utils/stt_client.py** (previously)
- Added `send_final()` method
- Added `send_reset()` method
- Updated protocol handler
---
## Next Steps
1. **Test thoroughly** with different speech patterns
2. **Tune silence timeout** based on user feedback
3. **Consider VAD integration** for more accurate speech end detection
4. **Add metrics** to track transcription latency
---
**Status**: ✅ **READY FOR TESTING**
The system now:
- ✅ Connects to ONNX STT server (port 8766)
- ✅ Uses CUDA GPU acceleration (cuDNN 9)
- ✅ Receives partial transcripts
- ✅ Automatically detects silence
- ✅ Sends final command after 1.5s silence
- ✅ Forwards final transcript to LlamaCPP
**Test it now with `!miku listen`!**