223 lines
6.1 KiB
Markdown
223 lines
6.1 KiB
Markdown
# Silence Detection Implementation
|
|
|
|
## What Was Added
|
|
|
|
Implemented automatic silence detection to trigger final transcriptions in the new ONNX-based STT system.
|
|
|
|
### Problem
|
|
The new ONNX server requires manually sending a `{"type": "final"}` command to get the complete transcription. Without this, partial transcripts would appear but never be finalized and sent to LlamaCPP.
|
|
|
|
### Solution
|
|
Added silence tracking in `voice_receiver.py`:
|
|
|
|
1. **Track audio timestamps**: Record when the last audio chunk was sent
|
|
2. **Detect silence**: Start a timer after each audio chunk
|
|
3. **Send final command**: If no new audio arrives within 1.5 seconds, send `{"type": "final"}`
|
|
4. **Cancel on new audio**: Reset the timer if more audio arrives
|
|
|
|
---
|
|
|
|
## Implementation Details
|
|
|
|
### New Attributes
|
|
```python
|
|
self.last_audio_time: Dict[int, float] = {} # Track last audio per user
|
|
self.silence_tasks: Dict[int, asyncio.Task] = {} # Silence detection tasks
|
|
self.silence_timeout = 1.5 # Seconds of silence before "final"
|
|
```
|
|
|
|
### New Method
|
|
```python
|
|
async def _detect_silence(self, user_id: int):
|
|
"""
|
|
Wait for silence timeout and send 'final' command to STT.
|
|
Called after each audio chunk.
|
|
"""
|
|
await asyncio.sleep(self.silence_timeout)
|
|
stt_client = self.stt_clients.get(user_id)
|
|
if stt_client and stt_client.is_connected():
|
|
await stt_client.send_final()
|
|
```
|
|
|
|
### Integration
|
|
- Called after sending each audio chunk
|
|
- Cancels previous silence task if new audio arrives
|
|
- Automatically cleaned up when stopping listening
|
|
|
|
---
|
|
|
|
## Testing
|
|
|
|
### Test 1: Basic Transcription
|
|
1. Join voice channel
|
|
2. Run `!miku listen`
|
|
3. **Speak a sentence** and wait 1.5 seconds
|
|
4. **Expected**: Final transcript appears and is sent to LlamaCPP
|
|
|
|
### Test 2: Continuous Speech
|
|
1. Start listening
|
|
2. **Speak multiple sentences** with pauses < 1.5s between them
|
|
3. **Expected**: Partial transcripts update, final sent after last sentence
|
|
|
|
### Test 3: Multiple Users
|
|
1. Have 2+ users in voice channel
|
|
2. Each runs `!miku listen`
|
|
3. Both speak (taking turns or simultaneously)
|
|
4. **Expected**: Each user's speech is transcribed independently
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
### Silence Timeout
|
|
Default: `1.5` seconds
|
|
|
|
**To adjust**, edit `voice_receiver.py`:
|
|
```python
|
|
self.silence_timeout = 1.5 # Change this value
|
|
```
|
|
|
|
**Recommendations**:
|
|
- **Too short (< 1.0s)**: May cut off during natural pauses in speech
|
|
- **Too long (> 3.0s)**: User waits too long for response
|
|
- **Sweet spot**: 1.5-2.0s works well for conversational speech
|
|
|
|
---
|
|
|
|
## Monitoring
|
|
|
|
### Check Logs for Silence Detection
|
|
```bash
|
|
docker logs miku-bot 2>&1 | grep "Silence detected"
|
|
```
|
|
|
|
**Expected output**:
|
|
```
|
|
[DEBUG] Silence detected for user 209381657369772032, requesting final transcript
|
|
```
|
|
|
|
### Check Final Transcripts
|
|
```bash
|
|
docker logs miku-bot 2>&1 | grep "FINAL TRANSCRIPT"
|
|
```
|
|
|
|
### Check STT Processing
|
|
```bash
|
|
docker logs miku-stt 2>&1 | grep "Final transcription"
|
|
```
|
|
|
|
---
|
|
|
|
## Debugging
|
|
|
|
### Issue: No Final Transcript
|
|
**Symptoms**: Partial transcripts appear but never finalize
|
|
|
|
**Debug steps**:
|
|
1. Check if silence detection is triggering:
|
|
```bash
|
|
docker logs miku-bot 2>&1 | grep "Silence detected"
|
|
```
|
|
|
|
2. Check if final command is being sent:
|
|
```bash
|
|
docker logs miku-stt 2>&1 | grep "type.*final"
|
|
```
|
|
|
|
3. Increase log level in stt_client.py:
|
|
```python
|
|
logger.setLevel(logging.DEBUG)
|
|
```
|
|
|
|
### Issue: Cuts Off Mid-Sentence
|
|
**Symptoms**: Final transcript triggers during natural pauses
|
|
|
|
**Solution**: Increase silence timeout:
|
|
```python
|
|
self.silence_timeout = 2.0 # or 2.5
|
|
```
|
|
|
|
### Issue: Too Slow to Respond
|
|
**Symptoms**: Long wait after user stops speaking
|
|
|
|
**Solution**: Decrease silence timeout:
|
|
```python
|
|
self.silence_timeout = 1.0 # or 1.2
|
|
```
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
```
|
|
Discord Voice → voice_receiver.py
|
|
↓
|
|
[Audio Chunk Received]
|
|
↓
|
|
┌─────────────────────┐
|
|
│ send_audio() │
|
|
│ to STT server │
|
|
└─────────────────────┘
|
|
↓
|
|
┌─────────────────────┐
|
|
│ Start silence │
|
|
│ detection timer │
|
|
│ (1.5s countdown) │
|
|
└─────────────────────┘
|
|
↓
|
|
┌──────┴──────┐
|
|
│ │
|
|
More audio No more audio
|
|
arrives for 1.5s
|
|
│ │
|
|
↓ ↓
|
|
Cancel timer ┌──────────────┐
|
|
Start new │ send_final() │
|
|
│ to STT │
|
|
└──────────────┘
|
|
↓
|
|
┌─────────────────┐
|
|
│ Final transcript│
|
|
│ → LlamaCPP │
|
|
└─────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Files Modified
|
|
|
|
1. **bot/utils/voice_receiver.py**
|
|
- Added `last_audio_time` tracking
|
|
- Added `silence_tasks` management
|
|
- Added `_detect_silence()` method
|
|
- Integrated silence detection in `_send_audio_chunk()`
|
|
- Added cleanup in `stop_listening()`
|
|
|
|
2. **bot/utils/stt_client.py** (previously)
|
|
- Added `send_final()` method
|
|
- Added `send_reset()` method
|
|
- Updated protocol handler
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. **Test thoroughly** with different speech patterns
|
|
2. **Tune silence timeout** based on user feedback
|
|
3. **Consider VAD integration** for more accurate speech end detection
|
|
4. **Add metrics** to track transcription latency
|
|
|
|
---
|
|
|
|
**Status**: ✅ **READY FOR TESTING**
|
|
|
|
The system now:
|
|
- ✅ Connects to ONNX STT server (port 8766)
|
|
- ✅ Uses CUDA GPU acceleration (cuDNN 9)
|
|
- ✅ Receives partial transcripts
|
|
- ✅ Automatically detects silence
|
|
- ✅ Sends final command after 1.5s silence
|
|
- ✅ Forwards final transcript to LlamaCPP
|
|
|
|
**Test it now with `!miku listen`!**
|