Implemented experimental real production ready voice chat, relegated old flow to voice debug mode. New Web UI panel for Voice Chat.
This commit is contained in:
222
SILENCE_DETECTION.md
Normal file
222
SILENCE_DETECTION.md
Normal file
@@ -0,0 +1,222 @@
|
||||
# Silence Detection Implementation
|
||||
|
||||
## What Was Added
|
||||
|
||||
Implemented automatic silence detection to trigger final transcriptions in the new ONNX-based STT system.
|
||||
|
||||
### Problem
|
||||
The new ONNX server requires manually sending a `{"type": "final"}` command to get the complete transcription. Without this, partial transcripts would appear but never be finalized and sent to LlamaCPP.
|
||||
|
||||
### Solution
|
||||
Added silence tracking in `voice_receiver.py`:
|
||||
|
||||
1. **Track audio timestamps**: Record when the last audio chunk was sent
|
||||
2. **Detect silence**: Start a timer after each audio chunk
|
||||
3. **Send final command**: If no new audio arrives within 1.5 seconds, send `{"type": "final"}`
|
||||
4. **Cancel on new audio**: Reset the timer if more audio arrives
|
||||
|
||||
---
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### New Attributes
|
||||
```python
|
||||
self.last_audio_time: Dict[int, float] = {} # Track last audio per user
|
||||
self.silence_tasks: Dict[int, asyncio.Task] = {} # Silence detection tasks
|
||||
self.silence_timeout = 1.5 # Seconds of silence before "final"
|
||||
```
|
||||
|
||||
### New Method
|
||||
```python
|
||||
async def _detect_silence(self, user_id: int):
|
||||
"""
|
||||
Wait for silence timeout and send 'final' command to STT.
|
||||
Called after each audio chunk.
|
||||
"""
|
||||
await asyncio.sleep(self.silence_timeout)
|
||||
stt_client = self.stt_clients.get(user_id)
|
||||
if stt_client and stt_client.is_connected():
|
||||
await stt_client.send_final()
|
||||
```
|
||||
|
||||
### Integration
|
||||
- Called after sending each audio chunk
|
||||
- Cancels previous silence task if new audio arrives
|
||||
- Automatically cleaned up when stopping listening
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
### Test 1: Basic Transcription
|
||||
1. Join voice channel
|
||||
2. Run `!miku listen`
|
||||
3. **Speak a sentence** and wait 1.5 seconds
|
||||
4. **Expected**: Final transcript appears and is sent to LlamaCPP
|
||||
|
||||
### Test 2: Continuous Speech
|
||||
1. Start listening
|
||||
2. **Speak multiple sentences** with pauses < 1.5s between them
|
||||
3. **Expected**: Partial transcripts update, final sent after last sentence
|
||||
|
||||
### Test 3: Multiple Users
|
||||
1. Have 2+ users in voice channel
|
||||
2. Each runs `!miku listen`
|
||||
3. Both speak (taking turns or simultaneously)
|
||||
4. **Expected**: Each user's speech is transcribed independently
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
### Silence Timeout
|
||||
Default: `1.5` seconds
|
||||
|
||||
**To adjust**, edit `voice_receiver.py`:
|
||||
```python
|
||||
self.silence_timeout = 1.5 # Change this value
|
||||
```
|
||||
|
||||
**Recommendations**:
|
||||
- **Too short (< 1.0s)**: May cut off during natural pauses in speech
|
||||
- **Too long (> 3.0s)**: User waits too long for response
|
||||
- **Sweet spot**: 1.5-2.0s works well for conversational speech
|
||||
|
||||
---
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Check Logs for Silence Detection
|
||||
```bash
|
||||
docker logs miku-bot 2>&1 | grep "Silence detected"
|
||||
```
|
||||
|
||||
**Expected output**:
|
||||
```
|
||||
[DEBUG] Silence detected for user 209381657369772032, requesting final transcript
|
||||
```
|
||||
|
||||
### Check Final Transcripts
|
||||
```bash
|
||||
docker logs miku-bot 2>&1 | grep "FINAL TRANSCRIPT"
|
||||
```
|
||||
|
||||
### Check STT Processing
|
||||
```bash
|
||||
docker logs miku-stt 2>&1 | grep "Final transcription"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Debugging
|
||||
|
||||
### Issue: No Final Transcript
|
||||
**Symptoms**: Partial transcripts appear but never finalize
|
||||
|
||||
**Debug steps**:
|
||||
1. Check if silence detection is triggering:
|
||||
```bash
|
||||
docker logs miku-bot 2>&1 | grep "Silence detected"
|
||||
```
|
||||
|
||||
2. Check if final command is being sent:
|
||||
```bash
|
||||
docker logs miku-stt 2>&1 | grep "type.*final"
|
||||
```
|
||||
|
||||
3. Increase log level in stt_client.py:
|
||||
```python
|
||||
logger.setLevel(logging.DEBUG)
|
||||
```
|
||||
|
||||
### Issue: Cuts Off Mid-Sentence
|
||||
**Symptoms**: Final transcript triggers during natural pauses
|
||||
|
||||
**Solution**: Increase silence timeout:
|
||||
```python
|
||||
self.silence_timeout = 2.0 # or 2.5
|
||||
```
|
||||
|
||||
### Issue: Too Slow to Respond
|
||||
**Symptoms**: Long wait after user stops speaking
|
||||
|
||||
**Solution**: Decrease silence timeout:
|
||||
```python
|
||||
self.silence_timeout = 1.0 # or 1.2
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Discord Voice → voice_receiver.py
|
||||
↓
|
||||
[Audio Chunk Received]
|
||||
↓
|
||||
┌─────────────────────┐
|
||||
│ send_audio() │
|
||||
│ to STT server │
|
||||
└─────────────────────┘
|
||||
↓
|
||||
┌─────────────────────┐
|
||||
│ Start silence │
|
||||
│ detection timer │
|
||||
│ (1.5s countdown) │
|
||||
└─────────────────────┘
|
||||
↓
|
||||
┌──────┴──────┐
|
||||
│ │
|
||||
More audio No more audio
|
||||
arrives for 1.5s
|
||||
│ │
|
||||
↓ ↓
|
||||
Cancel timer ┌──────────────┐
|
||||
Start new │ send_final() │
|
||||
│ to STT │
|
||||
└──────────────┘
|
||||
↓
|
||||
┌─────────────────┐
|
||||
│ Final transcript│
|
||||
│ → LlamaCPP │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Files Modified
|
||||
|
||||
1. **bot/utils/voice_receiver.py**
|
||||
- Added `last_audio_time` tracking
|
||||
- Added `silence_tasks` management
|
||||
- Added `_detect_silence()` method
|
||||
- Integrated silence detection in `_send_audio_chunk()`
|
||||
- Added cleanup in `stop_listening()`
|
||||
|
||||
2. **bot/utils/stt_client.py** (previously)
|
||||
- Added `send_final()` method
|
||||
- Added `send_reset()` method
|
||||
- Updated protocol handler
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Test thoroughly** with different speech patterns
|
||||
2. **Tune silence timeout** based on user feedback
|
||||
3. **Consider VAD integration** for more accurate speech end detection
|
||||
4. **Add metrics** to track transcription latency
|
||||
|
||||
---
|
||||
|
||||
**Status**: ✅ **READY FOR TESTING**
|
||||
|
||||
The system now:
|
||||
- ✅ Connects to ONNX STT server (port 8766)
|
||||
- ✅ Uses CUDA GPU acceleration (cuDNN 9)
|
||||
- ✅ Receives partial transcripts
|
||||
- ✅ Automatically detects silence
|
||||
- ✅ Sends final command after 1.5s silence
|
||||
- ✅ Forwards final transcript to LlamaCPP
|
||||
|
||||
**Test it now with `!miku listen`!**
|
||||
Reference in New Issue
Block a user