Implemented experimental real production ready voice chat, relegated old flow to voice debug mode. New Web UI panel for Voice Chat.

2026-01-20 23:06:17 +02:00
parent 362108f4b0
commit 2934efba22
31 changed files with 5408 additions and 357 deletions
--- a/SILENCE_DETECTION.md
+++ b/SILENCE_DETECTION.md
@@ -0,0 +1,222 @@
+# Silence Detection Implementation
+
+## What Was Added
+
+Implemented automatic silence detection to trigger final transcriptions in the new ONNX-based STT system.
+
+### Problem
+The new ONNX server requires manually sending a `{"type": "final"}` command to get the complete transcription. Without this, partial transcripts would appear but never be finalized and sent to LlamaCPP.
+
+### Solution
+Added silence tracking in `voice_receiver.py`:
+
+1. **Track audio timestamps**: Record when the last audio chunk was sent
+2. **Detect silence**: Start a timer after each audio chunk  
+3. **Send final command**: If no new audio arrives within 1.5 seconds, send `{"type": "final"}`
+4. **Cancel on new audio**: Reset the timer if more audio arrives
+
+---
+
+## Implementation Details
+
+### New Attributes
+```python
+self.last_audio_time: Dict[int, float] = {}      # Track last audio per user
+self.silence_tasks: Dict[int, asyncio.Task] = {} # Silence detection tasks
+self.silence_timeout = 1.5  # Seconds of silence before "final"
+```
+
+### New Method
+```python
+async def _detect_silence(self, user_id: int):
+    """
+    Wait for silence timeout and send 'final' command to STT.
+    Called after each audio chunk.
+    """
+    await asyncio.sleep(self.silence_timeout)
+    stt_client = self.stt_clients.get(user_id)
+    if stt_client and stt_client.is_connected():
+        await stt_client.send_final()
+```
+
+### Integration
+- Called after sending each audio chunk
+- Cancels previous silence task if new audio arrives
+- Automatically cleaned up when stopping listening
+
+---
+
+## Testing
+
+### Test 1: Basic Transcription
+1. Join voice channel
+2. Run `!miku listen`
+3. **Speak a sentence** and wait 1.5 seconds
+4. **Expected**: Final transcript appears and is sent to LlamaCPP
+
+### Test 2: Continuous Speech
+1. Start listening
+2. **Speak multiple sentences** with pauses < 1.5s between them
+3. **Expected**: Partial transcripts update, final sent after last sentence
+
+### Test 3: Multiple Users
+1. Have 2+ users in voice channel
+2. Each runs `!miku listen`
+3. Both speak (taking turns or simultaneously)
+4. **Expected**: Each user's speech is transcribed independently
+
+---
+
+## Configuration
+
+### Silence Timeout
+Default: `1.5` seconds
+
+**To adjust**, edit `voice_receiver.py`:
+```python
+self.silence_timeout = 1.5  # Change this value
+```
+
+**Recommendations**:
+- **Too short (< 1.0s)**: May cut off during natural pauses in speech
+- **Too long (> 3.0s)**: User waits too long for response
+- **Sweet spot**: 1.5-2.0s works well for conversational speech
+
+---
+
+## Monitoring
+
+### Check Logs for Silence Detection
+```bash
+docker logs miku-bot 2>&1 | grep "Silence detected"
+```
+
+**Expected output**:
+```
+[DEBUG] Silence detected for user 209381657369772032, requesting final transcript
+```
+
+### Check Final Transcripts
+```bash
+docker logs miku-bot 2>&1 | grep "FINAL TRANSCRIPT"
+```
+
+### Check STT Processing
+```bash
+docker logs miku-stt 2>&1 | grep "Final transcription"
+```
+
+---
+
+## Debugging
+
+### Issue: No Final Transcript
+**Symptoms**: Partial transcripts appear but never finalize
+
+**Debug steps**:
+1. Check if silence detection is triggering:
+   ```bash
+   docker logs miku-bot 2>&1 | grep "Silence detected"
+   ```
+
+2. Check if final command is being sent:
+   ```bash
+   docker logs miku-stt 2>&1 | grep "type.*final"
+   ```
+
+3. Increase log level in stt_client.py:
+   ```python
+   logger.setLevel(logging.DEBUG)
+   ```
+
+### Issue: Cuts Off Mid-Sentence
+**Symptoms**: Final transcript triggers during natural pauses
+
+**Solution**: Increase silence timeout:
+```python
+self.silence_timeout = 2.0  # or 2.5
+```
+
+### Issue: Too Slow to Respond
+**Symptoms**: Long wait after user stops speaking
+
+**Solution**: Decrease silence timeout:
+```python
+self.silence_timeout = 1.0  # or 1.2
+```
+
+---
+
+## Architecture
+
+```
+Discord Voice → voice_receiver.py
+                     ↓
+            [Audio Chunk Received]
+                     ↓
+         ┌─────────────────────┐
+         │  send_audio()       │
+         │  to STT server      │
+         └─────────────────────┘
+                     ↓
+         ┌─────────────────────┐
+         │  Start silence      │
+         │  detection timer    │
+         │  (1.5s countdown)   │
+         └─────────────────────┘
+                     ↓
+              ┌──────┴──────┐
+              │             │
+        More audio    No more audio
+        arrives       for 1.5s
+              │             │
+              ↓             ↓
+         Cancel timer  ┌──────────────┐
+         Start new     │ send_final() │
+                       │ to STT       │
+                       └──────────────┘
+                             ↓
+                    ┌─────────────────┐
+                    │ Final transcript│
+                    │ → LlamaCPP     │
+                    └─────────────────┘
+```
+
+---
+
+## Files Modified
+
+1. **bot/utils/voice_receiver.py**
+   - Added `last_audio_time` tracking
+   - Added `silence_tasks` management
+   - Added `_detect_silence()` method
+   - Integrated silence detection in `_send_audio_chunk()`
+   - Added cleanup in `stop_listening()`
+
+2. **bot/utils/stt_client.py** (previously)
+   - Added `send_final()` method
+   - Added `send_reset()` method
+   - Updated protocol handler
+
+---
+
+## Next Steps
+
+1. **Test thoroughly** with different speech patterns
+2. **Tune silence timeout** based on user feedback
+3. **Consider VAD integration** for more accurate speech end detection
+4. **Add metrics** to track transcription latency
+
+---
+
+**Status**: ✅ **READY FOR TESTING**
+
+The system now:
+- ✅ Connects to ONNX STT server (port 8766)
+- ✅ Uses CUDA GPU acceleration (cuDNN 9)
+- ✅ Receives partial transcripts
+- ✅ Automatically detects silence
+- ✅ Sends final command after 1.5s silence
+- ✅ Forwards final transcript to LlamaCPP
+
+**Test it now with `!miku listen`!**