readmes/SILENCE_DETECTION.md

# Silence Detection Implementation

## What Was Added

Implemented automatic silence detection to trigger final transcriptions in the new ONNX-based STT system.

### Problem
The new ONNX server requires manually sending a `{"type": "final"}` command to get the complete transcription. Without this, partial transcripts would appear but never be finalized and sent to LlamaCPP.

### Solution
Added silence tracking in `voice_receiver.py`:

1. **Track audio timestamps**: Record when the last audio chunk was sent
2. **Detect silence**: Start a timer after each audio chunk  
3. **Send final command**: If no new audio arrives within 1.5 seconds, send `{"type": "final"}`
4. **Cancel on new audio**: Reset the timer if more audio arrives

---

## Implementation Details

### New Attributes
```python
self.last_audio_time: Dict[int, float] = {}      # Track last audio per user
self.silence_tasks: Dict[int, asyncio.Task] = {} # Silence detection tasks
self.silence_timeout = 1.5  # Seconds of silence before "final"
```

### New Method
```python
async def _detect_silence(self, user_id: int):
    """
    Wait for silence timeout and send 'final' command to STT.
    Called after each audio chunk.
    """
    await asyncio.sleep(self.silence_timeout)
    stt_client = self.stt_clients.get(user_id)
    if stt_client and stt_client.is_connected():
        await stt_client.send_final()
```

### Integration
- Called after sending each audio chunk
- Cancels previous silence task if new audio arrives
- Automatically cleaned up when stopping listening

---

## Testing

### Test 1: Basic Transcription
1. Join voice channel
2. Run `!miku listen`
3. **Speak a sentence** and wait 1.5 seconds
4. **Expected**: Final transcript appears and is sent to LlamaCPP

### Test 2: Continuous Speech
1. Start listening
2. **Speak multiple sentences** with pauses < 1.5s between them
3. **Expected**: Partial transcripts update, final sent after last sentence

### Test 3: Multiple Users
1. Have 2+ users in voice channel
2. Each runs `!miku listen`
3. Both speak (taking turns or simultaneously)
4. **Expected**: Each user's speech is transcribed independently

---

## Configuration

### Silence Timeout
Default: `1.5` seconds

**To adjust**, edit `voice_receiver.py`:
```python
self.silence_timeout = 1.5  # Change this value
```

**Recommendations**:
- **Too short (< 1.0s)**: May cut off during natural pauses in speech
- **Too long (> 3.0s)**: User waits too long for response
- **Sweet spot**: 1.5-2.0s works well for conversational speech

---

## Monitoring

### Check Logs for Silence Detection
```bash
docker logs miku-bot 2>&1 | grep "Silence detected"
```

**Expected output**:
```
[DEBUG] Silence detected for user 209381657369772032, requesting final transcript
```

### Check Final Transcripts
```bash
docker logs miku-bot 2>&1 | grep "FINAL TRANSCRIPT"
```

### Check STT Processing
```bash
docker logs miku-stt 2>&1 | grep "Final transcription"
```

---

## Debugging

### Issue: No Final Transcript
**Symptoms**: Partial transcripts appear but never finalize

**Debug steps**:
1. Check if silence detection is triggering:
   ```bash
   docker logs miku-bot 2>&1 | grep "Silence detected"
   ```

2. Check if final command is being sent:
   ```bash
   docker logs miku-stt 2>&1 | grep "type.*final"
   ```

3. Increase log level in stt_client.py:
   ```python
   logger.setLevel(logging.DEBUG)
   ```

### Issue: Cuts Off Mid-Sentence
**Symptoms**: Final transcript triggers during natural pauses

**Solution**: Increase silence timeout:
```python
self.silence_timeout = 2.0  # or 2.5
```

### Issue: Too Slow to Respond
**Symptoms**: Long wait after user stops speaking

**Solution**: Decrease silence timeout:
```python
self.silence_timeout = 1.0  # or 1.2
```

---

## Architecture

```
Discord Voice → voice_receiver.py
                     ↓
            [Audio Chunk Received]
                     ↓
         ┌─────────────────────┐
         │  send_audio()       │
         │  to STT server      │
         └─────────────────────┘
                     ↓
         ┌─────────────────────┐
         │  Start silence      │
         │  detection timer    │
         │  (1.5s countdown)   │
         └─────────────────────┘
                     ↓
              ┌──────┴──────┐
              │             │
        More audio    No more audio
        arrives       for 1.5s
              │             │
              ↓             ↓
         Cancel timer  ┌──────────────┐
         Start new     │ send_final() │
                       │ to STT       │
                       └──────────────┘
                             ↓
                    ┌─────────────────┐
                    │ Final transcript│
                    │ → LlamaCPP     │
                    └─────────────────┘
```

---

## Files Modified

1. **bot/utils/voice_receiver.py**
   - Added `last_audio_time` tracking
   - Added `silence_tasks` management
   - Added `_detect_silence()` method
   - Integrated silence detection in `_send_audio_chunk()`
   - Added cleanup in `stop_listening()`

2. **bot/utils/stt_client.py** (previously)
   - Added `send_final()` method
   - Added `send_reset()` method
   - Updated protocol handler

---

## Next Steps

1. **Test thoroughly** with different speech patterns
2. **Tune silence timeout** based on user feedback
3. **Consider VAD integration** for more accurate speech end detection
4. **Add metrics** to track transcription latency

---

**Status**: ✅ **READY FOR TESTING**

The system now:
- ✅ Connects to ONNX STT server (port 8766)
- ✅ Uses CUDA GPU acceleration (cuDNN 9)
- ✅ Receives partial transcripts
- ✅ Automatically detects silence
- ✅ Sends final command after 1.5s silence
- ✅ Forwards final transcript to LlamaCPP

**Test it now with `!miku listen`!**
moved AI generated readmes to readme folder (may delete) 2026-01-27 19:57:48 +02:00			`# Silence Detection Implementation`

			`## What Was Added`

			`Implemented automatic silence detection to trigger final transcriptions in the new ONNX-based STT system.`

			`### Problem`
			The new ONNX server requires manually sending a `{"type": "final"}` command to get the complete transcription. Without this, partial transcripts would appear but never be finalized and sent to LlamaCPP.

			`### Solution`
			Added silence tracking in `voice_receiver.py`:

			`1. Track audio timestamps: Record when the last audio chunk was sent`
			`2. Detect silence: Start a timer after each audio chunk`
			3. Send final command: If no new audio arrives within 1.5 seconds, send `{"type": "final"}`
			`4. Cancel on new audio: Reset the timer if more audio arrives`

			`---`

			`## Implementation Details`

			`### New Attributes`
			```python
			`self.last_audio_time: Dict[int, float] = {} # Track last audio per user`
			`self.silence_tasks: Dict[int, asyncio.Task] = {} # Silence detection tasks`
			`self.silence_timeout = 1.5 # Seconds of silence before "final"`
			```

			`### New Method`
			```python
			`async def _detect_silence(self, user_id: int):`
			`"""`
			`Wait for silence timeout and send 'final' command to STT.`
			`Called after each audio chunk.`
			`"""`
			`await asyncio.sleep(self.silence_timeout)`
			`stt_client = self.stt_clients.get(user_id)`
			`if stt_client and stt_client.is_connected():`
			`await stt_client.send_final()`
			```

			`### Integration`
			`- Called after sending each audio chunk`
			`- Cancels previous silence task if new audio arrives`
			`- Automatically cleaned up when stopping listening`

			`---`

			`## Testing`

			`### Test 1: Basic Transcription`
			`1. Join voice channel`
			2. Run `!miku listen`
			`3. Speak a sentence and wait 1.5 seconds`
			`4. Expected: Final transcript appears and is sent to LlamaCPP`

			`### Test 2: Continuous Speech`
			`1. Start listening`
			`2. Speak multiple sentences with pauses < 1.5s between them`
			`3. Expected: Partial transcripts update, final sent after last sentence`

			`### Test 3: Multiple Users`
			`1. Have 2+ users in voice channel`
			2. Each runs `!miku listen`
			`3. Both speak (taking turns or simultaneously)`
			`4. Expected: Each user's speech is transcribed independently`

			`---`

			`## Configuration`

			`### Silence Timeout`
			Default: `1.5` seconds

			To adjust, edit `voice_receiver.py`:
			```python
			`self.silence_timeout = 1.5 # Change this value`
			```

			`Recommendations:`
			`- Too short (< 1.0s): May cut off during natural pauses in speech`
			`- Too long (> 3.0s): User waits too long for response`
			`- Sweet spot: 1.5-2.0s works well for conversational speech`

			`---`

			`## Monitoring`

			`### Check Logs for Silence Detection`
			```bash
			`docker logs miku-bot 2>&1 \| grep "Silence detected"`
			```

			`Expected output:`
			```
			`[DEBUG] Silence detected for user 209381657369772032, requesting final transcript`
			```

			`### Check Final Transcripts`
			```bash
			`docker logs miku-bot 2>&1 \| grep "FINAL TRANSCRIPT"`
			```

			`### Check STT Processing`
			```bash
			`docker logs miku-stt 2>&1 \| grep "Final transcription"`
			```

			`---`

			`## Debugging`

			`### Issue: No Final Transcript`
			`Symptoms: Partial transcripts appear but never finalize`

			`Debug steps:`
			`1. Check if silence detection is triggering:`
			```bash
			`docker logs miku-bot 2>&1 \| grep "Silence detected"`
			```

			`2. Check if final command is being sent:`
			```bash
			`docker logs miku-stt 2>&1 \| grep "type.*final"`
			```

			`3. Increase log level in stt_client.py:`
			```python
			`logger.setLevel(logging.DEBUG)`
			```

			`### Issue: Cuts Off Mid-Sentence`
			`Symptoms: Final transcript triggers during natural pauses`

			`Solution: Increase silence timeout:`
			```python
			`self.silence_timeout = 2.0 # or 2.5`
			```

			`### Issue: Too Slow to Respond`
			`Symptoms: Long wait after user stops speaking`

			`Solution: Decrease silence timeout:`
			```python
			`self.silence_timeout = 1.0 # or 1.2`
			```

			`---`

			`## Architecture`

			```
			`Discord Voice → voice_receiver.py`
			`↓`
			`[Audio Chunk Received]`
			`↓`
			`┌─────────────────────┐`
			`│ send_audio() │`
			`│ to STT server │`
			`└─────────────────────┘`
			`↓`
			`┌─────────────────────┐`
			`│ Start silence │`
			`│ detection timer │`
			`│ (1.5s countdown) │`
			`└─────────────────────┘`
			`↓`
			`┌──────┴──────┐`
			`│ │`
			`More audio No more audio`
			`arrives for 1.5s`
			`│ │`
			`↓ ↓`
			`Cancel timer ┌──────────────┐`
			`Start new │ send_final() │`
			`│ to STT │`
			`└──────────────┘`
			`↓`
			`┌─────────────────┐`
			`│ Final transcript│`
			`│ → LlamaCPP │`
			`└─────────────────┘`
			```

			`---`

			`## Files Modified`

			`1. bot/utils/voice_receiver.py`
			- Added `last_audio_time` tracking
			- Added `silence_tasks` management
			- Added `_detect_silence()` method
			- Integrated silence detection in `_send_audio_chunk()`
			- Added cleanup in `stop_listening()`

			`2. bot/utils/stt_client.py (previously)`
			- Added `send_final()` method
			- Added `send_reset()` method
			`- Updated protocol handler`

			`---`

			`## Next Steps`

			`1. Test thoroughly with different speech patterns`
			`2. Tune silence timeout based on user feedback`
			`3. Consider VAD integration for more accurate speech end detection`
			`4. Add metrics to track transcription latency`

			`---`

			`Status: ✅ READY FOR TESTING`

			`The system now:`
			`- ✅ Connects to ONNX STT server (port 8766)`
			`- ✅ Uses CUDA GPU acceleration (cuDNN 9)`
			`- ✅ Receives partial transcripts`
			`- ✅ Automatically detects silence`
			`- ✅ Sends final command after 1.5s silence`
			`- ✅ Forwards final transcript to LlamaCPP`

			Test it now with `!miku listen`!