miku-discord/SILENCE_DETECTION.md

# Silence Detection Implementation

## What Was Added

Implemented automatic silence detection to trigger final transcriptions in the new ONNX-based STT system.

### Problem
The new ONNX server requires manually sending a `{"type": "final"}` command to get the complete transcription. Without this, partial transcripts would appear but never be finalized and sent to LlamaCPP.

### Solution
Added silence tracking in `voice_receiver.py`:

1. **Track audio timestamps**: Record when the last audio chunk was sent
2. **Detect silence**: Start a timer after each audio chunk
3. **Send final command**: If no new audio arrives within 1.5 seconds, send `{"type": "final"}`
4. **Cancel on new audio**: Reset the timer if more audio arrives

---

## Implementation Details

### New Attributes
```python
self.last_audio_time: Dict[int, float] = {}      # Track last audio per user
self.silence_tasks: Dict[int, asyncio.Task] = {} # Silence detection tasks
self.silence_timeout = 1.5  # Seconds of silence before "final"
```

### New Method
```python
async def _detect_silence(self, user_id: int):
    """
    Wait for silence timeout and send 'final' command to STT.
    Called after each audio chunk.
    """
    await asyncio.sleep(self.silence_timeout)
    stt_client = self.stt_clients.get(user_id)
    if stt_client and stt_client.is_connected():
        await stt_client.send_final()
```

### Integration
- Called after sending each audio chunk
- Cancels previous silence task if new audio arrives
- Automatically cleaned up when stopping listening

---

## Testing

### Test 1: Basic Transcription
1. Join voice channel
2. Run `!miku listen`
3. **Speak a sentence** and wait 1.5 seconds
4. **Expected**: Final transcript appears and is sent to LlamaCPP

### Test 2: Continuous Speech
1. Start listening
2. **Speak multiple sentences** with pauses < 1.5s between them
3. **Expected**: Partial transcripts update, final sent after last sentence

### Test 3: Multiple Users
1. Have 2+ users in voice channel
2. Each runs `!miku listen`
3. Both speak (taking turns or simultaneously)
4. **Expected**: Each user's speech is transcribed independently

---

## Configuration

### Silence Timeout
Default: `1.5` seconds

**To adjust**, edit `voice_receiver.py`:
```python
self.silence_timeout = 1.5  # Change this value
```

**Recommendations**:
- **Too short (< 1.0s)**: May cut off during natural pauses in speech
- **Too long (> 3.0s)**: User waits too long for response
- **Sweet spot**: 1.5-2.0s works well for conversational speech

---

## Monitoring

### Check Logs for Silence Detection
```bash
docker logs miku-bot 2>&1 | grep "Silence detected"
```

**Expected output**:
```
[DEBUG] Silence detected for user 209381657369772032, requesting final transcript
```

### Check Final Transcripts
```bash
docker logs miku-bot 2>&1 | grep "FINAL TRANSCRIPT"
```

### Check STT Processing
```bash
docker logs miku-stt 2>&1 | grep "Final transcription"
```

---

## Debugging

### Issue: No Final Transcript
**Symptoms**: Partial transcripts appear but never finalize

**Debug steps**:
1. Check if silence detection is triggering:
   ```bash
   docker logs miku-bot 2>&1 | grep "Silence detected"
   ```

2. Check if final command is being sent:
   ```bash
   docker logs miku-stt 2>&1 | grep "type.*final"
   ```

3. Increase log level in stt_client.py:
   ```python
   logger.setLevel(logging.DEBUG)
   ```

### Issue: Cuts Off Mid-Sentence
**Symptoms**: Final transcript triggers during natural pauses

**Solution**: Increase silence timeout:
```python
self.silence_timeout = 2.0  # or 2.5
```

### Issue: Too Slow to Respond
**Symptoms**: Long wait after user stops speaking

**Solution**: Decrease silence timeout:
```python
self.silence_timeout = 1.0  # or 1.2
```

---

## Architecture

```
Discord Voice → voice_receiver.py
                     ↓
            [Audio Chunk Received]
                     ↓
         ┌─────────────────────┐
         │  send_audio()       │
         │  to STT server      │
         └─────────────────────┘
                     ↓
         ┌─────────────────────┐
         │  Start silence      │
         │  detection timer    │
         │  (1.5s countdown)   │
         └─────────────────────┘
                     ↓
              ┌──────┴──────┐
              │             │
        More audio    No more audio
        arrives       for 1.5s
              │             │
              ↓             ↓
         Cancel timer  ┌──────────────┐
         Start new     │ send_final() │
                       │ to STT       │
                       └──────────────┘
                             ↓
                    ┌─────────────────┐
                    │ Final transcript│
                    │ → LlamaCPP     │
                    └─────────────────┘
```

---

## Files Modified

1. **bot/utils/voice_receiver.py**
   - Added `last_audio_time` tracking
   - Added `silence_tasks` management
   - Added `_detect_silence()` method
   - Integrated silence detection in `_send_audio_chunk()`
   - Added cleanup in `stop_listening()`

2. **bot/utils/stt_client.py** (previously)
   - Added `send_final()` method
   - Added `send_reset()` method
   - Updated protocol handler

---

## Next Steps

1. **Test thoroughly** with different speech patterns
2. **Tune silence timeout** based on user feedback
3. **Consider VAD integration** for more accurate speech end detection
4. **Add metrics** to track transcription latency

---

**Status**: ✅ **READY FOR TESTING**

The system now:
- ✅ Connects to ONNX STT server (port 8766)
- ✅ Uses CUDA GPU acceleration (cuDNN 9)
- ✅ Receives partial transcripts
- ✅ Automatically detects silence
- ✅ Sends final command after 1.5s silence
- ✅ Forwards final transcript to LlamaCPP

**Test it now with `!miku listen`!**