Phase 4 STT pipeline implemented — Silero VAD + faster-whisper — still not working well at all

2026-01-17 03:14:40 +02:00
parent 3e59e5d2f6
commit d1e6b21508
30 changed files with 156595 additions and 8 deletions
--- a/stt/README.md
+++ b/stt/README.md
@@ -0,0 +1,152 @@
+# Miku STT (Speech-to-Text) Server
+
+Real-time speech-to-text service for Miku voice chat using Silero VAD (CPU) and Faster-Whisper (GPU).
+
+## Architecture
+
+- **Silero VAD** (CPU): Lightweight voice activity detection, runs continuously
+- **Faster-Whisper** (GPU GTX 1660): Efficient speech transcription using CTranslate2
+- **FastAPI WebSocket**: Real-time bidirectional communication
+
+## Features
+
+- ✅ Real-time voice activity detection with conservative settings
+- ✅ Streaming partial transcripts during speech
+- ✅ Final transcript on speech completion
+- ✅ Interruption detection (user speaking over Miku)
+- ✅ Multi-user support with isolated sessions
+- ✅ KV cache optimization ready (partial text for LLM precomputation)
+
+## API Endpoints
+
+### WebSocket: `/ws/stt/{user_id}`
+
+Real-time STT session for a specific user.
+
+**Client sends:** Raw PCM audio (int16, 16kHz mono, 20ms chunks = 320 samples)
+
+**Server sends:** JSON events:
+```json
+// VAD events
+{"type": "vad", "event": "speech_start", "speaking": true, "probability": 0.85, "timestamp": 1250.5}
+{"type": "vad", "event": "speaking", "speaking": true, "probability": 0.92, "timestamp": 1270.5}
+{"type": "vad", "event": "speech_end", "speaking": false, "probability": 0.35, "timestamp": 3500.0}
+
+// Transcription events
+{"type": "partial", "text": "Hello how are", "user_id": "123", "timestamp": 2000.0}
+{"type": "final", "text": "Hello how are you?", "user_id": "123", "timestamp": 3500.0}
+
+// Interruption detection
+{"type": "interruption", "probability": 0.92, "timestamp": 1500.0}
+```
+
+### HTTP GET: `/health`
+
+Health check with model status.
+
+**Response:**
+```json
+{
+  "status": "healthy",
+  "models": {
+    "vad": {"loaded": true, "device": "cpu"},
+    "whisper": {"loaded": true, "model": "small", "device": "cuda"}
+  },
+  "sessions": {
+    "active": 2,
+    "users": ["user123", "user456"]
+  }
+}
+```
+
+## Configuration
+
+### VAD Parameters (Conservative)
+
+- **Threshold**: 0.5 (speech probability)
+- **Min speech duration**: 250ms (avoid false triggers)
+- **Min silence duration**: 500ms (don't cut off mid-sentence)
+- **Speech padding**: 30ms (context around speech)
+
+### Whisper Parameters
+
+- **Model**: small (balanced speed/quality, ~500MB VRAM)
+- **Compute**: float16 (GPU optimization)
+- **Language**: en (English)
+- **Beam size**: 5 (quality/speed balance)
+
+## Usage Example
+
+```python
+import asyncio
+import websockets
+import numpy as np
+
+async def stream_audio():
+    uri = "ws://localhost:8001/ws/stt/user123"
+    
+    async with websockets.connect(uri) as websocket:
+        # Wait for ready
+        ready = await websocket.recv()
+        print(ready)
+        
+        # Stream audio chunks (16kHz, 20ms chunks)
+        for audio_chunk in audio_stream:
+            # Convert to bytes (int16)
+            audio_bytes = audio_chunk.astype(np.int16).tobytes()
+            await websocket.send(audio_bytes)
+            
+            # Receive events
+            event = await websocket.recv()
+            print(event)
+
+asyncio.run(stream_audio())
+```
+
+## Docker Setup
+
+### Build
+```bash
+docker-compose build miku-stt
+```
+
+### Run
+```bash
+docker-compose up -d miku-stt
+```
+
+### Logs
+```bash
+docker-compose logs -f miku-stt
+```
+
+### Test
+```bash
+curl http://localhost:8001/health
+```
+
+## GPU Sharing with Soprano
+
+Both STT (Whisper) and TTS (Soprano) run on GTX 1660 but at different times:
+
+1. **User speaking** → Whisper active, Soprano idle
+2. **LLM processing** → Both idle
+3. **Miku speaking** → Soprano active, Whisper idle (VAD monitoring only)
+
+Interruption detection runs VAD continuously but doesn't use GPU.
+
+## Performance
+
+- **VAD latency**: 10-20ms per chunk (CPU)
+- **Whisper latency**: ~1-2s for 2s audio (GPU)
+- **Memory usage**: 
+  - Silero VAD: ~100MB (CPU)
+  - Faster-Whisper small: ~500MB (GPU VRAM)
+
+## Future Improvements
+
+- [ ] Multi-language support (auto-detect)
+- [ ] Word-level timestamps for better sync
+- [ ] Custom vocabulary/prompt tuning
+- [ ] Speaker diarization (multiple speakers)
+- [ ] Noise suppression preprocessing