153 lines
3.9 KiB
Markdown
153 lines
3.9 KiB
Markdown
|
|
# Miku STT (Speech-to-Text) Server
|
||
|
|
|
||
|
|
Real-time speech-to-text service for Miku voice chat using Silero VAD (CPU) and Faster-Whisper (GPU).
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
- **Silero VAD** (CPU): Lightweight voice activity detection, runs continuously
|
||
|
|
- **Faster-Whisper** (GPU GTX 1660): Efficient speech transcription using CTranslate2
|
||
|
|
- **FastAPI WebSocket**: Real-time bidirectional communication
|
||
|
|
|
||
|
|
## Features
|
||
|
|
|
||
|
|
- ✅ Real-time voice activity detection with conservative settings
|
||
|
|
- ✅ Streaming partial transcripts during speech
|
||
|
|
- ✅ Final transcript on speech completion
|
||
|
|
- ✅ Interruption detection (user speaking over Miku)
|
||
|
|
- ✅ Multi-user support with isolated sessions
|
||
|
|
- ✅ KV cache optimization ready (partial text for LLM precomputation)
|
||
|
|
|
||
|
|
## API Endpoints
|
||
|
|
|
||
|
|
### WebSocket: `/ws/stt/{user_id}`
|
||
|
|
|
||
|
|
Real-time STT session for a specific user.
|
||
|
|
|
||
|
|
**Client sends:** Raw PCM audio (int16, 16kHz mono, 20ms chunks = 320 samples)
|
||
|
|
|
||
|
|
**Server sends:** JSON events:
|
||
|
|
```json
|
||
|
|
// VAD events
|
||
|
|
{"type": "vad", "event": "speech_start", "speaking": true, "probability": 0.85, "timestamp": 1250.5}
|
||
|
|
{"type": "vad", "event": "speaking", "speaking": true, "probability": 0.92, "timestamp": 1270.5}
|
||
|
|
{"type": "vad", "event": "speech_end", "speaking": false, "probability": 0.35, "timestamp": 3500.0}
|
||
|
|
|
||
|
|
// Transcription events
|
||
|
|
{"type": "partial", "text": "Hello how are", "user_id": "123", "timestamp": 2000.0}
|
||
|
|
{"type": "final", "text": "Hello how are you?", "user_id": "123", "timestamp": 3500.0}
|
||
|
|
|
||
|
|
// Interruption detection
|
||
|
|
{"type": "interruption", "probability": 0.92, "timestamp": 1500.0}
|
||
|
|
```
|
||
|
|
|
||
|
|
### HTTP GET: `/health`
|
||
|
|
|
||
|
|
Health check with model status.
|
||
|
|
|
||
|
|
**Response:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"status": "healthy",
|
||
|
|
"models": {
|
||
|
|
"vad": {"loaded": true, "device": "cpu"},
|
||
|
|
"whisper": {"loaded": true, "model": "small", "device": "cuda"}
|
||
|
|
},
|
||
|
|
"sessions": {
|
||
|
|
"active": 2,
|
||
|
|
"users": ["user123", "user456"]
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Configuration
|
||
|
|
|
||
|
|
### VAD Parameters (Conservative)
|
||
|
|
|
||
|
|
- **Threshold**: 0.5 (speech probability)
|
||
|
|
- **Min speech duration**: 250ms (avoid false triggers)
|
||
|
|
- **Min silence duration**: 500ms (don't cut off mid-sentence)
|
||
|
|
- **Speech padding**: 30ms (context around speech)
|
||
|
|
|
||
|
|
### Whisper Parameters
|
||
|
|
|
||
|
|
- **Model**: small (balanced speed/quality, ~500MB VRAM)
|
||
|
|
- **Compute**: float16 (GPU optimization)
|
||
|
|
- **Language**: en (English)
|
||
|
|
- **Beam size**: 5 (quality/speed balance)
|
||
|
|
|
||
|
|
## Usage Example
|
||
|
|
|
||
|
|
```python
|
||
|
|
import asyncio
|
||
|
|
import websockets
|
||
|
|
import numpy as np
|
||
|
|
|
||
|
|
async def stream_audio():
|
||
|
|
uri = "ws://localhost:8001/ws/stt/user123"
|
||
|
|
|
||
|
|
async with websockets.connect(uri) as websocket:
|
||
|
|
# Wait for ready
|
||
|
|
ready = await websocket.recv()
|
||
|
|
print(ready)
|
||
|
|
|
||
|
|
# Stream audio chunks (16kHz, 20ms chunks)
|
||
|
|
for audio_chunk in audio_stream:
|
||
|
|
# Convert to bytes (int16)
|
||
|
|
audio_bytes = audio_chunk.astype(np.int16).tobytes()
|
||
|
|
await websocket.send(audio_bytes)
|
||
|
|
|
||
|
|
# Receive events
|
||
|
|
event = await websocket.recv()
|
||
|
|
print(event)
|
||
|
|
|
||
|
|
asyncio.run(stream_audio())
|
||
|
|
```
|
||
|
|
|
||
|
|
## Docker Setup
|
||
|
|
|
||
|
|
### Build
|
||
|
|
```bash
|
||
|
|
docker-compose build miku-stt
|
||
|
|
```
|
||
|
|
|
||
|
|
### Run
|
||
|
|
```bash
|
||
|
|
docker-compose up -d miku-stt
|
||
|
|
```
|
||
|
|
|
||
|
|
### Logs
|
||
|
|
```bash
|
||
|
|
docker-compose logs -f miku-stt
|
||
|
|
```
|
||
|
|
|
||
|
|
### Test
|
||
|
|
```bash
|
||
|
|
curl http://localhost:8001/health
|
||
|
|
```
|
||
|
|
|
||
|
|
## GPU Sharing with Soprano
|
||
|
|
|
||
|
|
Both STT (Whisper) and TTS (Soprano) run on GTX 1660 but at different times:
|
||
|
|
|
||
|
|
1. **User speaking** → Whisper active, Soprano idle
|
||
|
|
2. **LLM processing** → Both idle
|
||
|
|
3. **Miku speaking** → Soprano active, Whisper idle (VAD monitoring only)
|
||
|
|
|
||
|
|
Interruption detection runs VAD continuously but doesn't use GPU.
|
||
|
|
|
||
|
|
## Performance
|
||
|
|
|
||
|
|
- **VAD latency**: 10-20ms per chunk (CPU)
|
||
|
|
- **Whisper latency**: ~1-2s for 2s audio (GPU)
|
||
|
|
- **Memory usage**:
|
||
|
|
- Silero VAD: ~100MB (CPU)
|
||
|
|
- Faster-Whisper small: ~500MB (GPU VRAM)
|
||
|
|
|
||
|
|
## Future Improvements
|
||
|
|
|
||
|
|
- [ ] Multi-language support (auto-detect)
|
||
|
|
- [ ] Word-level timestamps for better sync
|
||
|
|
- [ ] Custom vocabulary/prompt tuning
|
||
|
|
- [ ] Speaker diarization (multiple speakers)
|
||
|
|
- [ ] Noise suppression preprocessing
|