Changed stt to parakeet — still experiemntal, though performance seems to be better

2026-01-18 03:35:50 +02:00
parent 50e4f7a5f2
commit 0a8910fff8
10 changed files with 375 additions and 37 deletions
--- a/stt/PARAKEET_MIGRATION.md
+++ b/stt/PARAKEET_MIGRATION.md
@@ -0,0 +1,114 @@
+# NVIDIA Parakeet Migration
+
+## Summary
+
+Replaced Faster-Whisper with NVIDIA Parakeet TDT (Token-and-Duration Transducer) for real-time speech transcription.
+
+## Changes Made
+
+### 1. New Transcriber: `parakeet_transcriber.py`
+- **Model**: `nvidia/parakeet-tdt-0.6b-v3` (600M parameters)
+- **Features**:
+  - Real-time streaming transcription
+  - Word-level timestamps for LLM pre-computation
+  - GPU-accelerated (CUDA)
+  - Lower latency than Faster-Whisper
+  - Native PyTorch (no CTranslate2 dependency)
+
+### 2. Requirements Updated
+**Removed**:
+- `faster-whisper==1.2.1`
+- `ctranslate2==4.5.0`
+
+**Added**:
+- `transformers==4.47.1` - HuggingFace model loading
+- `accelerate==1.2.1` - GPU optimization
+- `sentencepiece==0.2.0` - Tokenization
+
+**Kept**:
+- `torch==2.9.1` & `torchaudio==2.9.1` - Core ML framework
+- `silero-vad==5.1.2` - VAD still uses Silero (CPU)
+
+### 3. Server Updates: `stt_server.py`
+**Changes**:
+- Import `ParakeetTranscriber` instead of `WhisperTranscriber`
+- Partial transcripts now include `words` array with timestamps
+- Final transcripts include `words` array for LLM pre-computation
+- Startup logs show "Loading NVIDIA Parakeet TDT model"
+
+**Word-level Token Format**:
+```json
+{
+  "type": "partial",
+  "text": "hello world",
+  "words": [
+    {"word": "hello", "start_time": 0.0, "end_time": 0.5},
+    {"word": "world", "start_time": 0.5, "end_time": 1.0}
+  ],
+  "user_id": "123",
+  "timestamp": 1234.56
+}
+```
+
+## Advantages Over Faster-Whisper
+
+1. **Real-time Performance**: TDT architecture designed for streaming
+2. **No cuDNN Issues**: Native PyTorch, no CTranslate2 library loading problems
+3. **Word-level Tokens**: Enables LLM prompt pre-computation during speech
+4. **Lower Latency**: Optimized for real-time use cases
+5. **Better GPU Utilization**: Uses standard PyTorch CUDA
+6. **Simpler Dependencies**: No external compiled libraries
+
+## Deployment
+
+1. **Build Container**:
+   ```bash
+   docker-compose build miku-stt
+   ```
+
+2. **First Run** (downloads model ~600MB):
+   ```bash
+   docker-compose up miku-stt
+   ```
+   Model will be cached in `/models` volume for subsequent runs.
+
+3. **Verify GPU Usage**:
+   ```bash
+   docker exec miku-stt nvidia-smi
+   ```
+   You should see `python3` process using VRAM (~1.5GB for model + inference).
+
+## Testing
+
+Same test procedure as before:
+1. Join voice channel
+2. `!miku listen`
+3. Speak clearly
+4. Check logs for "Parakeet model loaded"
+5. Verify transcripts appear faster than before
+
+## Bot-Side Compatibility
+
+No changes needed to bot code - STT WebSocket protocol is identical. The bot will automatically receive word-level tokens in partial/final transcript messages.
+
+### Future Enhancement: LLM Pre-computation
+The `words` array can be used to start LLM inference before full transcript completes:
+- Send partial words to LLM as they arrive
+- LLM begins processing prompt tokens
+- Faster response time when user finishes speaking
+
+## Rollback (if needed)
+
+To revert to Faster-Whisper:
+1. Restore `requirements.txt` from git
+2. Restore `stt_server.py` from git  
+3. Delete `parakeet_transcriber.py`
+4. Rebuild container
+
+## Performance Expectations
+
+- **Model Load Time**: ~5-10 seconds (first time downloads from HuggingFace)
+- **VRAM Usage**: ~1.5GB (vs ~800MB for Whisper small)
+- **Latency**: ~200-500ms for 2-second audio chunks
+- **GPU Utilization**: 30-60% during active transcription
+- **Accuracy**: Similar to Whisper small (designed for English)