3.4 KiB
3.4 KiB
NVIDIA Parakeet Migration
Summary
Replaced Faster-Whisper with NVIDIA Parakeet TDT (Token-and-Duration Transducer) for real-time speech transcription.
Changes Made
1. New Transcriber: parakeet_transcriber.py
- Model:
nvidia/parakeet-tdt-0.6b-v3(600M parameters) - Features:
- Real-time streaming transcription
- Word-level timestamps for LLM pre-computation
- GPU-accelerated (CUDA)
- Lower latency than Faster-Whisper
- Native PyTorch (no CTranslate2 dependency)
2. Requirements Updated
Removed:
faster-whisper==1.2.1ctranslate2==4.5.0
Added:
transformers==4.47.1- HuggingFace model loadingaccelerate==1.2.1- GPU optimizationsentencepiece==0.2.0- Tokenization
Kept:
torch==2.9.1&torchaudio==2.9.1- Core ML frameworksilero-vad==5.1.2- VAD still uses Silero (CPU)
3. Server Updates: stt_server.py
Changes:
- Import
ParakeetTranscriberinstead ofWhisperTranscriber - Partial transcripts now include
wordsarray with timestamps - Final transcripts include
wordsarray for LLM pre-computation - Startup logs show "Loading NVIDIA Parakeet TDT model"
Word-level Token Format:
{
"type": "partial",
"text": "hello world",
"words": [
{"word": "hello", "start_time": 0.0, "end_time": 0.5},
{"word": "world", "start_time": 0.5, "end_time": 1.0}
],
"user_id": "123",
"timestamp": 1234.56
}
Advantages Over Faster-Whisper
- Real-time Performance: TDT architecture designed for streaming
- No cuDNN Issues: Native PyTorch, no CTranslate2 library loading problems
- Word-level Tokens: Enables LLM prompt pre-computation during speech
- Lower Latency: Optimized for real-time use cases
- Better GPU Utilization: Uses standard PyTorch CUDA
- Simpler Dependencies: No external compiled libraries
Deployment
-
Build Container:
docker-compose build miku-stt -
First Run (downloads model ~600MB):
docker-compose up miku-sttModel will be cached in
/modelsvolume for subsequent runs. -
Verify GPU Usage:
docker exec miku-stt nvidia-smiYou should see
python3process using VRAM (~1.5GB for model + inference).
Testing
Same test procedure as before:
- Join voice channel
!miku listen- Speak clearly
- Check logs for "Parakeet model loaded"
- Verify transcripts appear faster than before
Bot-Side Compatibility
No changes needed to bot code - STT WebSocket protocol is identical. The bot will automatically receive word-level tokens in partial/final transcript messages.
Future Enhancement: LLM Pre-computation
The words array can be used to start LLM inference before full transcript completes:
- Send partial words to LLM as they arrive
- LLM begins processing prompt tokens
- Faster response time when user finishes speaking
Rollback (if needed)
To revert to Faster-Whisper:
- Restore
requirements.txtfrom git - Restore
stt_server.pyfrom git - Delete
parakeet_transcriber.py - Rebuild container
Performance Expectations
- Model Load Time: ~5-10 seconds (first time downloads from HuggingFace)
- VRAM Usage: ~1.5GB (vs ~800MB for Whisper small)
- Latency: ~200-500ms for 2-second audio chunks
- GPU Utilization: 30-60% during active transcription
- Accuracy: Similar to Whisper small (designed for English)