Files
miku-discord/stt/PARAKEET_MIGRATION.md

3.4 KiB

NVIDIA Parakeet Migration

Summary

Replaced Faster-Whisper with NVIDIA Parakeet TDT (Token-and-Duration Transducer) for real-time speech transcription.

Changes Made

1. New Transcriber: parakeet_transcriber.py

  • Model: nvidia/parakeet-tdt-0.6b-v3 (600M parameters)
  • Features:
    • Real-time streaming transcription
    • Word-level timestamps for LLM pre-computation
    • GPU-accelerated (CUDA)
    • Lower latency than Faster-Whisper
    • Native PyTorch (no CTranslate2 dependency)

2. Requirements Updated

Removed:

  • faster-whisper==1.2.1
  • ctranslate2==4.5.0

Added:

  • transformers==4.47.1 - HuggingFace model loading
  • accelerate==1.2.1 - GPU optimization
  • sentencepiece==0.2.0 - Tokenization

Kept:

  • torch==2.9.1 & torchaudio==2.9.1 - Core ML framework
  • silero-vad==5.1.2 - VAD still uses Silero (CPU)

3. Server Updates: stt_server.py

Changes:

  • Import ParakeetTranscriber instead of WhisperTranscriber
  • Partial transcripts now include words array with timestamps
  • Final transcripts include words array for LLM pre-computation
  • Startup logs show "Loading NVIDIA Parakeet TDT model"

Word-level Token Format:

{
  "type": "partial",
  "text": "hello world",
  "words": [
    {"word": "hello", "start_time": 0.0, "end_time": 0.5},
    {"word": "world", "start_time": 0.5, "end_time": 1.0}
  ],
  "user_id": "123",
  "timestamp": 1234.56
}

Advantages Over Faster-Whisper

  1. Real-time Performance: TDT architecture designed for streaming
  2. No cuDNN Issues: Native PyTorch, no CTranslate2 library loading problems
  3. Word-level Tokens: Enables LLM prompt pre-computation during speech
  4. Lower Latency: Optimized for real-time use cases
  5. Better GPU Utilization: Uses standard PyTorch CUDA
  6. Simpler Dependencies: No external compiled libraries

Deployment

  1. Build Container:

    docker-compose build miku-stt
    
  2. First Run (downloads model ~600MB):

    docker-compose up miku-stt
    

    Model will be cached in /models volume for subsequent runs.

  3. Verify GPU Usage:

    docker exec miku-stt nvidia-smi
    

    You should see python3 process using VRAM (~1.5GB for model + inference).

Testing

Same test procedure as before:

  1. Join voice channel
  2. !miku listen
  3. Speak clearly
  4. Check logs for "Parakeet model loaded"
  5. Verify transcripts appear faster than before

Bot-Side Compatibility

No changes needed to bot code - STT WebSocket protocol is identical. The bot will automatically receive word-level tokens in partial/final transcript messages.

Future Enhancement: LLM Pre-computation

The words array can be used to start LLM inference before full transcript completes:

  • Send partial words to LLM as they arrive
  • LLM begins processing prompt tokens
  • Faster response time when user finishes speaking

Rollback (if needed)

To revert to Faster-Whisper:

  1. Restore requirements.txt from git
  2. Restore stt_server.py from git
  3. Delete parakeet_transcriber.py
  4. Rebuild container

Performance Expectations

  • Model Load Time: ~5-10 seconds (first time downloads from HuggingFace)
  • VRAM Usage: ~1.5GB (vs ~800MB for Whisper small)
  • Latency: ~200-500ms for 2-second audio chunks
  • GPU Utilization: 30-60% during active transcription
  • Accuracy: Similar to Whisper small (designed for English)