5.7 KiB
5.7 KiB
STT Migration: NeMo → ONNX Runtime
What Changed
Old Implementation (stt/):
- Used NVIDIA NeMo toolkit with PyTorch
- Heavy memory usage (~4-5GB VRAM)
- Complex dependency tree (NeMo, transformers, huggingface-hub conflicts)
- Slow transcription (~2-3 seconds per utterance)
- Custom VAD + FastAPI WebSocket server
New Implementation (stt-parakeet/):
- Uses
onnx-asrlibrary with ONNX Runtime - Optimized VRAM usage (~2-3GB VRAM)
- Simple dependencies (onnxruntime-gpu, onnx-asr, numpy)
- Much faster transcription (~0.5-1 second per utterance)
- Clean architecture with modular ASR pipeline
Architecture
stt-parakeet/
├── Dockerfile # CUDA 12.1 + Python 3.11 + ONNX Runtime
├── requirements-stt.txt # Exact pinned dependencies
├── asr/
│ └── asr_pipeline.py # ONNX ASR wrapper with GPU acceleration
├── server/
│ └── ws_server.py # WebSocket server (port 8766)
├── vad/
│ └── silero_vad.py # Voice Activity Detection
└── models/ # Model cache (auto-downloaded)
Docker Setup
Build
docker-compose build miku-stt
Run
docker-compose up -d miku-stt
Check Logs
docker logs -f miku-stt
Verify CUDA
docker exec miku-stt python3.11 -c "import onnxruntime as ort; print('CUDA:', 'CUDAExecutionProvider' in ort.get_available_providers())"
API Changes
Old Protocol (port 8001)
# FastAPI with /ws/stt/{user_id} endpoint
ws://localhost:8001/ws/stt/123456
# Events:
{
"type": "vad",
"event": "speech_start" | "speaking" | "speech_end",
"probability": 0.95
}
{
"type": "partial",
"text": "Hello",
"words": []
}
{
"type": "final",
"text": "Hello world",
"words": [{"word": "Hello", "start_time": 0.0, "end_time": 0.5}]
}
New Protocol (port 8766)
# Direct WebSocket connection
ws://localhost:8766
# Send audio (binary):
# - int16 PCM, 16kHz mono
# - Send as raw bytes
# Send commands (JSON):
{"type": "final"} # Trigger final transcription
{"type": "reset"} # Clear audio buffer
# Receive transcripts:
{
"type": "transcript",
"text": "Hello world",
"is_final": false # Progressive transcription
}
{
"type": "transcript",
"text": "Hello world",
"is_final": true # Final transcription after "final" command
}
Bot Integration Changes Needed
1. Update WebSocket URL
# Old
ws://miku-stt:8000/ws/stt/{user_id}
# New
ws://miku-stt:8766
2. Update Message Format
# Old: Send audio with metadata
await websocket.send_bytes(audio_data)
# New: Send raw audio bytes (same)
await websocket.send(audio_data) # bytes
# Old: Listen for VAD events
if msg["type"] == "vad":
# Handle VAD
# New: No VAD events (handled internally)
# Just send final command when user stops speaking
await websocket.send(json.dumps({"type": "final"}))
3. Update Response Handling
# Old
if msg["type"] == "partial":
text = msg["text"]
words = msg["words"]
if msg["type"] == "final":
text = msg["text"]
words = msg["words"]
# New
if msg["type"] == "transcript":
text = msg["text"]
is_final = msg["is_final"]
# No word-level timestamps in ONNX version
Performance Comparison
| Metric | Old (NeMo) | New (ONNX) |
|---|---|---|
| VRAM Usage | 4-5GB | 2-3GB |
| Transcription Speed | 2-3s | 0.5-1s |
| Build Time | ~10 min | ~5 min |
| Dependencies | 50+ packages | 15 packages |
| GPU Utilization | 60-70% | 85-95% |
| OOM Crashes | Frequent | None |
Migration Steps
- ✅ Build new container:
docker-compose build miku-stt - ✅ Update bot WebSocket client (
bot/utils/stt_client.py) - ✅ Update voice receiver to send "final" command
- ⏳ Test transcription quality
- ⏳ Remove old
stt/directory
Troubleshooting
Issue 1: CUDA Not Working (Falling Back to CPU)
Symptoms:
[E:onnxruntime:Default] Failed to load library libonnxruntime_providers_cuda.so
with error: libcudnn.so.9: cannot open shared object file
Cause: ONNX Runtime GPU requires cuDNN 9, but CUDA 12.1 base image only has cuDNN 8.
Fix: Update Dockerfile base image:
FROM nvidia/cuda:12.6.2-cudnn-runtime-ubuntu22.04
Verify:
docker logs miku-stt 2>&1 | grep "Providers"
# Should show: CUDAExecutionProvider (not just CPUExecutionProvider)
Issue 2: Connection Refused (Port 8000)
Symptoms:
ConnectionRefusedError: [Errno 111] Connect call failed ('172.20.0.5', 8000)
Cause: New ONNX server runs on port 8766, not 8000.
Fix: Update bot/utils/stt_client.py:
stt_url: str = "ws://miku-stt:8766/ws/stt" # Changed from 8000
Issue 3: Protocol Mismatch
Symptoms: Bot doesn't receive transcripts, or transcripts are empty.
Cause: New ONNX server uses different WebSocket protocol.
Old Protocol (NeMo): Automatic VAD-triggered partial and final events
New Protocol (ONNX): Manual control with {"type": "final"} command
Fix:
- Updated
stt_client._handle_event()to handletranscripttype withis_finalflag - Added
send_final()method to request final transcription - Bot should call
stt_client.send_final()when user stops speaking
Rollback Plan
If needed, revert docker-compose.yml:
miku-stt:
build:
context: ./stt
dockerfile: Dockerfile.stt
# ... rest of old config
Notes
- Model downloads on first run (~600MB)
- Models cached in
./stt-parakeet/models/ - No word-level timestamps (ONNX model doesn't provide them)
- VAD handled internally (no need for external VAD integration)
- Uses same GPU (GTX 1660, device 0) as before