Files
miku-discord/readmes/STT_MIGRATION.md

5.7 KiB

STT Migration: NeMo → ONNX Runtime

What Changed

Old Implementation (stt/):

  • Used NVIDIA NeMo toolkit with PyTorch
  • Heavy memory usage (~4-5GB VRAM)
  • Complex dependency tree (NeMo, transformers, huggingface-hub conflicts)
  • Slow transcription (~2-3 seconds per utterance)
  • Custom VAD + FastAPI WebSocket server

New Implementation (stt-parakeet/):

  • Uses onnx-asr library with ONNX Runtime
  • Optimized VRAM usage (~2-3GB VRAM)
  • Simple dependencies (onnxruntime-gpu, onnx-asr, numpy)
  • Much faster transcription (~0.5-1 second per utterance)
  • Clean architecture with modular ASR pipeline

Architecture

stt-parakeet/
├── Dockerfile              # CUDA 12.1 + Python 3.11 + ONNX Runtime
├── requirements-stt.txt    # Exact pinned dependencies
├── asr/
│   └── asr_pipeline.py    # ONNX ASR wrapper with GPU acceleration
├── server/
│   └── ws_server.py       # WebSocket server (port 8766)
├── vad/
│   └── silero_vad.py      # Voice Activity Detection
└── models/                # Model cache (auto-downloaded)

Docker Setup

Build

docker-compose build miku-stt

Run

docker-compose up -d miku-stt

Check Logs

docker logs -f miku-stt

Verify CUDA

docker exec miku-stt python3.11 -c "import onnxruntime as ort; print('CUDA:', 'CUDAExecutionProvider' in ort.get_available_providers())"

API Changes

Old Protocol (port 8001)

# FastAPI with /ws/stt/{user_id} endpoint
ws://localhost:8001/ws/stt/123456

# Events:
{
  "type": "vad",
  "event": "speech_start" | "speaking" | "speech_end",
  "probability": 0.95
}
{
  "type": "partial",
  "text": "Hello",
  "words": []
}
{
  "type": "final",
  "text": "Hello world",
  "words": [{"word": "Hello", "start_time": 0.0, "end_time": 0.5}]
}

New Protocol (port 8766)

# Direct WebSocket connection
ws://localhost:8766

# Send audio (binary):
# - int16 PCM, 16kHz mono
# - Send as raw bytes

# Send commands (JSON):
{"type": "final"}   # Trigger final transcription
{"type": "reset"}   # Clear audio buffer

# Receive transcripts:
{
  "type": "transcript",
  "text": "Hello world",
  "is_final": false  # Progressive transcription
}
{
  "type": "transcript",
  "text": "Hello world",
  "is_final": true   # Final transcription after "final" command
}

Bot Integration Changes Needed

1. Update WebSocket URL

# Old
ws://miku-stt:8000/ws/stt/{user_id}

# New
ws://miku-stt:8766

2. Update Message Format

# Old: Send audio with metadata
await websocket.send_bytes(audio_data)

# New: Send raw audio bytes (same)
await websocket.send(audio_data)  # bytes

# Old: Listen for VAD events
if msg["type"] == "vad":
    # Handle VAD

# New: No VAD events (handled internally)
# Just send final command when user stops speaking
await websocket.send(json.dumps({"type": "final"}))

3. Update Response Handling

# Old
if msg["type"] == "partial":
    text = msg["text"]
    words = msg["words"]
    
if msg["type"] == "final":
    text = msg["text"]
    words = msg["words"]

# New
if msg["type"] == "transcript":
    text = msg["text"]
    is_final = msg["is_final"]
    # No word-level timestamps in ONNX version

Performance Comparison

Metric Old (NeMo) New (ONNX)
VRAM Usage 4-5GB 2-3GB
Transcription Speed 2-3s 0.5-1s
Build Time ~10 min ~5 min
Dependencies 50+ packages 15 packages
GPU Utilization 60-70% 85-95%
OOM Crashes Frequent None

Migration Steps

  1. Build new container: docker-compose build miku-stt
  2. Update bot WebSocket client (bot/utils/stt_client.py)
  3. Update voice receiver to send "final" command
  4. Test transcription quality
  5. Remove old stt/ directory

Troubleshooting

Issue 1: CUDA Not Working (Falling Back to CPU)

Symptoms:

[E:onnxruntime:Default] Failed to load library libonnxruntime_providers_cuda.so 
with error: libcudnn.so.9: cannot open shared object file

Cause: ONNX Runtime GPU requires cuDNN 9, but CUDA 12.1 base image only has cuDNN 8.

Fix: Update Dockerfile base image:

FROM nvidia/cuda:12.6.2-cudnn-runtime-ubuntu22.04

Verify:

docker logs miku-stt 2>&1 | grep "Providers"
# Should show: CUDAExecutionProvider (not just CPUExecutionProvider)

Issue 2: Connection Refused (Port 8000)

Symptoms:

ConnectionRefusedError: [Errno 111] Connect call failed ('172.20.0.5', 8000)

Cause: New ONNX server runs on port 8766, not 8000.

Fix: Update bot/utils/stt_client.py:

stt_url: str = "ws://miku-stt:8766/ws/stt"  # Changed from 8000

Issue 3: Protocol Mismatch

Symptoms: Bot doesn't receive transcripts, or transcripts are empty.

Cause: New ONNX server uses different WebSocket protocol.

Old Protocol (NeMo): Automatic VAD-triggered partial and final events New Protocol (ONNX): Manual control with {"type": "final"} command

Fix:

  • Updated stt_client._handle_event() to handle transcript type with is_final flag
  • Added send_final() method to request final transcription
  • Bot should call stt_client.send_final() when user stops speaking

Rollback Plan

If needed, revert docker-compose.yml:

miku-stt:
  build:
    context: ./stt
    dockerfile: Dockerfile.stt
  # ... rest of old config

Notes

  • Model downloads on first run (~600MB)
  • Models cached in ./stt-parakeet/models/
  • No word-level timestamps (ONNX model doesn't provide them)
  • VAD handled internally (no need for external VAD integration)
  • Uses same GPU (GTX 1660, device 0) as before