Koko210/miku-discord

Fork 0

Files

koko210Serve c58b941587 moved AI generated readmes to readme folder (may delete)

2026-01-27 19:57:48 +02:00

5.7 KiB

Raw Blame History

STT Migration: NeMo → ONNX Runtime

What Changed

Old Implementation (stt/):

Used NVIDIA NeMo toolkit with PyTorch
Heavy memory usage (~4-5GB VRAM)
Complex dependency tree (NeMo, transformers, huggingface-hub conflicts)
Slow transcription (~2-3 seconds per utterance)
Custom VAD + FastAPI WebSocket server

New Implementation (stt-parakeet/):

Uses onnx-asr library with ONNX Runtime
Optimized VRAM usage (~2-3GB VRAM)
Simple dependencies (onnxruntime-gpu, onnx-asr, numpy)
Much faster transcription (~0.5-1 second per utterance)
Clean architecture with modular ASR pipeline

Architecture

stt-parakeet/
├── Dockerfile              # CUDA 12.1 + Python 3.11 + ONNX Runtime
├── requirements-stt.txt    # Exact pinned dependencies
├── asr/
│   └── asr_pipeline.py    # ONNX ASR wrapper with GPU acceleration
├── server/
│   └── ws_server.py       # WebSocket server (port 8766)
├── vad/
│   └── silero_vad.py      # Voice Activity Detection
└── models/                # Model cache (auto-downloaded)

Docker Setup

Build

docker-compose build miku-stt

Run

docker-compose up -d miku-stt

Check Logs

docker logs -f miku-stt

Verify CUDA

docker exec miku-stt python3.11 -c "import onnxruntime as ort; print('CUDA:', 'CUDAExecutionProvider' in ort.get_available_providers())"

API Changes

Old Protocol (port 8001)

# FastAPI with /ws/stt/{user_id} endpoint
ws://localhost:8001/ws/stt/123456

# Events:
{
  "type": "vad",
  "event": "speech_start" | "speaking" | "speech_end",
  "probability": 0.95
}
{
  "type": "partial",
  "text": "Hello",
  "words": []
}
{
  "type": "final",
  "text": "Hello world",
  "words": [{"word": "Hello", "start_time": 0.0, "end_time": 0.5}]
}

New Protocol (port 8766)

# Direct WebSocket connection
ws://localhost:8766

# Send audio (binary):
# - int16 PCM, 16kHz mono
# - Send as raw bytes

# Send commands (JSON):
{"type": "final"}   # Trigger final transcription
{"type": "reset"}   # Clear audio buffer

# Receive transcripts:
{
  "type": "transcript",
  "text": "Hello world",
  "is_final": false  # Progressive transcription
}
{
  "type": "transcript",
  "text": "Hello world",
  "is_final": true   # Final transcription after "final" command
}

Bot Integration Changes Needed

1. Update WebSocket URL

# Old
ws://miku-stt:8000/ws/stt/{user_id}

# New
ws://miku-stt:8766

2. Update Message Format

# Old: Send audio with metadata
await websocket.send_bytes(audio_data)

# New: Send raw audio bytes (same)
await websocket.send(audio_data)  # bytes

# Old: Listen for VAD events
if msg["type"] == "vad":
    # Handle VAD

# New: No VAD events (handled internally)
# Just send final command when user stops speaking
await websocket.send(json.dumps({"type": "final"}))

3. Update Response Handling

# Old
if msg["type"] == "partial":
    text = msg["text"]
    words = msg["words"]
    
if msg["type"] == "final":
    text = msg["text"]
    words = msg["words"]

# New
if msg["type"] == "transcript":
    text = msg["text"]
    is_final = msg["is_final"]
    # No word-level timestamps in ONNX version

Performance Comparison

Metric	Old (NeMo)	New (ONNX)
VRAM Usage	4-5GB	2-3GB
Transcription Speed	2-3s	0.5-1s
Build Time	~10 min	~5 min
Dependencies	50+ packages	15 packages
GPU Utilization	60-70%	85-95%
OOM Crashes	Frequent	None

Migration Steps

✅ Build new container: docker-compose build miku-stt
✅ Update bot WebSocket client (bot/utils/stt_client.py)
✅ Update voice receiver to send "final" command
⏳ Test transcription quality
⏳ Remove old stt/ directory

Troubleshooting

Issue 1: CUDA Not Working (Falling Back to CPU)

Symptoms:

[E:onnxruntime:Default] Failed to load library libonnxruntime_providers_cuda.so 
with error: libcudnn.so.9: cannot open shared object file

Cause: ONNX Runtime GPU requires cuDNN 9, but CUDA 12.1 base image only has cuDNN 8.

Fix: Update Dockerfile base image:

FROM nvidia/cuda:12.6.2-cudnn-runtime-ubuntu22.04

Verify:

docker logs miku-stt 2>&1 | grep "Providers"
# Should show: CUDAExecutionProvider (not just CPUExecutionProvider)

Issue 2: Connection Refused (Port 8000)

Symptoms:

ConnectionRefusedError: [Errno 111] Connect call failed ('172.20.0.5', 8000)

Cause: New ONNX server runs on port 8766, not 8000.

Fix: Update bot/utils/stt_client.py:

stt_url: str = "ws://miku-stt:8766/ws/stt"  # Changed from 8000

Issue 3: Protocol Mismatch

Symptoms: Bot doesn't receive transcripts, or transcripts are empty.

Cause: New ONNX server uses different WebSocket protocol.

Old Protocol (NeMo): Automatic VAD-triggered partial and final events New Protocol (ONNX): Manual control with {"type": "final"} command

Fix:

Updated stt_client._handle_event() to handle transcript type with is_final flag
Added send_final() method to request final transcription
Bot should call stt_client.send_final() when user stops speaking

Rollback Plan

If needed, revert docker-compose.yml:

miku-stt:
  build:
    context: ./stt
    dockerfile: Dockerfile.stt
  # ... rest of old config

Notes

Model downloads on first run (~600MB)
Models cached in ./stt-parakeet/models/
No word-level timestamps (ONNX model doesn't provide them)
VAD handled internally (no need for external VAD integration)
Uses same GPU (GTX 1660, device 0) as before

5.7 KiB Raw Blame History