Files

koko210Serve 362108f4b0 Decided on Parakeet ONNX Runtime. Works pretty great. Realtime voice chat possible now. UX lacking.

2026-01-19 00:29:44 +02:00

5.7 KiB

Raw Blame History

Quick Start Guide

🚀 Getting Started in 5 Minutes

1. Setup Environment

# Make setup script executable and run it
chmod +x setup_env.sh
./setup_env.sh

The setup script will:

Create a virtual environment
Install all dependencies including onnx-asr
Check CUDA/GPU availability
Run system diagnostics
Optionally download the Parakeet model

2. Activate Virtual Environment

source venv/bin/activate

3. Test Your Setup

Run diagnostics to verify everything is working:

python3 tools/diagnose.py

Expected output should show:

✓ Python 3.10+
✓ onnx-asr installed
✓ CUDAExecutionProvider available
✓ GPU detected

4. Test Offline Transcription

Create a test audio file or use an existing WAV file:

python3 tools/test_offline.py test.wav

5. Start Real-Time Streaming

Terminal 1 - Start Server:

python3 server/ws_server.py

Terminal 2 - Start Client:

# List audio devices first
python3 client/mic_stream.py --list-devices

# Start streaming with your microphone
python3 client/mic_stream.py --device 0

🎯 Common Commands

Offline Transcription

# Basic transcription
python3 tools/test_offline.py audio.wav

# With Voice Activity Detection (for long files)
python3 tools/test_offline.py audio.wav --use-vad

# With quantization (faster, uses less memory)
python3 tools/test_offline.py audio.wav --quantization int8

WebSocket Server

# Start server on default port (8765)
python3 server/ws_server.py

# Custom host and port
python3 server/ws_server.py --host 0.0.0.0 --port 9000

# With VAD enabled
python3 server/ws_server.py --use-vad

Microphone Client

# List available audio devices
python3 client/mic_stream.py --list-devices

# Connect to server
python3 client/mic_stream.py --url ws://localhost:8765

# Use specific device
python3 client/mic_stream.py --device 2

# Custom sample rate
python3 client/mic_stream.py --sample-rate 16000

🔧 Troubleshooting

GPU Not Detected

Check NVIDIA driver:
```
nvidia-smi
```
Check CUDA version:
```
nvcc --version
```

Verify ONNX Runtime can see GPU:

python3 -c "import onnxruntime as ort; print(ort.get_available_providers())"

Should include CUDAExecutionProvider

Out of Memory

If you get CUDA out of memory errors:

Use quantization:

python3 tools/test_offline.py audio.wav --quantization int8

Close other GPU applications

Reduce GPU memory limit in asr/asr_pipeline.py:

"gpu_mem_limit": 4 * 1024 * 1024 * 1024,  # 4GB instead of 6GB

Microphone Not Working

Check permissions:

sudo usermod -a -G audio $USER
# Then logout and login again

Test with system audio recorder first

List and test devices:

python3 client/mic_stream.py --list-devices

Model Download Fails

If Hugging Face is slow or blocked:

Set HF token (optional, for faster downloads):
```
export HF_TOKEN="your_huggingface_token"
```

Manual download:

# Download from: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
# Extract to: models/parakeet/

📊 Performance Tips

For Best GPU Performance

Use TensorRT provider (faster than CUDA):
```
pip install tensorrt tensorrt-cu12-libs
```
Then edit asr/asr_pipeline.py to use TensorRT provider

Use FP16 quantization (on TensorRT):

providers = [
    ("TensorrtExecutionProvider", {
        "trt_fp16_enable": True,
    })
]

Enable quantization:

--quantization int8  # Good balance
--quantization fp16  # Better quality

For Lower Latency Streaming

Reduce chunk duration in client:

python3 client/mic_stream.py --chunk-duration 0.05

Disable VAD for shorter responses
Use quantized model for faster processing

🎤 Audio File Requirements

Supported Formats

Format: WAV (PCM_16, PCM_24, PCM_32, PCM_U8)
Sample Rate: 16000 Hz (recommended)
Channels: Mono (stereo will be converted to mono)

Convert Audio Files

# Using ffmpeg
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

# Using sox
sox input.mp3 -r 16000 -c 1 output.wav

📝 Example Workflow

Complete example for transcribing a meeting recording:

# 1. Activate environment
source venv/bin/activate

# 2. Convert audio to correct format
ffmpeg -i meeting.mp3 -ar 16000 -ac 1 meeting.wav

# 3. Transcribe with VAD (for long recordings)
python3 tools/test_offline.py meeting.wav --use-vad

# Output will show transcription with automatic segmentation

🌐 Supported Languages

The Parakeet TDT 0.6B V3 model supports 25+ languages including:

English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
And more...

The model automatically detects the language.

💡 Tips

For short audio clips (<30 seconds): Don't use VAD
For long audio files: Use --use-vad flag
For real-time streaming: Keep chunks small (0.1-0.5 seconds)
For best accuracy: Use 16kHz mono WAV files
For faster inference: Use --quantization int8

📚 More Information

See README.md for detailed documentation
Run python3 tools/diagnose.py for system check
Check logs for debugging information

🆘 Getting Help

If you encounter issues:

Run diagnostics:
```
python3 tools/diagnose.py
```
Check the logs in the terminal output
Verify your audio format and sample rate
Review the troubleshooting section above

5.7 KiB Raw Blame History