Decided on Parakeet ONNX Runtime. Works pretty great. Realtime voice chat possible now. UX lacking.

This commit is contained in:
2026-01-19 00:29:44 +02:00
parent 0a8910fff8
commit 362108f4b0
34 changed files with 4593 additions and 73 deletions

280
stt-parakeet/README.md Normal file
View File

@@ -0,0 +1,280 @@
# Parakeet ASR with ONNX Runtime
Real-time Automatic Speech Recognition (ASR) system using NVIDIA's Parakeet TDT 0.6B V3 model via the `onnx-asr` library, optimized for NVIDIA GPUs (GTX 1660 and better).
## Features
-**ONNX Runtime with GPU acceleration** (CUDA/TensorRT support)
-**Parakeet TDT 0.6B V3** multilingual model from Hugging Face
-**Real-time streaming** via WebSocket server
-**Voice Activity Detection** (Silero VAD)
-**Microphone client** for live transcription
-**Offline transcription** from audio files
-**Quantization support** (int8, fp16) for faster inference
## Model Information
This implementation uses:
- **Model**: `nemo-parakeet-tdt-0.6b-v3` (Multilingual)
- **Source**: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
- **Library**: https://github.com/istupakov/onnx-asr
- **Original Model**: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3
## System Requirements
- **GPU**: NVIDIA GPU with CUDA support (tested on GTX 1660)
- **CUDA**: Version 11.8 or 12.x
- **Python**: 3.10 or higher
- **Memory**: At least 4GB GPU memory recommended
## Installation
### 1. Clone the repository
```bash
cd /home/koko210Serve/parakeet-test
```
### 2. Create virtual environment
```bash
python3 -m venv venv
source venv/bin/activate
```
### 3. Install CUDA dependencies
Make sure you have CUDA installed. For Ubuntu:
```bash
# Check CUDA version
nvcc --version
# If you need to install CUDA, follow NVIDIA's instructions:
# https://developer.nvidia.com/cuda-downloads
```
### 4. Install Python dependencies
```bash
pip install --upgrade pip
pip install -r requirements.txt
```
Or manually:
```bash
# With GPU support (recommended)
pip install onnx-asr[gpu,hub]
# Additional dependencies
pip install numpy<2.0 websockets sounddevice soundfile
```
### 5. Verify CUDA availability
```bash
python3 -c "import onnxruntime as ort; print('Available providers:', ort.get_available_providers())"
```
You should see `CUDAExecutionProvider` in the list.
## Usage
### Test Offline Transcription
Transcribe an audio file:
```bash
python3 tools/test_offline.py test.wav
```
With VAD (for long audio files):
```bash
python3 tools/test_offline.py test.wav --use-vad
```
With quantization (faster, less memory):
```bash
python3 tools/test_offline.py test.wav --quantization int8
```
### Start WebSocket Server
Start the ASR server:
```bash
python3 server/ws_server.py
```
With options:
```bash
python3 server/ws_server.py --host 0.0.0.0 --port 8765 --use-vad
```
### Start Microphone Client
In a separate terminal, start the microphone client:
```bash
python3 client/mic_stream.py
```
List available audio devices:
```bash
python3 client/mic_stream.py --list-devices
```
Connect to a specific device:
```bash
python3 client/mic_stream.py --device 0
```
## Project Structure
```
parakeet-test/
├── asr/
│ ├── __init__.py
│ └── asr_pipeline.py # Main ASR pipeline using onnx-asr
├── client/
│ ├── __init__.py
│ └── mic_stream.py # Microphone streaming client
├── server/
│ ├── __init__.py
│ └── ws_server.py # WebSocket server for streaming ASR
├── vad/
│ ├── __init__.py
│ └── silero_vad.py # VAD wrapper using onnx-asr
├── tools/
│ ├── test_offline.py # Test offline transcription
│ └── diagnose.py # System diagnostics
├── models/
│ └── parakeet/ # Model files (auto-downloaded)
├── requirements.txt # Python dependencies
└── README.md # This file
```
## Model Files
The model files will be automatically downloaded from Hugging Face on first run to:
```
models/parakeet/
├── config.json
├── encoder-parakeet-tdt-0.6b-v3.onnx
├── decoder_joint-parakeet-tdt-0.6b-v3.onnx
└── vocab.txt
```
## Configuration
### GPU Settings
The ASR pipeline is configured to use CUDA by default. You can customize the execution providers in `asr/asr_pipeline.py`:
```python
providers = [
(
"CUDAExecutionProvider",
{
"device_id": 0,
"arena_extend_strategy": "kNextPowerOfTwo",
"gpu_mem_limit": 6 * 1024 * 1024 * 1024, # 6GB
"cudnn_conv_algo_search": "EXHAUSTIVE",
"do_copy_in_default_stream": True,
}
),
"CPUExecutionProvider",
]
```
### TensorRT (Optional - Faster Inference)
For even better performance, you can use TensorRT:
```bash
pip install tensorrt tensorrt-cu12-libs
```
Then modify the providers:
```python
providers = [
(
"TensorrtExecutionProvider",
{
"trt_max_workspace_size": 6 * 1024**3,
"trt_fp16_enable": True,
},
)
]
```
## Troubleshooting
### CUDA Not Available
If CUDA is not detected:
1. Check CUDA installation: `nvcc --version`
2. Verify GPU: `nvidia-smi`
3. Reinstall onnxruntime-gpu:
```bash
pip uninstall onnxruntime onnxruntime-gpu
pip install onnxruntime-gpu
```
### Memory Issues
If you run out of GPU memory:
1. Use quantization: `--quantization int8`
2. Reduce `gpu_mem_limit` in the configuration
3. Close other GPU-using applications
### Audio Issues
If microphone is not working:
1. List devices: `python3 client/mic_stream.py --list-devices`
2. Select the correct device: `--device <id>`
3. Check permissions: `sudo usermod -a -G audio $USER` (then logout/login)
### Slow Performance
1. Ensure GPU is being used (check logs for "CUDAExecutionProvider")
2. Try quantization for faster inference
3. Consider using TensorRT provider
4. Check GPU utilization: `nvidia-smi`
## Performance
Expected performance on GTX 1660 (6GB):
- **Offline transcription**: ~50-100x realtime (depending on audio length)
- **Streaming**: <100ms latency
- **Memory usage**: ~2-3GB GPU memory
- **Quantized (int8)**: ~30% faster, ~50% less memory
## License
This project uses:
- `onnx-asr`: MIT License
- Parakeet model: CC-BY-4.0 License
## References
- [onnx-asr GitHub](https://github.com/istupakov/onnx-asr)
- [Parakeet TDT 0.6B V3 ONNX](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)
- [NVIDIA Parakeet](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
- [ONNX Runtime](https://onnxruntime.ai/)
## Credits
- Model conversion by [istupakov](https://github.com/istupakov)
- Original Parakeet model by NVIDIA