stt-parakeet/REFACTORING.md

# Refactoring Summary

## Overview

Successfully refactored the Parakeet ASR codebase to use the `onnx-asr` library with ONNX Runtime GPU support for NVIDIA GTX 1660.

## Changes Made

### 1. Dependencies (`requirements.txt`)
- **Removed**: `onnxruntime-gpu`, `silero-vad`
- **Added**: `onnx-asr[gpu,hub]`, `soundfile`
- **Kept**: `numpy<2.0`, `websockets`, `sounddevice`

### 2. ASR Pipeline (`asr/asr_pipeline.py`)
- Completely refactored to use `onnx_asr.load_model()`
- Added support for:
  - GPU acceleration via CUDA/TensorRT
  - Model quantization (int8, fp16)
  - Voice Activity Detection (VAD)
  - Batch processing
  - Streaming audio chunks
- Configurable execution providers for GPU optimization
- Automatic model download from Hugging Face

### 3. VAD Module (`vad/silero_vad.py`)
- Refactored to use `onnx_asr.load_vad()`
- Integrated Silero VAD via onnx-asr
- Simplified API for VAD operations
- Note: VAD is best used via `model.with_vad()` method

### 4. WebSocket Server (`server/ws_server.py`)
- Created from scratch for streaming ASR
- Features:
  - Real-time audio streaming
  - JSON-based protocol
  - Support for multiple concurrent connections
  - Buffer management for audio chunks
  - Error handling and logging

### 5. Microphone Client (`client/mic_stream.py`)
- Created streaming client using `sounddevice`
- Features:
  - Real-time microphone capture
  - WebSocket streaming to server
  - Audio device selection
  - Automatic format conversion (float32 to int16)
  - Async communication

### 6. Test Script (`tools/test_offline.py`)
- Completely rewritten for onnx-asr
- Features:
  - Command-line interface
  - Support for WAV files
  - Optional VAD and quantization
  - Audio statistics and diagnostics

### 7. Diagnostics Tool (`tools/diagnose.py`)
- New comprehensive system check tool
- Checks:
  - Python version
  - Installed packages
  - CUDA availability
  - ONNX Runtime providers
  - Audio devices
  - Model files

### 8. Setup Script (`setup_env.sh`)
- Automated setup script
- Features:
  - Virtual environment creation
  - Dependency installation
  - CUDA/GPU detection
  - System diagnostics
  - Optional model download

### 9. Documentation
- **README.md**: Comprehensive documentation with:
  - Installation instructions
  - Usage examples
  - Configuration options
  - Troubleshooting guide
  - Performance tips
  
- **QUICKSTART.md**: Quick start guide with:
  - 5-minute setup
  - Common commands
  - Troubleshooting
  - Performance optimization
  
- **example.py**: Simple usage example

## Key Benefits

### 1. GPU Optimization
- Native CUDA support via ONNX Runtime
- Configurable GPU memory limits
- Optional TensorRT for even faster inference
- Automatic fallback to CPU if GPU unavailable

### 2. Simplified Model Management
- Automatic model download from Hugging Face
- No manual ONNX export needed
- Pre-converted models ready to use
- Support for quantized versions

### 3. Better Performance
- Optimized ONNX inference
- GPU acceleration on GTX 1660
- ~50-100x realtime on GPU
- Reduced memory usage with quantization

### 4. Improved Usability
- Simpler API
- Better error handling
- Comprehensive logging
- Easy configuration

### 5. Modern Features
- WebSocket streaming
- Real-time transcription
- VAD integration
- Batch processing

## Model Information

- **Model**: Parakeet TDT 0.6B V3 (Multilingual)
- **Source**: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
- **Size**: ~600MB
- **Languages**: 25+ languages
- **Location**: `models/parakeet/` (auto-downloaded)

## File Structure

```
parakeet-test/
├── asr/
│   ├── __init__.py              ✓ Updated
│   └── asr_pipeline.py          ✓ Refactored
├── client/
│   ├── __init__.py              ✓ Updated
│   └── mic_stream.py            ✓ New
├── server/
│   ├── __init__.py              ✓ Updated
│   └── ws_server.py             ✓ New
├── vad/
│   ├── __init__.py              ✓ Updated
│   └── silero_vad.py            ✓ Refactored
├── tools/
│   ├── diagnose.py              ✓ New
│   └── test_offline.py          ✓ Refactored
├── models/
│   └── parakeet/                ✓ Auto-created
├── requirements.txt             ✓ Updated
├── setup_env.sh                 ✓ New
├── README.md                    ✓ New
├── QUICKSTART.md                ✓ New
├── example.py                   ✓ New
├── .gitignore                   ✓ New
└── REFACTORING.md               ✓ This file
```

## Migration from Old Code

### Old Code Pattern:
```python
# Manual ONNX session creation
import onnxruntime as ort
session = ort.InferenceSession("encoder.onnx", providers=["CUDAExecutionProvider"])
# Manual preprocessing and decoding
```

### New Code Pattern:
```python
# Simple onnx-asr interface
import onnx_asr
model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3")
text = model.recognize("audio.wav")
```

## Testing Instructions

### 1. Setup
```bash
./setup_env.sh
source venv/bin/activate
```

### 2. Run Diagnostics
```bash
python3 tools/diagnose.py
```

### 3. Test Offline
```bash
python3 tools/test_offline.py test.wav
```

### 4. Test Streaming
```bash
# Terminal 1
python3 server/ws_server.py

# Terminal 2
python3 client/mic_stream.py
```

## Known Limitations

1. **Audio Format**: Only WAV files with PCM encoding supported directly
2. **Segment Length**: Models work best with <30 second segments
3. **GPU Memory**: Requires at least 2-3GB GPU memory
4. **Sample Rate**: 16kHz recommended for best results

## Future Enhancements

Possible improvements:
- [ ] Add support for other audio formats (MP3, FLAC, etc.)
- [ ] Implement beam search decoding
- [ ] Add language selection option
- [ ] Support for speaker diarization
- [ ] REST API in addition to WebSocket
- [ ] Docker containerization
- [ ] Batch file processing script
- [ ] Real-time visualization of transcription

## References

- [onnx-asr GitHub](https://github.com/istupakov/onnx-asr)
- [onnx-asr Documentation](https://istupakov.github.io/onnx-asr/)
- [Parakeet ONNX Model](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)
- [Original Parakeet Model](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
- [ONNX Runtime](https://onnxruntime.ai/)

## Support

For issues related to:
- **onnx-asr library**: https://github.com/istupakov/onnx-asr/issues
- **This implementation**: Check logs and run diagnose.py
- **GPU/CUDA issues**: Verify nvidia-smi and CUDA installation

---

**Refactoring completed on**: January 18, 2026
**Primary changes**: Migration to onnx-asr library for simplified ONNX inference with GPU support
Decided on Parakeet ONNX Runtime. Works pretty great. Realtime voice chat possible now. UX lacking. 2026-01-19 00:29:44 +02:00			`# Refactoring Summary`

			`## Overview`

			Successfully refactored the Parakeet ASR codebase to use the `onnx-asr` library with ONNX Runtime GPU support for NVIDIA GTX 1660.

			`## Changes Made`

			### 1. Dependencies (`requirements.txt`)
			- Removed: `onnxruntime-gpu`, `silero-vad`
			- Added: `onnx-asr[gpu,hub]`, `soundfile`
			- Kept: `numpy<2.0`, `websockets`, `sounddevice`

			### 2. ASR Pipeline (`asr/asr_pipeline.py`)
			- Completely refactored to use `onnx_asr.load_model()`
			`- Added support for:`
			`- GPU acceleration via CUDA/TensorRT`
			`- Model quantization (int8, fp16)`
			`- Voice Activity Detection (VAD)`
			`- Batch processing`
			`- Streaming audio chunks`
			`- Configurable execution providers for GPU optimization`
			`- Automatic model download from Hugging Face`

			### 3. VAD Module (`vad/silero_vad.py`)
			- Refactored to use `onnx_asr.load_vad()`
			`- Integrated Silero VAD via onnx-asr`
			`- Simplified API for VAD operations`
			- Note: VAD is best used via `model.with_vad()` method

			### 4. WebSocket Server (`server/ws_server.py`)
			`- Created from scratch for streaming ASR`
			`- Features:`
			`- Real-time audio streaming`
			`- JSON-based protocol`
			`- Support for multiple concurrent connections`
			`- Buffer management for audio chunks`
			`- Error handling and logging`

			### 5. Microphone Client (`client/mic_stream.py`)
			- Created streaming client using `sounddevice`
			`- Features:`
			`- Real-time microphone capture`
			`- WebSocket streaming to server`
			`- Audio device selection`
			`- Automatic format conversion (float32 to int16)`
			`- Async communication`

			### 6. Test Script (`tools/test_offline.py`)
			`- Completely rewritten for onnx-asr`
			`- Features:`
			`- Command-line interface`
			`- Support for WAV files`
			`- Optional VAD and quantization`
			`- Audio statistics and diagnostics`

			### 7. Diagnostics Tool (`tools/diagnose.py`)
			`- New comprehensive system check tool`
			`- Checks:`
			`- Python version`
			`- Installed packages`
			`- CUDA availability`
			`- ONNX Runtime providers`
			`- Audio devices`
			`- Model files`

			### 8. Setup Script (`setup_env.sh`)
			`- Automated setup script`
			`- Features:`
			`- Virtual environment creation`
			`- Dependency installation`
			`- CUDA/GPU detection`
			`- System diagnostics`
			`- Optional model download`

			`### 9. Documentation`
			`- README.md: Comprehensive documentation with:`
			`- Installation instructions`
			`- Usage examples`
			`- Configuration options`
			`- Troubleshooting guide`
			`- Performance tips`

			`- QUICKSTART.md: Quick start guide with:`
			`- 5-minute setup`
			`- Common commands`
			`- Troubleshooting`
			`- Performance optimization`

			`- example.py: Simple usage example`

			`## Key Benefits`

			`### 1. GPU Optimization`
			`- Native CUDA support via ONNX Runtime`
			`- Configurable GPU memory limits`
			`- Optional TensorRT for even faster inference`
			`- Automatic fallback to CPU if GPU unavailable`

			`### 2. Simplified Model Management`
			`- Automatic model download from Hugging Face`
			`- No manual ONNX export needed`
			`- Pre-converted models ready to use`
			`- Support for quantized versions`

			`### 3. Better Performance`
			`- Optimized ONNX inference`
			`- GPU acceleration on GTX 1660`
			`- ~50-100x realtime on GPU`
			`- Reduced memory usage with quantization`

			`### 4. Improved Usability`
			`- Simpler API`
			`- Better error handling`
			`- Comprehensive logging`
			`- Easy configuration`

			`### 5. Modern Features`
			`- WebSocket streaming`
			`- Real-time transcription`
			`- VAD integration`
			`- Batch processing`

			`## Model Information`

			`- Model: Parakeet TDT 0.6B V3 (Multilingual)`
			`- Source: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx`
			`- Size: ~600MB`
			`- Languages: 25+ languages`
			- Location: `models/parakeet/` (auto-downloaded)

			`## File Structure`

			```
			`parakeet-test/`
			`├── asr/`
			`│ ├── __init__.py ✓ Updated`
			`│ └── asr_pipeline.py ✓ Refactored`
			`├── client/`
			`│ ├── __init__.py ✓ Updated`
			`│ └── mic_stream.py ✓ New`
			`├── server/`
			`│ ├── __init__.py ✓ Updated`
			`│ └── ws_server.py ✓ New`
			`├── vad/`
			`│ ├── __init__.py ✓ Updated`
			`│ └── silero_vad.py ✓ Refactored`
			`├── tools/`
			`│ ├── diagnose.py ✓ New`
			`│ └── test_offline.py ✓ Refactored`
			`├── models/`
			`│ └── parakeet/ ✓ Auto-created`
			`├── requirements.txt ✓ Updated`
			`├── setup_env.sh ✓ New`
			`├── README.md ✓ New`
			`├── QUICKSTART.md ✓ New`
			`├── example.py ✓ New`
			`├── .gitignore ✓ New`
			`└── REFACTORING.md ✓ This file`
			```

			`## Migration from Old Code`

			`### Old Code Pattern:`
			```python
			`# Manual ONNX session creation`
			`import onnxruntime as ort`
			`session = ort.InferenceSession("encoder.onnx", providers=["CUDAExecutionProvider"])`
			`# Manual preprocessing and decoding`
			```

			`### New Code Pattern:`
			```python
			`# Simple onnx-asr interface`
			`import onnx_asr`
			`model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3")`
			`text = model.recognize("audio.wav")`
			```

			`## Testing Instructions`

			`### 1. Setup`
			```bash
			`./setup_env.sh`
			`source venv/bin/activate`
			```

			`### 2. Run Diagnostics`
			```bash
			`python3 tools/diagnose.py`
			```

			`### 3. Test Offline`
			```bash
			`python3 tools/test_offline.py test.wav`
			```

			`### 4. Test Streaming`
			```bash
			`# Terminal 1`
			`python3 server/ws_server.py`

			`# Terminal 2`
			`python3 client/mic_stream.py`
			```

			`## Known Limitations`

			`1. Audio Format: Only WAV files with PCM encoding supported directly`
			`2. Segment Length: Models work best with <30 second segments`
			`3. GPU Memory: Requires at least 2-3GB GPU memory`
			`4. Sample Rate: 16kHz recommended for best results`

			`## Future Enhancements`

			`Possible improvements:`
			`- [ ] Add support for other audio formats (MP3, FLAC, etc.)`
			`- [ ] Implement beam search decoding`
			`- [ ] Add language selection option`
			`- [ ] Support for speaker diarization`
			`- [ ] REST API in addition to WebSocket`
			`- [ ] Docker containerization`
			`- [ ] Batch file processing script`
			`- [ ] Real-time visualization of transcription`

			`## References`

			`- [onnx-asr GitHub](https://github.com/istupakov/onnx-asr)`
			`- [onnx-asr Documentation](https://istupakov.github.io/onnx-asr/)`
			`- [Parakeet ONNX Model](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)`
			`- [Original Parakeet Model](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)`
			`- [ONNX Runtime](https://onnxruntime.ai/)`

			`## Support`

			`For issues related to:`
			`- onnx-asr library: https://github.com/istupakov/onnx-asr/issues`
			`- This implementation: Check logs and run diagnose.py`
			`- GPU/CUDA issues: Verify nvidia-smi and CUDA installation`

			`---`

			`Refactoring completed on: January 18, 2026`
			`Primary changes: Migration to onnx-asr library for simplified ONNX inference with GPU support`