245 lines
6.6 KiB
Markdown
245 lines
6.6 KiB
Markdown
# Refactoring Summary
|
|
|
|
## Overview
|
|
|
|
Successfully refactored the Parakeet ASR codebase to use the `onnx-asr` library with ONNX Runtime GPU support for NVIDIA GTX 1660.
|
|
|
|
## Changes Made
|
|
|
|
### 1. Dependencies (`requirements.txt`)
|
|
- **Removed**: `onnxruntime-gpu`, `silero-vad`
|
|
- **Added**: `onnx-asr[gpu,hub]`, `soundfile`
|
|
- **Kept**: `numpy<2.0`, `websockets`, `sounddevice`
|
|
|
|
### 2. ASR Pipeline (`asr/asr_pipeline.py`)
|
|
- Completely refactored to use `onnx_asr.load_model()`
|
|
- Added support for:
|
|
- GPU acceleration via CUDA/TensorRT
|
|
- Model quantization (int8, fp16)
|
|
- Voice Activity Detection (VAD)
|
|
- Batch processing
|
|
- Streaming audio chunks
|
|
- Configurable execution providers for GPU optimization
|
|
- Automatic model download from Hugging Face
|
|
|
|
### 3. VAD Module (`vad/silero_vad.py`)
|
|
- Refactored to use `onnx_asr.load_vad()`
|
|
- Integrated Silero VAD via onnx-asr
|
|
- Simplified API for VAD operations
|
|
- Note: VAD is best used via `model.with_vad()` method
|
|
|
|
### 4. WebSocket Server (`server/ws_server.py`)
|
|
- Created from scratch for streaming ASR
|
|
- Features:
|
|
- Real-time audio streaming
|
|
- JSON-based protocol
|
|
- Support for multiple concurrent connections
|
|
- Buffer management for audio chunks
|
|
- Error handling and logging
|
|
|
|
### 5. Microphone Client (`client/mic_stream.py`)
|
|
- Created streaming client using `sounddevice`
|
|
- Features:
|
|
- Real-time microphone capture
|
|
- WebSocket streaming to server
|
|
- Audio device selection
|
|
- Automatic format conversion (float32 to int16)
|
|
- Async communication
|
|
|
|
### 6. Test Script (`tools/test_offline.py`)
|
|
- Completely rewritten for onnx-asr
|
|
- Features:
|
|
- Command-line interface
|
|
- Support for WAV files
|
|
- Optional VAD and quantization
|
|
- Audio statistics and diagnostics
|
|
|
|
### 7. Diagnostics Tool (`tools/diagnose.py`)
|
|
- New comprehensive system check tool
|
|
- Checks:
|
|
- Python version
|
|
- Installed packages
|
|
- CUDA availability
|
|
- ONNX Runtime providers
|
|
- Audio devices
|
|
- Model files
|
|
|
|
### 8. Setup Script (`setup_env.sh`)
|
|
- Automated setup script
|
|
- Features:
|
|
- Virtual environment creation
|
|
- Dependency installation
|
|
- CUDA/GPU detection
|
|
- System diagnostics
|
|
- Optional model download
|
|
|
|
### 9. Documentation
|
|
- **README.md**: Comprehensive documentation with:
|
|
- Installation instructions
|
|
- Usage examples
|
|
- Configuration options
|
|
- Troubleshooting guide
|
|
- Performance tips
|
|
|
|
- **QUICKSTART.md**: Quick start guide with:
|
|
- 5-minute setup
|
|
- Common commands
|
|
- Troubleshooting
|
|
- Performance optimization
|
|
|
|
- **example.py**: Simple usage example
|
|
|
|
## Key Benefits
|
|
|
|
### 1. GPU Optimization
|
|
- Native CUDA support via ONNX Runtime
|
|
- Configurable GPU memory limits
|
|
- Optional TensorRT for even faster inference
|
|
- Automatic fallback to CPU if GPU unavailable
|
|
|
|
### 2. Simplified Model Management
|
|
- Automatic model download from Hugging Face
|
|
- No manual ONNX export needed
|
|
- Pre-converted models ready to use
|
|
- Support for quantized versions
|
|
|
|
### 3. Better Performance
|
|
- Optimized ONNX inference
|
|
- GPU acceleration on GTX 1660
|
|
- ~50-100x realtime on GPU
|
|
- Reduced memory usage with quantization
|
|
|
|
### 4. Improved Usability
|
|
- Simpler API
|
|
- Better error handling
|
|
- Comprehensive logging
|
|
- Easy configuration
|
|
|
|
### 5. Modern Features
|
|
- WebSocket streaming
|
|
- Real-time transcription
|
|
- VAD integration
|
|
- Batch processing
|
|
|
|
## Model Information
|
|
|
|
- **Model**: Parakeet TDT 0.6B V3 (Multilingual)
|
|
- **Source**: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
|
|
- **Size**: ~600MB
|
|
- **Languages**: 25+ languages
|
|
- **Location**: `models/parakeet/` (auto-downloaded)
|
|
|
|
## File Structure
|
|
|
|
```
|
|
parakeet-test/
|
|
├── asr/
|
|
│ ├── __init__.py ✓ Updated
|
|
│ └── asr_pipeline.py ✓ Refactored
|
|
├── client/
|
|
│ ├── __init__.py ✓ Updated
|
|
│ └── mic_stream.py ✓ New
|
|
├── server/
|
|
│ ├── __init__.py ✓ Updated
|
|
│ └── ws_server.py ✓ New
|
|
├── vad/
|
|
│ ├── __init__.py ✓ Updated
|
|
│ └── silero_vad.py ✓ Refactored
|
|
├── tools/
|
|
│ ├── diagnose.py ✓ New
|
|
│ └── test_offline.py ✓ Refactored
|
|
├── models/
|
|
│ └── parakeet/ ✓ Auto-created
|
|
├── requirements.txt ✓ Updated
|
|
├── setup_env.sh ✓ New
|
|
├── README.md ✓ New
|
|
├── QUICKSTART.md ✓ New
|
|
├── example.py ✓ New
|
|
├── .gitignore ✓ New
|
|
└── REFACTORING.md ✓ This file
|
|
```
|
|
|
|
## Migration from Old Code
|
|
|
|
### Old Code Pattern:
|
|
```python
|
|
# Manual ONNX session creation
|
|
import onnxruntime as ort
|
|
session = ort.InferenceSession("encoder.onnx", providers=["CUDAExecutionProvider"])
|
|
# Manual preprocessing and decoding
|
|
```
|
|
|
|
### New Code Pattern:
|
|
```python
|
|
# Simple onnx-asr interface
|
|
import onnx_asr
|
|
model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3")
|
|
text = model.recognize("audio.wav")
|
|
```
|
|
|
|
## Testing Instructions
|
|
|
|
### 1. Setup
|
|
```bash
|
|
./setup_env.sh
|
|
source venv/bin/activate
|
|
```
|
|
|
|
### 2. Run Diagnostics
|
|
```bash
|
|
python3 tools/diagnose.py
|
|
```
|
|
|
|
### 3. Test Offline
|
|
```bash
|
|
python3 tools/test_offline.py test.wav
|
|
```
|
|
|
|
### 4. Test Streaming
|
|
```bash
|
|
# Terminal 1
|
|
python3 server/ws_server.py
|
|
|
|
# Terminal 2
|
|
python3 client/mic_stream.py
|
|
```
|
|
|
|
## Known Limitations
|
|
|
|
1. **Audio Format**: Only WAV files with PCM encoding supported directly
|
|
2. **Segment Length**: Models work best with <30 second segments
|
|
3. **GPU Memory**: Requires at least 2-3GB GPU memory
|
|
4. **Sample Rate**: 16kHz recommended for best results
|
|
|
|
## Future Enhancements
|
|
|
|
Possible improvements:
|
|
- [ ] Add support for other audio formats (MP3, FLAC, etc.)
|
|
- [ ] Implement beam search decoding
|
|
- [ ] Add language selection option
|
|
- [ ] Support for speaker diarization
|
|
- [ ] REST API in addition to WebSocket
|
|
- [ ] Docker containerization
|
|
- [ ] Batch file processing script
|
|
- [ ] Real-time visualization of transcription
|
|
|
|
## References
|
|
|
|
- [onnx-asr GitHub](https://github.com/istupakov/onnx-asr)
|
|
- [onnx-asr Documentation](https://istupakov.github.io/onnx-asr/)
|
|
- [Parakeet ONNX Model](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)
|
|
- [Original Parakeet Model](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
|
|
- [ONNX Runtime](https://onnxruntime.ai/)
|
|
|
|
## Support
|
|
|
|
For issues related to:
|
|
- **onnx-asr library**: https://github.com/istupakov/onnx-asr/issues
|
|
- **This implementation**: Check logs and run diagnose.py
|
|
- **GPU/CUDA issues**: Verify nvidia-smi and CUDA installation
|
|
|
|
---
|
|
|
|
**Refactoring completed on**: January 18, 2026
|
|
**Primary changes**: Migration to onnx-asr library for simplified ONNX inference with GPU support
|