Files
miku-discord/stt-parakeet/REFACTORING.md

245 lines
6.6 KiB
Markdown
Raw Normal View History

# Refactoring Summary
## Overview
Successfully refactored the Parakeet ASR codebase to use the `onnx-asr` library with ONNX Runtime GPU support for NVIDIA GTX 1660.
## Changes Made
### 1. Dependencies (`requirements.txt`)
- **Removed**: `onnxruntime-gpu`, `silero-vad`
- **Added**: `onnx-asr[gpu,hub]`, `soundfile`
- **Kept**: `numpy<2.0`, `websockets`, `sounddevice`
### 2. ASR Pipeline (`asr/asr_pipeline.py`)
- Completely refactored to use `onnx_asr.load_model()`
- Added support for:
- GPU acceleration via CUDA/TensorRT
- Model quantization (int8, fp16)
- Voice Activity Detection (VAD)
- Batch processing
- Streaming audio chunks
- Configurable execution providers for GPU optimization
- Automatic model download from Hugging Face
### 3. VAD Module (`vad/silero_vad.py`)
- Refactored to use `onnx_asr.load_vad()`
- Integrated Silero VAD via onnx-asr
- Simplified API for VAD operations
- Note: VAD is best used via `model.with_vad()` method
### 4. WebSocket Server (`server/ws_server.py`)
- Created from scratch for streaming ASR
- Features:
- Real-time audio streaming
- JSON-based protocol
- Support for multiple concurrent connections
- Buffer management for audio chunks
- Error handling and logging
### 5. Microphone Client (`client/mic_stream.py`)
- Created streaming client using `sounddevice`
- Features:
- Real-time microphone capture
- WebSocket streaming to server
- Audio device selection
- Automatic format conversion (float32 to int16)
- Async communication
### 6. Test Script (`tools/test_offline.py`)
- Completely rewritten for onnx-asr
- Features:
- Command-line interface
- Support for WAV files
- Optional VAD and quantization
- Audio statistics and diagnostics
### 7. Diagnostics Tool (`tools/diagnose.py`)
- New comprehensive system check tool
- Checks:
- Python version
- Installed packages
- CUDA availability
- ONNX Runtime providers
- Audio devices
- Model files
### 8. Setup Script (`setup_env.sh`)
- Automated setup script
- Features:
- Virtual environment creation
- Dependency installation
- CUDA/GPU detection
- System diagnostics
- Optional model download
### 9. Documentation
- **README.md**: Comprehensive documentation with:
- Installation instructions
- Usage examples
- Configuration options
- Troubleshooting guide
- Performance tips
- **QUICKSTART.md**: Quick start guide with:
- 5-minute setup
- Common commands
- Troubleshooting
- Performance optimization
- **example.py**: Simple usage example
## Key Benefits
### 1. GPU Optimization
- Native CUDA support via ONNX Runtime
- Configurable GPU memory limits
- Optional TensorRT for even faster inference
- Automatic fallback to CPU if GPU unavailable
### 2. Simplified Model Management
- Automatic model download from Hugging Face
- No manual ONNX export needed
- Pre-converted models ready to use
- Support for quantized versions
### 3. Better Performance
- Optimized ONNX inference
- GPU acceleration on GTX 1660
- ~50-100x realtime on GPU
- Reduced memory usage with quantization
### 4. Improved Usability
- Simpler API
- Better error handling
- Comprehensive logging
- Easy configuration
### 5. Modern Features
- WebSocket streaming
- Real-time transcription
- VAD integration
- Batch processing
## Model Information
- **Model**: Parakeet TDT 0.6B V3 (Multilingual)
- **Source**: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
- **Size**: ~600MB
- **Languages**: 25+ languages
- **Location**: `models/parakeet/` (auto-downloaded)
## File Structure
```
parakeet-test/
├── asr/
│ ├── __init__.py ✓ Updated
│ └── asr_pipeline.py ✓ Refactored
├── client/
│ ├── __init__.py ✓ Updated
│ └── mic_stream.py ✓ New
├── server/
│ ├── __init__.py ✓ Updated
│ └── ws_server.py ✓ New
├── vad/
│ ├── __init__.py ✓ Updated
│ └── silero_vad.py ✓ Refactored
├── tools/
│ ├── diagnose.py ✓ New
│ └── test_offline.py ✓ Refactored
├── models/
│ └── parakeet/ ✓ Auto-created
├── requirements.txt ✓ Updated
├── setup_env.sh ✓ New
├── README.md ✓ New
├── QUICKSTART.md ✓ New
├── example.py ✓ New
├── .gitignore ✓ New
└── REFACTORING.md ✓ This file
```
## Migration from Old Code
### Old Code Pattern:
```python
# Manual ONNX session creation
import onnxruntime as ort
session = ort.InferenceSession("encoder.onnx", providers=["CUDAExecutionProvider"])
# Manual preprocessing and decoding
```
### New Code Pattern:
```python
# Simple onnx-asr interface
import onnx_asr
model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3")
text = model.recognize("audio.wav")
```
## Testing Instructions
### 1. Setup
```bash
./setup_env.sh
source venv/bin/activate
```
### 2. Run Diagnostics
```bash
python3 tools/diagnose.py
```
### 3. Test Offline
```bash
python3 tools/test_offline.py test.wav
```
### 4. Test Streaming
```bash
# Terminal 1
python3 server/ws_server.py
# Terminal 2
python3 client/mic_stream.py
```
## Known Limitations
1. **Audio Format**: Only WAV files with PCM encoding supported directly
2. **Segment Length**: Models work best with <30 second segments
3. **GPU Memory**: Requires at least 2-3GB GPU memory
4. **Sample Rate**: 16kHz recommended for best results
## Future Enhancements
Possible improvements:
- [ ] Add support for other audio formats (MP3, FLAC, etc.)
- [ ] Implement beam search decoding
- [ ] Add language selection option
- [ ] Support for speaker diarization
- [ ] REST API in addition to WebSocket
- [ ] Docker containerization
- [ ] Batch file processing script
- [ ] Real-time visualization of transcription
## References
- [onnx-asr GitHub](https://github.com/istupakov/onnx-asr)
- [onnx-asr Documentation](https://istupakov.github.io/onnx-asr/)
- [Parakeet ONNX Model](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)
- [Original Parakeet Model](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
- [ONNX Runtime](https://onnxruntime.ai/)
## Support
For issues related to:
- **onnx-asr library**: https://github.com/istupakov/onnx-asr/issues
- **This implementation**: Check logs and run diagnose.py
- **GPU/CUDA issues**: Verify nvidia-smi and CUDA installation
---
**Refactoring completed on**: January 18, 2026
**Primary changes**: Migration to onnx-asr library for simplified ONNX inference with GPU support