Decided on Parakeet ONNX Runtime. Works pretty great. Realtime voice chat possible now. UX lacking.

2026-01-19 00:29:44 +02:00
parent 0a8910fff8
commit 362108f4b0
34 changed files with 4593 additions and 73 deletions
--- a/stt-parakeet/QUICKSTART.md
+++ b/stt-parakeet/QUICKSTART.md
@@ -0,0 +1,290 @@
+# Quick Start Guide
+
+## 🚀 Getting Started in 5 Minutes
+
+### 1. Setup Environment
+
+```bash
+# Make setup script executable and run it
+chmod +x setup_env.sh
+./setup_env.sh
+```
+
+The setup script will:
+- Create a virtual environment
+- Install all dependencies including `onnx-asr`
+- Check CUDA/GPU availability
+- Run system diagnostics
+- Optionally download the Parakeet model
+
+### 2. Activate Virtual Environment
+
+```bash
+source venv/bin/activate
+```
+
+### 3. Test Your Setup
+
+Run diagnostics to verify everything is working:
+
+```bash
+python3 tools/diagnose.py
+```
+
+Expected output should show:
+- ✓ Python 3.10+
+- ✓ onnx-asr installed
+- ✓ CUDAExecutionProvider available
+- ✓ GPU detected
+
+### 4. Test Offline Transcription
+
+Create a test audio file or use an existing WAV file:
+
+```bash
+python3 tools/test_offline.py test.wav
+```
+
+### 5. Start Real-Time Streaming
+
+**Terminal 1 - Start Server:**
+```bash
+python3 server/ws_server.py
+```
+
+**Terminal 2 - Start Client:**
+```bash
+# List audio devices first
+python3 client/mic_stream.py --list-devices
+
+# Start streaming with your microphone
+python3 client/mic_stream.py --device 0
+```
+
+## 🎯 Common Commands
+
+### Offline Transcription
+
+```bash
+# Basic transcription
+python3 tools/test_offline.py audio.wav
+
+# With Voice Activity Detection (for long files)
+python3 tools/test_offline.py audio.wav --use-vad
+
+# With quantization (faster, uses less memory)
+python3 tools/test_offline.py audio.wav --quantization int8
+```
+
+### WebSocket Server
+
+```bash
+# Start server on default port (8765)
+python3 server/ws_server.py
+
+# Custom host and port
+python3 server/ws_server.py --host 0.0.0.0 --port 9000
+
+# With VAD enabled
+python3 server/ws_server.py --use-vad
+```
+
+### Microphone Client
+
+```bash
+# List available audio devices
+python3 client/mic_stream.py --list-devices
+
+# Connect to server
+python3 client/mic_stream.py --url ws://localhost:8765
+
+# Use specific device
+python3 client/mic_stream.py --device 2
+
+# Custom sample rate
+python3 client/mic_stream.py --sample-rate 16000
+```
+
+## 🔧 Troubleshooting
+
+### GPU Not Detected
+
+1. Check NVIDIA driver:
+   ```bash
+   nvidia-smi
+   ```
+
+2. Check CUDA version:
+   ```bash
+   nvcc --version
+   ```
+
+3. Verify ONNX Runtime can see GPU:
+   ```bash
+   python3 -c "import onnxruntime as ort; print(ort.get_available_providers())"
+   ```
+
+   Should include `CUDAExecutionProvider`
+
+### Out of Memory
+
+If you get CUDA out of memory errors:
+
+1. **Use quantization:**
+   ```bash
+   python3 tools/test_offline.py audio.wav --quantization int8
+   ```
+
+2. **Close other GPU applications**
+
+3. **Reduce GPU memory limit** in `asr/asr_pipeline.py`:
+   ```python
+   "gpu_mem_limit": 4 * 1024 * 1024 * 1024,  # 4GB instead of 6GB
+   ```
+
+### Microphone Not Working
+
+1. Check permissions:
+   ```bash
+   sudo usermod -a -G audio $USER
+   # Then logout and login again
+   ```
+
+2. Test with system audio recorder first
+
+3. List and test devices:
+   ```bash
+   python3 client/mic_stream.py --list-devices
+   ```
+
+### Model Download Fails
+
+If Hugging Face is slow or blocked:
+
+1. **Set HF token** (optional, for faster downloads):
+   ```bash
+   export HF_TOKEN="your_huggingface_token"
+   ```
+
+2. **Manual download:**
+   ```bash
+   # Download from: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
+   # Extract to: models/parakeet/
+   ```
+
+## 📊 Performance Tips
+
+### For Best GPU Performance
+
+1. **Use TensorRT provider** (faster than CUDA):
+   ```bash
+   pip install tensorrt tensorrt-cu12-libs
+   ```
+   
+   Then edit `asr/asr_pipeline.py` to use TensorRT provider
+
+2. **Use FP16 quantization** (on TensorRT):
+   ```python
+   providers = [
+       ("TensorrtExecutionProvider", {
+           "trt_fp16_enable": True,
+       })
+   ]
+   ```
+
+3. **Enable quantization:**
+   ```bash
+   --quantization int8  # Good balance
+   --quantization fp16  # Better quality
+   ```
+
+### For Lower Latency Streaming
+
+1. **Reduce chunk duration** in client:
+   ```bash
+   python3 client/mic_stream.py --chunk-duration 0.05
+   ```
+
+2. **Disable VAD** for shorter responses
+
+3. **Use quantized model** for faster processing
+
+## 🎤 Audio File Requirements
+
+### Supported Formats
+- **Format**: WAV (PCM_16, PCM_24, PCM_32, PCM_U8)
+- **Sample Rate**: 16000 Hz (recommended)
+- **Channels**: Mono (stereo will be converted to mono)
+
+### Convert Audio Files
+
+```bash
+# Using ffmpeg
+ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
+
+# Using sox
+sox input.mp3 -r 16000 -c 1 output.wav
+```
+
+## 📝 Example Workflow
+
+Complete example for transcribing a meeting recording:
+
+```bash
+# 1. Activate environment
+source venv/bin/activate
+
+# 2. Convert audio to correct format
+ffmpeg -i meeting.mp3 -ar 16000 -ac 1 meeting.wav
+
+# 3. Transcribe with VAD (for long recordings)
+python3 tools/test_offline.py meeting.wav --use-vad
+
+# Output will show transcription with automatic segmentation
+```
+
+## 🌐 Supported Languages
+
+The Parakeet TDT 0.6B V3 model supports **25+ languages** including:
+- English
+- Spanish
+- French
+- German
+- Italian
+- Portuguese
+- Russian
+- Chinese
+- Japanese
+- Korean
+- And more...
+
+The model automatically detects the language.
+
+## 💡 Tips
+
+1. **For short audio clips** (<30 seconds): Don't use VAD
+2. **For long audio files**: Use `--use-vad` flag
+3. **For real-time streaming**: Keep chunks small (0.1-0.5 seconds)
+4. **For best accuracy**: Use 16kHz mono WAV files
+5. **For faster inference**: Use `--quantization int8`
+
+## 📚 More Information
+
+- See `README.md` for detailed documentation
+- Run `python3 tools/diagnose.py` for system check
+- Check logs for debugging information
+
+## 🆘 Getting Help
+
+If you encounter issues:
+
+1. Run diagnostics:
+   ```bash
+   python3 tools/diagnose.py
+   ```
+
+2. Check the logs in the terminal output
+
+3. Verify your audio format and sample rate
+
+4. Review the troubleshooting section above