Decided on Parakeet ONNX Runtime. Works pretty great. Realtime voice chat possible now. UX lacking.
This commit is contained in:
290
stt-parakeet/QUICKSTART.md
Normal file
290
stt-parakeet/QUICKSTART.md
Normal file
@@ -0,0 +1,290 @@
|
||||
# Quick Start Guide
|
||||
|
||||
## 🚀 Getting Started in 5 Minutes
|
||||
|
||||
### 1. Setup Environment
|
||||
|
||||
```bash
|
||||
# Make setup script executable and run it
|
||||
chmod +x setup_env.sh
|
||||
./setup_env.sh
|
||||
```
|
||||
|
||||
The setup script will:
|
||||
- Create a virtual environment
|
||||
- Install all dependencies including `onnx-asr`
|
||||
- Check CUDA/GPU availability
|
||||
- Run system diagnostics
|
||||
- Optionally download the Parakeet model
|
||||
|
||||
### 2. Activate Virtual Environment
|
||||
|
||||
```bash
|
||||
source venv/bin/activate
|
||||
```
|
||||
|
||||
### 3. Test Your Setup
|
||||
|
||||
Run diagnostics to verify everything is working:
|
||||
|
||||
```bash
|
||||
python3 tools/diagnose.py
|
||||
```
|
||||
|
||||
Expected output should show:
|
||||
- ✓ Python 3.10+
|
||||
- ✓ onnx-asr installed
|
||||
- ✓ CUDAExecutionProvider available
|
||||
- ✓ GPU detected
|
||||
|
||||
### 4. Test Offline Transcription
|
||||
|
||||
Create a test audio file or use an existing WAV file:
|
||||
|
||||
```bash
|
||||
python3 tools/test_offline.py test.wav
|
||||
```
|
||||
|
||||
### 5. Start Real-Time Streaming
|
||||
|
||||
**Terminal 1 - Start Server:**
|
||||
```bash
|
||||
python3 server/ws_server.py
|
||||
```
|
||||
|
||||
**Terminal 2 - Start Client:**
|
||||
```bash
|
||||
# List audio devices first
|
||||
python3 client/mic_stream.py --list-devices
|
||||
|
||||
# Start streaming with your microphone
|
||||
python3 client/mic_stream.py --device 0
|
||||
```
|
||||
|
||||
## 🎯 Common Commands
|
||||
|
||||
### Offline Transcription
|
||||
|
||||
```bash
|
||||
# Basic transcription
|
||||
python3 tools/test_offline.py audio.wav
|
||||
|
||||
# With Voice Activity Detection (for long files)
|
||||
python3 tools/test_offline.py audio.wav --use-vad
|
||||
|
||||
# With quantization (faster, uses less memory)
|
||||
python3 tools/test_offline.py audio.wav --quantization int8
|
||||
```
|
||||
|
||||
### WebSocket Server
|
||||
|
||||
```bash
|
||||
# Start server on default port (8765)
|
||||
python3 server/ws_server.py
|
||||
|
||||
# Custom host and port
|
||||
python3 server/ws_server.py --host 0.0.0.0 --port 9000
|
||||
|
||||
# With VAD enabled
|
||||
python3 server/ws_server.py --use-vad
|
||||
```
|
||||
|
||||
### Microphone Client
|
||||
|
||||
```bash
|
||||
# List available audio devices
|
||||
python3 client/mic_stream.py --list-devices
|
||||
|
||||
# Connect to server
|
||||
python3 client/mic_stream.py --url ws://localhost:8765
|
||||
|
||||
# Use specific device
|
||||
python3 client/mic_stream.py --device 2
|
||||
|
||||
# Custom sample rate
|
||||
python3 client/mic_stream.py --sample-rate 16000
|
||||
```
|
||||
|
||||
## 🔧 Troubleshooting
|
||||
|
||||
### GPU Not Detected
|
||||
|
||||
1. Check NVIDIA driver:
|
||||
```bash
|
||||
nvidia-smi
|
||||
```
|
||||
|
||||
2. Check CUDA version:
|
||||
```bash
|
||||
nvcc --version
|
||||
```
|
||||
|
||||
3. Verify ONNX Runtime can see GPU:
|
||||
```bash
|
||||
python3 -c "import onnxruntime as ort; print(ort.get_available_providers())"
|
||||
```
|
||||
|
||||
Should include `CUDAExecutionProvider`
|
||||
|
||||
### Out of Memory
|
||||
|
||||
If you get CUDA out of memory errors:
|
||||
|
||||
1. **Use quantization:**
|
||||
```bash
|
||||
python3 tools/test_offline.py audio.wav --quantization int8
|
||||
```
|
||||
|
||||
2. **Close other GPU applications**
|
||||
|
||||
3. **Reduce GPU memory limit** in `asr/asr_pipeline.py`:
|
||||
```python
|
||||
"gpu_mem_limit": 4 * 1024 * 1024 * 1024, # 4GB instead of 6GB
|
||||
```
|
||||
|
||||
### Microphone Not Working
|
||||
|
||||
1. Check permissions:
|
||||
```bash
|
||||
sudo usermod -a -G audio $USER
|
||||
# Then logout and login again
|
||||
```
|
||||
|
||||
2. Test with system audio recorder first
|
||||
|
||||
3. List and test devices:
|
||||
```bash
|
||||
python3 client/mic_stream.py --list-devices
|
||||
```
|
||||
|
||||
### Model Download Fails
|
||||
|
||||
If Hugging Face is slow or blocked:
|
||||
|
||||
1. **Set HF token** (optional, for faster downloads):
|
||||
```bash
|
||||
export HF_TOKEN="your_huggingface_token"
|
||||
```
|
||||
|
||||
2. **Manual download:**
|
||||
```bash
|
||||
# Download from: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
|
||||
# Extract to: models/parakeet/
|
||||
```
|
||||
|
||||
## 📊 Performance Tips
|
||||
|
||||
### For Best GPU Performance
|
||||
|
||||
1. **Use TensorRT provider** (faster than CUDA):
|
||||
```bash
|
||||
pip install tensorrt tensorrt-cu12-libs
|
||||
```
|
||||
|
||||
Then edit `asr/asr_pipeline.py` to use TensorRT provider
|
||||
|
||||
2. **Use FP16 quantization** (on TensorRT):
|
||||
```python
|
||||
providers = [
|
||||
("TensorrtExecutionProvider", {
|
||||
"trt_fp16_enable": True,
|
||||
})
|
||||
]
|
||||
```
|
||||
|
||||
3. **Enable quantization:**
|
||||
```bash
|
||||
--quantization int8 # Good balance
|
||||
--quantization fp16 # Better quality
|
||||
```
|
||||
|
||||
### For Lower Latency Streaming
|
||||
|
||||
1. **Reduce chunk duration** in client:
|
||||
```bash
|
||||
python3 client/mic_stream.py --chunk-duration 0.05
|
||||
```
|
||||
|
||||
2. **Disable VAD** for shorter responses
|
||||
|
||||
3. **Use quantized model** for faster processing
|
||||
|
||||
## 🎤 Audio File Requirements
|
||||
|
||||
### Supported Formats
|
||||
- **Format**: WAV (PCM_16, PCM_24, PCM_32, PCM_U8)
|
||||
- **Sample Rate**: 16000 Hz (recommended)
|
||||
- **Channels**: Mono (stereo will be converted to mono)
|
||||
|
||||
### Convert Audio Files
|
||||
|
||||
```bash
|
||||
# Using ffmpeg
|
||||
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
|
||||
|
||||
# Using sox
|
||||
sox input.mp3 -r 16000 -c 1 output.wav
|
||||
```
|
||||
|
||||
## 📝 Example Workflow
|
||||
|
||||
Complete example for transcribing a meeting recording:
|
||||
|
||||
```bash
|
||||
# 1. Activate environment
|
||||
source venv/bin/activate
|
||||
|
||||
# 2. Convert audio to correct format
|
||||
ffmpeg -i meeting.mp3 -ar 16000 -ac 1 meeting.wav
|
||||
|
||||
# 3. Transcribe with VAD (for long recordings)
|
||||
python3 tools/test_offline.py meeting.wav --use-vad
|
||||
|
||||
# Output will show transcription with automatic segmentation
|
||||
```
|
||||
|
||||
## 🌐 Supported Languages
|
||||
|
||||
The Parakeet TDT 0.6B V3 model supports **25+ languages** including:
|
||||
- English
|
||||
- Spanish
|
||||
- French
|
||||
- German
|
||||
- Italian
|
||||
- Portuguese
|
||||
- Russian
|
||||
- Chinese
|
||||
- Japanese
|
||||
- Korean
|
||||
- And more...
|
||||
|
||||
The model automatically detects the language.
|
||||
|
||||
## 💡 Tips
|
||||
|
||||
1. **For short audio clips** (<30 seconds): Don't use VAD
|
||||
2. **For long audio files**: Use `--use-vad` flag
|
||||
3. **For real-time streaming**: Keep chunks small (0.1-0.5 seconds)
|
||||
4. **For best accuracy**: Use 16kHz mono WAV files
|
||||
5. **For faster inference**: Use `--quantization int8`
|
||||
|
||||
## 📚 More Information
|
||||
|
||||
- See `README.md` for detailed documentation
|
||||
- Run `python3 tools/diagnose.py` for system check
|
||||
- Check logs for debugging information
|
||||
|
||||
## 🆘 Getting Help
|
||||
|
||||
If you encounter issues:
|
||||
|
||||
1. Run diagnostics:
|
||||
```bash
|
||||
python3 tools/diagnose.py
|
||||
```
|
||||
|
||||
2. Check the logs in the terminal output
|
||||
|
||||
3. Verify your audio format and sample rate
|
||||
|
||||
4. Review the troubleshooting section above
|
||||
Reference in New Issue
Block a user