291 lines
5.7 KiB
Markdown
291 lines
5.7 KiB
Markdown
|
|
# Quick Start Guide
|
||
|
|
|
||
|
|
## 🚀 Getting Started in 5 Minutes
|
||
|
|
|
||
|
|
### 1. Setup Environment
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Make setup script executable and run it
|
||
|
|
chmod +x setup_env.sh
|
||
|
|
./setup_env.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
The setup script will:
|
||
|
|
- Create a virtual environment
|
||
|
|
- Install all dependencies including `onnx-asr`
|
||
|
|
- Check CUDA/GPU availability
|
||
|
|
- Run system diagnostics
|
||
|
|
- Optionally download the Parakeet model
|
||
|
|
|
||
|
|
### 2. Activate Virtual Environment
|
||
|
|
|
||
|
|
```bash
|
||
|
|
source venv/bin/activate
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Test Your Setup
|
||
|
|
|
||
|
|
Run diagnostics to verify everything is working:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python3 tools/diagnose.py
|
||
|
|
```
|
||
|
|
|
||
|
|
Expected output should show:
|
||
|
|
- ✓ Python 3.10+
|
||
|
|
- ✓ onnx-asr installed
|
||
|
|
- ✓ CUDAExecutionProvider available
|
||
|
|
- ✓ GPU detected
|
||
|
|
|
||
|
|
### 4. Test Offline Transcription
|
||
|
|
|
||
|
|
Create a test audio file or use an existing WAV file:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python3 tools/test_offline.py test.wav
|
||
|
|
```
|
||
|
|
|
||
|
|
### 5. Start Real-Time Streaming
|
||
|
|
|
||
|
|
**Terminal 1 - Start Server:**
|
||
|
|
```bash
|
||
|
|
python3 server/ws_server.py
|
||
|
|
```
|
||
|
|
|
||
|
|
**Terminal 2 - Start Client:**
|
||
|
|
```bash
|
||
|
|
# List audio devices first
|
||
|
|
python3 client/mic_stream.py --list-devices
|
||
|
|
|
||
|
|
# Start streaming with your microphone
|
||
|
|
python3 client/mic_stream.py --device 0
|
||
|
|
```
|
||
|
|
|
||
|
|
## 🎯 Common Commands
|
||
|
|
|
||
|
|
### Offline Transcription
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Basic transcription
|
||
|
|
python3 tools/test_offline.py audio.wav
|
||
|
|
|
||
|
|
# With Voice Activity Detection (for long files)
|
||
|
|
python3 tools/test_offline.py audio.wav --use-vad
|
||
|
|
|
||
|
|
# With quantization (faster, uses less memory)
|
||
|
|
python3 tools/test_offline.py audio.wav --quantization int8
|
||
|
|
```
|
||
|
|
|
||
|
|
### WebSocket Server
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Start server on default port (8765)
|
||
|
|
python3 server/ws_server.py
|
||
|
|
|
||
|
|
# Custom host and port
|
||
|
|
python3 server/ws_server.py --host 0.0.0.0 --port 9000
|
||
|
|
|
||
|
|
# With VAD enabled
|
||
|
|
python3 server/ws_server.py --use-vad
|
||
|
|
```
|
||
|
|
|
||
|
|
### Microphone Client
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# List available audio devices
|
||
|
|
python3 client/mic_stream.py --list-devices
|
||
|
|
|
||
|
|
# Connect to server
|
||
|
|
python3 client/mic_stream.py --url ws://localhost:8765
|
||
|
|
|
||
|
|
# Use specific device
|
||
|
|
python3 client/mic_stream.py --device 2
|
||
|
|
|
||
|
|
# Custom sample rate
|
||
|
|
python3 client/mic_stream.py --sample-rate 16000
|
||
|
|
```
|
||
|
|
|
||
|
|
## 🔧 Troubleshooting
|
||
|
|
|
||
|
|
### GPU Not Detected
|
||
|
|
|
||
|
|
1. Check NVIDIA driver:
|
||
|
|
```bash
|
||
|
|
nvidia-smi
|
||
|
|
```
|
||
|
|
|
||
|
|
2. Check CUDA version:
|
||
|
|
```bash
|
||
|
|
nvcc --version
|
||
|
|
```
|
||
|
|
|
||
|
|
3. Verify ONNX Runtime can see GPU:
|
||
|
|
```bash
|
||
|
|
python3 -c "import onnxruntime as ort; print(ort.get_available_providers())"
|
||
|
|
```
|
||
|
|
|
||
|
|
Should include `CUDAExecutionProvider`
|
||
|
|
|
||
|
|
### Out of Memory
|
||
|
|
|
||
|
|
If you get CUDA out of memory errors:
|
||
|
|
|
||
|
|
1. **Use quantization:**
|
||
|
|
```bash
|
||
|
|
python3 tools/test_offline.py audio.wav --quantization int8
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Close other GPU applications**
|
||
|
|
|
||
|
|
3. **Reduce GPU memory limit** in `asr/asr_pipeline.py`:
|
||
|
|
```python
|
||
|
|
"gpu_mem_limit": 4 * 1024 * 1024 * 1024, # 4GB instead of 6GB
|
||
|
|
```
|
||
|
|
|
||
|
|
### Microphone Not Working
|
||
|
|
|
||
|
|
1. Check permissions:
|
||
|
|
```bash
|
||
|
|
sudo usermod -a -G audio $USER
|
||
|
|
# Then logout and login again
|
||
|
|
```
|
||
|
|
|
||
|
|
2. Test with system audio recorder first
|
||
|
|
|
||
|
|
3. List and test devices:
|
||
|
|
```bash
|
||
|
|
python3 client/mic_stream.py --list-devices
|
||
|
|
```
|
||
|
|
|
||
|
|
### Model Download Fails
|
||
|
|
|
||
|
|
If Hugging Face is slow or blocked:
|
||
|
|
|
||
|
|
1. **Set HF token** (optional, for faster downloads):
|
||
|
|
```bash
|
||
|
|
export HF_TOKEN="your_huggingface_token"
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Manual download:**
|
||
|
|
```bash
|
||
|
|
# Download from: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
|
||
|
|
# Extract to: models/parakeet/
|
||
|
|
```
|
||
|
|
|
||
|
|
## 📊 Performance Tips
|
||
|
|
|
||
|
|
### For Best GPU Performance
|
||
|
|
|
||
|
|
1. **Use TensorRT provider** (faster than CUDA):
|
||
|
|
```bash
|
||
|
|
pip install tensorrt tensorrt-cu12-libs
|
||
|
|
```
|
||
|
|
|
||
|
|
Then edit `asr/asr_pipeline.py` to use TensorRT provider
|
||
|
|
|
||
|
|
2. **Use FP16 quantization** (on TensorRT):
|
||
|
|
```python
|
||
|
|
providers = [
|
||
|
|
("TensorrtExecutionProvider", {
|
||
|
|
"trt_fp16_enable": True,
|
||
|
|
})
|
||
|
|
]
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Enable quantization:**
|
||
|
|
```bash
|
||
|
|
--quantization int8 # Good balance
|
||
|
|
--quantization fp16 # Better quality
|
||
|
|
```
|
||
|
|
|
||
|
|
### For Lower Latency Streaming
|
||
|
|
|
||
|
|
1. **Reduce chunk duration** in client:
|
||
|
|
```bash
|
||
|
|
python3 client/mic_stream.py --chunk-duration 0.05
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Disable VAD** for shorter responses
|
||
|
|
|
||
|
|
3. **Use quantized model** for faster processing
|
||
|
|
|
||
|
|
## 🎤 Audio File Requirements
|
||
|
|
|
||
|
|
### Supported Formats
|
||
|
|
- **Format**: WAV (PCM_16, PCM_24, PCM_32, PCM_U8)
|
||
|
|
- **Sample Rate**: 16000 Hz (recommended)
|
||
|
|
- **Channels**: Mono (stereo will be converted to mono)
|
||
|
|
|
||
|
|
### Convert Audio Files
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Using ffmpeg
|
||
|
|
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
|
||
|
|
|
||
|
|
# Using sox
|
||
|
|
sox input.mp3 -r 16000 -c 1 output.wav
|
||
|
|
```
|
||
|
|
|
||
|
|
## 📝 Example Workflow
|
||
|
|
|
||
|
|
Complete example for transcribing a meeting recording:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Activate environment
|
||
|
|
source venv/bin/activate
|
||
|
|
|
||
|
|
# 2. Convert audio to correct format
|
||
|
|
ffmpeg -i meeting.mp3 -ar 16000 -ac 1 meeting.wav
|
||
|
|
|
||
|
|
# 3. Transcribe with VAD (for long recordings)
|
||
|
|
python3 tools/test_offline.py meeting.wav --use-vad
|
||
|
|
|
||
|
|
# Output will show transcription with automatic segmentation
|
||
|
|
```
|
||
|
|
|
||
|
|
## 🌐 Supported Languages
|
||
|
|
|
||
|
|
The Parakeet TDT 0.6B V3 model supports **25+ languages** including:
|
||
|
|
- English
|
||
|
|
- Spanish
|
||
|
|
- French
|
||
|
|
- German
|
||
|
|
- Italian
|
||
|
|
- Portuguese
|
||
|
|
- Russian
|
||
|
|
- Chinese
|
||
|
|
- Japanese
|
||
|
|
- Korean
|
||
|
|
- And more...
|
||
|
|
|
||
|
|
The model automatically detects the language.
|
||
|
|
|
||
|
|
## 💡 Tips
|
||
|
|
|
||
|
|
1. **For short audio clips** (<30 seconds): Don't use VAD
|
||
|
|
2. **For long audio files**: Use `--use-vad` flag
|
||
|
|
3. **For real-time streaming**: Keep chunks small (0.1-0.5 seconds)
|
||
|
|
4. **For best accuracy**: Use 16kHz mono WAV files
|
||
|
|
5. **For faster inference**: Use `--quantization int8`
|
||
|
|
|
||
|
|
## 📚 More Information
|
||
|
|
|
||
|
|
- See `README.md` for detailed documentation
|
||
|
|
- Run `python3 tools/diagnose.py` for system check
|
||
|
|
- Check logs for debugging information
|
||
|
|
|
||
|
|
## 🆘 Getting Help
|
||
|
|
|
||
|
|
If you encounter issues:
|
||
|
|
|
||
|
|
1. Run diagnostics:
|
||
|
|
```bash
|
||
|
|
python3 tools/diagnose.py
|
||
|
|
```
|
||
|
|
|
||
|
|
2. Check the logs in the terminal output
|
||
|
|
|
||
|
|
3. Verify your audio format and sample rate
|
||
|
|
|
||
|
|
4. Review the troubleshooting section above
|