# Quick Start Guide ## 🚀 Getting Started in 5 Minutes ### 1. Setup Environment ```bash # Make setup script executable and run it chmod +x setup_env.sh ./setup_env.sh ``` The setup script will: - Create a virtual environment - Install all dependencies including `onnx-asr` - Check CUDA/GPU availability - Run system diagnostics - Optionally download the Parakeet model ### 2. Activate Virtual Environment ```bash source venv/bin/activate ``` ### 3. Test Your Setup Run diagnostics to verify everything is working: ```bash python3 tools/diagnose.py ``` Expected output should show: - ✓ Python 3.10+ - ✓ onnx-asr installed - ✓ CUDAExecutionProvider available - ✓ GPU detected ### 4. Test Offline Transcription Create a test audio file or use an existing WAV file: ```bash python3 tools/test_offline.py test.wav ``` ### 5. Start Real-Time Streaming **Terminal 1 - Start Server:** ```bash python3 server/ws_server.py ``` **Terminal 2 - Start Client:** ```bash # List audio devices first python3 client/mic_stream.py --list-devices # Start streaming with your microphone python3 client/mic_stream.py --device 0 ``` ## 🎯 Common Commands ### Offline Transcription ```bash # Basic transcription python3 tools/test_offline.py audio.wav # With Voice Activity Detection (for long files) python3 tools/test_offline.py audio.wav --use-vad # With quantization (faster, uses less memory) python3 tools/test_offline.py audio.wav --quantization int8 ``` ### WebSocket Server ```bash # Start server on default port (8765) python3 server/ws_server.py # Custom host and port python3 server/ws_server.py --host 0.0.0.0 --port 9000 # With VAD enabled python3 server/ws_server.py --use-vad ``` ### Microphone Client ```bash # List available audio devices python3 client/mic_stream.py --list-devices # Connect to server python3 client/mic_stream.py --url ws://localhost:8765 # Use specific device python3 client/mic_stream.py --device 2 # Custom sample rate python3 client/mic_stream.py --sample-rate 16000 ``` ## 🔧 Troubleshooting ### GPU Not Detected 1. Check NVIDIA driver: ```bash nvidia-smi ``` 2. Check CUDA version: ```bash nvcc --version ``` 3. Verify ONNX Runtime can see GPU: ```bash python3 -c "import onnxruntime as ort; print(ort.get_available_providers())" ``` Should include `CUDAExecutionProvider` ### Out of Memory If you get CUDA out of memory errors: 1. **Use quantization:** ```bash python3 tools/test_offline.py audio.wav --quantization int8 ``` 2. **Close other GPU applications** 3. **Reduce GPU memory limit** in `asr/asr_pipeline.py`: ```python "gpu_mem_limit": 4 * 1024 * 1024 * 1024, # 4GB instead of 6GB ``` ### Microphone Not Working 1. Check permissions: ```bash sudo usermod -a -G audio $USER # Then logout and login again ``` 2. Test with system audio recorder first 3. List and test devices: ```bash python3 client/mic_stream.py --list-devices ``` ### Model Download Fails If Hugging Face is slow or blocked: 1. **Set HF token** (optional, for faster downloads): ```bash export HF_TOKEN="your_huggingface_token" ``` 2. **Manual download:** ```bash # Download from: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx # Extract to: models/parakeet/ ``` ## 📊 Performance Tips ### For Best GPU Performance 1. **Use TensorRT provider** (faster than CUDA): ```bash pip install tensorrt tensorrt-cu12-libs ``` Then edit `asr/asr_pipeline.py` to use TensorRT provider 2. **Use FP16 quantization** (on TensorRT): ```python providers = [ ("TensorrtExecutionProvider", { "trt_fp16_enable": True, }) ] ``` 3. **Enable quantization:** ```bash --quantization int8 # Good balance --quantization fp16 # Better quality ``` ### For Lower Latency Streaming 1. **Reduce chunk duration** in client: ```bash python3 client/mic_stream.py --chunk-duration 0.05 ``` 2. **Disable VAD** for shorter responses 3. **Use quantized model** for faster processing ## 🎤 Audio File Requirements ### Supported Formats - **Format**: WAV (PCM_16, PCM_24, PCM_32, PCM_U8) - **Sample Rate**: 16000 Hz (recommended) - **Channels**: Mono (stereo will be converted to mono) ### Convert Audio Files ```bash # Using ffmpeg ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav # Using sox sox input.mp3 -r 16000 -c 1 output.wav ``` ## 📝 Example Workflow Complete example for transcribing a meeting recording: ```bash # 1. Activate environment source venv/bin/activate # 2. Convert audio to correct format ffmpeg -i meeting.mp3 -ar 16000 -ac 1 meeting.wav # 3. Transcribe with VAD (for long recordings) python3 tools/test_offline.py meeting.wav --use-vad # Output will show transcription with automatic segmentation ``` ## 🌐 Supported Languages The Parakeet TDT 0.6B V3 model supports **25+ languages** including: - English - Spanish - French - German - Italian - Portuguese - Russian - Chinese - Japanese - Korean - And more... The model automatically detects the language. ## 💡 Tips 1. **For short audio clips** (<30 seconds): Don't use VAD 2. **For long audio files**: Use `--use-vad` flag 3. **For real-time streaming**: Keep chunks small (0.1-0.5 seconds) 4. **For best accuracy**: Use 16kHz mono WAV files 5. **For faster inference**: Use `--quantization int8` ## 📚 More Information - See `README.md` for detailed documentation - Run `python3 tools/diagnose.py` for system check - Check logs for debugging information ## 🆘 Getting Help If you encounter issues: 1. Run diagnostics: ```bash python3 tools/diagnose.py ``` 2. Check the logs in the terminal output 3. Verify your audio format and sample rate 4. Review the troubleshooting section above