stt-parakeet/QUICKSTART.md

# Quick Start Guide

## 🚀 Getting Started in 5 Minutes

### 1. Setup Environment

```bash
# Make setup script executable and run it
chmod +x setup_env.sh
./setup_env.sh
```

The setup script will:
- Create a virtual environment
- Install all dependencies including `onnx-asr`
- Check CUDA/GPU availability
- Run system diagnostics
- Optionally download the Parakeet model

### 2. Activate Virtual Environment

```bash
source venv/bin/activate
```

### 3. Test Your Setup

Run diagnostics to verify everything is working:

```bash
python3 tools/diagnose.py
```

Expected output should show:
- ✓ Python 3.10+
- ✓ onnx-asr installed
- ✓ CUDAExecutionProvider available
- ✓ GPU detected

### 4. Test Offline Transcription

Create a test audio file or use an existing WAV file:

```bash
python3 tools/test_offline.py test.wav
```

### 5. Start Real-Time Streaming

**Terminal 1 - Start Server:**
```bash
python3 server/ws_server.py
```

**Terminal 2 - Start Client:**
```bash
# List audio devices first
python3 client/mic_stream.py --list-devices

# Start streaming with your microphone
python3 client/mic_stream.py --device 0
```

## 🎯 Common Commands

### Offline Transcription

```bash
# Basic transcription
python3 tools/test_offline.py audio.wav

# With Voice Activity Detection (for long files)
python3 tools/test_offline.py audio.wav --use-vad

# With quantization (faster, uses less memory)
python3 tools/test_offline.py audio.wav --quantization int8
```

### WebSocket Server

```bash
# Start server on default port (8765)
python3 server/ws_server.py

# Custom host and port
python3 server/ws_server.py --host 0.0.0.0 --port 9000

# With VAD enabled
python3 server/ws_server.py --use-vad
```

### Microphone Client

```bash
# List available audio devices
python3 client/mic_stream.py --list-devices

# Connect to server
python3 client/mic_stream.py --url ws://localhost:8765

# Use specific device
python3 client/mic_stream.py --device 2

# Custom sample rate
python3 client/mic_stream.py --sample-rate 16000
```

## 🔧 Troubleshooting

### GPU Not Detected

1. Check NVIDIA driver:
   ```bash
   nvidia-smi
   ```

2. Check CUDA version:
   ```bash
   nvcc --version
   ```

3. Verify ONNX Runtime can see GPU:
   ```bash
   python3 -c "import onnxruntime as ort; print(ort.get_available_providers())"
   ```

   Should include `CUDAExecutionProvider`

### Out of Memory

If you get CUDA out of memory errors:

1. **Use quantization:**
   ```bash
   python3 tools/test_offline.py audio.wav --quantization int8
   ```

2. **Close other GPU applications**

3. **Reduce GPU memory limit** in `asr/asr_pipeline.py`:
   ```python
   "gpu_mem_limit": 4 * 1024 * 1024 * 1024,  # 4GB instead of 6GB
   ```

### Microphone Not Working

1. Check permissions:
   ```bash
   sudo usermod -a -G audio $USER
   # Then logout and login again
   ```

2. Test with system audio recorder first

3. List and test devices:
   ```bash
   python3 client/mic_stream.py --list-devices
   ```

### Model Download Fails

If Hugging Face is slow or blocked:

1. **Set HF token** (optional, for faster downloads):
   ```bash
   export HF_TOKEN="your_huggingface_token"
   ```

2. **Manual download:**
   ```bash
   # Download from: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
   # Extract to: models/parakeet/
   ```

## 📊 Performance Tips

### For Best GPU Performance

1. **Use TensorRT provider** (faster than CUDA):
   ```bash
   pip install tensorrt tensorrt-cu12-libs
   ```
   
   Then edit `asr/asr_pipeline.py` to use TensorRT provider

2. **Use FP16 quantization** (on TensorRT):
   ```python
   providers = [
       ("TensorrtExecutionProvider", {
           "trt_fp16_enable": True,
       })
   ]
   ```

3. **Enable quantization:**
   ```bash
   --quantization int8  # Good balance
   --quantization fp16  # Better quality
   ```

### For Lower Latency Streaming

1. **Reduce chunk duration** in client:
   ```bash
   python3 client/mic_stream.py --chunk-duration 0.05
   ```

2. **Disable VAD** for shorter responses

3. **Use quantized model** for faster processing

## 🎤 Audio File Requirements

### Supported Formats
- **Format**: WAV (PCM_16, PCM_24, PCM_32, PCM_U8)
- **Sample Rate**: 16000 Hz (recommended)
- **Channels**: Mono (stereo will be converted to mono)

### Convert Audio Files

```bash
# Using ffmpeg
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

# Using sox
sox input.mp3 -r 16000 -c 1 output.wav
```

## 📝 Example Workflow

Complete example for transcribing a meeting recording:

```bash
# 1. Activate environment
source venv/bin/activate

# 2. Convert audio to correct format
ffmpeg -i meeting.mp3 -ar 16000 -ac 1 meeting.wav

# 3. Transcribe with VAD (for long recordings)
python3 tools/test_offline.py meeting.wav --use-vad

# Output will show transcription with automatic segmentation
```

## 🌐 Supported Languages

The Parakeet TDT 0.6B V3 model supports **25+ languages** including:
- English
- Spanish
- French
- German
- Italian
- Portuguese
- Russian
- Chinese
- Japanese
- Korean
- And more...

The model automatically detects the language.

## 💡 Tips

1. **For short audio clips** (<30 seconds): Don't use VAD
2. **For long audio files**: Use `--use-vad` flag
3. **For real-time streaming**: Keep chunks small (0.1-0.5 seconds)
4. **For best accuracy**: Use 16kHz mono WAV files
5. **For faster inference**: Use `--quantization int8`

## 📚 More Information

- See `README.md` for detailed documentation
- Run `python3 tools/diagnose.py` for system check
- Check logs for debugging information

## 🆘 Getting Help

If you encounter issues:

1. Run diagnostics:
   ```bash
   python3 tools/diagnose.py
   ```

2. Check the logs in the terminal output

3. Verify your audio format and sample rate

4. Review the troubleshooting section above
Decided on Parakeet ONNX Runtime. Works pretty great. Realtime voice chat possible now. UX lacking. 2026-01-19 00:29:44 +02:00			`# Quick Start Guide`

			`## 🚀 Getting Started in 5 Minutes`

			`### 1. Setup Environment`

			```bash
			`# Make setup script executable and run it`
			`chmod +x setup_env.sh`
			`./setup_env.sh`
			```

			`The setup script will:`
			`- Create a virtual environment`
			- Install all dependencies including `onnx-asr`
			`- Check CUDA/GPU availability`
			`- Run system diagnostics`
			`- Optionally download the Parakeet model`

			`### 2. Activate Virtual Environment`

			```bash
			`source venv/bin/activate`
			```

			`### 3. Test Your Setup`

			`Run diagnostics to verify everything is working:`

			```bash
			`python3 tools/diagnose.py`
			```

			`Expected output should show:`
			`- ✓ Python 3.10+`
			`- ✓ onnx-asr installed`
			`- ✓ CUDAExecutionProvider available`
			`- ✓ GPU detected`

			`### 4. Test Offline Transcription`

			`Create a test audio file or use an existing WAV file:`

			```bash
			`python3 tools/test_offline.py test.wav`
			```

			`### 5. Start Real-Time Streaming`

			`Terminal 1 - Start Server:`
			```bash
			`python3 server/ws_server.py`
			```

			`Terminal 2 - Start Client:`
			```bash
			`# List audio devices first`
			`python3 client/mic_stream.py --list-devices`

			`# Start streaming with your microphone`
			`python3 client/mic_stream.py --device 0`
			```

			`## 🎯 Common Commands`

			`### Offline Transcription`

			```bash
			`# Basic transcription`
			`python3 tools/test_offline.py audio.wav`

			`# With Voice Activity Detection (for long files)`
			`python3 tools/test_offline.py audio.wav --use-vad`

			`# With quantization (faster, uses less memory)`
			`python3 tools/test_offline.py audio.wav --quantization int8`
			```

			`### WebSocket Server`

			```bash
			`# Start server on default port (8765)`
			`python3 server/ws_server.py`

			`# Custom host and port`
			`python3 server/ws_server.py --host 0.0.0.0 --port 9000`

			`# With VAD enabled`
			`python3 server/ws_server.py --use-vad`
			```

			`### Microphone Client`

			```bash
			`# List available audio devices`
			`python3 client/mic_stream.py --list-devices`

			`# Connect to server`
			`python3 client/mic_stream.py --url ws://localhost:8765`

			`# Use specific device`
			`python3 client/mic_stream.py --device 2`

			`# Custom sample rate`
			`python3 client/mic_stream.py --sample-rate 16000`
			```

			`## 🔧 Troubleshooting`

			`### GPU Not Detected`

			`1. Check NVIDIA driver:`
			```bash
			`nvidia-smi`
			```

			`2. Check CUDA version:`
			```bash
			`nvcc --version`
			```

			`3. Verify ONNX Runtime can see GPU:`
			```bash
			`python3 -c "import onnxruntime as ort; print(ort.get_available_providers())"`
			```

			Should include `CUDAExecutionProvider`

			`### Out of Memory`

			`If you get CUDA out of memory errors:`

			`1. Use quantization:`
			```bash
			`python3 tools/test_offline.py audio.wav --quantization int8`
			```

			`2. Close other GPU applications`

			3. Reduce GPU memory limit in `asr/asr_pipeline.py`:
			```python
			`"gpu_mem_limit": 4 * 1024 * 1024 * 1024, # 4GB instead of 6GB`
			```

			`### Microphone Not Working`

			`1. Check permissions:`
			```bash
			`sudo usermod -a -G audio $USER`
			`# Then logout and login again`
			```

			`2. Test with system audio recorder first`

			`3. List and test devices:`
			```bash
			`python3 client/mic_stream.py --list-devices`
			```

			`### Model Download Fails`

			`If Hugging Face is slow or blocked:`

			`1. Set HF token (optional, for faster downloads):`
			```bash
			`export HF_TOKEN="your_huggingface_token"`
			```

			`2. Manual download:`
			```bash
			`# Download from: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx`
			`# Extract to: models/parakeet/`
			```

			`## 📊 Performance Tips`

			`### For Best GPU Performance`

			`1. Use TensorRT provider (faster than CUDA):`
			```bash
			`pip install tensorrt tensorrt-cu12-libs`
			```

			Then edit `asr/asr_pipeline.py` to use TensorRT provider

			`2. Use FP16 quantization (on TensorRT):`
			```python
			`providers = [`
			`("TensorrtExecutionProvider", {`
			`"trt_fp16_enable": True,`
			`})`
			`]`
			```

			`3. Enable quantization:`
			```bash
			`--quantization int8 # Good balance`
			`--quantization fp16 # Better quality`
			```

			`### For Lower Latency Streaming`

			`1. Reduce chunk duration in client:`
			```bash
			`python3 client/mic_stream.py --chunk-duration 0.05`
			```

			`2. Disable VAD for shorter responses`

			`3. Use quantized model for faster processing`

			`## 🎤 Audio File Requirements`

			`### Supported Formats`
			`- Format: WAV (PCM_16, PCM_24, PCM_32, PCM_U8)`
			`- Sample Rate: 16000 Hz (recommended)`
			`- Channels: Mono (stereo will be converted to mono)`

			`### Convert Audio Files`

			```bash
			`# Using ffmpeg`
			`ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav`

			`# Using sox`
			`sox input.mp3 -r 16000 -c 1 output.wav`
			```

			`## 📝 Example Workflow`

			`Complete example for transcribing a meeting recording:`

			```bash
			`# 1. Activate environment`
			`source venv/bin/activate`

			`# 2. Convert audio to correct format`
			`ffmpeg -i meeting.mp3 -ar 16000 -ac 1 meeting.wav`

			`# 3. Transcribe with VAD (for long recordings)`
			`python3 tools/test_offline.py meeting.wav --use-vad`

			`# Output will show transcription with automatic segmentation`
			```

			`## 🌐 Supported Languages`

			`The Parakeet TDT 0.6B V3 model supports 25+ languages including:`
			`- English`
			`- Spanish`
			`- French`
			`- German`
			`- Italian`
			`- Portuguese`
			`- Russian`
			`- Chinese`
			`- Japanese`
			`- Korean`
			`- And more...`

			`The model automatically detects the language.`

			`## 💡 Tips`

			`1. For short audio clips (<30 seconds): Don't use VAD`
			2. For long audio files: Use `--use-vad` flag
			`3. For real-time streaming: Keep chunks small (0.1-0.5 seconds)`
			`4. For best accuracy: Use 16kHz mono WAV files`
			5. For faster inference: Use `--quantization int8`

			`## 📚 More Information`

			- See `README.md` for detailed documentation
			- Run `python3 tools/diagnose.py` for system check
			`- Check logs for debugging information`

			`## 🆘 Getting Help`

			`If you encounter issues:`

			`1. Run diagnostics:`
			```bash
			`python3 tools/diagnose.py`
			```

			`2. Check the logs in the terminal output`

			`3. Verify your audio format and sample rate`

			`4. Review the troubleshooting section above`