Files
miku-discord/stt-parakeet/QUICKSTART.md

5.7 KiB

Quick Start Guide

🚀 Getting Started in 5 Minutes

1. Setup Environment

# Make setup script executable and run it
chmod +x setup_env.sh
./setup_env.sh

The setup script will:

  • Create a virtual environment
  • Install all dependencies including onnx-asr
  • Check CUDA/GPU availability
  • Run system diagnostics
  • Optionally download the Parakeet model

2. Activate Virtual Environment

source venv/bin/activate

3. Test Your Setup

Run diagnostics to verify everything is working:

python3 tools/diagnose.py

Expected output should show:

  • ✓ Python 3.10+
  • ✓ onnx-asr installed
  • ✓ CUDAExecutionProvider available
  • ✓ GPU detected

4. Test Offline Transcription

Create a test audio file or use an existing WAV file:

python3 tools/test_offline.py test.wav

5. Start Real-Time Streaming

Terminal 1 - Start Server:

python3 server/ws_server.py

Terminal 2 - Start Client:

# List audio devices first
python3 client/mic_stream.py --list-devices

# Start streaming with your microphone
python3 client/mic_stream.py --device 0

🎯 Common Commands

Offline Transcription

# Basic transcription
python3 tools/test_offline.py audio.wav

# With Voice Activity Detection (for long files)
python3 tools/test_offline.py audio.wav --use-vad

# With quantization (faster, uses less memory)
python3 tools/test_offline.py audio.wav --quantization int8

WebSocket Server

# Start server on default port (8765)
python3 server/ws_server.py

# Custom host and port
python3 server/ws_server.py --host 0.0.0.0 --port 9000

# With VAD enabled
python3 server/ws_server.py --use-vad

Microphone Client

# List available audio devices
python3 client/mic_stream.py --list-devices

# Connect to server
python3 client/mic_stream.py --url ws://localhost:8765

# Use specific device
python3 client/mic_stream.py --device 2

# Custom sample rate
python3 client/mic_stream.py --sample-rate 16000

🔧 Troubleshooting

GPU Not Detected

  1. Check NVIDIA driver:

    nvidia-smi
    
  2. Check CUDA version:

    nvcc --version
    
  3. Verify ONNX Runtime can see GPU:

    python3 -c "import onnxruntime as ort; print(ort.get_available_providers())"
    

    Should include CUDAExecutionProvider

Out of Memory

If you get CUDA out of memory errors:

  1. Use quantization:

    python3 tools/test_offline.py audio.wav --quantization int8
    
  2. Close other GPU applications

  3. Reduce GPU memory limit in asr/asr_pipeline.py:

    "gpu_mem_limit": 4 * 1024 * 1024 * 1024,  # 4GB instead of 6GB
    

Microphone Not Working

  1. Check permissions:

    sudo usermod -a -G audio $USER
    # Then logout and login again
    
  2. Test with system audio recorder first

  3. List and test devices:

    python3 client/mic_stream.py --list-devices
    

Model Download Fails

If Hugging Face is slow or blocked:

  1. Set HF token (optional, for faster downloads):

    export HF_TOKEN="your_huggingface_token"
    
  2. Manual download:

    # Download from: https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx
    # Extract to: models/parakeet/
    

📊 Performance Tips

For Best GPU Performance

  1. Use TensorRT provider (faster than CUDA):

    pip install tensorrt tensorrt-cu12-libs
    

    Then edit asr/asr_pipeline.py to use TensorRT provider

  2. Use FP16 quantization (on TensorRT):

    providers = [
        ("TensorrtExecutionProvider", {
            "trt_fp16_enable": True,
        })
    ]
    
  3. Enable quantization:

    --quantization int8  # Good balance
    --quantization fp16  # Better quality
    

For Lower Latency Streaming

  1. Reduce chunk duration in client:

    python3 client/mic_stream.py --chunk-duration 0.05
    
  2. Disable VAD for shorter responses

  3. Use quantized model for faster processing

🎤 Audio File Requirements

Supported Formats

  • Format: WAV (PCM_16, PCM_24, PCM_32, PCM_U8)
  • Sample Rate: 16000 Hz (recommended)
  • Channels: Mono (stereo will be converted to mono)

Convert Audio Files

# Using ffmpeg
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

# Using sox
sox input.mp3 -r 16000 -c 1 output.wav

📝 Example Workflow

Complete example for transcribing a meeting recording:

# 1. Activate environment
source venv/bin/activate

# 2. Convert audio to correct format
ffmpeg -i meeting.mp3 -ar 16000 -ac 1 meeting.wav

# 3. Transcribe with VAD (for long recordings)
python3 tools/test_offline.py meeting.wav --use-vad

# Output will show transcription with automatic segmentation

🌐 Supported Languages

The Parakeet TDT 0.6B V3 model supports 25+ languages including:

  • English
  • Spanish
  • French
  • German
  • Italian
  • Portuguese
  • Russian
  • Chinese
  • Japanese
  • Korean
  • And more...

The model automatically detects the language.

💡 Tips

  1. For short audio clips (<30 seconds): Don't use VAD
  2. For long audio files: Use --use-vad flag
  3. For real-time streaming: Keep chunks small (0.1-0.5 seconds)
  4. For best accuracy: Use 16kHz mono WAV files
  5. For faster inference: Use --quantization int8

📚 More Information

  • See README.md for detailed documentation
  • Run python3 tools/diagnose.py for system check
  • Check logs for debugging information

🆘 Getting Help

If you encounter issues:

  1. Run diagnostics:

    python3 tools/diagnose.py
    
  2. Check the logs in the terminal output

  3. Verify your audio format and sample rate

  4. Review the troubleshooting section above