Decided on Parakeet ONNX Runtime. Works pretty great. Realtime voice chat possible now. UX lacking.

2026-01-19 00:29:44 +02:00
parent 0a8910fff8
commit 362108f4b0
34 changed files with 4593 additions and 73 deletions
--- a/stt-parakeet/CLIENT_GUIDE.md
+++ b/stt-parakeet/CLIENT_GUIDE.md
@@ -0,0 +1,303 @@
+# Server & Client Usage Guide
+
+## ✅ Server is Working!
+
+The WebSocket server is running on port **8766** with GPU acceleration.
+
+## Quick Start
+
+### 1. Start the Server
+
+```bash
+./run.sh server/ws_server.py
+```
+
+Server will start on: `ws://localhost:8766`
+
+### 2. Test with Simple Client
+
+```bash
+./run.sh test_client.py test.wav
+```
+
+### 3. Use Microphone Client
+
+```bash
+# List audio devices first
+./run.sh client/mic_stream.py --list-devices
+
+# Start streaming from microphone
+./run.sh client/mic_stream.py
+
+# Or specify device
+./run.sh client/mic_stream.py --device 0
+```
+
+## Available Clients
+
+### 1. **test_client.py** - Simple File Testing
+```bash
+./run.sh test_client.py your_audio.wav
+```
+- Sends audio file to server
+- Shows real-time transcription
+- Good for testing
+
+### 2. **client/mic_stream.py** - Live Microphone
+```bash
+./run.sh client/mic_stream.py
+```
+- Captures from microphone
+- Streams to server
+- Real-time transcription display
+
+### 3. **Custom Client** - Your Own Script
+
+```python
+import asyncio
+import websockets
+import json
+
+async def connect():
+    async with websockets.connect("ws://localhost:8766") as ws:
+        # Send audio as int16 PCM bytes
+        audio_bytes = your_audio_data.astype('int16').tobytes()
+        await ws.send(audio_bytes)
+        
+        # Receive transcription
+        response = await ws.recv()
+        result = json.loads(response)
+        print(result['text'])
+
+asyncio.run(connect())
+```
+
+## Server Options
+
+```bash
+# Custom host/port
+./run.sh server/ws_server.py --host 0.0.0.0 --port 9000
+
+# Enable VAD (for long audio)
+./run.sh server/ws_server.py --use-vad
+
+# Different model
+./run.sh server/ws_server.py --model nemo-parakeet-tdt-0.6b-v3
+
+# Change sample rate
+./run.sh server/ws_server.py --sample-rate 16000
+```
+
+## Client Options
+
+### Microphone Client
+```bash
+# List devices
+./run.sh client/mic_stream.py --list-devices
+
+# Use specific device
+./run.sh client/mic_stream.py --device 2
+
+# Custom server URL
+./run.sh client/mic_stream.py --url ws://192.168.1.100:8766
+
+# Adjust chunk duration (lower = lower latency)
+./run.sh client/mic_stream.py --chunk-duration 0.05
+```
+
+## Protocol
+
+The server uses a simple JSON-based protocol:
+
+### Server → Client Messages
+
+```json
+{
+  "type": "info",
+  "message": "Connected to ASR server",
+  "sample_rate": 16000
+}
+```
+
+```json
+{
+  "type": "transcript",
+  "text": "transcribed text here",
+  "is_final": false
+}
+```
+
+```json
+{
+  "type": "error",
+  "message": "error description"
+}
+```
+
+### Client → Server Messages
+
+**Send audio:**
+- Binary data (int16 PCM, little-endian)
+- Sample rate: 16000 Hz
+- Mono channel
+
+**Send commands:**
+```json
+{"type": "final"}   // Process remaining buffer
+{"type": "reset"}   // Reset audio buffer
+```
+
+## Audio Format Requirements
+
+- **Format**: int16 PCM (bytes)
+- **Sample Rate**: 16000 Hz
+- **Channels**: Mono (1)
+- **Byte Order**: Little-endian
+
+### Convert Audio in Python
+
+```python
+import numpy as np
+import soundfile as sf
+
+# Load audio
+audio, sr = sf.read("file.wav", dtype='float32')
+
+# Convert to mono
+if audio.ndim > 1:
+    audio = audio[:, 0]
+
+# Resample if needed (install resampy)
+if sr != 16000:
+    import resampy
+    audio = resampy.resample(audio, sr, 16000)
+
+# Convert to int16 for sending
+audio_int16 = (audio * 32767).astype(np.int16)
+audio_bytes = audio_int16.tobytes()
+```
+
+## Examples
+
+### Browser Client (JavaScript)
+
+```javascript
+const ws = new WebSocket('ws://localhost:8766');
+
+ws.onopen = () => {
+    console.log('Connected!');
+    
+    // Capture from microphone
+    navigator.mediaDevices.getUserMedia({ audio: true })
+        .then(stream => {
+            const audioContext = new AudioContext({ sampleRate: 16000 });
+            const source = audioContext.createMediaStreamSource(stream);
+            const processor = audioContext.createScriptProcessor(4096, 1, 1);
+            
+            processor.onaudioprocess = (e) => {
+                const audioData = e.inputBuffer.getChannelData(0);
+                // Convert float32 to int16
+                const int16Data = new Int16Array(audioData.length);
+                for (let i = 0; i < audioData.length; i++) {
+                    int16Data[i] = Math.max(-32768, Math.min(32767, audioData[i] * 32768));
+                }
+                ws.send(int16Data.buffer);
+            };
+            
+            source.connect(processor);
+            processor.connect(audioContext.destination);
+        });
+};
+
+ws.onmessage = (event) => {
+    const data = JSON.parse(event.data);
+    if (data.type === 'transcript') {
+        console.log('Transcription:', data.text);
+    }
+};
+```
+
+### Python Script Client
+
+```python
+#!/usr/bin/env python3
+import asyncio
+import websockets
+import sounddevice as sd
+import numpy as np
+import json
+
+async def stream_microphone():
+    uri = "ws://localhost:8766"
+    
+    async with websockets.connect(uri) as ws:
+        print("Connected!")
+        
+        def audio_callback(indata, frames, time, status):
+            # Convert to int16 and send
+            audio = (indata[:, 0] * 32767).astype(np.int16)
+            asyncio.create_task(ws.send(audio.tobytes()))
+        
+        # Start recording
+        with sd.InputStream(callback=audio_callback,
+                           channels=1,
+                           samplerate=16000,
+                           blocksize=1600):  # 0.1 second chunks
+            
+            while True:
+                response = await ws.recv()
+                data = json.loads(response)
+                if data.get('type') == 'transcript':
+                    print(f"→ {data['text']}")
+
+asyncio.run(stream_microphone())
+```
+
+## Performance
+
+With GPU (GTX 1660):
+- **Latency**: <100ms per chunk
+- **Throughput**: ~50-100x realtime
+- **GPU Memory**: ~1.3GB
+- **Languages**: 25+ (auto-detected)
+
+## Troubleshooting
+
+### Server won't start
+```bash
+# Check if port is in use
+lsof -i:8766
+
+# Kill existing server
+pkill -f ws_server.py
+
+# Restart
+./run.sh server/ws_server.py
+```
+
+### Client can't connect
+```bash
+# Check server is running
+ps aux | grep ws_server
+
+# Check firewall
+sudo ufw allow 8766
+```
+
+### No transcription output
+- Check audio format (must be int16 PCM, 16kHz, mono)
+- Check chunk size (not too small)
+- Check server logs for errors
+
+### GPU not working
+- Server will fall back to CPU automatically
+- Check `nvidia-smi` for GPU status
+- Verify CUDA libraries are loaded (should be automatic with `./run.sh`)
+
+## Next Steps
+
+1. **Test the server**: `./run.sh test_client.py test.wav`
+2. **Try microphone**: `./run.sh client/mic_stream.py`
+3. **Build your own client** using the examples above
+
+Happy transcribing! 🎤