Decided on Parakeet ONNX Runtime. Works pretty great. Realtime voice chat possible now. UX lacking.
This commit is contained in:
303
stt-parakeet/CLIENT_GUIDE.md
Normal file
303
stt-parakeet/CLIENT_GUIDE.md
Normal file
@@ -0,0 +1,303 @@
|
||||
# Server & Client Usage Guide
|
||||
|
||||
## ✅ Server is Working!
|
||||
|
||||
The WebSocket server is running on port **8766** with GPU acceleration.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Start the Server
|
||||
|
||||
```bash
|
||||
./run.sh server/ws_server.py
|
||||
```
|
||||
|
||||
Server will start on: `ws://localhost:8766`
|
||||
|
||||
### 2. Test with Simple Client
|
||||
|
||||
```bash
|
||||
./run.sh test_client.py test.wav
|
||||
```
|
||||
|
||||
### 3. Use Microphone Client
|
||||
|
||||
```bash
|
||||
# List audio devices first
|
||||
./run.sh client/mic_stream.py --list-devices
|
||||
|
||||
# Start streaming from microphone
|
||||
./run.sh client/mic_stream.py
|
||||
|
||||
# Or specify device
|
||||
./run.sh client/mic_stream.py --device 0
|
||||
```
|
||||
|
||||
## Available Clients
|
||||
|
||||
### 1. **test_client.py** - Simple File Testing
|
||||
```bash
|
||||
./run.sh test_client.py your_audio.wav
|
||||
```
|
||||
- Sends audio file to server
|
||||
- Shows real-time transcription
|
||||
- Good for testing
|
||||
|
||||
### 2. **client/mic_stream.py** - Live Microphone
|
||||
```bash
|
||||
./run.sh client/mic_stream.py
|
||||
```
|
||||
- Captures from microphone
|
||||
- Streams to server
|
||||
- Real-time transcription display
|
||||
|
||||
### 3. **Custom Client** - Your Own Script
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import websockets
|
||||
import json
|
||||
|
||||
async def connect():
|
||||
async with websockets.connect("ws://localhost:8766") as ws:
|
||||
# Send audio as int16 PCM bytes
|
||||
audio_bytes = your_audio_data.astype('int16').tobytes()
|
||||
await ws.send(audio_bytes)
|
||||
|
||||
# Receive transcription
|
||||
response = await ws.recv()
|
||||
result = json.loads(response)
|
||||
print(result['text'])
|
||||
|
||||
asyncio.run(connect())
|
||||
```
|
||||
|
||||
## Server Options
|
||||
|
||||
```bash
|
||||
# Custom host/port
|
||||
./run.sh server/ws_server.py --host 0.0.0.0 --port 9000
|
||||
|
||||
# Enable VAD (for long audio)
|
||||
./run.sh server/ws_server.py --use-vad
|
||||
|
||||
# Different model
|
||||
./run.sh server/ws_server.py --model nemo-parakeet-tdt-0.6b-v3
|
||||
|
||||
# Change sample rate
|
||||
./run.sh server/ws_server.py --sample-rate 16000
|
||||
```
|
||||
|
||||
## Client Options
|
||||
|
||||
### Microphone Client
|
||||
```bash
|
||||
# List devices
|
||||
./run.sh client/mic_stream.py --list-devices
|
||||
|
||||
# Use specific device
|
||||
./run.sh client/mic_stream.py --device 2
|
||||
|
||||
# Custom server URL
|
||||
./run.sh client/mic_stream.py --url ws://192.168.1.100:8766
|
||||
|
||||
# Adjust chunk duration (lower = lower latency)
|
||||
./run.sh client/mic_stream.py --chunk-duration 0.05
|
||||
```
|
||||
|
||||
## Protocol
|
||||
|
||||
The server uses a simple JSON-based protocol:
|
||||
|
||||
### Server → Client Messages
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "info",
|
||||
"message": "Connected to ASR server",
|
||||
"sample_rate": 16000
|
||||
}
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "transcript",
|
||||
"text": "transcribed text here",
|
||||
"is_final": false
|
||||
}
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "error",
|
||||
"message": "error description"
|
||||
}
|
||||
```
|
||||
|
||||
### Client → Server Messages
|
||||
|
||||
**Send audio:**
|
||||
- Binary data (int16 PCM, little-endian)
|
||||
- Sample rate: 16000 Hz
|
||||
- Mono channel
|
||||
|
||||
**Send commands:**
|
||||
```json
|
||||
{"type": "final"} // Process remaining buffer
|
||||
{"type": "reset"} // Reset audio buffer
|
||||
```
|
||||
|
||||
## Audio Format Requirements
|
||||
|
||||
- **Format**: int16 PCM (bytes)
|
||||
- **Sample Rate**: 16000 Hz
|
||||
- **Channels**: Mono (1)
|
||||
- **Byte Order**: Little-endian
|
||||
|
||||
### Convert Audio in Python
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
import soundfile as sf
|
||||
|
||||
# Load audio
|
||||
audio, sr = sf.read("file.wav", dtype='float32')
|
||||
|
||||
# Convert to mono
|
||||
if audio.ndim > 1:
|
||||
audio = audio[:, 0]
|
||||
|
||||
# Resample if needed (install resampy)
|
||||
if sr != 16000:
|
||||
import resampy
|
||||
audio = resampy.resample(audio, sr, 16000)
|
||||
|
||||
# Convert to int16 for sending
|
||||
audio_int16 = (audio * 32767).astype(np.int16)
|
||||
audio_bytes = audio_int16.tobytes()
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Browser Client (JavaScript)
|
||||
|
||||
```javascript
|
||||
const ws = new WebSocket('ws://localhost:8766');
|
||||
|
||||
ws.onopen = () => {
|
||||
console.log('Connected!');
|
||||
|
||||
// Capture from microphone
|
||||
navigator.mediaDevices.getUserMedia({ audio: true })
|
||||
.then(stream => {
|
||||
const audioContext = new AudioContext({ sampleRate: 16000 });
|
||||
const source = audioContext.createMediaStreamSource(stream);
|
||||
const processor = audioContext.createScriptProcessor(4096, 1, 1);
|
||||
|
||||
processor.onaudioprocess = (e) => {
|
||||
const audioData = e.inputBuffer.getChannelData(0);
|
||||
// Convert float32 to int16
|
||||
const int16Data = new Int16Array(audioData.length);
|
||||
for (let i = 0; i < audioData.length; i++) {
|
||||
int16Data[i] = Math.max(-32768, Math.min(32767, audioData[i] * 32768));
|
||||
}
|
||||
ws.send(int16Data.buffer);
|
||||
};
|
||||
|
||||
source.connect(processor);
|
||||
processor.connect(audioContext.destination);
|
||||
});
|
||||
};
|
||||
|
||||
ws.onmessage = (event) => {
|
||||
const data = JSON.parse(event.data);
|
||||
if (data.type === 'transcript') {
|
||||
console.log('Transcription:', data.text);
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
### Python Script Client
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
import asyncio
|
||||
import websockets
|
||||
import sounddevice as sd
|
||||
import numpy as np
|
||||
import json
|
||||
|
||||
async def stream_microphone():
|
||||
uri = "ws://localhost:8766"
|
||||
|
||||
async with websockets.connect(uri) as ws:
|
||||
print("Connected!")
|
||||
|
||||
def audio_callback(indata, frames, time, status):
|
||||
# Convert to int16 and send
|
||||
audio = (indata[:, 0] * 32767).astype(np.int16)
|
||||
asyncio.create_task(ws.send(audio.tobytes()))
|
||||
|
||||
# Start recording
|
||||
with sd.InputStream(callback=audio_callback,
|
||||
channels=1,
|
||||
samplerate=16000,
|
||||
blocksize=1600): # 0.1 second chunks
|
||||
|
||||
while True:
|
||||
response = await ws.recv()
|
||||
data = json.loads(response)
|
||||
if data.get('type') == 'transcript':
|
||||
print(f"→ {data['text']}")
|
||||
|
||||
asyncio.run(stream_microphone())
|
||||
```
|
||||
|
||||
## Performance
|
||||
|
||||
With GPU (GTX 1660):
|
||||
- **Latency**: <100ms per chunk
|
||||
- **Throughput**: ~50-100x realtime
|
||||
- **GPU Memory**: ~1.3GB
|
||||
- **Languages**: 25+ (auto-detected)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Server won't start
|
||||
```bash
|
||||
# Check if port is in use
|
||||
lsof -i:8766
|
||||
|
||||
# Kill existing server
|
||||
pkill -f ws_server.py
|
||||
|
||||
# Restart
|
||||
./run.sh server/ws_server.py
|
||||
```
|
||||
|
||||
### Client can't connect
|
||||
```bash
|
||||
# Check server is running
|
||||
ps aux | grep ws_server
|
||||
|
||||
# Check firewall
|
||||
sudo ufw allow 8766
|
||||
```
|
||||
|
||||
### No transcription output
|
||||
- Check audio format (must be int16 PCM, 16kHz, mono)
|
||||
- Check chunk size (not too small)
|
||||
- Check server logs for errors
|
||||
|
||||
### GPU not working
|
||||
- Server will fall back to CPU automatically
|
||||
- Check `nvidia-smi` for GPU status
|
||||
- Verify CUDA libraries are loaded (should be automatic with `./run.sh`)
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Test the server**: `./run.sh test_client.py test.wav`
|
||||
2. **Try microphone**: `./run.sh client/mic_stream.py`
|
||||
3. **Build your own client** using the examples above
|
||||
|
||||
Happy transcribing! 🎤
|
||||
Reference in New Issue
Block a user