stt/README.md

# Miku STT (Speech-to-Text) Server

Real-time speech-to-text service for Miku voice chat using Silero VAD (CPU) and Faster-Whisper (GPU).

## Architecture

- **Silero VAD** (CPU): Lightweight voice activity detection, runs continuously
- **Faster-Whisper** (GPU GTX 1660): Efficient speech transcription using CTranslate2
- **FastAPI WebSocket**: Real-time bidirectional communication

## Features

- ✅ Real-time voice activity detection with conservative settings
- ✅ Streaming partial transcripts during speech
- ✅ Final transcript on speech completion
- ✅ Interruption detection (user speaking over Miku)
- ✅ Multi-user support with isolated sessions
- ✅ KV cache optimization ready (partial text for LLM precomputation)

## API Endpoints

### WebSocket: `/ws/stt/{user_id}`

Real-time STT session for a specific user.

**Client sends:** Raw PCM audio (int16, 16kHz mono, 20ms chunks = 320 samples)

**Server sends:** JSON events:
```json
// VAD events
{"type": "vad", "event": "speech_start", "speaking": true, "probability": 0.85, "timestamp": 1250.5}
{"type": "vad", "event": "speaking", "speaking": true, "probability": 0.92, "timestamp": 1270.5}
{"type": "vad", "event": "speech_end", "speaking": false, "probability": 0.35, "timestamp": 3500.0}

// Transcription events
{"type": "partial", "text": "Hello how are", "user_id": "123", "timestamp": 2000.0}
{"type": "final", "text": "Hello how are you?", "user_id": "123", "timestamp": 3500.0}

// Interruption detection
{"type": "interruption", "probability": 0.92, "timestamp": 1500.0}
```

### HTTP GET: `/health`

Health check with model status.

**Response:**
```json
{
  "status": "healthy",
  "models": {
    "vad": {"loaded": true, "device": "cpu"},
    "whisper": {"loaded": true, "model": "small", "device": "cuda"}
  },
  "sessions": {
    "active": 2,
    "users": ["user123", "user456"]
  }
}
```

## Configuration

### VAD Parameters (Conservative)

- **Threshold**: 0.5 (speech probability)
- **Min speech duration**: 250ms (avoid false triggers)
- **Min silence duration**: 500ms (don't cut off mid-sentence)
- **Speech padding**: 30ms (context around speech)

### Whisper Parameters

- **Model**: small (balanced speed/quality, ~500MB VRAM)
- **Compute**: float16 (GPU optimization)
- **Language**: en (English)
- **Beam size**: 5 (quality/speed balance)

## Usage Example

```python
import asyncio
import websockets
import numpy as np

async def stream_audio():
    uri = "ws://localhost:8001/ws/stt/user123"
    
    async with websockets.connect(uri) as websocket:
        # Wait for ready
        ready = await websocket.recv()
        print(ready)
        
        # Stream audio chunks (16kHz, 20ms chunks)
        for audio_chunk in audio_stream:
            # Convert to bytes (int16)
            audio_bytes = audio_chunk.astype(np.int16).tobytes()
            await websocket.send(audio_bytes)
            
            # Receive events
            event = await websocket.recv()
            print(event)

asyncio.run(stream_audio())
```

## Docker Setup

### Build
```bash
docker-compose build miku-stt
```

### Run
```bash
docker-compose up -d miku-stt
```

### Logs
```bash
docker-compose logs -f miku-stt
```

### Test
```bash
curl http://localhost:8001/health
```

## GPU Sharing with Soprano

Both STT (Whisper) and TTS (Soprano) run on GTX 1660 but at different times:

1. **User speaking** → Whisper active, Soprano idle
2. **LLM processing** → Both idle
3. **Miku speaking** → Soprano active, Whisper idle (VAD monitoring only)

Interruption detection runs VAD continuously but doesn't use GPU.

## Performance

- **VAD latency**: 10-20ms per chunk (CPU)
- **Whisper latency**: ~1-2s for 2s audio (GPU)
- **Memory usage**: 
  - Silero VAD: ~100MB (CPU)
  - Faster-Whisper small: ~500MB (GPU VRAM)

## Future Improvements

- [ ] Multi-language support (auto-detect)
- [ ] Word-level timestamps for better sync
- [ ] Custom vocabulary/prompt tuning
- [ ] Speaker diarization (multiple speakers)
- [ ] Noise suppression preprocessing
Phase 4 STT pipeline implemented — Silero VAD + faster-whisper — still not working well at all 2026-01-17 03:14:40 +02:00			`# Miku STT (Speech-to-Text) Server`

			`Real-time speech-to-text service for Miku voice chat using Silero VAD (CPU) and Faster-Whisper (GPU).`

			`## Architecture`

			`- Silero VAD (CPU): Lightweight voice activity detection, runs continuously`
			`- Faster-Whisper (GPU GTX 1660): Efficient speech transcription using CTranslate2`
			`- FastAPI WebSocket: Real-time bidirectional communication`

			`## Features`

			`- ✅ Real-time voice activity detection with conservative settings`
			`- ✅ Streaming partial transcripts during speech`
			`- ✅ Final transcript on speech completion`
			`- ✅ Interruption detection (user speaking over Miku)`
			`- ✅ Multi-user support with isolated sessions`
			`- ✅ KV cache optimization ready (partial text for LLM precomputation)`

			`## API Endpoints`

			### WebSocket: `/ws/stt/{user_id}`

			`Real-time STT session for a specific user.`

			`Client sends: Raw PCM audio (int16, 16kHz mono, 20ms chunks = 320 samples)`

			`Server sends: JSON events:`
			```json
			`// VAD events`
			`{"type": "vad", "event": "speech_start", "speaking": true, "probability": 0.85, "timestamp": 1250.5}`
			`{"type": "vad", "event": "speaking", "speaking": true, "probability": 0.92, "timestamp": 1270.5}`
			`{"type": "vad", "event": "speech_end", "speaking": false, "probability": 0.35, "timestamp": 3500.0}`

			`// Transcription events`
			`{"type": "partial", "text": "Hello how are", "user_id": "123", "timestamp": 2000.0}`
			`{"type": "final", "text": "Hello how are you?", "user_id": "123", "timestamp": 3500.0}`

			`// Interruption detection`
			`{"type": "interruption", "probability": 0.92, "timestamp": 1500.0}`
			```

			### HTTP GET: `/health`

			`Health check with model status.`

			`Response:`
			```json
			`{`
			`"status": "healthy",`
			`"models": {`
			`"vad": {"loaded": true, "device": "cpu"},`
			`"whisper": {"loaded": true, "model": "small", "device": "cuda"}`
			`},`
			`"sessions": {`
			`"active": 2,`
			`"users": ["user123", "user456"]`
			`}`
			`}`
			```

			`## Configuration`

			`### VAD Parameters (Conservative)`

			`- Threshold: 0.5 (speech probability)`
			`- Min speech duration: 250ms (avoid false triggers)`
			`- Min silence duration: 500ms (don't cut off mid-sentence)`
			`- Speech padding: 30ms (context around speech)`

			`### Whisper Parameters`

			`- Model: small (balanced speed/quality, ~500MB VRAM)`
			`- Compute: float16 (GPU optimization)`
			`- Language: en (English)`
			`- Beam size: 5 (quality/speed balance)`

			`## Usage Example`

			```python
			`import asyncio`
			`import websockets`
			`import numpy as np`

			`async def stream_audio():`
			`uri = "ws://localhost:8001/ws/stt/user123"`

			`async with websockets.connect(uri) as websocket:`
			`# Wait for ready`
			`ready = await websocket.recv()`
			`print(ready)`

			`# Stream audio chunks (16kHz, 20ms chunks)`
			`for audio_chunk in audio_stream:`
			`# Convert to bytes (int16)`
			`audio_bytes = audio_chunk.astype(np.int16).tobytes()`
			`await websocket.send(audio_bytes)`

			`# Receive events`
			`event = await websocket.recv()`
			`print(event)`

			`asyncio.run(stream_audio())`
			```

			`## Docker Setup`

			`### Build`
			```bash
			`docker-compose build miku-stt`
			```

			`### Run`
			```bash
			`docker-compose up -d miku-stt`
			```

			`### Logs`
			```bash
			`docker-compose logs -f miku-stt`
			```

			`### Test`
			```bash
			`curl http://localhost:8001/health`
			```

			`## GPU Sharing with Soprano`

			`Both STT (Whisper) and TTS (Soprano) run on GTX 1660 but at different times:`

			`1. User speaking → Whisper active, Soprano idle`
			`2. LLM processing → Both idle`
			`3. Miku speaking → Soprano active, Whisper idle (VAD monitoring only)`

			`Interruption detection runs VAD continuously but doesn't use GPU.`

			`## Performance`

			`- VAD latency: 10-20ms per chunk (CPU)`
			`- Whisper latency: ~1-2s for 2s audio (GPU)`
			`- Memory usage:`
			`- Silero VAD: ~100MB (CPU)`
			`- Faster-Whisper small: ~500MB (GPU VRAM)`

			`## Future Improvements`

			`- [ ] Multi-language support (auto-detect)`
			`- [ ] Word-level timestamps for better sync`
			`- [ ] Custom vocabulary/prompt tuning`
			`- [ ] Speaker diarization (multiple speakers)`
			`- [ ] Noise suppression preprocessing`