Files

koko210Serve d1e6b21508 Phase 4 STT pipeline implemented — Silero VAD + faster-whisper — still not working well at all

2026-01-17 03:14:40 +02:00

17 KiB

Raw Blame History

Voice-to-Voice Quick Reference

Complete Pipeline Status ✅

All phases complete and deployed!

Phase Completion Status

✅ Phase 1: Voice Connection (COMPLETE)

Discord voice channel connection
Audio playback via discord.py
Resource management and cleanup

✅ Phase 2: Audio Streaming (COMPLETE)

Soprano TTS server (GTX 1660)
RVC voice conversion
Real-time streaming via WebSocket
Token-by-token synthesis

✅ Phase 3: Text-to-Voice (COMPLETE)

LLaMA text generation (AMD RX 6800)
Streaming token pipeline
TTS integration with !miku say
Natural conversation flow

✅ Phase 4A: STT Container (COMPLETE)

Silero VAD on CPU
Faster-Whisper on GTX 1660
WebSocket server at port 8001
Per-user session management
Chunk buffering for VAD

✅ Phase 4B: Bot STT Integration (COMPLETE - READY FOR TESTING)

Discord audio capture
Opus decode + resampling
STT client WebSocket integration
Voice commands: !miku listen, !miku stop-listening
LLM voice response generation
Interruption detection and cancellation
/interrupt endpoint in RVC API

Quick Start Commands

Setup

!miku join              # Join your voice channel
!miku listen            # Start listening to your voice

Usage

Speak into your microphone
Miku will transcribe your speech
Miku will respond with voice
Interrupt her by speaking while she's talking

Teardown

!miku stop-listening    # Stop listening to your voice
!miku leave             # Leave voice channel

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                         USER INPUT                              │
└─────────────────────────────────────────────────────────────────┘
                              │
                              │ Discord Voice (Opus 48kHz)
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    miku-bot Container                           │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ VoiceReceiver (discord.sinks.Sink)                        │ │
│  │  - Opus decode → PCM                                      │ │
│  │  - Stereo → Mono                                          │ │
│  │  - Resample 48kHz → 16kHz                                 │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
│                    │ PCM int16, 16kHz, 20ms chunks              │
│  ┌─────────────────▼─────────────────────────────────────────┐ │
│  │ STTClient (WebSocket)                                     │ │
│  │  - Sends audio to miku-stt                                │ │
│  │  - Receives VAD events, transcripts                       │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
└────────────────────┼───────────────────────────────────────────┘
                     │ ws://miku-stt:8001/ws/stt/{user_id}
                     ▼
┌─────────────────────────────────────────────────────────────────┐
│                    miku-stt Container                           │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ VADProcessor (Silero VAD 5.1.2)         [CPU]            │ │
│  │  - Chunk buffering (512 samples min)                      │ │
│  │  - Speech detection (threshold=0.5)                       │ │
│  │  - Events: speech_start, speaking, speech_end             │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
│                    │ Audio segments                             │
│  ┌─────────────────▼─────────────────────────────────────────┐ │
│  │ WhisperTranscriber (Faster-Whisper 1.2.1) [GTX 1660]    │ │
│  │  - Model: small (1.3GB VRAM)                              │ │
│  │  - Transcribes speech segments                            │ │
│  │  - Returns: partial & final transcripts                   │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
└────────────────────┼───────────────────────────────────────────┘
                     │ JSON events via WebSocket
                     ▼
┌─────────────────────────────────────────────────────────────────┐
│                    miku-bot Container                           │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ voice_manager.py Callbacks                                │ │
│  │  - on_vad_event()         → Log VAD states                │ │
│  │  - on_partial_transcript() → Show typing indicator        │ │
│  │  - on_final_transcript()   → Generate LLM response        │ │
│  │  - on_interruption()       → Cancel TTS playback          │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
│                    │ Final transcript text                      │
│  ┌─────────────────▼─────────────────────────────────────────┐ │
│  │ _generate_voice_response()                                │ │
│  │  - Build LLM prompt with conversation history             │ │
│  │  - Stream LLM response                                    │ │
│  │  - Send tokens to TTS                                     │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
└────────────────────┼───────────────────────────────────────────┘
                     │ HTTP streaming to LLaMA server
                     ▼
┌─────────────────────────────────────────────────────────────────┐
│              llama-cpp-server (AMD RX 6800)                     │
│  - Streaming text generation                                   │
│  - 20-30 tokens/sec                                            │
│  - Returns: {"delta": {"content": "token"}}                    │
└─────────────────┬───────────────────────────────────────────────┘
                  │ Token stream
                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                    miku-bot Container                           │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ audio_source.send_token()                                 │ │
│  │  - Buffers tokens                                         │ │
│  │  - Sends to RVC WebSocket                                 │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
└────────────────────┼───────────────────────────────────────────┘
                     │ ws://miku-rvc-api:8765/ws/stream
                     ▼
┌─────────────────────────────────────────────────────────────────┐
│                 miku-rvc-api Container                          │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ Soprano TTS Server (miku-soprano-tts)    [GTX 1660]      │ │
│  │  - Text → Audio synthesis                                 │ │
│  │  - 32kHz output                                           │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
│                    │ Raw audio via ZMQ                          │
│  ┌─────────────────▼─────────────────────────────────────────┐ │
│  │ RVC Voice Conversion                     [GTX 1660]      │ │
│  │  - Voice cloning & pitch shifting                         │ │
│  │  - 48kHz output                                           │ │
│  └─────────────────┬─────────────────────────────────────────┘ │
└────────────────────┼───────────────────────────────────────────┘
                     │ PCM float32, 48kHz
                     ▼
┌─────────────────────────────────────────────────────────────────┐
│                    miku-bot Container                           │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ discord.VoiceClient                                       │ │
│  │  - Plays audio in voice channel                           │ │
│  │  - Can be interrupted by user speech                      │ │
│  └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                       USER OUTPUT                               │
│                   (Miku's voice response)                       │
└─────────────────────────────────────────────────────────────────┘

Interruption Flow

User speaks during Miku's TTS
         │
         ▼
VAD detects speech (probability > 0.7)
         │
         ▼
STT sends interruption event
         │
         ▼
on_user_interruption() callback
         │
         ▼
_cancel_tts() → voice_client.stop()
         │
         ▼
POST http://miku-rvc-api:8765/interrupt
         │
         ▼
Flush ZMQ socket + clear RVC buffers
         │
         ▼
Miku stops speaking, ready for new input

Hardware Utilization

Listen Phase (User Speaking)

CPU: Silero VAD processing
GTX 1660: Faster-Whisper transcription (1.3GB VRAM)
AMD RX 6800: Idle

Think Phase (LLM Generation)

CPU: Idle
GTX 1660: Idle
AMD RX 6800: LLaMA inference (20-30 tokens/sec)

Speak Phase (Miku Responding)

CPU: Silero VAD monitoring for interruption
GTX 1660: Soprano TTS + RVC synthesis
AMD RX 6800: Idle

Performance Metrics

Expected Latencies

Stage	Latency
Discord audio capture	~20ms
Opus decode + resample	<10ms
VAD processing	<50ms
Whisper transcription	200-500ms
LLM token generation	33-50ms/tok
TTS synthesis	Real-time
Total (speech → response)	1-2s

VRAM Usage

GPU	Component	VRAM
AMD RX 6800	LLaMA 8B Q4	~5.5GB
GTX 1660	Whisper small	1.3GB
GTX 1660	Soprano + RVC	~3GB

Key Files

Bot Container

bot/utils/stt_client.py - WebSocket client for STT
bot/utils/voice_receiver.py - Discord audio sink
bot/utils/voice_manager.py - Voice session with STT integration
bot/commands/voice.py - Voice commands including listen/stop-listening

STT Container

stt/vad_processor.py - Silero VAD with chunk buffering
stt/whisper_transcriber.py - Faster-Whisper transcription
stt/stt_server.py - FastAPI WebSocket server

RVC Container

soprano_to_rvc/soprano_rvc_api.py - TTS + RVC pipeline with /interrupt endpoint

Configuration Files

docker-compose.yml

Network: miku-network (all containers)
Ports:
- miku-bot: 8081 (API)
- miku-rvc-api: 8765 (TTS)
- miku-stt: 8001 (STT)
- llama-cpp-server: 8080 (LLM)

VAD Settings (stt/vad_processor.py)

threshold = 0.5          # Speech detection sensitivity
min_speech = 250         # Minimum speech duration (ms)
min_silence = 500        # Silence before speech_end (ms)
interruption_threshold = 0.7  # Probability for interruption

Whisper Settings (stt/whisper_transcriber.py)

model = "small"          # 1.3GB VRAM
device = "cuda"
compute_type = "float16"
beam_size = 5
patience = 1.0

Testing Commands

# Check all container health
curl http://localhost:8001/health  # STT
curl http://localhost:8765/health  # RVC
curl http://localhost:8080/health  # LLM

# Monitor logs
docker logs -f miku-bot | grep -E "(listen|transcript|interrupt)"
docker logs -f miku-stt
docker logs -f miku-rvc-api | grep interrupt

# Test interrupt endpoint
curl -X POST http://localhost:8765/interrupt

# Check GPU usage
nvidia-smi

Troubleshooting

Issue	Solution
No audio from Discord	Check bot has Connect and Speak permissions
VAD not detecting	Speak louder, check microphone, lower threshold
Empty transcripts	Speak for at least 1-2 seconds, check Whisper model
Interruption not working	Verify `miku_speaking=true`, check VAD probability
High latency	Profile each stage, check GPU utilization

Next Features (Phase 4C+)

KV cache precomputation from partial transcripts
Multi-user simultaneous conversation
Latency optimization (<1s total)
Voice activity history and analytics
Emotion detection from speech patterns
Context-aware interruption handling

Ready to test! Use !miku join → !miku listen → speak to Miku 🎤

17 KiB Raw Blame History

Voice-to-Voice Quick Reference

Complete Pipeline Status ✅

Phase Completion Status

✅ Phase 1: Voice Connection (COMPLETE)

✅ Phase 2: Audio Streaming (COMPLETE)

✅ Phase 3: Text-to-Voice (COMPLETE)

✅ Phase 4A: STT Container (COMPLETE)

✅ Phase 4B: Bot STT Integration (COMPLETE - READY FOR TESTING)

Quick Start Commands

Setup

Usage

Teardown

Architecture Diagram

Interruption Flow

Hardware Utilization

Listen Phase (User Speaking)

Think Phase (LLM Generation)

Speak Phase (Miku Responding)

Performance Metrics

Expected Latencies

VRAM Usage

Key Files

Bot Container

STT Container

RVC Container

Configuration Files

docker-compose.yml

VAD Settings (stt/vad_processor.py)

Whisper Settings (stt/whisper_transcriber.py)

Testing Commands

Troubleshooting

Next Features (Phase 4C+)

17 KiB

Raw Blame History