324 lines
17 KiB
Markdown
324 lines
17 KiB
Markdown
|
|
# Voice-to-Voice Quick Reference
|
||
|
|
|
||
|
|
## Complete Pipeline Status ✅
|
||
|
|
|
||
|
|
All phases complete and deployed!
|
||
|
|
|
||
|
|
## Phase Completion Status
|
||
|
|
|
||
|
|
### ✅ Phase 1: Voice Connection (COMPLETE)
|
||
|
|
- Discord voice channel connection
|
||
|
|
- Audio playback via discord.py
|
||
|
|
- Resource management and cleanup
|
||
|
|
|
||
|
|
### ✅ Phase 2: Audio Streaming (COMPLETE)
|
||
|
|
- Soprano TTS server (GTX 1660)
|
||
|
|
- RVC voice conversion
|
||
|
|
- Real-time streaming via WebSocket
|
||
|
|
- Token-by-token synthesis
|
||
|
|
|
||
|
|
### ✅ Phase 3: Text-to-Voice (COMPLETE)
|
||
|
|
- LLaMA text generation (AMD RX 6800)
|
||
|
|
- Streaming token pipeline
|
||
|
|
- TTS integration with `!miku say`
|
||
|
|
- Natural conversation flow
|
||
|
|
|
||
|
|
### ✅ Phase 4A: STT Container (COMPLETE)
|
||
|
|
- Silero VAD on CPU
|
||
|
|
- Faster-Whisper on GTX 1660
|
||
|
|
- WebSocket server at port 8001
|
||
|
|
- Per-user session management
|
||
|
|
- Chunk buffering for VAD
|
||
|
|
|
||
|
|
### ✅ Phase 4B: Bot STT Integration (COMPLETE - READY FOR TESTING)
|
||
|
|
- Discord audio capture
|
||
|
|
- Opus decode + resampling
|
||
|
|
- STT client WebSocket integration
|
||
|
|
- Voice commands: `!miku listen`, `!miku stop-listening`
|
||
|
|
- LLM voice response generation
|
||
|
|
- Interruption detection and cancellation
|
||
|
|
- `/interrupt` endpoint in RVC API
|
||
|
|
|
||
|
|
## Quick Start Commands
|
||
|
|
|
||
|
|
### Setup
|
||
|
|
```bash
|
||
|
|
!miku join # Join your voice channel
|
||
|
|
!miku listen # Start listening to your voice
|
||
|
|
```
|
||
|
|
|
||
|
|
### Usage
|
||
|
|
- **Speak** into your microphone
|
||
|
|
- Miku will **transcribe** your speech
|
||
|
|
- Miku will **respond** with voice
|
||
|
|
- **Interrupt** her by speaking while she's talking
|
||
|
|
|
||
|
|
### Teardown
|
||
|
|
```bash
|
||
|
|
!miku stop-listening # Stop listening to your voice
|
||
|
|
!miku leave # Leave voice channel
|
||
|
|
```
|
||
|
|
|
||
|
|
## Architecture Diagram
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─────────────────────────────────────────────────────────────────┐
|
||
|
|
│ USER INPUT │
|
||
|
|
└─────────────────────────────────────────────────────────────────┘
|
||
|
|
│
|
||
|
|
│ Discord Voice (Opus 48kHz)
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────────────────────────────────────┐
|
||
|
|
│ miku-bot Container │
|
||
|
|
│ ┌───────────────────────────────────────────────────────────┐ │
|
||
|
|
│ │ VoiceReceiver (discord.sinks.Sink) │ │
|
||
|
|
│ │ - Opus decode → PCM │ │
|
||
|
|
│ │ - Stereo → Mono │ │
|
||
|
|
│ │ - Resample 48kHz → 16kHz │ │
|
||
|
|
│ └─────────────────┬─────────────────────────────────────────┘ │
|
||
|
|
│ │ PCM int16, 16kHz, 20ms chunks │
|
||
|
|
│ ┌─────────────────▼─────────────────────────────────────────┐ │
|
||
|
|
│ │ STTClient (WebSocket) │ │
|
||
|
|
│ │ - Sends audio to miku-stt │ │
|
||
|
|
│ │ - Receives VAD events, transcripts │ │
|
||
|
|
│ └─────────────────┬─────────────────────────────────────────┘ │
|
||
|
|
└────────────────────┼───────────────────────────────────────────┘
|
||
|
|
│ ws://miku-stt:8001/ws/stt/{user_id}
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────────────────────────────────────┐
|
||
|
|
│ miku-stt Container │
|
||
|
|
│ ┌───────────────────────────────────────────────────────────┐ │
|
||
|
|
│ │ VADProcessor (Silero VAD 5.1.2) [CPU] │ │
|
||
|
|
│ │ - Chunk buffering (512 samples min) │ │
|
||
|
|
│ │ - Speech detection (threshold=0.5) │ │
|
||
|
|
│ │ - Events: speech_start, speaking, speech_end │ │
|
||
|
|
│ └─────────────────┬─────────────────────────────────────────┘ │
|
||
|
|
│ │ Audio segments │
|
||
|
|
│ ┌─────────────────▼─────────────────────────────────────────┐ │
|
||
|
|
│ │ WhisperTranscriber (Faster-Whisper 1.2.1) [GTX 1660] │ │
|
||
|
|
│ │ - Model: small (1.3GB VRAM) │ │
|
||
|
|
│ │ - Transcribes speech segments │ │
|
||
|
|
│ │ - Returns: partial & final transcripts │ │
|
||
|
|
│ └─────────────────┬─────────────────────────────────────────┘ │
|
||
|
|
└────────────────────┼───────────────────────────────────────────┘
|
||
|
|
│ JSON events via WebSocket
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────────────────────────────────────┐
|
||
|
|
│ miku-bot Container │
|
||
|
|
│ ┌───────────────────────────────────────────────────────────┐ │
|
||
|
|
│ │ voice_manager.py Callbacks │ │
|
||
|
|
│ │ - on_vad_event() → Log VAD states │ │
|
||
|
|
│ │ - on_partial_transcript() → Show typing indicator │ │
|
||
|
|
│ │ - on_final_transcript() → Generate LLM response │ │
|
||
|
|
│ │ - on_interruption() → Cancel TTS playback │ │
|
||
|
|
│ └─────────────────┬─────────────────────────────────────────┘ │
|
||
|
|
│ │ Final transcript text │
|
||
|
|
│ ┌─────────────────▼─────────────────────────────────────────┐ │
|
||
|
|
│ │ _generate_voice_response() │ │
|
||
|
|
│ │ - Build LLM prompt with conversation history │ │
|
||
|
|
│ │ - Stream LLM response │ │
|
||
|
|
│ │ - Send tokens to TTS │ │
|
||
|
|
│ └─────────────────┬─────────────────────────────────────────┘ │
|
||
|
|
└────────────────────┼───────────────────────────────────────────┘
|
||
|
|
│ HTTP streaming to LLaMA server
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────────────────────────────────────┐
|
||
|
|
│ llama-cpp-server (AMD RX 6800) │
|
||
|
|
│ - Streaming text generation │
|
||
|
|
│ - 20-30 tokens/sec │
|
||
|
|
│ - Returns: {"delta": {"content": "token"}} │
|
||
|
|
└─────────────────┬───────────────────────────────────────────────┘
|
||
|
|
│ Token stream
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────────────────────────────────────┐
|
||
|
|
│ miku-bot Container │
|
||
|
|
│ ┌───────────────────────────────────────────────────────────┐ │
|
||
|
|
│ │ audio_source.send_token() │ │
|
||
|
|
│ │ - Buffers tokens │ │
|
||
|
|
│ │ - Sends to RVC WebSocket │ │
|
||
|
|
│ └─────────────────┬─────────────────────────────────────────┘ │
|
||
|
|
└────────────────────┼───────────────────────────────────────────┘
|
||
|
|
│ ws://miku-rvc-api:8765/ws/stream
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────────────────────────────────────┐
|
||
|
|
│ miku-rvc-api Container │
|
||
|
|
│ ┌───────────────────────────────────────────────────────────┐ │
|
||
|
|
│ │ Soprano TTS Server (miku-soprano-tts) [GTX 1660] │ │
|
||
|
|
│ │ - Text → Audio synthesis │ │
|
||
|
|
│ │ - 32kHz output │ │
|
||
|
|
│ └─────────────────┬─────────────────────────────────────────┘ │
|
||
|
|
│ │ Raw audio via ZMQ │
|
||
|
|
│ ┌─────────────────▼─────────────────────────────────────────┐ │
|
||
|
|
│ │ RVC Voice Conversion [GTX 1660] │ │
|
||
|
|
│ │ - Voice cloning & pitch shifting │ │
|
||
|
|
│ │ - 48kHz output │ │
|
||
|
|
│ └─────────────────┬─────────────────────────────────────────┘ │
|
||
|
|
└────────────────────┼───────────────────────────────────────────┘
|
||
|
|
│ PCM float32, 48kHz
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────────────────────────────────────┐
|
||
|
|
│ miku-bot Container │
|
||
|
|
│ ┌───────────────────────────────────────────────────────────┐ │
|
||
|
|
│ │ discord.VoiceClient │ │
|
||
|
|
│ │ - Plays audio in voice channel │ │
|
||
|
|
│ │ - Can be interrupted by user speech │ │
|
||
|
|
│ └───────────────────────────────────────────────────────────┘ │
|
||
|
|
└─────────────────────────────────────────────────────────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────────────────────────────────────┐
|
||
|
|
│ USER OUTPUT │
|
||
|
|
│ (Miku's voice response) │
|
||
|
|
└─────────────────────────────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
## Interruption Flow
|
||
|
|
|
||
|
|
```
|
||
|
|
User speaks during Miku's TTS
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
VAD detects speech (probability > 0.7)
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
STT sends interruption event
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
on_user_interruption() callback
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
_cancel_tts() → voice_client.stop()
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
POST http://miku-rvc-api:8765/interrupt
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
Flush ZMQ socket + clear RVC buffers
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
Miku stops speaking, ready for new input
|
||
|
|
```
|
||
|
|
|
||
|
|
## Hardware Utilization
|
||
|
|
|
||
|
|
### Listen Phase (User Speaking)
|
||
|
|
- **CPU**: Silero VAD processing
|
||
|
|
- **GTX 1660**: Faster-Whisper transcription (1.3GB VRAM)
|
||
|
|
- **AMD RX 6800**: Idle
|
||
|
|
|
||
|
|
### Think Phase (LLM Generation)
|
||
|
|
- **CPU**: Idle
|
||
|
|
- **GTX 1660**: Idle
|
||
|
|
- **AMD RX 6800**: LLaMA inference (20-30 tokens/sec)
|
||
|
|
|
||
|
|
### Speak Phase (Miku Responding)
|
||
|
|
- **CPU**: Silero VAD monitoring for interruption
|
||
|
|
- **GTX 1660**: Soprano TTS + RVC synthesis
|
||
|
|
- **AMD RX 6800**: Idle
|
||
|
|
|
||
|
|
## Performance Metrics
|
||
|
|
|
||
|
|
### Expected Latencies
|
||
|
|
| Stage | Latency |
|
||
|
|
|--------------------------|--------------|
|
||
|
|
| Discord audio capture | ~20ms |
|
||
|
|
| Opus decode + resample | <10ms |
|
||
|
|
| VAD processing | <50ms |
|
||
|
|
| Whisper transcription | 200-500ms |
|
||
|
|
| LLM token generation | 33-50ms/tok |
|
||
|
|
| TTS synthesis | Real-time |
|
||
|
|
| **Total (speech → response)** | **1-2s** |
|
||
|
|
|
||
|
|
### VRAM Usage
|
||
|
|
| GPU | Component | VRAM |
|
||
|
|
|-------------|----------------|-----------|
|
||
|
|
| AMD RX 6800 | LLaMA 8B Q4 | ~5.5GB |
|
||
|
|
| GTX 1660 | Whisper small | 1.3GB |
|
||
|
|
| GTX 1660 | Soprano + RVC | ~3GB |
|
||
|
|
|
||
|
|
## Key Files
|
||
|
|
|
||
|
|
### Bot Container
|
||
|
|
- `bot/utils/stt_client.py` - WebSocket client for STT
|
||
|
|
- `bot/utils/voice_receiver.py` - Discord audio sink
|
||
|
|
- `bot/utils/voice_manager.py` - Voice session with STT integration
|
||
|
|
- `bot/commands/voice.py` - Voice commands including listen/stop-listening
|
||
|
|
|
||
|
|
### STT Container
|
||
|
|
- `stt/vad_processor.py` - Silero VAD with chunk buffering
|
||
|
|
- `stt/whisper_transcriber.py` - Faster-Whisper transcription
|
||
|
|
- `stt/stt_server.py` - FastAPI WebSocket server
|
||
|
|
|
||
|
|
### RVC Container
|
||
|
|
- `soprano_to_rvc/soprano_rvc_api.py` - TTS + RVC pipeline with /interrupt endpoint
|
||
|
|
|
||
|
|
## Configuration Files
|
||
|
|
|
||
|
|
### docker-compose.yml
|
||
|
|
- Network: `miku-network` (all containers)
|
||
|
|
- Ports:
|
||
|
|
- miku-bot: 8081 (API)
|
||
|
|
- miku-rvc-api: 8765 (TTS)
|
||
|
|
- miku-stt: 8001 (STT)
|
||
|
|
- llama-cpp-server: 8080 (LLM)
|
||
|
|
|
||
|
|
### VAD Settings (stt/vad_processor.py)
|
||
|
|
```python
|
||
|
|
threshold = 0.5 # Speech detection sensitivity
|
||
|
|
min_speech = 250 # Minimum speech duration (ms)
|
||
|
|
min_silence = 500 # Silence before speech_end (ms)
|
||
|
|
interruption_threshold = 0.7 # Probability for interruption
|
||
|
|
```
|
||
|
|
|
||
|
|
### Whisper Settings (stt/whisper_transcriber.py)
|
||
|
|
```python
|
||
|
|
model = "small" # 1.3GB VRAM
|
||
|
|
device = "cuda"
|
||
|
|
compute_type = "float16"
|
||
|
|
beam_size = 5
|
||
|
|
patience = 1.0
|
||
|
|
```
|
||
|
|
|
||
|
|
## Testing Commands
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check all container health
|
||
|
|
curl http://localhost:8001/health # STT
|
||
|
|
curl http://localhost:8765/health # RVC
|
||
|
|
curl http://localhost:8080/health # LLM
|
||
|
|
|
||
|
|
# Monitor logs
|
||
|
|
docker logs -f miku-bot | grep -E "(listen|transcript|interrupt)"
|
||
|
|
docker logs -f miku-stt
|
||
|
|
docker logs -f miku-rvc-api | grep interrupt
|
||
|
|
|
||
|
|
# Test interrupt endpoint
|
||
|
|
curl -X POST http://localhost:8765/interrupt
|
||
|
|
|
||
|
|
# Check GPU usage
|
||
|
|
nvidia-smi
|
||
|
|
```
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
| Issue | Solution |
|
||
|
|
|-------|----------|
|
||
|
|
| No audio from Discord | Check bot has Connect and Speak permissions |
|
||
|
|
| VAD not detecting | Speak louder, check microphone, lower threshold |
|
||
|
|
| Empty transcripts | Speak for at least 1-2 seconds, check Whisper model |
|
||
|
|
| Interruption not working | Verify `miku_speaking=true`, check VAD probability |
|
||
|
|
| High latency | Profile each stage, check GPU utilization |
|
||
|
|
|
||
|
|
## Next Features (Phase 4C+)
|
||
|
|
|
||
|
|
- [ ] KV cache precomputation from partial transcripts
|
||
|
|
- [ ] Multi-user simultaneous conversation
|
||
|
|
- [ ] Latency optimization (<1s total)
|
||
|
|
- [ ] Voice activity history and analytics
|
||
|
|
- [ ] Emotion detection from speech patterns
|
||
|
|
- [ ] Context-aware interruption handling
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Ready to test!** Use `!miku join` → `!miku listen` → speak to Miku 🎤
|