Files
miku-discord/VOICE_CHAT_IMPLEMENTATION.md

223 lines
6.3 KiB
Markdown
Raw Normal View History

2025-12-07 17:15:09 +02:00
# Voice Chat Implementation with Fish.audio
## Overview
This document explains how to integrate Fish.audio TTS API with the Miku Discord bot for voice channel conversations.
## Fish.audio API Setup
### 1. Get API Key
- Create account at https://fish.audio/
- Get API key from: https://fish.audio/app/api-keys/
### 2. Find Your Miku Voice Model ID
- Browse voices at https://fish.audio/
- Find your Miku voice model
- Copy the model ID from the URL (e.g., `8ef4a238714b45718ce04243307c57a7`)
- Or use the copy button on the voice page
## API Usage for Discord Voice Chat
### Basic TTS Request (REST API)
```python
import requests
def generate_speech(text: str, voice_id: str, api_key: str) -> bytes:
"""Generate speech using Fish.audio API"""
url = "https://api.fish.audio/v1/tts"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
"model": "s1" # Recommended model
}
payload = {
"text": text,
"reference_id": voice_id, # Your Miku voice model ID
"format": "mp3", # or "pcm" for raw audio
"latency": "balanced", # Lower latency for real-time
"temperature": 0.9, # Controls randomness (0-1)
"normalize": True # Reduces latency
}
response = requests.post(url, json=payload, headers=headers)
return response.content # Returns audio bytes
```
### Real-time Streaming (WebSocket - Recommended for VC)
```python
from fish_audio_sdk import WebSocketSession, TTSRequest
def stream_to_discord(text: str, voice_id: str, api_key: str):
"""Stream audio directly to Discord voice channel"""
ws_session = WebSocketSession(api_key)
# Define text generator (can stream from LLM responses)
def text_stream():
# You can yield text as it's generated from your LLM
yield text
with ws_session:
for audio_chunk in ws_session.tts(
TTSRequest(
text="", # Empty when streaming
reference_id=voice_id,
format="pcm", # Best for Discord
sample_rate=48000 # Discord uses 48kHz
),
text_stream()
):
# Send audio_chunk to Discord voice channel
yield audio_chunk
```
### Async Streaming (Better for Discord.py)
```python
from fish_audio_sdk import AsyncWebSocketSession, TTSRequest
import asyncio
async def async_stream_speech(text: str, voice_id: str, api_key: str):
"""Async streaming for Discord.py integration"""
ws_session = AsyncWebSocketSession(api_key)
async def text_stream():
yield text
async with ws_session:
audio_buffer = bytearray()
async for audio_chunk in ws_session.tts(
TTSRequest(
text="",
reference_id=voice_id,
format="pcm",
sample_rate=48000
),
text_stream()
):
audio_buffer.extend(audio_chunk)
return bytes(audio_buffer)
```
## Integration with Miku Bot
### Required Dependencies
Add to `requirements.txt`:
```
discord.py[voice]
PyNaCl
fish-audio-sdk
speech_recognition # For STT
pydub # Audio processing
```
### Environment Variables
Add to your `.env` or docker-compose.yml:
```bash
FISH_API_KEY=your_api_key_here
MIKU_VOICE_ID=your_miku_model_id_here
```
### Discord Voice Channel Flow
```
1. User speaks in VC
2. Capture audio → Speech Recognition (STT)
3. Convert speech to text
4. Process with Miku's LLM (existing bot logic)
5. Generate response text
6. Send to Fish.audio TTS API
7. Stream audio back to Discord VC
```
## Key Implementation Details
### For Low Latency Voice Chat:
- Use WebSocket streaming instead of REST API
- Set `latency: "balanced"` in requests
- Use `format: "pcm"` with `sample_rate: 48000` for Discord
- Stream LLM responses as they generate (don't wait for full response)
### Audio Format for Discord:
- **Sample Rate**: 48000 Hz (Discord standard)
- **Channels**: 1 (mono)
- **Format**: PCM (raw audio) or Opus (compressed)
- **Bit Depth**: 16-bit
### Cost Considerations:
- **TTS**: $15.00 per million UTF-8 bytes
- Example: ~$0.015 for 1000 characters
- Monitor usage at https://fish.audio/app/billing/
### API Features Available:
- **Temperature** (0-1): Controls speech randomness/expressiveness
- **Prosody**: Control speed and volume
```python
"prosody": {
"speed": 1.0, # 0.5-2.0 range
"volume": 0 # -10 to 10 dB
}
```
- **Chunk Length** (100-300): Affects streaming speed
- **Normalize**: Reduces latency but may affect number/date pronunciation
## Example: Integrate with Existing LLM
```python
from utils.llm import query_ollama
from fish_audio_sdk import AsyncWebSocketSession, TTSRequest
async def miku_voice_response(user_message: str):
"""Generate Miku's response and convert to speech"""
# 1. Get text response from existing LLM
response_text = await query_ollama(
prompt=user_message,
model=globals.OLLAMA_MODEL
)
# 2. Convert to speech
ws_session = AsyncWebSocketSession(globals.FISH_API_KEY)
async def text_stream():
# Can stream as LLM generates if needed
yield response_text
async with ws_session:
async for audio_chunk in ws_session.tts(
TTSRequest(
text="",
reference_id=globals.MIKU_VOICE_ID,
format="pcm",
sample_rate=48000
),
text_stream()
):
# Send to Discord voice channel
yield audio_chunk
```
## Rate Limits
Check the current rate limits at:
https://docs.fish.audio/developer-platform/models-pricing/pricing-and-rate-limits
## Additional Resources
- **API Reference**: https://docs.fish.audio/api-reference/introduction
- **Python SDK**: https://github.com/fishaudio/fish-audio-python
- **WebSocket Docs**: https://docs.fish.audio/sdk-reference/python/websocket
- **Discord Community**: https://discord.com/invite/dF9Db2Tt3Y
- **Support**: support@fish.audio
## Next Steps
1. Create Fish.audio account and get API key
2. Find/select Miku voice model and get its ID
3. Install required dependencies
4. Implement voice channel connection in bot
5. Add speech-to-text for user audio
6. Connect Fish.audio TTS to output audio
7. Test latency and quality