6.3 KiB
6.3 KiB
Voice Chat Implementation with Fish.audio
Overview
This document explains how to integrate Fish.audio TTS API with the Miku Discord bot for voice channel conversations.
Fish.audio API Setup
1. Get API Key
- Create account at https://fish.audio/
- Get API key from: https://fish.audio/app/api-keys/
2. Find Your Miku Voice Model ID
- Browse voices at https://fish.audio/
- Find your Miku voice model
- Copy the model ID from the URL (e.g.,
8ef4a238714b45718ce04243307c57a7) - Or use the copy button on the voice page
API Usage for Discord Voice Chat
Basic TTS Request (REST API)
import requests
def generate_speech(text: str, voice_id: str, api_key: str) -> bytes:
"""Generate speech using Fish.audio API"""
url = "https://api.fish.audio/v1/tts"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
"model": "s1" # Recommended model
}
payload = {
"text": text,
"reference_id": voice_id, # Your Miku voice model ID
"format": "mp3", # or "pcm" for raw audio
"latency": "balanced", # Lower latency for real-time
"temperature": 0.9, # Controls randomness (0-1)
"normalize": True # Reduces latency
}
response = requests.post(url, json=payload, headers=headers)
return response.content # Returns audio bytes
Real-time Streaming (WebSocket - Recommended for VC)
from fish_audio_sdk import WebSocketSession, TTSRequest
def stream_to_discord(text: str, voice_id: str, api_key: str):
"""Stream audio directly to Discord voice channel"""
ws_session = WebSocketSession(api_key)
# Define text generator (can stream from LLM responses)
def text_stream():
# You can yield text as it's generated from your LLM
yield text
with ws_session:
for audio_chunk in ws_session.tts(
TTSRequest(
text="", # Empty when streaming
reference_id=voice_id,
format="pcm", # Best for Discord
sample_rate=48000 # Discord uses 48kHz
),
text_stream()
):
# Send audio_chunk to Discord voice channel
yield audio_chunk
Async Streaming (Better for Discord.py)
from fish_audio_sdk import AsyncWebSocketSession, TTSRequest
import asyncio
async def async_stream_speech(text: str, voice_id: str, api_key: str):
"""Async streaming for Discord.py integration"""
ws_session = AsyncWebSocketSession(api_key)
async def text_stream():
yield text
async with ws_session:
audio_buffer = bytearray()
async for audio_chunk in ws_session.tts(
TTSRequest(
text="",
reference_id=voice_id,
format="pcm",
sample_rate=48000
),
text_stream()
):
audio_buffer.extend(audio_chunk)
return bytes(audio_buffer)
Integration with Miku Bot
Required Dependencies
Add to requirements.txt:
discord.py[voice]
PyNaCl
fish-audio-sdk
speech_recognition # For STT
pydub # Audio processing
Environment Variables
Add to your .env or docker-compose.yml:
FISH_API_KEY=your_api_key_here
MIKU_VOICE_ID=your_miku_model_id_here
Discord Voice Channel Flow
1. User speaks in VC
↓
2. Capture audio → Speech Recognition (STT)
↓
3. Convert speech to text
↓
4. Process with Miku's LLM (existing bot logic)
↓
5. Generate response text
↓
6. Send to Fish.audio TTS API
↓
7. Stream audio back to Discord VC
Key Implementation Details
For Low Latency Voice Chat:
- Use WebSocket streaming instead of REST API
- Set
latency: "balanced"in requests - Use
format: "pcm"withsample_rate: 48000for Discord - Stream LLM responses as they generate (don't wait for full response)
Audio Format for Discord:
- Sample Rate: 48000 Hz (Discord standard)
- Channels: 1 (mono)
- Format: PCM (raw audio) or Opus (compressed)
- Bit Depth: 16-bit
Cost Considerations:
- TTS: $15.00 per million UTF-8 bytes
- Example: ~$0.015 for 1000 characters
- Monitor usage at https://fish.audio/app/billing/
API Features Available:
- Temperature (0-1): Controls speech randomness/expressiveness
- Prosody: Control speed and volume
"prosody": { "speed": 1.0, # 0.5-2.0 range "volume": 0 # -10 to 10 dB } - Chunk Length (100-300): Affects streaming speed
- Normalize: Reduces latency but may affect number/date pronunciation
Example: Integrate with Existing LLM
from utils.llm import query_ollama
from fish_audio_sdk import AsyncWebSocketSession, TTSRequest
async def miku_voice_response(user_message: str):
"""Generate Miku's response and convert to speech"""
# 1. Get text response from existing LLM
response_text = await query_ollama(
prompt=user_message,
model=globals.OLLAMA_MODEL
)
# 2. Convert to speech
ws_session = AsyncWebSocketSession(globals.FISH_API_KEY)
async def text_stream():
# Can stream as LLM generates if needed
yield response_text
async with ws_session:
async for audio_chunk in ws_session.tts(
TTSRequest(
text="",
reference_id=globals.MIKU_VOICE_ID,
format="pcm",
sample_rate=48000
),
text_stream()
):
# Send to Discord voice channel
yield audio_chunk
Rate Limits
Check the current rate limits at: https://docs.fish.audio/developer-platform/models-pricing/pricing-and-rate-limits
Additional Resources
- API Reference: https://docs.fish.audio/api-reference/introduction
- Python SDK: https://github.com/fishaudio/fish-audio-python
- WebSocket Docs: https://docs.fish.audio/sdk-reference/python/websocket
- Discord Community: https://discord.com/invite/dF9Db2Tt3Y
- Support: support@fish.audio
Next Steps
- Create Fish.audio account and get API key
- Find/select Miku voice model and get its ID
- Install required dependencies
- Implement voice channel connection in bot
- Add speech-to-text for user audio
- Connect Fish.audio TTS to output audio
- Test latency and quality