# Voice Chat Implementation with Fish.audio ## Overview This document explains how to integrate Fish.audio TTS API with the Miku Discord bot for voice channel conversations. ## Fish.audio API Setup ### 1. Get API Key - Create account at https://fish.audio/ - Get API key from: https://fish.audio/app/api-keys/ ### 2. Find Your Miku Voice Model ID - Browse voices at https://fish.audio/ - Find your Miku voice model - Copy the model ID from the URL (e.g., `8ef4a238714b45718ce04243307c57a7`) - Or use the copy button on the voice page ## API Usage for Discord Voice Chat ### Basic TTS Request (REST API) ```python import requests def generate_speech(text: str, voice_id: str, api_key: str) -> bytes: """Generate speech using Fish.audio API""" url = "https://api.fish.audio/v1/tts" headers = { "Authorization": f"Bearer {api_key}", "Content-Type": "application/json", "model": "s1" # Recommended model } payload = { "text": text, "reference_id": voice_id, # Your Miku voice model ID "format": "mp3", # or "pcm" for raw audio "latency": "balanced", # Lower latency for real-time "temperature": 0.9, # Controls randomness (0-1) "normalize": True # Reduces latency } response = requests.post(url, json=payload, headers=headers) return response.content # Returns audio bytes ``` ### Real-time Streaming (WebSocket - Recommended for VC) ```python from fish_audio_sdk import WebSocketSession, TTSRequest def stream_to_discord(text: str, voice_id: str, api_key: str): """Stream audio directly to Discord voice channel""" ws_session = WebSocketSession(api_key) # Define text generator (can stream from LLM responses) def text_stream(): # You can yield text as it's generated from your LLM yield text with ws_session: for audio_chunk in ws_session.tts( TTSRequest( text="", # Empty when streaming reference_id=voice_id, format="pcm", # Best for Discord sample_rate=48000 # Discord uses 48kHz ), text_stream() ): # Send audio_chunk to Discord voice channel yield audio_chunk ``` ### Async Streaming (Better for Discord.py) ```python from fish_audio_sdk import AsyncWebSocketSession, TTSRequest import asyncio async def async_stream_speech(text: str, voice_id: str, api_key: str): """Async streaming for Discord.py integration""" ws_session = AsyncWebSocketSession(api_key) async def text_stream(): yield text async with ws_session: audio_buffer = bytearray() async for audio_chunk in ws_session.tts( TTSRequest( text="", reference_id=voice_id, format="pcm", sample_rate=48000 ), text_stream() ): audio_buffer.extend(audio_chunk) return bytes(audio_buffer) ``` ## Integration with Miku Bot ### Required Dependencies Add to `requirements.txt`: ``` discord.py[voice] PyNaCl fish-audio-sdk speech_recognition # For STT pydub # Audio processing ``` ### Environment Variables Add to your `.env` or docker-compose.yml: ```bash FISH_API_KEY=your_api_key_here MIKU_VOICE_ID=your_miku_model_id_here ``` ### Discord Voice Channel Flow ``` 1. User speaks in VC ↓ 2. Capture audio → Speech Recognition (STT) ↓ 3. Convert speech to text ↓ 4. Process with Miku's LLM (existing bot logic) ↓ 5. Generate response text ↓ 6. Send to Fish.audio TTS API ↓ 7. Stream audio back to Discord VC ``` ## Key Implementation Details ### For Low Latency Voice Chat: - Use WebSocket streaming instead of REST API - Set `latency: "balanced"` in requests - Use `format: "pcm"` with `sample_rate: 48000` for Discord - Stream LLM responses as they generate (don't wait for full response) ### Audio Format for Discord: - **Sample Rate**: 48000 Hz (Discord standard) - **Channels**: 1 (mono) - **Format**: PCM (raw audio) or Opus (compressed) - **Bit Depth**: 16-bit ### Cost Considerations: - **TTS**: $15.00 per million UTF-8 bytes - Example: ~$0.015 for 1000 characters - Monitor usage at https://fish.audio/app/billing/ ### API Features Available: - **Temperature** (0-1): Controls speech randomness/expressiveness - **Prosody**: Control speed and volume ```python "prosody": { "speed": 1.0, # 0.5-2.0 range "volume": 0 # -10 to 10 dB } ``` - **Chunk Length** (100-300): Affects streaming speed - **Normalize**: Reduces latency but may affect number/date pronunciation ## Example: Integrate with Existing LLM ```python from utils.llm import query_ollama from fish_audio_sdk import AsyncWebSocketSession, TTSRequest async def miku_voice_response(user_message: str): """Generate Miku's response and convert to speech""" # 1. Get text response from existing LLM response_text = await query_ollama( prompt=user_message, model=globals.OLLAMA_MODEL ) # 2. Convert to speech ws_session = AsyncWebSocketSession(globals.FISH_API_KEY) async def text_stream(): # Can stream as LLM generates if needed yield response_text async with ws_session: async for audio_chunk in ws_session.tts( TTSRequest( text="", reference_id=globals.MIKU_VOICE_ID, format="pcm", sample_rate=48000 ), text_stream() ): # Send to Discord voice channel yield audio_chunk ``` ## Rate Limits Check the current rate limits at: https://docs.fish.audio/developer-platform/models-pricing/pricing-and-rate-limits ## Additional Resources - **API Reference**: https://docs.fish.audio/api-reference/introduction - **Python SDK**: https://github.com/fishaudio/fish-audio-python - **WebSocket Docs**: https://docs.fish.audio/sdk-reference/python/websocket - **Discord Community**: https://discord.com/invite/dF9Db2Tt3Y - **Support**: support@fish.audio ## Next Steps 1. Create Fish.audio account and get API key 2. Find/select Miku voice model and get its ID 3. Install required dependencies 4. Implement voice channel connection in bot 5. Add speech-to-text for user audio 6. Connect Fish.audio TTS to output audio 7. Test latency and quality