Files
miku-discord/VOICE_CHAT_IMPLEMENTATION.md

6.3 KiB

Voice Chat Implementation with Fish.audio

Overview

This document explains how to integrate Fish.audio TTS API with the Miku Discord bot for voice channel conversations.

Fish.audio API Setup

1. Get API Key

2. Find Your Miku Voice Model ID

  • Browse voices at https://fish.audio/
  • Find your Miku voice model
  • Copy the model ID from the URL (e.g., 8ef4a238714b45718ce04243307c57a7)
  • Or use the copy button on the voice page

API Usage for Discord Voice Chat

Basic TTS Request (REST API)

import requests

def generate_speech(text: str, voice_id: str, api_key: str) -> bytes:
    """Generate speech using Fish.audio API"""
    url = "https://api.fish.audio/v1/tts"
    
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
        "model": "s1"  # Recommended model
    }
    
    payload = {
        "text": text,
        "reference_id": voice_id,  # Your Miku voice model ID
        "format": "mp3",           # or "pcm" for raw audio
        "latency": "balanced",     # Lower latency for real-time
        "temperature": 0.9,        # Controls randomness (0-1)
        "normalize": True          # Reduces latency
    }
    
    response = requests.post(url, json=payload, headers=headers)
    return response.content  # Returns audio bytes
from fish_audio_sdk import WebSocketSession, TTSRequest

def stream_to_discord(text: str, voice_id: str, api_key: str):
    """Stream audio directly to Discord voice channel"""
    ws_session = WebSocketSession(api_key)
    
    # Define text generator (can stream from LLM responses)
    def text_stream():
        # You can yield text as it's generated from your LLM
        yield text
    
    with ws_session:
        for audio_chunk in ws_session.tts(
            TTSRequest(
                text="",  # Empty when streaming
                reference_id=voice_id,
                format="pcm",        # Best for Discord
                sample_rate=48000    # Discord uses 48kHz
            ),
            text_stream()
        ):
            # Send audio_chunk to Discord voice channel
            yield audio_chunk

Async Streaming (Better for Discord.py)

from fish_audio_sdk import AsyncWebSocketSession, TTSRequest
import asyncio

async def async_stream_speech(text: str, voice_id: str, api_key: str):
    """Async streaming for Discord.py integration"""
    ws_session = AsyncWebSocketSession(api_key)
    
    async def text_stream():
        yield text
    
    async with ws_session:
        audio_buffer = bytearray()
        async for audio_chunk in ws_session.tts(
            TTSRequest(
                text="",
                reference_id=voice_id,
                format="pcm",
                sample_rate=48000
            ),
            text_stream()
        ):
            audio_buffer.extend(audio_chunk)
    
    return bytes(audio_buffer)

Integration with Miku Bot

Required Dependencies

Add to requirements.txt:

discord.py[voice]
PyNaCl
fish-audio-sdk
speech_recognition  # For STT
pydub  # Audio processing

Environment Variables

Add to your .env or docker-compose.yml:

FISH_API_KEY=your_api_key_here
MIKU_VOICE_ID=your_miku_model_id_here

Discord Voice Channel Flow

1. User speaks in VC
   ↓
2. Capture audio → Speech Recognition (STT)
   ↓
3. Convert speech to text
   ↓
4. Process with Miku's LLM (existing bot logic)
   ↓
5. Generate response text
   ↓
6. Send to Fish.audio TTS API
   ↓
7. Stream audio back to Discord VC

Key Implementation Details

For Low Latency Voice Chat:

  • Use WebSocket streaming instead of REST API
  • Set latency: "balanced" in requests
  • Use format: "pcm" with sample_rate: 48000 for Discord
  • Stream LLM responses as they generate (don't wait for full response)

Audio Format for Discord:

  • Sample Rate: 48000 Hz (Discord standard)
  • Channels: 1 (mono)
  • Format: PCM (raw audio) or Opus (compressed)
  • Bit Depth: 16-bit

Cost Considerations:

API Features Available:

  • Temperature (0-1): Controls speech randomness/expressiveness
  • Prosody: Control speed and volume
    "prosody": {
        "speed": 1.0,  # 0.5-2.0 range
        "volume": 0    # -10 to 10 dB
    }
    
  • Chunk Length (100-300): Affects streaming speed
  • Normalize: Reduces latency but may affect number/date pronunciation

Example: Integrate with Existing LLM

from utils.llm import query_ollama
from fish_audio_sdk import AsyncWebSocketSession, TTSRequest

async def miku_voice_response(user_message: str):
    """Generate Miku's response and convert to speech"""
    
    # 1. Get text response from existing LLM
    response_text = await query_ollama(
        prompt=user_message,
        model=globals.OLLAMA_MODEL
    )
    
    # 2. Convert to speech
    ws_session = AsyncWebSocketSession(globals.FISH_API_KEY)
    
    async def text_stream():
        # Can stream as LLM generates if needed
        yield response_text
    
    async with ws_session:
        async for audio_chunk in ws_session.tts(
            TTSRequest(
                text="",
                reference_id=globals.MIKU_VOICE_ID,
                format="pcm",
                sample_rate=48000
            ),
            text_stream()
        ):
            # Send to Discord voice channel
            yield audio_chunk

Rate Limits

Check the current rate limits at: https://docs.fish.audio/developer-platform/models-pricing/pricing-and-rate-limits

Additional Resources

Next Steps

  1. Create Fish.audio account and get API key
  2. Find/select Miku voice model and get its ID
  3. Install required dependencies
  4. Implement voice channel connection in bot
  5. Add speech-to-text for user audio
  6. Connect Fish.audio TTS to output audio
  7. Test latency and quality