Koko210/miku-discord

Fork 0

Files

koko210Serve 8c74ad5260 Initial commit: Miku Discord Bot

2025-12-07 17:15:09 +02:00

6.3 KiB

Raw Blame History

Voice Chat Implementation with Fish.audio

Overview

This document explains how to integrate Fish.audio TTS API with the Miku Discord bot for voice channel conversations.

Fish.audio API Setup

1. Get API Key

Create account at https://fish.audio/
Get API key from: https://fish.audio/app/api-keys/

2. Find Your Miku Voice Model ID

Browse voices at https://fish.audio/
Find your Miku voice model
Copy the model ID from the URL (e.g., 8ef4a238714b45718ce04243307c57a7)
Or use the copy button on the voice page

API Usage for Discord Voice Chat

Basic TTS Request (REST API)

import requests

def generate_speech(text: str, voice_id: str, api_key: str) -> bytes:
    """Generate speech using Fish.audio API"""
    url = "https://api.fish.audio/v1/tts"
    
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
        "model": "s1"  # Recommended model
    }
    
    payload = {
        "text": text,
        "reference_id": voice_id,  # Your Miku voice model ID
        "format": "mp3",           # or "pcm" for raw audio
        "latency": "balanced",     # Lower latency for real-time
        "temperature": 0.9,        # Controls randomness (0-1)
        "normalize": True          # Reduces latency
    }
    
    response = requests.post(url, json=payload, headers=headers)
    return response.content  # Returns audio bytes

Real-time Streaming (WebSocket - Recommended for VC)

from fish_audio_sdk import WebSocketSession, TTSRequest

def stream_to_discord(text: str, voice_id: str, api_key: str):
    """Stream audio directly to Discord voice channel"""
    ws_session = WebSocketSession(api_key)
    
    # Define text generator (can stream from LLM responses)
    def text_stream():
        # You can yield text as it's generated from your LLM
        yield text
    
    with ws_session:
        for audio_chunk in ws_session.tts(
            TTSRequest(
                text="",  # Empty when streaming
                reference_id=voice_id,
                format="pcm",        # Best for Discord
                sample_rate=48000    # Discord uses 48kHz
            ),
            text_stream()
        ):
            # Send audio_chunk to Discord voice channel
            yield audio_chunk

Async Streaming (Better for Discord.py)

from fish_audio_sdk import AsyncWebSocketSession, TTSRequest
import asyncio

async def async_stream_speech(text: str, voice_id: str, api_key: str):
    """Async streaming for Discord.py integration"""
    ws_session = AsyncWebSocketSession(api_key)
    
    async def text_stream():
        yield text
    
    async with ws_session:
        audio_buffer = bytearray()
        async for audio_chunk in ws_session.tts(
            TTSRequest(
                text="",
                reference_id=voice_id,
                format="pcm",
                sample_rate=48000
            ),
            text_stream()
        ):
            audio_buffer.extend(audio_chunk)
    
    return bytes(audio_buffer)

Integration with Miku Bot

Required Dependencies

Add to requirements.txt:

discord.py[voice]
PyNaCl
fish-audio-sdk
speech_recognition  # For STT
pydub  # Audio processing

Environment Variables

Add to your .env or docker-compose.yml:

FISH_API_KEY=your_api_key_here
MIKU_VOICE_ID=your_miku_model_id_here

Discord Voice Channel Flow

1. User speaks in VC
   ↓
2. Capture audio → Speech Recognition (STT)
   ↓
3. Convert speech to text
   ↓
4. Process with Miku's LLM (existing bot logic)
   ↓
5. Generate response text
   ↓
6. Send to Fish.audio TTS API
   ↓
7. Stream audio back to Discord VC

Key Implementation Details

For Low Latency Voice Chat:

Use WebSocket streaming instead of REST API
Set latency: "balanced" in requests
Use format: "pcm" with sample_rate: 48000 for Discord
Stream LLM responses as they generate (don't wait for full response)

Audio Format for Discord:

Sample Rate: 48000 Hz (Discord standard)
Channels: 1 (mono)
Format: PCM (raw audio) or Opus (compressed)
Bit Depth: 16-bit

Cost Considerations:

TTS: $15.00 per million UTF-8 bytes
Example: ~$0.015 for 1000 characters
Monitor usage at https://fish.audio/app/billing/

API Features Available:

Temperature (0-1): Controls speech randomness/expressiveness

Prosody: Control speed and volume

"prosody": {
    "speed": 1.0,  # 0.5-2.0 range
    "volume": 0    # -10 to 10 dB
}

Chunk Length (100-300): Affects streaming speed
Normalize: Reduces latency but may affect number/date pronunciation

Example: Integrate with Existing LLM

from utils.llm import query_ollama
from fish_audio_sdk import AsyncWebSocketSession, TTSRequest

async def miku_voice_response(user_message: str):
    """Generate Miku's response and convert to speech"""
    
    # 1. Get text response from existing LLM
    response_text = await query_ollama(
        prompt=user_message,
        model=globals.OLLAMA_MODEL
    )
    
    # 2. Convert to speech
    ws_session = AsyncWebSocketSession(globals.FISH_API_KEY)
    
    async def text_stream():
        # Can stream as LLM generates if needed
        yield response_text
    
    async with ws_session:
        async for audio_chunk in ws_session.tts(
            TTSRequest(
                text="",
                reference_id=globals.MIKU_VOICE_ID,
                format="pcm",
                sample_rate=48000
            ),
            text_stream()
        ):
            # Send to Discord voice channel
            yield audio_chunk

Rate Limits

Check the current rate limits at: https://docs.fish.audio/developer-platform/models-pricing/pricing-and-rate-limits

Additional Resources

API Reference: https://docs.fish.audio/api-reference/introduction
Python SDK: https://github.com/fishaudio/fish-audio-python
WebSocket Docs: https://docs.fish.audio/sdk-reference/python/websocket
Discord Community: https://discord.com/invite/dF9Db2Tt3Y
Support: support@fish.audio

Next Steps

Create Fish.audio account and get API key
Find/select Miku voice model and get its ID
Install required dependencies
Implement voice channel connection in bot
Add speech-to-text for user audio
Connect Fish.audio TTS to output audio
Test latency and quality

6.3 KiB Raw Blame History