VOICE_CHAT_IMPLEMENTATION.md

# Voice Chat Implementation with Fish.audio

## Overview
This document explains how to integrate Fish.audio TTS API with the Miku Discord bot for voice channel conversations.

## Fish.audio API Setup

### 1. Get API Key
- Create account at https://fish.audio/
- Get API key from: https://fish.audio/app/api-keys/

### 2. Find Your Miku Voice Model ID
- Browse voices at https://fish.audio/
- Find your Miku voice model
- Copy the model ID from the URL (e.g., `8ef4a238714b45718ce04243307c57a7`)
- Or use the copy button on the voice page

## API Usage for Discord Voice Chat

### Basic TTS Request (REST API)
```python
import requests

def generate_speech(text: str, voice_id: str, api_key: str) -> bytes:
    """Generate speech using Fish.audio API"""
    url = "https://api.fish.audio/v1/tts"
    
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
        "model": "s1"  # Recommended model
    }
    
    payload = {
        "text": text,
        "reference_id": voice_id,  # Your Miku voice model ID
        "format": "mp3",           # or "pcm" for raw audio
        "latency": "balanced",     # Lower latency for real-time
        "temperature": 0.9,        # Controls randomness (0-1)
        "normalize": True          # Reduces latency
    }
    
    response = requests.post(url, json=payload, headers=headers)
    return response.content  # Returns audio bytes
```

### Real-time Streaming (WebSocket - Recommended for VC)
```python
from fish_audio_sdk import WebSocketSession, TTSRequest

def stream_to_discord(text: str, voice_id: str, api_key: str):
    """Stream audio directly to Discord voice channel"""
    ws_session = WebSocketSession(api_key)
    
    # Define text generator (can stream from LLM responses)
    def text_stream():
        # You can yield text as it's generated from your LLM
        yield text
    
    with ws_session:
        for audio_chunk in ws_session.tts(
            TTSRequest(
                text="",  # Empty when streaming
                reference_id=voice_id,
                format="pcm",        # Best for Discord
                sample_rate=48000    # Discord uses 48kHz
            ),
            text_stream()
        ):
            # Send audio_chunk to Discord voice channel
            yield audio_chunk
```

### Async Streaming (Better for Discord.py)
```python
from fish_audio_sdk import AsyncWebSocketSession, TTSRequest
import asyncio

async def async_stream_speech(text: str, voice_id: str, api_key: str):
    """Async streaming for Discord.py integration"""
    ws_session = AsyncWebSocketSession(api_key)
    
    async def text_stream():
        yield text
    
    async with ws_session:
        audio_buffer = bytearray()
        async for audio_chunk in ws_session.tts(
            TTSRequest(
                text="",
                reference_id=voice_id,
                format="pcm",
                sample_rate=48000
            ),
            text_stream()
        ):
            audio_buffer.extend(audio_chunk)
    
    return bytes(audio_buffer)
```

## Integration with Miku Bot

### Required Dependencies
Add to `requirements.txt`:
```
discord.py[voice]
PyNaCl
fish-audio-sdk
speech_recognition  # For STT
pydub  # Audio processing
```

### Environment Variables
Add to your `.env` or docker-compose.yml:
```bash
FISH_API_KEY=your_api_key_here
MIKU_VOICE_ID=your_miku_model_id_here
```

### Discord Voice Channel Flow
```
1. User speaks in VC
   ↓
2. Capture audio → Speech Recognition (STT)
   ↓
3. Convert speech to text
   ↓
4. Process with Miku's LLM (existing bot logic)
   ↓
5. Generate response text
   ↓
6. Send to Fish.audio TTS API
   ↓
7. Stream audio back to Discord VC
```

## Key Implementation Details

### For Low Latency Voice Chat:
- Use WebSocket streaming instead of REST API
- Set `latency: "balanced"` in requests
- Use `format: "pcm"` with `sample_rate: 48000` for Discord
- Stream LLM responses as they generate (don't wait for full response)

### Audio Format for Discord:
- **Sample Rate**: 48000 Hz (Discord standard)
- **Channels**: 1 (mono)
- **Format**: PCM (raw audio) or Opus (compressed)
- **Bit Depth**: 16-bit

### Cost Considerations:
- **TTS**: $15.00 per million UTF-8 bytes
- Example: ~$0.015 for 1000 characters
- Monitor usage at https://fish.audio/app/billing/

### API Features Available:
- **Temperature** (0-1): Controls speech randomness/expressiveness
- **Prosody**: Control speed and volume
  ```python
  "prosody": {
      "speed": 1.0,  # 0.5-2.0 range
      "volume": 0    # -10 to 10 dB
  }
  ```
- **Chunk Length** (100-300): Affects streaming speed
- **Normalize**: Reduces latency but may affect number/date pronunciation

## Example: Integrate with Existing LLM
```python
from utils.llm import query_ollama
from fish_audio_sdk import AsyncWebSocketSession, TTSRequest

async def miku_voice_response(user_message: str):
    """Generate Miku's response and convert to speech"""
    
    # 1. Get text response from existing LLM
    response_text = await query_ollama(
        prompt=user_message,
        model=globals.OLLAMA_MODEL
    )
    
    # 2. Convert to speech
    ws_session = AsyncWebSocketSession(globals.FISH_API_KEY)
    
    async def text_stream():
        # Can stream as LLM generates if needed
        yield response_text
    
    async with ws_session:
        async for audio_chunk in ws_session.tts(
            TTSRequest(
                text="",
                reference_id=globals.MIKU_VOICE_ID,
                format="pcm",
                sample_rate=48000
            ),
            text_stream()
        ):
            # Send to Discord voice channel
            yield audio_chunk
```

## Rate Limits
Check the current rate limits at:
https://docs.fish.audio/developer-platform/models-pricing/pricing-and-rate-limits

## Additional Resources
- **API Reference**: https://docs.fish.audio/api-reference/introduction
- **Python SDK**: https://github.com/fishaudio/fish-audio-python
- **WebSocket Docs**: https://docs.fish.audio/sdk-reference/python/websocket
- **Discord Community**: https://discord.com/invite/dF9Db2Tt3Y
- **Support**: support@fish.audio

## Next Steps
1. Create Fish.audio account and get API key
2. Find/select Miku voice model and get its ID
3. Install required dependencies
4. Implement voice channel connection in bot
5. Add speech-to-text for user audio
6. Connect Fish.audio TTS to output audio
7. Test latency and quality
Initial commit: Miku Discord Bot 2025-12-07 17:15:09 +02:00			`# Voice Chat Implementation with Fish.audio`

			`## Overview`
			`This document explains how to integrate Fish.audio TTS API with the Miku Discord bot for voice channel conversations.`

			`## Fish.audio API Setup`

			`### 1. Get API Key`
			`- Create account at https://fish.audio/`
			`- Get API key from: https://fish.audio/app/api-keys/`

			`### 2. Find Your Miku Voice Model ID`
			`- Browse voices at https://fish.audio/`
			`- Find your Miku voice model`
			- Copy the model ID from the URL (e.g., `8ef4a238714b45718ce04243307c57a7`)
			`- Or use the copy button on the voice page`

			`## API Usage for Discord Voice Chat`

			`### Basic TTS Request (REST API)`
			```python
			`import requests`

			`def generate_speech(text: str, voice_id: str, api_key: str) -> bytes:`
			`"""Generate speech using Fish.audio API"""`
			`url = "https://api.fish.audio/v1/tts"`

			`headers = {`
			`"Authorization": f"Bearer {api_key}",`
			`"Content-Type": "application/json",`
			`"model": "s1" # Recommended model`
			`}`

			`payload = {`
			`"text": text,`
			`"reference_id": voice_id, # Your Miku voice model ID`
			`"format": "mp3", # or "pcm" for raw audio`
			`"latency": "balanced", # Lower latency for real-time`
			`"temperature": 0.9, # Controls randomness (0-1)`
			`"normalize": True # Reduces latency`
			`}`

			`response = requests.post(url, json=payload, headers=headers)`
			`return response.content # Returns audio bytes`
			```

			`### Real-time Streaming (WebSocket - Recommended for VC)`
			```python
			`from fish_audio_sdk import WebSocketSession, TTSRequest`

			`def stream_to_discord(text: str, voice_id: str, api_key: str):`
			`"""Stream audio directly to Discord voice channel"""`
			`ws_session = WebSocketSession(api_key)`

			`# Define text generator (can stream from LLM responses)`
			`def text_stream():`
			`# You can yield text as it's generated from your LLM`
			`yield text`

			`with ws_session:`
			`for audio_chunk in ws_session.tts(`
			`TTSRequest(`
			`text="", # Empty when streaming`
			`reference_id=voice_id,`
			`format="pcm", # Best for Discord`
			`sample_rate=48000 # Discord uses 48kHz`
			`),`
			`text_stream()`
			`):`
			`# Send audio_chunk to Discord voice channel`
			`yield audio_chunk`
			```

			`### Async Streaming (Better for Discord.py)`
			```python
			`from fish_audio_sdk import AsyncWebSocketSession, TTSRequest`
			`import asyncio`

			`async def async_stream_speech(text: str, voice_id: str, api_key: str):`
			`"""Async streaming for Discord.py integration"""`
			`ws_session = AsyncWebSocketSession(api_key)`

			`async def text_stream():`
			`yield text`

			`async with ws_session:`
			`audio_buffer = bytearray()`
			`async for audio_chunk in ws_session.tts(`
			`TTSRequest(`
			`text="",`
			`reference_id=voice_id,`
			`format="pcm",`
			`sample_rate=48000`
			`),`
			`text_stream()`
			`):`
			`audio_buffer.extend(audio_chunk)`

			`return bytes(audio_buffer)`
			```

			`## Integration with Miku Bot`

			`### Required Dependencies`
			Add to `requirements.txt`:
			```
			`discord.py[voice]`
			`PyNaCl`
			`fish-audio-sdk`
			`speech_recognition # For STT`
			`pydub # Audio processing`
			```

			`### Environment Variables`
			Add to your `.env` or docker-compose.yml:
			```bash
			`FISH_API_KEY=your_api_key_here`
			`MIKU_VOICE_ID=your_miku_model_id_here`
			```

			`### Discord Voice Channel Flow`
			```
			`1. User speaks in VC`
			`↓`
			`2. Capture audio → Speech Recognition (STT)`
			`↓`
			`3. Convert speech to text`
			`↓`
			`4. Process with Miku's LLM (existing bot logic)`
			`↓`
			`5. Generate response text`
			`↓`
			`6. Send to Fish.audio TTS API`
			`↓`
			`7. Stream audio back to Discord VC`
			```

			`## Key Implementation Details`

			`### For Low Latency Voice Chat:`
			`- Use WebSocket streaming instead of REST API`
			- Set `latency: "balanced"` in requests
			- Use `format: "pcm"` with `sample_rate: 48000` for Discord
			`- Stream LLM responses as they generate (don't wait for full response)`

			`### Audio Format for Discord:`
			`- Sample Rate: 48000 Hz (Discord standard)`
			`- Channels: 1 (mono)`
			`- Format: PCM (raw audio) or Opus (compressed)`
			`- Bit Depth: 16-bit`

			`### Cost Considerations:`
			`- TTS: $15.00 per million UTF-8 bytes`
			`- Example: ~$0.015 for 1000 characters`
			`- Monitor usage at https://fish.audio/app/billing/`

			`### API Features Available:`
			`- Temperature (0-1): Controls speech randomness/expressiveness`
			`- Prosody: Control speed and volume`
			```python
			`"prosody": {`
			`"speed": 1.0, # 0.5-2.0 range`
			`"volume": 0 # -10 to 10 dB`
			`}`
			```
			`- Chunk Length (100-300): Affects streaming speed`
			`- Normalize: Reduces latency but may affect number/date pronunciation`

			`## Example: Integrate with Existing LLM`
			```python
			`from utils.llm import query_ollama`
			`from fish_audio_sdk import AsyncWebSocketSession, TTSRequest`

			`async def miku_voice_response(user_message: str):`
			`"""Generate Miku's response and convert to speech"""`

			`# 1. Get text response from existing LLM`
			`response_text = await query_ollama(`
			`prompt=user_message,`
			`model=globals.OLLAMA_MODEL`
			`)`

			`# 2. Convert to speech`
			`ws_session = AsyncWebSocketSession(globals.FISH_API_KEY)`

			`async def text_stream():`
			`# Can stream as LLM generates if needed`
			`yield response_text`

			`async with ws_session:`
			`async for audio_chunk in ws_session.tts(`
			`TTSRequest(`
			`text="",`
			`reference_id=globals.MIKU_VOICE_ID,`
			`format="pcm",`
			`sample_rate=48000`
			`),`
			`text_stream()`
			`):`
			`# Send to Discord voice channel`
			`yield audio_chunk`
			```

			`## Rate Limits`
			`Check the current rate limits at:`
			`https://docs.fish.audio/developer-platform/models-pricing/pricing-and-rate-limits`

			`## Additional Resources`
			`- API Reference: https://docs.fish.audio/api-reference/introduction`
			`- Python SDK: https://github.com/fishaudio/fish-audio-python`
			`- WebSocket Docs: https://docs.fish.audio/sdk-reference/python/websocket`
			`- Discord Community: https://discord.com/invite/dF9Db2Tt3Y`
			`- Support: support@fish.audio`

			`## Next Steps`
			`1. Create Fish.audio account and get API key`
			`2. Find/select Miku voice model and get its ID`
			`3. Install required dependencies`
			`4. Implement voice channel connection in bot`
			`5. Add speech-to-text for user audio`
			`6. Connect Fish.audio TTS to output audio`
			`7. Test latency and quality`