223 lines
6.3 KiB
Markdown
223 lines
6.3 KiB
Markdown
|
|
# Voice Chat Implementation with Fish.audio
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
This document explains how to integrate Fish.audio TTS API with the Miku Discord bot for voice channel conversations.
|
||
|
|
|
||
|
|
## Fish.audio API Setup
|
||
|
|
|
||
|
|
### 1. Get API Key
|
||
|
|
- Create account at https://fish.audio/
|
||
|
|
- Get API key from: https://fish.audio/app/api-keys/
|
||
|
|
|
||
|
|
### 2. Find Your Miku Voice Model ID
|
||
|
|
- Browse voices at https://fish.audio/
|
||
|
|
- Find your Miku voice model
|
||
|
|
- Copy the model ID from the URL (e.g., `8ef4a238714b45718ce04243307c57a7`)
|
||
|
|
- Or use the copy button on the voice page
|
||
|
|
|
||
|
|
## API Usage for Discord Voice Chat
|
||
|
|
|
||
|
|
### Basic TTS Request (REST API)
|
||
|
|
```python
|
||
|
|
import requests
|
||
|
|
|
||
|
|
def generate_speech(text: str, voice_id: str, api_key: str) -> bytes:
|
||
|
|
"""Generate speech using Fish.audio API"""
|
||
|
|
url = "https://api.fish.audio/v1/tts"
|
||
|
|
|
||
|
|
headers = {
|
||
|
|
"Authorization": f"Bearer {api_key}",
|
||
|
|
"Content-Type": "application/json",
|
||
|
|
"model": "s1" # Recommended model
|
||
|
|
}
|
||
|
|
|
||
|
|
payload = {
|
||
|
|
"text": text,
|
||
|
|
"reference_id": voice_id, # Your Miku voice model ID
|
||
|
|
"format": "mp3", # or "pcm" for raw audio
|
||
|
|
"latency": "balanced", # Lower latency for real-time
|
||
|
|
"temperature": 0.9, # Controls randomness (0-1)
|
||
|
|
"normalize": True # Reduces latency
|
||
|
|
}
|
||
|
|
|
||
|
|
response = requests.post(url, json=payload, headers=headers)
|
||
|
|
return response.content # Returns audio bytes
|
||
|
|
```
|
||
|
|
|
||
|
|
### Real-time Streaming (WebSocket - Recommended for VC)
|
||
|
|
```python
|
||
|
|
from fish_audio_sdk import WebSocketSession, TTSRequest
|
||
|
|
|
||
|
|
def stream_to_discord(text: str, voice_id: str, api_key: str):
|
||
|
|
"""Stream audio directly to Discord voice channel"""
|
||
|
|
ws_session = WebSocketSession(api_key)
|
||
|
|
|
||
|
|
# Define text generator (can stream from LLM responses)
|
||
|
|
def text_stream():
|
||
|
|
# You can yield text as it's generated from your LLM
|
||
|
|
yield text
|
||
|
|
|
||
|
|
with ws_session:
|
||
|
|
for audio_chunk in ws_session.tts(
|
||
|
|
TTSRequest(
|
||
|
|
text="", # Empty when streaming
|
||
|
|
reference_id=voice_id,
|
||
|
|
format="pcm", # Best for Discord
|
||
|
|
sample_rate=48000 # Discord uses 48kHz
|
||
|
|
),
|
||
|
|
text_stream()
|
||
|
|
):
|
||
|
|
# Send audio_chunk to Discord voice channel
|
||
|
|
yield audio_chunk
|
||
|
|
```
|
||
|
|
|
||
|
|
### Async Streaming (Better for Discord.py)
|
||
|
|
```python
|
||
|
|
from fish_audio_sdk import AsyncWebSocketSession, TTSRequest
|
||
|
|
import asyncio
|
||
|
|
|
||
|
|
async def async_stream_speech(text: str, voice_id: str, api_key: str):
|
||
|
|
"""Async streaming for Discord.py integration"""
|
||
|
|
ws_session = AsyncWebSocketSession(api_key)
|
||
|
|
|
||
|
|
async def text_stream():
|
||
|
|
yield text
|
||
|
|
|
||
|
|
async with ws_session:
|
||
|
|
audio_buffer = bytearray()
|
||
|
|
async for audio_chunk in ws_session.tts(
|
||
|
|
TTSRequest(
|
||
|
|
text="",
|
||
|
|
reference_id=voice_id,
|
||
|
|
format="pcm",
|
||
|
|
sample_rate=48000
|
||
|
|
),
|
||
|
|
text_stream()
|
||
|
|
):
|
||
|
|
audio_buffer.extend(audio_chunk)
|
||
|
|
|
||
|
|
return bytes(audio_buffer)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Integration with Miku Bot
|
||
|
|
|
||
|
|
### Required Dependencies
|
||
|
|
Add to `requirements.txt`:
|
||
|
|
```
|
||
|
|
discord.py[voice]
|
||
|
|
PyNaCl
|
||
|
|
fish-audio-sdk
|
||
|
|
speech_recognition # For STT
|
||
|
|
pydub # Audio processing
|
||
|
|
```
|
||
|
|
|
||
|
|
### Environment Variables
|
||
|
|
Add to your `.env` or docker-compose.yml:
|
||
|
|
```bash
|
||
|
|
FISH_API_KEY=your_api_key_here
|
||
|
|
MIKU_VOICE_ID=your_miku_model_id_here
|
||
|
|
```
|
||
|
|
|
||
|
|
### Discord Voice Channel Flow
|
||
|
|
```
|
||
|
|
1. User speaks in VC
|
||
|
|
↓
|
||
|
|
2. Capture audio → Speech Recognition (STT)
|
||
|
|
↓
|
||
|
|
3. Convert speech to text
|
||
|
|
↓
|
||
|
|
4. Process with Miku's LLM (existing bot logic)
|
||
|
|
↓
|
||
|
|
5. Generate response text
|
||
|
|
↓
|
||
|
|
6. Send to Fish.audio TTS API
|
||
|
|
↓
|
||
|
|
7. Stream audio back to Discord VC
|
||
|
|
```
|
||
|
|
|
||
|
|
## Key Implementation Details
|
||
|
|
|
||
|
|
### For Low Latency Voice Chat:
|
||
|
|
- Use WebSocket streaming instead of REST API
|
||
|
|
- Set `latency: "balanced"` in requests
|
||
|
|
- Use `format: "pcm"` with `sample_rate: 48000` for Discord
|
||
|
|
- Stream LLM responses as they generate (don't wait for full response)
|
||
|
|
|
||
|
|
### Audio Format for Discord:
|
||
|
|
- **Sample Rate**: 48000 Hz (Discord standard)
|
||
|
|
- **Channels**: 1 (mono)
|
||
|
|
- **Format**: PCM (raw audio) or Opus (compressed)
|
||
|
|
- **Bit Depth**: 16-bit
|
||
|
|
|
||
|
|
### Cost Considerations:
|
||
|
|
- **TTS**: $15.00 per million UTF-8 bytes
|
||
|
|
- Example: ~$0.015 for 1000 characters
|
||
|
|
- Monitor usage at https://fish.audio/app/billing/
|
||
|
|
|
||
|
|
### API Features Available:
|
||
|
|
- **Temperature** (0-1): Controls speech randomness/expressiveness
|
||
|
|
- **Prosody**: Control speed and volume
|
||
|
|
```python
|
||
|
|
"prosody": {
|
||
|
|
"speed": 1.0, # 0.5-2.0 range
|
||
|
|
"volume": 0 # -10 to 10 dB
|
||
|
|
}
|
||
|
|
```
|
||
|
|
- **Chunk Length** (100-300): Affects streaming speed
|
||
|
|
- **Normalize**: Reduces latency but may affect number/date pronunciation
|
||
|
|
|
||
|
|
## Example: Integrate with Existing LLM
|
||
|
|
```python
|
||
|
|
from utils.llm import query_ollama
|
||
|
|
from fish_audio_sdk import AsyncWebSocketSession, TTSRequest
|
||
|
|
|
||
|
|
async def miku_voice_response(user_message: str):
|
||
|
|
"""Generate Miku's response and convert to speech"""
|
||
|
|
|
||
|
|
# 1. Get text response from existing LLM
|
||
|
|
response_text = await query_ollama(
|
||
|
|
prompt=user_message,
|
||
|
|
model=globals.OLLAMA_MODEL
|
||
|
|
)
|
||
|
|
|
||
|
|
# 2. Convert to speech
|
||
|
|
ws_session = AsyncWebSocketSession(globals.FISH_API_KEY)
|
||
|
|
|
||
|
|
async def text_stream():
|
||
|
|
# Can stream as LLM generates if needed
|
||
|
|
yield response_text
|
||
|
|
|
||
|
|
async with ws_session:
|
||
|
|
async for audio_chunk in ws_session.tts(
|
||
|
|
TTSRequest(
|
||
|
|
text="",
|
||
|
|
reference_id=globals.MIKU_VOICE_ID,
|
||
|
|
format="pcm",
|
||
|
|
sample_rate=48000
|
||
|
|
),
|
||
|
|
text_stream()
|
||
|
|
):
|
||
|
|
# Send to Discord voice channel
|
||
|
|
yield audio_chunk
|
||
|
|
```
|
||
|
|
|
||
|
|
## Rate Limits
|
||
|
|
Check the current rate limits at:
|
||
|
|
https://docs.fish.audio/developer-platform/models-pricing/pricing-and-rate-limits
|
||
|
|
|
||
|
|
## Additional Resources
|
||
|
|
- **API Reference**: https://docs.fish.audio/api-reference/introduction
|
||
|
|
- **Python SDK**: https://github.com/fishaudio/fish-audio-python
|
||
|
|
- **WebSocket Docs**: https://docs.fish.audio/sdk-reference/python/websocket
|
||
|
|
- **Discord Community**: https://discord.com/invite/dF9Db2Tt3Y
|
||
|
|
- **Support**: support@fish.audio
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
1. Create Fish.audio account and get API key
|
||
|
|
2. Find/select Miku voice model and get its ID
|
||
|
|
3. Install required dependencies
|
||
|
|
4. Implement voice channel connection in bot
|
||
|
|
5. Add speech-to-text for user audio
|
||
|
|
6. Connect Fish.audio TTS to output audio
|
||
|
|
7. Test latency and quality
|