Remove all Ollama remnants and complete migration to llama.cpp
- Remove Ollama-specific files (Dockerfile.ollama, entrypoint.sh) - Replace all query_ollama imports and calls with query_llama - Remove langchain-ollama dependency from requirements.txt - Update all utility files (autonomous, kindness, image_generation, etc.) - Update README.md documentation references - Maintain backward compatibility alias in llm.py
This commit is contained in:
@@ -1,222 +0,0 @@
|
||||
# Voice Chat Implementation with Fish.audio
|
||||
|
||||
## Overview
|
||||
This document explains how to integrate Fish.audio TTS API with the Miku Discord bot for voice channel conversations.
|
||||
|
||||
## Fish.audio API Setup
|
||||
|
||||
### 1. Get API Key
|
||||
- Create account at https://fish.audio/
|
||||
- Get API key from: https://fish.audio/app/api-keys/
|
||||
|
||||
### 2. Find Your Miku Voice Model ID
|
||||
- Browse voices at https://fish.audio/
|
||||
- Find your Miku voice model
|
||||
- Copy the model ID from the URL (e.g., `8ef4a238714b45718ce04243307c57a7`)
|
||||
- Or use the copy button on the voice page
|
||||
|
||||
## API Usage for Discord Voice Chat
|
||||
|
||||
### Basic TTS Request (REST API)
|
||||
```python
|
||||
import requests
|
||||
|
||||
def generate_speech(text: str, voice_id: str, api_key: str) -> bytes:
|
||||
"""Generate speech using Fish.audio API"""
|
||||
url = "https://api.fish.audio/v1/tts"
|
||||
|
||||
headers = {
|
||||
"Authorization": f"Bearer {api_key}",
|
||||
"Content-Type": "application/json",
|
||||
"model": "s1" # Recommended model
|
||||
}
|
||||
|
||||
payload = {
|
||||
"text": text,
|
||||
"reference_id": voice_id, # Your Miku voice model ID
|
||||
"format": "mp3", # or "pcm" for raw audio
|
||||
"latency": "balanced", # Lower latency for real-time
|
||||
"temperature": 0.9, # Controls randomness (0-1)
|
||||
"normalize": True # Reduces latency
|
||||
}
|
||||
|
||||
response = requests.post(url, json=payload, headers=headers)
|
||||
return response.content # Returns audio bytes
|
||||
```
|
||||
|
||||
### Real-time Streaming (WebSocket - Recommended for VC)
|
||||
```python
|
||||
from fish_audio_sdk import WebSocketSession, TTSRequest
|
||||
|
||||
def stream_to_discord(text: str, voice_id: str, api_key: str):
|
||||
"""Stream audio directly to Discord voice channel"""
|
||||
ws_session = WebSocketSession(api_key)
|
||||
|
||||
# Define text generator (can stream from LLM responses)
|
||||
def text_stream():
|
||||
# You can yield text as it's generated from your LLM
|
||||
yield text
|
||||
|
||||
with ws_session:
|
||||
for audio_chunk in ws_session.tts(
|
||||
TTSRequest(
|
||||
text="", # Empty when streaming
|
||||
reference_id=voice_id,
|
||||
format="pcm", # Best for Discord
|
||||
sample_rate=48000 # Discord uses 48kHz
|
||||
),
|
||||
text_stream()
|
||||
):
|
||||
# Send audio_chunk to Discord voice channel
|
||||
yield audio_chunk
|
||||
```
|
||||
|
||||
### Async Streaming (Better for Discord.py)
|
||||
```python
|
||||
from fish_audio_sdk import AsyncWebSocketSession, TTSRequest
|
||||
import asyncio
|
||||
|
||||
async def async_stream_speech(text: str, voice_id: str, api_key: str):
|
||||
"""Async streaming for Discord.py integration"""
|
||||
ws_session = AsyncWebSocketSession(api_key)
|
||||
|
||||
async def text_stream():
|
||||
yield text
|
||||
|
||||
async with ws_session:
|
||||
audio_buffer = bytearray()
|
||||
async for audio_chunk in ws_session.tts(
|
||||
TTSRequest(
|
||||
text="",
|
||||
reference_id=voice_id,
|
||||
format="pcm",
|
||||
sample_rate=48000
|
||||
),
|
||||
text_stream()
|
||||
):
|
||||
audio_buffer.extend(audio_chunk)
|
||||
|
||||
return bytes(audio_buffer)
|
||||
```
|
||||
|
||||
## Integration with Miku Bot
|
||||
|
||||
### Required Dependencies
|
||||
Add to `requirements.txt`:
|
||||
```
|
||||
discord.py[voice]
|
||||
PyNaCl
|
||||
fish-audio-sdk
|
||||
speech_recognition # For STT
|
||||
pydub # Audio processing
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
Add to your `.env` or docker-compose.yml:
|
||||
```bash
|
||||
FISH_API_KEY=your_api_key_here
|
||||
MIKU_VOICE_ID=your_miku_model_id_here
|
||||
```
|
||||
|
||||
### Discord Voice Channel Flow
|
||||
```
|
||||
1. User speaks in VC
|
||||
↓
|
||||
2. Capture audio → Speech Recognition (STT)
|
||||
↓
|
||||
3. Convert speech to text
|
||||
↓
|
||||
4. Process with Miku's LLM (existing bot logic)
|
||||
↓
|
||||
5. Generate response text
|
||||
↓
|
||||
6. Send to Fish.audio TTS API
|
||||
↓
|
||||
7. Stream audio back to Discord VC
|
||||
```
|
||||
|
||||
## Key Implementation Details
|
||||
|
||||
### For Low Latency Voice Chat:
|
||||
- Use WebSocket streaming instead of REST API
|
||||
- Set `latency: "balanced"` in requests
|
||||
- Use `format: "pcm"` with `sample_rate: 48000` for Discord
|
||||
- Stream LLM responses as they generate (don't wait for full response)
|
||||
|
||||
### Audio Format for Discord:
|
||||
- **Sample Rate**: 48000 Hz (Discord standard)
|
||||
- **Channels**: 1 (mono)
|
||||
- **Format**: PCM (raw audio) or Opus (compressed)
|
||||
- **Bit Depth**: 16-bit
|
||||
|
||||
### Cost Considerations:
|
||||
- **TTS**: $15.00 per million UTF-8 bytes
|
||||
- Example: ~$0.015 for 1000 characters
|
||||
- Monitor usage at https://fish.audio/app/billing/
|
||||
|
||||
### API Features Available:
|
||||
- **Temperature** (0-1): Controls speech randomness/expressiveness
|
||||
- **Prosody**: Control speed and volume
|
||||
```python
|
||||
"prosody": {
|
||||
"speed": 1.0, # 0.5-2.0 range
|
||||
"volume": 0 # -10 to 10 dB
|
||||
}
|
||||
```
|
||||
- **Chunk Length** (100-300): Affects streaming speed
|
||||
- **Normalize**: Reduces latency but may affect number/date pronunciation
|
||||
|
||||
## Example: Integrate with Existing LLM
|
||||
```python
|
||||
from utils.llm import query_ollama
|
||||
from fish_audio_sdk import AsyncWebSocketSession, TTSRequest
|
||||
|
||||
async def miku_voice_response(user_message: str):
|
||||
"""Generate Miku's response and convert to speech"""
|
||||
|
||||
# 1. Get text response from existing LLM
|
||||
response_text = await query_ollama(
|
||||
prompt=user_message,
|
||||
model=globals.OLLAMA_MODEL
|
||||
)
|
||||
|
||||
# 2. Convert to speech
|
||||
ws_session = AsyncWebSocketSession(globals.FISH_API_KEY)
|
||||
|
||||
async def text_stream():
|
||||
# Can stream as LLM generates if needed
|
||||
yield response_text
|
||||
|
||||
async with ws_session:
|
||||
async for audio_chunk in ws_session.tts(
|
||||
TTSRequest(
|
||||
text="",
|
||||
reference_id=globals.MIKU_VOICE_ID,
|
||||
format="pcm",
|
||||
sample_rate=48000
|
||||
),
|
||||
text_stream()
|
||||
):
|
||||
# Send to Discord voice channel
|
||||
yield audio_chunk
|
||||
```
|
||||
|
||||
## Rate Limits
|
||||
Check the current rate limits at:
|
||||
https://docs.fish.audio/developer-platform/models-pricing/pricing-and-rate-limits
|
||||
|
||||
## Additional Resources
|
||||
- **API Reference**: https://docs.fish.audio/api-reference/introduction
|
||||
- **Python SDK**: https://github.com/fishaudio/fish-audio-python
|
||||
- **WebSocket Docs**: https://docs.fish.audio/sdk-reference/python/websocket
|
||||
- **Discord Community**: https://discord.com/invite/dF9Db2Tt3Y
|
||||
- **Support**: support@fish.audio
|
||||
|
||||
## Next Steps
|
||||
1. Create Fish.audio account and get API key
|
||||
2. Find/select Miku voice model and get its ID
|
||||
3. Install required dependencies
|
||||
4. Implement voice channel connection in bot
|
||||
5. Add speech-to-text for user audio
|
||||
6. Connect Fish.audio TTS to output audio
|
||||
7. Test latency and quality
|
||||
Reference in New Issue
Block a user