316 lines
7.7 KiB
Markdown
316 lines
7.7 KiB
Markdown
|
|
# Testing Autonomous System V2
|
||
|
|
|
||
|
|
## Quick Start Guide
|
||
|
|
|
||
|
|
### Step 1: Enable V2 System (Optional - Test Mode)
|
||
|
|
|
||
|
|
The V2 system can run **alongside** V1 for comparison. To enable it:
|
||
|
|
|
||
|
|
**Option A: Edit `bot.py` to start V2 on bot ready**
|
||
|
|
|
||
|
|
Add this to the `on_ready()` function in `bot/bot.py`:
|
||
|
|
|
||
|
|
```python
|
||
|
|
# After existing setup code, add:
|
||
|
|
from utils.autonomous_v2_integration import start_v2_system_for_all_servers
|
||
|
|
|
||
|
|
# Start V2 autonomous system
|
||
|
|
await start_v2_system_for_all_servers(client)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Option B: Manual API testing (no code changes needed)**
|
||
|
|
|
||
|
|
Just use the API endpoints to check what V2 is thinking, without actually running it.
|
||
|
|
|
||
|
|
### Step 2: Test the V2 Decision System
|
||
|
|
|
||
|
|
#### Check what V2 is "thinking" for a server:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Get current social stats
|
||
|
|
curl http://localhost:3939/autonomous/v2/stats/<GUILD_ID>
|
||
|
|
|
||
|
|
# Example response:
|
||
|
|
{
|
||
|
|
"status": "ok",
|
||
|
|
"guild_id": 759889672804630530,
|
||
|
|
"stats": {
|
||
|
|
"loneliness": "0.42",
|
||
|
|
"boredom": "0.65",
|
||
|
|
"excitement": "0.15",
|
||
|
|
"curiosity": "0.20",
|
||
|
|
"chattiness": "0.70",
|
||
|
|
"action_urgency": "0.48"
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Trigger a manual V2 analysis:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# See what V2 would decide right now
|
||
|
|
curl http://localhost:3939/autonomous/v2/check/<GUILD_ID>
|
||
|
|
|
||
|
|
# Example response:
|
||
|
|
{
|
||
|
|
"status": "ok",
|
||
|
|
"guild_id": 759889672804630530,
|
||
|
|
"analysis": {
|
||
|
|
"stats": { ... },
|
||
|
|
"interest_score": "0.73",
|
||
|
|
"triggers": [
|
||
|
|
"KEYWORD_DETECTED (0.60): Interesting keywords: vocaloid, miku",
|
||
|
|
"CONVERSATION_PEAK (0.60): Lots of people are chatting"
|
||
|
|
],
|
||
|
|
"recent_messages": 15,
|
||
|
|
"conversation_active": true,
|
||
|
|
"would_call_llm": true
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Get overall V2 status:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# See V2 status for all servers
|
||
|
|
curl http://localhost:3939/autonomous/v2/status
|
||
|
|
|
||
|
|
# Example response:
|
||
|
|
{
|
||
|
|
"status": "ok",
|
||
|
|
"servers": {
|
||
|
|
"759889672804630530": {
|
||
|
|
"server_name": "Example Server",
|
||
|
|
"loop_running": true,
|
||
|
|
"action_urgency": "0.52",
|
||
|
|
"loneliness": "0.30",
|
||
|
|
"boredom": "0.45",
|
||
|
|
"excitement": "0.20",
|
||
|
|
"chattiness": "0.70"
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 3: Monitor Behavior
|
||
|
|
|
||
|
|
#### Watch for V2 log messages:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker compose logs -f bot | grep -E "🧠|🎯|🤔"
|
||
|
|
```
|
||
|
|
|
||
|
|
You'll see messages like:
|
||
|
|
```
|
||
|
|
🧠 Starting autonomous decision loop for server 759889672804630530
|
||
|
|
🎯 Interest score 0.73 - Consulting LLM for server 759889672804630530
|
||
|
|
🤔 LLM decision: YES, someone mentioned you (Interest: 0.73)
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Compare V1 vs V2:
|
||
|
|
|
||
|
|
**V1 logs:**
|
||
|
|
```
|
||
|
|
💬 Miku said something general in #miku-chat
|
||
|
|
```
|
||
|
|
|
||
|
|
**V2 logs:**
|
||
|
|
```
|
||
|
|
🎯 Interest score 0.82 - Consulting LLM
|
||
|
|
🤔 LLM decision: YES
|
||
|
|
💬 Miku said something general in #miku-chat
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 4: Tune the System
|
||
|
|
|
||
|
|
Edit `bot/utils/autonomous_v2.py` to adjust behavior:
|
||
|
|
|
||
|
|
```python
|
||
|
|
# How sensitive is the decision system?
|
||
|
|
self.LLM_CALL_THRESHOLD = 0.6 # Lower = more responsive (more LLM calls)
|
||
|
|
self.ACTION_THRESHOLD = 0.5 # Lower = more chatty
|
||
|
|
|
||
|
|
# How fast do stats build?
|
||
|
|
LONELINESS_BUILD_RATE = 0.01 # Higher = gets lonely faster
|
||
|
|
BOREDOM_BUILD_RATE = 0.01 # Higher = gets bored faster
|
||
|
|
|
||
|
|
# Check intervals
|
||
|
|
MIN_SLEEP = 30 # Seconds between checks during active chat
|
||
|
|
MAX_SLEEP = 180 # Seconds between checks when quiet
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 5: Understanding the Stats
|
||
|
|
|
||
|
|
#### Loneliness (0.0 - 1.0)
|
||
|
|
- **Increases**: When not mentioned for >30 minutes
|
||
|
|
- **Decreases**: When mentioned, engaged
|
||
|
|
- **Effect**: At 0.7+, seeks attention
|
||
|
|
|
||
|
|
#### Boredom (0.0 - 1.0)
|
||
|
|
- **Increases**: When quiet, hasn't spoken in >1 hour
|
||
|
|
- **Decreases**: When shares content, conversation happens
|
||
|
|
- **Effect**: At 0.7+, likely to share tweets/content
|
||
|
|
|
||
|
|
#### Excitement (0.0 - 1.0)
|
||
|
|
- **Increases**: During active conversations
|
||
|
|
- **Decreases**: Fades over time (decays fast)
|
||
|
|
- **Effect**: Higher = more likely to jump into conversation
|
||
|
|
|
||
|
|
#### Curiosity (0.0 - 1.0)
|
||
|
|
- **Increases**: Interesting keywords detected
|
||
|
|
- **Decreases**: Fades over time
|
||
|
|
- **Effect**: High curiosity = asks questions
|
||
|
|
|
||
|
|
#### Chattiness (0.0 - 1.0)
|
||
|
|
- **Set by mood**:
|
||
|
|
- excited/bubbly: 0.85-0.9
|
||
|
|
- neutral: 0.5
|
||
|
|
- shy/sleepy: 0.2-0.3
|
||
|
|
- asleep: 0.0
|
||
|
|
- **Effect**: Base multiplier for all interactions
|
||
|
|
|
||
|
|
### Step 6: Trigger Examples
|
||
|
|
|
||
|
|
Test specific triggers by creating conditions:
|
||
|
|
|
||
|
|
#### Test MENTIONED trigger:
|
||
|
|
1. Mention @Miku in the autonomous channel
|
||
|
|
2. Check stats: `curl http://localhost:3939/autonomous/v2/check/<GUILD_ID>`
|
||
|
|
3. Should show: `"triggers": ["MENTIONED (0.90): Someone mentioned me!"]`
|
||
|
|
|
||
|
|
#### Test KEYWORD trigger:
|
||
|
|
1. Say "I love Vocaloid music" in channel
|
||
|
|
2. Check stats
|
||
|
|
3. Should show: `"triggers": ["KEYWORD_DETECTED (0.60): Interesting keywords: vocaloid, music"]`
|
||
|
|
|
||
|
|
#### Test CONVERSATION_PEAK:
|
||
|
|
1. Have 3+ people chat within 5 minutes
|
||
|
|
2. Check stats
|
||
|
|
3. Should show: `"triggers": ["CONVERSATION_PEAK (0.60): Lots of people are chatting"]`
|
||
|
|
|
||
|
|
#### Test LONELINESS:
|
||
|
|
1. Don't mention Miku for 30+ minutes
|
||
|
|
2. Check stats: `curl http://localhost:3939/autonomous/v2/stats/<GUILD_ID>`
|
||
|
|
3. Watch loneliness increase over time
|
||
|
|
|
||
|
|
### Step 7: Debugging
|
||
|
|
|
||
|
|
#### V2 won't start?
|
||
|
|
```bash
|
||
|
|
# Check if import works
|
||
|
|
docker compose exec bot python -c "from utils.autonomous_v2 import autonomous_system_v2; print('OK')"
|
||
|
|
```
|
||
|
|
|
||
|
|
#### V2 never calls LLM?
|
||
|
|
```bash
|
||
|
|
# Check interest scores
|
||
|
|
curl http://localhost:3939/autonomous/v2/check/<GUILD_ID>
|
||
|
|
|
||
|
|
# If interest_score is always < 0.6:
|
||
|
|
# - Channel might be too quiet
|
||
|
|
# - Stats might not be building
|
||
|
|
# - Try mentioning Miku (instant 0.9 score)
|
||
|
|
```
|
||
|
|
|
||
|
|
#### V2 calls LLM too much?
|
||
|
|
```bash
|
||
|
|
# Increase threshold in autonomous_v2.py:
|
||
|
|
self.LLM_CALL_THRESHOLD = 0.7 # Was 0.6
|
||
|
|
```
|
||
|
|
|
||
|
|
## Performance Monitoring
|
||
|
|
|
||
|
|
### Expected LLM Call Frequency
|
||
|
|
|
||
|
|
**Quiet server (few messages):**
|
||
|
|
- V1: ~10 random calls/day
|
||
|
|
- V2: ~2-5 targeted calls/day
|
||
|
|
- **GPU usage: LOWER with V2**
|
||
|
|
|
||
|
|
**Active server (100+ messages/day):**
|
||
|
|
- V1: ~10 random calls/day (same)
|
||
|
|
- V2: ~10-20 targeted calls/day (responsive to activity)
|
||
|
|
- **GPU usage: SLIGHTLY HIGHER, but much more relevant**
|
||
|
|
|
||
|
|
### Check GPU Usage
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Monitor GPU while bot is running
|
||
|
|
nvidia-smi -l 1
|
||
|
|
|
||
|
|
# V1: GPU spikes randomly every 15 minutes
|
||
|
|
# V2: GPU spikes only when something interesting happens
|
||
|
|
```
|
||
|
|
|
||
|
|
### Monitor LLM Queue
|
||
|
|
|
||
|
|
If you notice lag:
|
||
|
|
1. Check how many LLM calls are queued
|
||
|
|
2. Increase `LLM_CALL_THRESHOLD` to reduce frequency
|
||
|
|
3. Increase check intervals for quieter periods
|
||
|
|
|
||
|
|
## Migration Path
|
||
|
|
|
||
|
|
### Phase 1: Testing (Current)
|
||
|
|
- V1 running (scheduled actions)
|
||
|
|
- V2 running (parallel, logging decisions)
|
||
|
|
- Compare behaviors
|
||
|
|
- Tune V2 parameters
|
||
|
|
|
||
|
|
### Phase 2: Gradual Replacement
|
||
|
|
```python
|
||
|
|
# In server_manager.py, comment out V1 jobs:
|
||
|
|
# scheduler.add_job(
|
||
|
|
# self._run_autonomous_for_server,
|
||
|
|
# IntervalTrigger(minutes=15),
|
||
|
|
# ...
|
||
|
|
# )
|
||
|
|
|
||
|
|
# Keep V2 running
|
||
|
|
autonomous_system_v2.start_loop_for_server(guild_id, client)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Phase 3: Full Migration
|
||
|
|
- Disable all V1 autonomous jobs
|
||
|
|
- Keep only V2 system
|
||
|
|
- Keep manual triggers for testing
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### "Module not found: autonomous_v2"
|
||
|
|
```bash
|
||
|
|
# Restart the bot container
|
||
|
|
docker compose restart bot
|
||
|
|
```
|
||
|
|
|
||
|
|
### "Stats always show 0.00"
|
||
|
|
- V2 decision loop might not be running
|
||
|
|
- Check: `curl http://localhost:3939/autonomous/v2/status`
|
||
|
|
- Should show: `"loop_running": true`
|
||
|
|
|
||
|
|
### "Interest score always low"
|
||
|
|
- Channel might be genuinely quiet
|
||
|
|
- Try creating activity: post messages, images, mention Miku
|
||
|
|
- Loneliness/boredom build over time (30-60 min)
|
||
|
|
|
||
|
|
### "LLM called too frequently"
|
||
|
|
- Increase thresholds in `autonomous_v2.py`
|
||
|
|
- Check which triggers are firing: use `/autonomous/v2/check`
|
||
|
|
- Adjust trigger scores if needed
|
||
|
|
|
||
|
|
## API Endpoints Reference
|
||
|
|
|
||
|
|
```
|
||
|
|
GET /autonomous/v2/stats/{guild_id} - Get social stats
|
||
|
|
GET /autonomous/v2/check/{guild_id} - Manual analysis (what would V2 do?)
|
||
|
|
GET /autonomous/v2/status - V2 status for all servers
|
||
|
|
```
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
|
||
|
|
1. Run V2 for 24-48 hours
|
||
|
|
2. Compare decision quality vs V1
|
||
|
|
3. Tune thresholds based on server activity
|
||
|
|
4. Gradually phase out V1 if V2 works well
|
||
|
|
5. Add dashboard for real-time stats visualization
|