10 KiB
VRAM-Aware Profile Picture System
Overview
The profile picture feature now manages GPU VRAM efficiently by coordinating between the vision model and face detection model. Since both require VRAM and there isn't enough for both simultaneously, the system automatically swaps models as needed.
Architecture
Services in docker-compose.yml
┌─────────────────────────────────────────────────────────────┐
│ GPU (Shared VRAM) │
│ ┌───────────────┐ ┌──────────────────────────────┐ │
│ │ llama-swap │ ←──→ │ anime-face-detector │ │
│ │ (Text/Vision) │ │ (YOLOv3 Face Detection) │ │
│ └───────────────┘ └──────────────────────────────┘ │
│ ↑ ↑ │
└─────────┼───────────────────────────┼───────────────────────┘
│ │
┌─────┴──────────────────────────┴────┐
│ miku-bot │
│ (Coordinates model swapping) │
└──────────────────────────────────────┘
VRAM Management Flow
Profile Picture Change Process:
-
Vision Model Phase (if using Danbooru):
User triggers change → Danbooru search → Download image → Vision model verifies it's Miku → Vision model returns result -
VRAM Swap:
Bot swaps to text model → Vision model unloads → VRAM freed (3 second wait for complete unload) -
Face Detection Phase:
Face detector loads → Detect face → Return bbox/keypoints → Face detector stays loaded for future requests -
Cropping & Upload:
Crop image using face bbox → Upload to Discord
Key Files
Consolidated Structure
miku-discord/
├── docker-compose.yml # All 3 services (llama-swap, miku-bot, anime-face-detector)
├── face-detector/ # Face detection service (moved from separate repo)
│ ├── Dockerfile
│ ├── supervisord.conf
│ ├── api/
│ │ ├── main.py # FastAPI face detection endpoint
│ │ └── outputs/ # Detection results
│ └── images/ # Test images
└── bot/
└── utils/
├── profile_picture_manager.py # Updated with VRAM management
└── face_detector_manager.py # (Optional advanced version)
Modified Files
1. profile_picture_manager.py
Added _ensure_vram_available() method:
async def _ensure_vram_available(self, debug: bool = False):
"""
Ensure VRAM is available for face detection by swapping to text model.
This unloads the vision model if it's loaded.
"""
# Trigger swap to text model
# Vision model auto-unloads
# Wait 3 seconds for VRAM to clear
Updated _detect_face():
async def _detect_face(self, image_bytes: bytes, debug: bool = False):
# First: Free VRAM
await self._ensure_vram_available(debug=debug)
# Then: Call face detection API
# Face detector has exclusive VRAM access
2. docker-compose.yml
Added anime-face-detector service:
anime-face-detector:
build: ./face-detector
runtime: nvidia
volumes:
- ./face-detector/api:/app/api
ports:
- "7860:7860" # Gradio UI
- "6078:6078" # FastAPI
Model Characteristics
| Model | Size | VRAM Usage | TTL (Auto-unload) | Purpose |
|---|---|---|---|---|
| llama3.1 (Text) | ~4.5GB | ~5GB | 30 min | Text generation |
| vision (MiniCPM-V) | ~3.8GB | ~4GB+ | 15 min | Image understanding |
| YOLOv3 Face Detector | ~250MB | ~1GB | Always loaded | Anime face detection |
Total VRAM: ~8GB available on GPU Conflict: Vision (~4GB) + Face Detector (~1GB) = Too much when vision has overhead
How It Works
Automatic VRAM Management
-
When vision model is needed:
- Bot makes request to llama-swap
- llama-swap loads vision model (unloads text if needed)
- Vision model processes request
- Vision model stays loaded for 15 minutes (TTL)
-
When face detection is needed:
_ensure_vram_available()swaps to text model- llama-swap unloads vision model automatically
- 3-second wait ensures VRAM is fully released
- Face detection API called (loads YOLOv3)
- Face detection succeeds with enough VRAM
-
After face detection:
- Face detector stays loaded (no TTL, always ready)
- Vision model can be loaded again when needed
- llama-swap handles the swap automatically
Why This Works
✅ Sequential Processing: Vision verification happens first, face detection after ✅ Automatic Swapping: llama-swap handles model management ✅ Minimal Code Changes: Just one method added to ensure swap happens ✅ Graceful Fallback: If face detection fails, saliency detection still works
API Endpoints
Face Detection API
Endpoint: http://anime-face-detector:6078/detect
Request:
curl -X POST http://localhost:6078/detect -F "file=@image.jpg"
Response:
{
"detections": [
{
"bbox": [x1, y1, x2, y2],
"confidence": 0.98,
"keypoints": [[x, y, score], ...]
}
],
"count": 1,
"annotated_image": "/app/api/outputs/..._annotated.jpg",
"json_file": "/app/api/outputs/..._results.json"
}
Health Check:
curl http://localhost:6078/health
# Returns: {"status":"healthy","detector_loaded":true}
Gradio UI: http://localhost:7860 (visual testing)
Deployment
Build and Start All Services
cd /home/koko210Serve/docker/miku-discord
docker-compose up -d --build
This starts:
- ✅ llama-swap (text/vision models)
- ✅ miku-bot (Discord bot)
- ✅ anime-face-detector (face detection API)
Verify Services
# Check all containers are running
docker-compose ps
# Check face detector API
curl http://localhost:6078/health
# Check llama-swap
curl http://localhost:8090/health
# Check bot logs
docker-compose logs -f miku-bot | grep "face detector"
# Should see: "✅ Anime face detector API connected"
Test Profile Picture Change
# Via API
curl -X POST "http://localhost:3939/profile-picture/change"
# Via Web UI
# Navigate to http://localhost:3939 → Actions → Profile Picture
Monitoring VRAM Usage
Check GPU Memory
# From host
nvidia-smi
# From llama-swap container
docker exec llama-swap nvidia-smi
# From face-detector container
docker exec anime-face-detector nvidia-smi
Check Model Status
# See which model is loaded in llama-swap
docker exec llama-swap ps aux | grep llama-server
# Check face detector
docker exec anime-face-detector ps aux | grep python
Troubleshooting
"Out of Memory" Errors
Symptom: Vision model crashes with cudaMalloc failed: out of memory
Solution: The VRAM swap should prevent this. If it still occurs:
-
Check swap timing:
# In profile_picture_manager.py, increase wait time: await asyncio.sleep(5) # Instead of 3 -
Manually unload vision:
# Force swap to text model curl -X POST http://localhost:8090/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"llama3.1","messages":[{"role":"user","content":"hi"}],"max_tokens":1}' -
Check if face detector is already loaded:
docker exec anime-face-detector nvidia-smi
Face Detection Not Working
Symptom: Cannot connect to host anime-face-detector:6078
Solution:
# Check container is running
docker ps | grep anime-face-detector
# Check network
docker network inspect miku-discord_default
# Restart face detector
docker-compose restart anime-face-detector
# Check logs
docker-compose logs anime-face-detector
Vision Model Still Loaded
Symptom: Face detection OOM even after swap
Solution:
# Force model unload by stopping llama-swap briefly
docker-compose restart llama-swap
# Or increase wait time in _ensure_vram_available()
Performance Metrics
Typical Timeline
| Step | Duration | VRAM State |
|---|---|---|
| Vision verification | 5-10s | Vision model loaded (~4GB) |
| Model swap + wait | 3-5s | Transitioning (releasing VRAM) |
| Face detection | 1-2s | Face detector loaded (~1GB) |
| Cropping & upload | 1-2s | Face detector still loaded |
| Total | 10-19s | Efficient VRAM usage |
VRAM Timeline
Time: 0s 5s 10s 13s 15s
│ │ │ │ │
Vision: ████████████░░░░░░░░░░░░ ← Unloads after verification
Swap: ░░░░░░░░░░░░███░░░░░░░░░ ← 3s transition
Face: ░░░░░░░░░░░░░░░████████ ← Loads for detection
Benefits of This Approach
✅ No Manual Intervention: Automatic VRAM management ✅ Reliable: Sequential processing avoids conflicts ✅ Efficient: Models only loaded when needed ✅ Simple: Minimal code changes ✅ Maintainable: Uses existing llama-swap features ✅ Graceful: Fallback to saliency if face detection unavailable
Future Enhancements
Potential improvements:
- Dynamic Model Unloading: Explicitly unload vision model via API if llama-swap adds support
- VRAM Monitoring: Check actual VRAM usage before loading face detector
- Queue System: Process multiple images without repeated model swaps
- Persistent Face Detector: Keep loaded in background, use pause/resume
- Smaller Models: Use quantized versions to reduce VRAM requirements
Related Documentation
/miku-discord/FACE_DETECTION_API_MIGRATION.md- Original API migration/miku-discord/PROFILE_PICTURE_IMPLEMENTATION.md- Profile picture feature details/face-detector/api/main.py- Face detection API implementationllama-swap-config.yaml- Model swap configuration