Files
miku-discord/readmes/VRAM_MANAGEMENT.md

10 KiB

VRAM-Aware Profile Picture System

Overview

The profile picture feature now manages GPU VRAM efficiently by coordinating between the vision model and face detection model. Since both require VRAM and there isn't enough for both simultaneously, the system automatically swaps models as needed.

Architecture

Services in docker-compose.yml

┌─────────────────────────────────────────────────────────────┐
│                    GPU (Shared VRAM)                        │
│  ┌───────────────┐      ┌──────────────────────────────┐   │
│  │  llama-swap   │ ←──→ │   anime-face-detector        │   │
│  │ (Text/Vision) │      │   (YOLOv3 Face Detection)    │   │
│  └───────────────┘      └──────────────────────────────┘   │
│         ↑                           ↑                       │
└─────────┼───────────────────────────┼───────────────────────┘
          │                           │
    ┌─────┴──────────────────────────┴────┐
    │         miku-bot                     │
    │  (Coordinates model swapping)        │
    └──────────────────────────────────────┘

VRAM Management Flow

Profile Picture Change Process:

  1. Vision Model Phase (if using Danbooru):

    User triggers change → Danbooru search → Download image → 
    Vision model verifies it's Miku → Vision model returns result
    
  2. VRAM Swap:

    Bot swaps to text model → Vision model unloads → VRAM freed
    (3 second wait for complete unload)
    
  3. Face Detection Phase:

    Face detector loads → Detect face → Return bbox/keypoints → 
    Face detector stays loaded for future requests
    
  4. Cropping & Upload:

    Crop image using face bbox → Upload to Discord
    

Key Files

Consolidated Structure

miku-discord/
├── docker-compose.yml           # All 3 services (llama-swap, miku-bot, anime-face-detector)
├── face-detector/               # Face detection service (moved from separate repo)
│   ├── Dockerfile
│   ├── supervisord.conf
│   ├── api/
│   │   ├── main.py             # FastAPI face detection endpoint
│   │   └── outputs/            # Detection results
│   └── images/                 # Test images
└── bot/
    └── utils/
        ├── profile_picture_manager.py    # Updated with VRAM management
        └── face_detector_manager.py      # (Optional advanced version)

Modified Files

1. profile_picture_manager.py

Added _ensure_vram_available() method:

async def _ensure_vram_available(self, debug: bool = False):
    """
    Ensure VRAM is available for face detection by swapping to text model.
    This unloads the vision model if it's loaded.
    """
    # Trigger swap to text model
    # Vision model auto-unloads
    # Wait 3 seconds for VRAM to clear

Updated _detect_face():

async def _detect_face(self, image_bytes: bytes, debug: bool = False):
    # First: Free VRAM
    await self._ensure_vram_available(debug=debug)
    
    # Then: Call face detection API
    # Face detector has exclusive VRAM access

2. docker-compose.yml

Added anime-face-detector service:

anime-face-detector:
  build: ./face-detector
  runtime: nvidia
  volumes:
    - ./face-detector/api:/app/api
  ports:
    - "7860:7860"  # Gradio UI
    - "6078:6078"  # FastAPI

Model Characteristics

Model Size VRAM Usage TTL (Auto-unload) Purpose
llama3.1 (Text) ~4.5GB ~5GB 30 min Text generation
vision (MiniCPM-V) ~3.8GB ~4GB+ 15 min Image understanding
YOLOv3 Face Detector ~250MB ~1GB Always loaded Anime face detection

Total VRAM: ~8GB available on GPU Conflict: Vision (~4GB) + Face Detector (~1GB) = Too much when vision has overhead

How It Works

Automatic VRAM Management

  1. When vision model is needed:

    • Bot makes request to llama-swap
    • llama-swap loads vision model (unloads text if needed)
    • Vision model processes request
    • Vision model stays loaded for 15 minutes (TTL)
  2. When face detection is needed:

    • _ensure_vram_available() swaps to text model
    • llama-swap unloads vision model automatically
    • 3-second wait ensures VRAM is fully released
    • Face detection API called (loads YOLOv3)
    • Face detection succeeds with enough VRAM
  3. After face detection:

    • Face detector stays loaded (no TTL, always ready)
    • Vision model can be loaded again when needed
    • llama-swap handles the swap automatically

Why This Works

Sequential Processing: Vision verification happens first, face detection after Automatic Swapping: llama-swap handles model management Minimal Code Changes: Just one method added to ensure swap happens Graceful Fallback: If face detection fails, saliency detection still works

API Endpoints

Face Detection API

Endpoint: http://anime-face-detector:6078/detect

Request:

curl -X POST http://localhost:6078/detect -F "file=@image.jpg"

Response:

{
  "detections": [
    {
      "bbox": [x1, y1, x2, y2],
      "confidence": 0.98,
      "keypoints": [[x, y, score], ...]
    }
  ],
  "count": 1,
  "annotated_image": "/app/api/outputs/..._annotated.jpg",
  "json_file": "/app/api/outputs/..._results.json"
}

Health Check:

curl http://localhost:6078/health
# Returns: {"status":"healthy","detector_loaded":true}

Gradio UI: http://localhost:7860 (visual testing)

Deployment

Build and Start All Services

cd /home/koko210Serve/docker/miku-discord
docker-compose up -d --build

This starts:

  • llama-swap (text/vision models)
  • miku-bot (Discord bot)
  • anime-face-detector (face detection API)

Verify Services

# Check all containers are running
docker-compose ps

# Check face detector API
curl http://localhost:6078/health

# Check llama-swap
curl http://localhost:8090/health

# Check bot logs
docker-compose logs -f miku-bot | grep "face detector"
# Should see: "✅ Anime face detector API connected"

Test Profile Picture Change

# Via API
curl -X POST "http://localhost:3939/profile-picture/change"

# Via Web UI
# Navigate to http://localhost:3939 → Actions → Profile Picture

Monitoring VRAM Usage

Check GPU Memory

# From host
nvidia-smi

# From llama-swap container
docker exec llama-swap nvidia-smi

# From face-detector container
docker exec anime-face-detector nvidia-smi

Check Model Status

# See which model is loaded in llama-swap
docker exec llama-swap ps aux | grep llama-server

# Check face detector
docker exec anime-face-detector ps aux | grep python

Troubleshooting

"Out of Memory" Errors

Symptom: Vision model crashes with cudaMalloc failed: out of memory

Solution: The VRAM swap should prevent this. If it still occurs:

  1. Check swap timing:

    # In profile_picture_manager.py, increase wait time:
    await asyncio.sleep(5)  # Instead of 3
    
  2. Manually unload vision:

    # Force swap to text model
    curl -X POST http://localhost:8090/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{"model":"llama3.1","messages":[{"role":"user","content":"hi"}],"max_tokens":1}'
    
  3. Check if face detector is already loaded:

    docker exec anime-face-detector nvidia-smi
    

Face Detection Not Working

Symptom: Cannot connect to host anime-face-detector:6078

Solution:

# Check container is running
docker ps | grep anime-face-detector

# Check network
docker network inspect miku-discord_default

# Restart face detector
docker-compose restart anime-face-detector

# Check logs
docker-compose logs anime-face-detector

Vision Model Still Loaded

Symptom: Face detection OOM even after swap

Solution:

# Force model unload by stopping llama-swap briefly
docker-compose restart llama-swap

# Or increase wait time in _ensure_vram_available()

Performance Metrics

Typical Timeline

Step Duration VRAM State
Vision verification 5-10s Vision model loaded (~4GB)
Model swap + wait 3-5s Transitioning (releasing VRAM)
Face detection 1-2s Face detector loaded (~1GB)
Cropping & upload 1-2s Face detector still loaded
Total 10-19s Efficient VRAM usage

VRAM Timeline

Time:   0s    5s    10s   13s   15s
        │     │     │     │     │
Vision: ████████████░░░░░░░░░░░░   ← Unloads after verification
Swap:   ░░░░░░░░░░░░███░░░░░░░░░   ← 3s transition
Face:   ░░░░░░░░░░░░░░░████████   ← Loads for detection

Benefits of This Approach

No Manual Intervention: Automatic VRAM management Reliable: Sequential processing avoids conflicts Efficient: Models only loaded when needed Simple: Minimal code changes Maintainable: Uses existing llama-swap features Graceful: Fallback to saliency if face detection unavailable

Future Enhancements

Potential improvements:

  1. Dynamic Model Unloading: Explicitly unload vision model via API if llama-swap adds support
  2. VRAM Monitoring: Check actual VRAM usage before loading face detector
  3. Queue System: Process multiple images without repeated model swaps
  4. Persistent Face Detector: Keep loaded in background, use pause/resume
  5. Smaller Models: Use quantized versions to reduce VRAM requirements
  • /miku-discord/FACE_DETECTION_API_MIGRATION.md - Original API migration
  • /miku-discord/PROFILE_PICTURE_IMPLEMENTATION.md - Profile picture feature details
  • /face-detector/api/main.py - Face detection API implementation
  • llama-swap-config.yaml - Model swap configuration