Files

koko210Serve 88d4256755 Organize documentation: Move all .md files to readmes/ directory

2025-12-07 17:21:59 +02:00

10 KiB

Raw Blame History

VRAM-Aware Profile Picture System

Overview

The profile picture feature now manages GPU VRAM efficiently by coordinating between the vision model and face detection model. Since both require VRAM and there isn't enough for both simultaneously, the system automatically swaps models as needed.

Architecture

Services in docker-compose.yml

┌─────────────────────────────────────────────────────────────┐
│                    GPU (Shared VRAM)                        │
│  ┌───────────────┐      ┌──────────────────────────────┐   │
│  │  llama-swap   │ ←──→ │   anime-face-detector        │   │
│  │ (Text/Vision) │      │   (YOLOv3 Face Detection)    │   │
│  └───────────────┘      └──────────────────────────────┘   │
│         ↑                           ↑                       │
└─────────┼───────────────────────────┼───────────────────────┘
          │                           │
    ┌─────┴──────────────────────────┴────┐
    │         miku-bot                     │
    │  (Coordinates model swapping)        │
    └──────────────────────────────────────┘

VRAM Management Flow

Profile Picture Change Process:

Vision Model Phase (if using Danbooru):

User triggers change → Danbooru search → Download image → 
Vision model verifies it's Miku → Vision model returns result

VRAM Swap:

Bot swaps to text model → Vision model unloads → VRAM freed
(3 second wait for complete unload)

Face Detection Phase:

Face detector loads → Detect face → Return bbox/keypoints → 
Face detector stays loaded for future requests

Cropping & Upload:

Crop image using face bbox → Upload to Discord

Key Files

Consolidated Structure

miku-discord/
├── docker-compose.yml           # All 3 services (llama-swap, miku-bot, anime-face-detector)
├── face-detector/               # Face detection service (moved from separate repo)
│   ├── Dockerfile
│   ├── supervisord.conf
│   ├── api/
│   │   ├── main.py             # FastAPI face detection endpoint
│   │   └── outputs/            # Detection results
│   └── images/                 # Test images
└── bot/
    └── utils/
        ├── profile_picture_manager.py    # Updated with VRAM management
        └── face_detector_manager.py      # (Optional advanced version)

Modified Files

1. profile_picture_manager.py

Added _ensure_vram_available() method:

async def _ensure_vram_available(self, debug: bool = False):
    """
    Ensure VRAM is available for face detection by swapping to text model.
    This unloads the vision model if it's loaded.
    """
    # Trigger swap to text model
    # Vision model auto-unloads
    # Wait 3 seconds for VRAM to clear

Updated _detect_face():

async def _detect_face(self, image_bytes: bytes, debug: bool = False):
    # First: Free VRAM
    await self._ensure_vram_available(debug=debug)
    
    # Then: Call face detection API
    # Face detector has exclusive VRAM access

2. docker-compose.yml

Added anime-face-detector service:

anime-face-detector:
  build: ./face-detector
  runtime: nvidia
  volumes:
    - ./face-detector/api:/app/api
  ports:
    - "7860:7860"  # Gradio UI
    - "6078:6078"  # FastAPI

Model Characteristics

Model	Size	VRAM Usage	TTL (Auto-unload)	Purpose
llama3.1 (Text)	~4.5GB	~5GB	30 min	Text generation
vision (MiniCPM-V)	~3.8GB	~4GB+	15 min	Image understanding
YOLOv3 Face Detector	~250MB	~1GB	Always loaded	Anime face detection

Total VRAM: ~8GB available on GPU Conflict: Vision (~4GB) + Face Detector (~1GB) = Too much when vision has overhead

How It Works

Automatic VRAM Management

When vision model is needed:
- Bot makes request to llama-swap
- llama-swap loads vision model (unloads text if needed)
- Vision model processes request
- Vision model stays loaded for 15 minutes (TTL)
When face detection is needed:
- _ensure_vram_available() swaps to text model
- llama-swap unloads vision model automatically
- 3-second wait ensures VRAM is fully released
- Face detection API called (loads YOLOv3)
- Face detection succeeds with enough VRAM
After face detection:
- Face detector stays loaded (no TTL, always ready)
- Vision model can be loaded again when needed
- llama-swap handles the swap automatically

Why This Works

✅ Sequential Processing: Vision verification happens first, face detection after ✅ Automatic Swapping: llama-swap handles model management ✅ Minimal Code Changes: Just one method added to ensure swap happens ✅ Graceful Fallback: If face detection fails, saliency detection still works

API Endpoints

Face Detection API

Endpoint: http://anime-face-detector:6078/detect

Request:

curl -X POST http://localhost:6078/detect -F "file=@image.jpg"

Response:

{
  "detections": [
    {
      "bbox": [x1, y1, x2, y2],
      "confidence": 0.98,
      "keypoints": [[x, y, score], ...]
    }
  ],
  "count": 1,
  "annotated_image": "/app/api/outputs/..._annotated.jpg",
  "json_file": "/app/api/outputs/..._results.json"
}

Health Check:

curl http://localhost:6078/health
# Returns: {"status":"healthy","detector_loaded":true}

Gradio UI: http://localhost:7860 (visual testing)

Deployment

Build and Start All Services

cd /home/koko210Serve/docker/miku-discord
docker-compose up -d --build

This starts:

✅ llama-swap (text/vision models)
✅ miku-bot (Discord bot)
✅ anime-face-detector (face detection API)

Verify Services

# Check all containers are running
docker-compose ps

# Check face detector API
curl http://localhost:6078/health

# Check llama-swap
curl http://localhost:8090/health

# Check bot logs
docker-compose logs -f miku-bot | grep "face detector"
# Should see: "✅ Anime face detector API connected"

Test Profile Picture Change

# Via API
curl -X POST "http://localhost:3939/profile-picture/change"

# Via Web UI
# Navigate to http://localhost:3939 → Actions → Profile Picture

Monitoring VRAM Usage

Check GPU Memory

# From host
nvidia-smi

# From llama-swap container
docker exec llama-swap nvidia-smi

# From face-detector container
docker exec anime-face-detector nvidia-smi

Check Model Status

# See which model is loaded in llama-swap
docker exec llama-swap ps aux | grep llama-server

# Check face detector
docker exec anime-face-detector ps aux | grep python

Troubleshooting

"Out of Memory" Errors

Symptom: Vision model crashes with cudaMalloc failed: out of memory

Solution: The VRAM swap should prevent this. If it still occurs:

Check swap timing:

# In profile_picture_manager.py, increase wait time:
await asyncio.sleep(5)  # Instead of 3

Manually unload vision:

# Force swap to text model
curl -X POST http://localhost:8090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1","messages":[{"role":"user","content":"hi"}],"max_tokens":1}'

Check if face detector is already loaded:

docker exec anime-face-detector nvidia-smi

Face Detection Not Working

Symptom: Cannot connect to host anime-face-detector:6078

Solution:

# Check container is running
docker ps | grep anime-face-detector

# Check network
docker network inspect miku-discord_default

# Restart face detector
docker-compose restart anime-face-detector

# Check logs
docker-compose logs anime-face-detector

Vision Model Still Loaded

Symptom: Face detection OOM even after swap