# VRAM-Aware Profile Picture System

## Overview

The profile picture feature now manages GPU VRAM efficiently by coordinating between the vision model and face detection model. Since both require VRAM and there isn't enough for both simultaneously, the system automatically swaps models as needed.

## Architecture

### Services in docker-compose.yml

```
┌─────────────────────────────────────────────────────────────┐
│                    GPU (Shared VRAM)                        │
│  ┌───────────────┐      ┌──────────────────────────────┐   │
│  │  llama-swap   │ ←──→ │   anime-face-detector        │   │
│  │ (Text/Vision) │      │   (YOLOv3 Face Detection)    │   │
│  └───────────────┘      └──────────────────────────────┘   │
│         ↑                           ↑                       │
└─────────┼───────────────────────────┼───────────────────────┘
          │                           │
    ┌─────┴──────────────────────────┴────┐
    │         miku-bot                     │
    │  (Coordinates model swapping)        │
    └──────────────────────────────────────┘
```

### VRAM Management Flow

#### Profile Picture Change Process:

1. **Vision Model Phase** (if using Danbooru):
   ```
   User triggers change → Danbooru search → Download image → 
   Vision model verifies it's Miku → Vision model returns result
   ```

2. **VRAM Swap**:
   ```
   Bot swaps to text model → Vision model unloads → VRAM freed
   (3 second wait for complete unload)
   ```

3. **Face Detection Phase**:
   ```
   Face detector loads → Detect face → Return bbox/keypoints → 
   Face detector stays loaded for future requests
   ```

4. **Cropping & Upload**:
   ```
   Crop image using face bbox → Upload to Discord
   ```

## Key Files

### Consolidated Structure

```
miku-discord/
├── docker-compose.yml           # All 3 services (llama-swap, miku-bot, anime-face-detector)
├── face-detector/               # Face detection service (moved from separate repo)
│   ├── Dockerfile
│   ├── supervisord.conf
│   ├── api/
│   │   ├── main.py             # FastAPI face detection endpoint
│   │   └── outputs/            # Detection results
│   └── images/                 # Test images
└── bot/
    └── utils/
        ├── profile_picture_manager.py    # Updated with VRAM management
        └── face_detector_manager.py      # (Optional advanced version)
```

### Modified Files

#### 1. **profile_picture_manager.py**

Added `_ensure_vram_available()` method:
```python
async def _ensure_vram_available(self, debug: bool = False):
    """
    Ensure VRAM is available for face detection by swapping to text model.
    This unloads the vision model if it's loaded.
    """
    # Trigger swap to text model
    # Vision model auto-unloads
    # Wait 3 seconds for VRAM to clear
```

Updated `_detect_face()`:
```python
async def _detect_face(self, image_bytes: bytes, debug: bool = False):
    # First: Free VRAM
    await self._ensure_vram_available(debug=debug)
    
    # Then: Call face detection API
    # Face detector has exclusive VRAM access
```

#### 2. **docker-compose.yml**

Added `anime-face-detector` service:
```yaml
anime-face-detector:
  build: ./face-detector
  runtime: nvidia
  volumes:
    - ./face-detector/api:/app/api
  ports:
    - "7860:7860"  # Gradio UI
    - "6078:6078"  # FastAPI
```

## Model Characteristics

| Model | Size | VRAM Usage | TTL (Auto-unload) | Purpose |
|-------|------|------------|-------------------|---------|
| llama3.1 (Text) | ~4.5GB | ~5GB | 30 min | Text generation |
| vision (MiniCPM-V) | ~3.8GB | ~4GB+ | 15 min | Image understanding |
| YOLOv3 Face Detector | ~250MB | ~1GB | Always loaded | Anime face detection |

**Total VRAM**: ~8GB available on GPU
**Conflict**: Vision (~4GB) + Face Detector (~1GB) = Too much when vision has overhead

## How It Works

### Automatic VRAM Management

1. **When vision model is needed**:
   - Bot makes request to llama-swap
   - llama-swap loads vision model (unloads text if needed)
   - Vision model processes request
   - Vision model stays loaded for 15 minutes (TTL)

2. **When face detection is needed**:
   - `_ensure_vram_available()` swaps to text model
   - llama-swap unloads vision model automatically
   - 3-second wait ensures VRAM is fully released
   - Face detection API called (loads YOLOv3)
   - Face detection succeeds with enough VRAM

3. **After face detection**:
   - Face detector stays loaded (no TTL, always ready)
   - Vision model can be loaded again when needed
   - llama-swap handles the swap automatically

### Why This Works

✅ **Sequential Processing**: Vision verification happens first, face detection after
✅ **Automatic Swapping**: llama-swap handles model management
✅ **Minimal Code Changes**: Just one method added to ensure swap happens
✅ **Graceful Fallback**: If face detection fails, saliency detection still works

## API Endpoints

### Face Detection API

**Endpoint**: `http://anime-face-detector:6078/detect`

**Request**:
```bash
curl -X POST http://localhost:6078/detect -F "file=@image.jpg"
```

**Response**:
```json
{
  "detections": [
    {
      "bbox": [x1, y1, x2, y2],
      "confidence": 0.98,
      "keypoints": [[x, y, score], ...]
    }
  ],
  "count": 1,
  "annotated_image": "/app/api/outputs/..._annotated.jpg",
  "json_file": "/app/api/outputs/..._results.json"
}
```

**Health Check**:
```bash
curl http://localhost:6078/health
# Returns: {"status":"healthy","detector_loaded":true}
```

**Gradio UI**: http://localhost:7860 (visual testing)

## Deployment

### Build and Start All Services

```bash
cd /home/koko210Serve/docker/miku-discord
docker-compose up -d --build
```

This starts:
- ✅ llama-swap (text/vision models)
- ✅ miku-bot (Discord bot)
- ✅ anime-face-detector (face detection API)

### Verify Services

```bash
# Check all containers are running
docker-compose ps

# Check face detector API
curl http://localhost:6078/health

# Check llama-swap
curl http://localhost:8090/health

# Check bot logs
docker-compose logs -f miku-bot | grep "face detector"
# Should see: "✅ Anime face detector API connected"
```

### Test Profile Picture Change

```bash
# Via API
curl -X POST "http://localhost:3939/profile-picture/change"

# Via Web UI
# Navigate to http://localhost:3939 → Actions → Profile Picture
```

## Monitoring VRAM Usage

### Check GPU Memory

```bash
# From host
nvidia-smi

# From llama-swap container
docker exec llama-swap nvidia-smi

# From face-detector container
docker exec anime-face-detector nvidia-smi
```

### Check Model Status

```bash
# See which model is loaded in llama-swap
docker exec llama-swap ps aux | grep llama-server

# Check face detector
docker exec anime-face-detector ps aux | grep python
```

## Troubleshooting

### "Out of Memory" Errors

**Symptom**: Vision model crashes with `cudaMalloc failed: out of memory`

**Solution**: The VRAM swap should prevent this. If it still occurs:

1. **Check swap timing**:
   ```bash
   # In profile_picture_manager.py, increase wait time:
   await asyncio.sleep(5)  # Instead of 3
   ```

2. **Manually unload vision**:
   ```bash
   # Force swap to text model
   curl -X POST http://localhost:8090/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{"model":"llama3.1","messages":[{"role":"user","content":"hi"}],"max_tokens":1}'
   ```

3. **Check if face detector is already loaded**:
   ```bash
   docker exec anime-face-detector nvidia-smi
   ```

### Face Detection Not Working

**Symptom**: `Cannot connect to host anime-face-detector:6078`

**Solution**:
```bash
# Check container is running
docker ps | grep anime-face-detector

# Check network
docker network inspect miku-discord_default

# Restart face detector
docker-compose restart anime-face-detector

# Check logs
docker-compose logs anime-face-detector
```

### Vision Model Still Loaded

**Symptom**: Face detection OOM even after swap

**Solution**:
```bash
# Force model unload by stopping llama-swap briefly
docker-compose restart llama-swap

# Or increase wait time in _ensure_vram_available()
```

## Performance Metrics

### Typical Timeline

| Step | Duration | VRAM State |
|------|----------|------------|
| Vision verification | 5-10s | Vision model loaded (~4GB) |
| Model swap + wait | 3-5s | Transitioning (releasing VRAM) |
| Face detection | 1-2s | Face detector loaded (~1GB) |
| Cropping & upload | 1-2s | Face detector still loaded |
| **Total** | **10-19s** | Efficient VRAM usage |

### VRAM Timeline

```
Time:   0s    5s    10s   13s   15s
        │     │     │     │     │
Vision: ████████████░░░░░░░░░░░░   ← Unloads after verification
Swap:   ░░░░░░░░░░░░███░░░░░░░░░   ← 3s transition
Face:   ░░░░░░░░░░░░░░░████████   ← Loads for detection
```

## Benefits of This Approach

✅ **No Manual Intervention**: Automatic VRAM management
✅ **Reliable**: Sequential processing avoids conflicts
✅ **Efficient**: Models only loaded when needed
✅ **Simple**: Minimal code changes
✅ **Maintainable**: Uses existing llama-swap features
✅ **Graceful**: Fallback to saliency if face detection unavailable

## Future Enhancements

Potential improvements:

1. **Dynamic Model Unloading**: Explicitly unload vision model via API if llama-swap adds support
2. **VRAM Monitoring**: Check actual VRAM usage before loading face detector
3. **Queue System**: Process multiple images without repeated model swaps
4. **Persistent Face Detector**: Keep loaded in background, use pause/resume
5. **Smaller Models**: Use quantized versions to reduce VRAM requirements

## Related Documentation

- `/miku-discord/FACE_DETECTION_API_MIGRATION.md` - Original API migration
- `/miku-discord/PROFILE_PICTURE_IMPLEMENTATION.md` - Profile picture feature details
- `/face-detector/api/main.py` - Face detection API implementation
- `llama-swap-config.yaml` - Model swap configuration