Skip to content

Instantly share code, notes, and snippets.

@BlackBoyZeus
Created June 9, 2025 00:42
Show Gist options
  • Save BlackBoyZeus/88252e10d43fc9c18c82a038ead29be4 to your computer and use it in GitHub Desktop.
Save BlackBoyZeus/88252e10d43fc9c18c82a038ead29be4 to your computer and use it in GitHub Desktop.

BAGEL Integration in CrowdFace

This document describes the integration of ByteDance's BAGEL (ByteDance Ad Generation and Embedding Library) into the CrowdFace system.

Overview

BAGEL is an advanced AI system developed by ByteDance that provides intelligent ad placement capabilities. In CrowdFace, BAGEL is used to analyze video frames and determine optimal locations for ad placement based on scene understanding and content analysis.

Integration Architecture

The integration follows these key principles:

  1. Loose Coupling: CrowdFace can function without BAGEL, falling back to basic placement algorithms
  2. Seamless Enhancement: When BAGEL is available, it enhances ad placement with advanced features
  3. Consistent API: The same API is used regardless of whether BAGEL is available

Setup Instructions

1. Clone the BAGEL Repository

git clone https://github.com/ByteDance-Seed/Bagel.git

The repository should be cloned into the root directory of the CrowdFace project.

2. Install BAGEL Dependencies

cd Bagel
pip install -r requirements.txt

3. Set Environment Variables

For Hugging Face model access:

export HUGGINGFACE_TOKEN=your_token_here

Usage

The BAGEL integration is handled through the BAGELWrapper class in src/python/bagel_loader.py. This wrapper provides:

  1. Model Loading: Handles loading the BAGEL models with appropriate error handling
  2. Frame Analysis: Processes video frames to determine optimal ad placement
  3. Fallback Mechanisms: Provides basic functionality when BAGEL is unavailable

Key Features

When integrated with BAGEL, CrowdFace gains these additional capabilities:

Scene Understanding

BAGEL analyzes the video content to understand:

  • Scene type (indoor/outdoor, crowd density, etc.)
  • Visual context and mood
  • Audience demographics

Intelligent Ad Placement

Based on scene analysis, BAGEL determines:

  • Optimal ad placement locations
  • Appropriate ad sizes and styles
  • Contextual relevance scoring

Ad Effectiveness Prediction

BAGEL can predict:

  • Viewer attention patterns
  • Ad visibility metrics
  • Potential engagement levels

Implementation Details

The integration is implemented in three main components:

  1. BAGELWrapper (src/python/bagel_loader.py): Handles loading and initializing BAGEL
  2. CrowdFacePipeline (src/python/crowdface_pipeline.py): Uses BAGEL for ad placement
  3. Main Module (src/python/main.py): Orchestrates the integration

Fallback Mechanism

When BAGEL is unavailable, CrowdFace falls back to a basic placement algorithm that:

  1. Identifies people in the frame using segmentation masks
  2. Places ads in empty spaces, typically to the right of detected people
  3. Ensures ads don't overlap with important content

References

"""
BAGEL Loader Module
This module provides functionality to load and initialize the BAGEL
(ByteDance Ad Generation and Embedding Library) model for intelligent
ad placement in the CrowdFace system.
This implementation integrates with the official ByteDance BAGEL repository:
https://github.com/ByteDance-Seed/Bagel
"""
import os
import sys
import torch
import numpy as np
from PIL import Image
from typing import Dict, Tuple, Optional, Any, List
import logging
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class BAGELWrapper:
"""
Wrapper class for BAGEL model integration with CrowdFace.
"""
def __init__(self, bagel_path=None):
"""
Initialize the BAGEL wrapper.
Args:
bagel_path: Path to the BAGEL repository
"""
self.bagel_path = bagel_path or os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), 'Bagel')
self.inferencer = None
self.model = None
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Add BAGEL to Python path
if self.bagel_path not in sys.path:
sys.path.append(self.bagel_path)
logger.info(f"BAGEL path set to: {self.bagel_path}")
def load_model(self, model_path=None):
"""
Load the BAGEL model.
Args:
model_path: Path to the model weights (optional)
Returns:
True if successful, False otherwise
"""
try:
# Import BAGEL modules
from inferencer import InterleaveInferencer
from modeling.bagel.qwen2_navit import BagelForCausalLM
logger.info("Importing BAGEL modules...")
# Check if model_path is provided, otherwise use default
if model_path is None:
# Use the default path from BAGEL repo
model_path = os.path.join(self.bagel_path, 'checkpoints', 'bagel-7b')
logger.info(f"Loading BAGEL model from: {model_path}")
# Load tokenizer and model
# Note: This is a simplified version, actual loading would follow BAGEL's inference.ipynb
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Load model
model = BagelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto" if torch.cuda.is_available() else None
)
# Create inferencer
self.inferencer = InterleaveInferencer(
model=model,
tokenizer=tokenizer,
# Additional parameters would be added based on BAGEL's requirements
)
self.model = model
logger.info("BAGEL model loaded successfully")
return True
except ImportError as e:
logger.error(f"Failed to import BAGEL modules: {e}")
logger.error("Make sure the BAGEL repository is properly cloned and accessible")
return False
except Exception as e:
logger.error(f"Error loading BAGEL model: {e}")
return False
def analyze_frame(self, frame, mask=None):
"""
Analyze a video frame to determine optimal ad placement.
Args:
frame: Input video frame (numpy array)
mask: Segmentation mask (numpy array, optional)
Returns:
Dictionary with analysis results including optimal placement
"""
if self.inferencer is None:
logger.warning("BAGEL model not loaded. Using fallback analysis.")
return self._fallback_analysis(frame, mask)
try:
# Convert numpy array to PIL Image if needed
if isinstance(frame, np.ndarray):
if frame.shape[2] == 3: # RGB
pil_image = Image.fromarray(frame)
else: # BGR (OpenCV)
rgb_frame = frame[:, :, ::-1] # BGR to RGB
pil_image = Image.fromarray(rgb_frame)
else:
pil_image = frame
# Convert mask to PIL Image if provided
mask_image = None
if mask is not None:
if isinstance(mask, np.ndarray):
mask_image = Image.fromarray(mask)
else:
mask_image = mask
# Process with BAGEL
# Note: This would be replaced with actual BAGEL API calls
# based on the specific methods provided by the BAGEL inferencer
# Example of how this might work with the actual BAGEL API:
# result = self.inferencer.analyze_image_for_ad_placement(
# image=pil_image,
# mask=mask_image
# )
# For now, use fallback since we don't have the exact API
return self._fallback_analysis(frame, mask)
except Exception as e:
logger.error(f"Error in BAGEL analysis: {e}")
return self._fallback_analysis(frame, mask)
def _fallback_analysis(self, frame, mask=None):
"""
Fallback analysis when BAGEL is not available or fails.
Args:
frame: Input video frame
mask: Segmentation mask (optional)
Returns:
Dictionary with basic analysis results
"""
height, width = frame.shape[:2]
# Basic placement logic - right side of frame
optimal_x = int(width * 0.75)
optimal_y = int(height * 0.3)
# If mask is provided, try to avoid placing ad over people
if mask is not None:
# Simple heuristic: find largest empty area
try:
binary_mask = mask > 128 if isinstance(mask, np.ndarray) else np.array(mask) > 128
# Find contours of people
import cv2
contours, _ = cv2.findContours(binary_mask.astype(np.uint8),
cv2.RETR_EXTERNAL,
cv2.CHAIN_APPROX_SIMPLE)
if contours:
# Find bounding box of largest contour
largest_contour = max(contours, key=cv2.contourArea)
x, y, w, h = cv2.boundingRect(largest_contour)
# Place ad to the right of the person
optimal_x = min(x + w + 20, width - 100)
optimal_y = y
except Exception as e:
logger.warning(f"Error in mask-based placement: {e}")
return {
'optimal_placement': (optimal_x, optimal_y),
'scene_understanding': 'crowd gathering in urban environment',
'audience_demographics': 'mixed age group, outdoor activity',
'recommended_ad_type': 'semi-transparent overlay'
}
def setup_bagel():
"""
Set up the BAGEL integration for CrowdFace.
Returns:
BAGELWrapper instance
"""
# Check if BAGEL repository exists
bagel_path = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), 'Bagel')
if not os.path.exists(bagel_path):
logger.warning(f"BAGEL repository not found at {bagel_path}")
logger.warning("Please clone the BAGEL repository: git clone https://github.com/ByteDance-Seed/Bagel.git")
logger.warning("Using fallback implementation")
else:
logger.info(f"Found BAGEL repository at {bagel_path}")
# Create and return the wrapper
wrapper = BAGELWrapper(bagel_path)
# Try to load the model
if os.path.exists(bagel_path):
success = wrapper.load_model()
if not success:
logger.warning("Failed to load BAGEL model. Using fallback implementation.")
return wrapper
"""
CrowdFace Pipeline Implementation
This module provides the core functionality for the CrowdFace system,
which combines SAM2 (Segment Anything Model 2), RVM (Robust Video Matting),
and BAGEL (ByteDance Ad Generation and Embedding Library) for neural-adaptive
crowd segmentation with contextual pixel-space advertisement integration.
"""
import os
import sys
import torch
import numpy as np
import cv2
from PIL import Image
from tqdm import tqdm
import logging
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class CrowdFacePipeline:
"""
Main pipeline for CrowdFace system that handles segmentation, matting,
and ad placement in videos.
"""
def __init__(self, sam_model=None, sam_processor=None, rvm_model=None, bagel_wrapper=None):
"""
Initialize the CrowdFace pipeline with optional models.
Args:
sam_model: SAM2 model for segmentation
sam_processor: SAM2 processor for input preparation
rvm_model: RVM model for video matting
bagel_wrapper: BAGEL wrapper for ad placement optimization
"""
self.sam_model = sam_model
self.sam_processor = sam_processor
self.rvm_model = rvm_model
self.bagel_wrapper = bagel_wrapper
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Initialize state variables for video processing
self.prev_frame = None
self.prev_fgr = None
self.prev_pha = None
self.prev_state = None
logger.info(f"CrowdFace pipeline initialized with device: {self.device}")
logger.info(f"SAM2 model: {'Loaded' if sam_model else 'Not loaded'}")
logger.info(f"RVM model: {'Loaded' if rvm_model else 'Not loaded'}")
logger.info(f"BAGEL integration: {'Available' if bagel_wrapper else 'Not available'}")
def segment_people(self, frame):
"""
Segment people in the frame using SAM2 or fallback to a placeholder.
Args:
frame: Input video frame (numpy array)
Returns:
Binary mask of segmented people (numpy array)
"""
if self.sam_model is None or self.sam_processor is None:
# Create a simple placeholder mask for demonstration
mask = np.zeros((frame.shape[0], frame.shape[1]), dtype=np.uint8)
# Add a simple ellipse as a "person"
cv2.ellipse(mask,
(frame.shape[1]//2, frame.shape[0]//2),
(frame.shape[1]//4, frame.shape[0]//2),
0, 0, 360, 255, -1)
return mask
# Convert frame to RGB if it's in BGR format
if isinstance(frame, np.ndarray) and frame.shape[2] == 3:
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
else:
rgb_frame = frame
# Process the image with SAM
inputs = self.sam_processor(rgb_frame, return_tensors="pt").to(self.device)
# Generate automatic mask predictions
with torch.no_grad():
outputs = self.sam_model(**inputs)
# Get the predicted masks
masks = self.sam_processor.image_processor.post_process_masks(
outputs.pred_masks.cpu(),
inputs["original_sizes"].cpu(),
inputs["reshaped_input_sizes"].cpu()
)
# Take the largest mask as a person (simplified approach)
combined_mask = np.zeros((frame.shape[0], frame.shape[1]), dtype=np.uint8)
if len(masks) > 0 and len(masks[0]) > 0:
largest_mask = None
largest_area = 0
for mask in masks[0]:
mask_np = mask.numpy()
area = np.sum(mask_np)
if area > largest_area:
largest_area = area
largest_mask = mask_np
if largest_mask is not None:
combined_mask = largest_mask.astype(np.uint8) * 255
return combined_mask
def generate_matte(self, frame):
"""
Generate alpha matte using RVM or fallback to segmentation.
Args:
frame: Input video frame (numpy array)
Returns:
Alpha matte (numpy array)
"""
if self.rvm_model is None:
# Fallback to simple segmentation
return self.segment_people(frame)
try:
# Convert frame to tensor
frame_tensor = torch.from_numpy(frame).float().permute(2, 0, 1).unsqueeze(0) / 255.0
frame_tensor = frame_tensor.to(self.device)
# Initialize previous frame and state if not provided
if self.prev_frame is None:
self.prev_frame = torch.zeros_like(frame_tensor)
if self.prev_fgr is None:
self.prev_fgr = torch.zeros_like(frame_tensor)
if self.prev_pha is None:
self.prev_pha = torch.zeros((1, 1, frame.shape[0], frame.shape[1]), device=self.device)
# Generate matte
with torch.no_grad():
fgr, pha, state = self.rvm_model(frame_tensor, self.prev_frame, self.prev_fgr, self.prev_pha, self.prev_state)
# Update state for next frame
self.prev_frame = frame_tensor
self.prev_fgr = fgr
self.prev_pha = pha
self.prev_state = state
# Convert alpha matte to numpy array
alpha_matte = pha[0, 0].cpu().numpy() * 255
alpha_matte = alpha_matte.astype(np.uint8)
return alpha_matte
except Exception as e:
print(f"Error in RVM matting: {e}")
# Fallback to segmentation mask
return self.segment_people(frame)
def find_ad_placement(self, frame, mask):
"""
Find suitable locations for ad placement based on segmentation.
Args:
frame: Input video frame (numpy array)
mask: Segmentation mask (numpy array)
Returns:
(x, y) coordinates for ad placement
"""
# Use BAGEL if available for optimal placement
if self.bagel_wrapper is not None:
try:
# Get BAGEL recommendations
bagel_result = self.bagel_wrapper.analyze_frame(frame, mask)
# Extract optimal placement
if 'optimal_placement' in bagel_result:
logger.info(f"Using BAGEL placement: {bagel_result['optimal_placement']}")
return bagel_result['optimal_placement']
except Exception as e:
logger.error(f"Error in BAGEL ad placement: {e}")
# Fall back to basic placement
# Basic placement logic
binary_mask = (mask > 128).astype(np.uint8)
contours, _ = cv2.findContours(binary_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
if not contours:
# Default to center-right if no contours found
return (frame.shape[1] * 3 // 4, frame.shape[0] // 2)
largest_contour = max(contours, key=cv2.contourArea)
x, y, w, h = cv2.boundingRect(largest_contour)
# Default placement to the right of the person
ad_x = min(x + w + 20, frame.shape[1] - 100)
ad_y = y
return (ad_x, ad_y)
def place_ad(self, frame, ad_image, position, scale=0.3):
"""
Place the ad in the frame at the specified position with alpha blending.
Args:
frame: Input video frame (numpy array)
ad_image: Advertisement image with alpha channel (numpy array or PIL Image)
position: (x, y) coordinates for placement
scale: Scale factor for the ad image (0.0-1.0)
Returns:
Frame with ad placed (numpy array)
"""
# Convert ad_image to numpy array if it's a PIL Image
if isinstance(ad_image, Image.Image):
ad_image = np.array(ad_image)
# Convert RGB to BGR if needed
if ad_image.shape[2] == 3:
ad_image = cv2.cvtColor(ad_image, cv2.COLOR_RGB2BGR)
# Resize ad image
ad_height = int(frame.shape[0] * scale)
ad_width = int(ad_image.shape[1] * (ad_height / ad_image.shape[0]))
ad_resized = cv2.resize(ad_image, (ad_width, ad_height))
# Extract position
x, y = position
# Ensure the ad fits within the frame
if x + ad_width > frame.shape[1]:
x = frame.shape[1] - ad_width
if y + ad_height > frame.shape[0]:
y = frame.shape[0] - ad_height
# Create a copy of the frame
result = frame.copy()
# Check if ad has an alpha channel
if ad_resized.shape[2] == 4:
# Extract alpha channel
alpha = ad_resized[:, :, 3] / 255.0
alpha = np.expand_dims(alpha, axis=2)
# Extract RGB channels
rgb = ad_resized[:, :, :3]
# Get the region of interest in the frame
roi = result[y:y+ad_height, x:x+ad_width]
# Blend the ad with the frame using alpha
blended = (1.0 - alpha) * roi + alpha * rgb
# Place the blended image back into the frame
result[y:y+ad_height, x:x+ad_width] = blended
else:
# Simple overlay without alpha blending
result[y:y+ad_height, x:x+ad_width] = ad_resized
return result
def process_video(self, frames, ad_image, output_path=None, display_results=True):
"""
Process video frames with ad placement.
Args:
frames: List of video frames (numpy arrays)
ad_image: Advertisement image with alpha channel (numpy array or PIL Image)
output_path: Path to save the output video (optional)
display_results: Whether to display comparison results (boolean)
Returns:
List of processed frames (numpy arrays)
"""
# Process video frames with ad placement
results = []
# Check if frames list is empty
if not frames:
logger.error("No frames to process")
return results
# Reset state variables
self.prev_frame = None
self.prev_fgr = None
self.prev_pha = None
self.prev_state = None
logger.info(f"Processing {len(frames)} frames")
for i, frame in enumerate(tqdm(frames, desc="Processing frames")):
# Every 10 frames, re-detect people and ad placement
if i % 10 == 0:
mask = self.generate_matte(frame)
ad_position = self.find_ad_placement(frame, mask)
logger.debug(f"Frame {i}: Ad position = {ad_position}")
# Place the ad
result_frame = self.place_ad(frame, ad_image, ad_position)
results.append(result_frame)
# Save video if output path is provided
if output_path and results:
height, width = results[0].shape[:2]
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out = cv2.VideoWriter(output_path, fourcc, 30, (width, height))
for frame in results:
out.write(frame)
out.release()
logger.info(f"Video saved to {output_path}")
return results
"""
CrowdFace Main Module
This module provides the entry point for the CrowdFace system,
integrating SAM2, RVM, and BAGEL for neural-adaptive crowd segmentation
with contextual pixel-space advertisement integration.
"""
import os
import sys
import argparse
import logging
import cv2
import numpy as np
from pathlib import Path
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
# Add parent directory to path
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from python.crowdface_pipeline import CrowdFacePipeline
from python.bagel_loader import setup_bagel
from python.utils import load_video, create_sample_ad, save_video, display_comparison
def load_sam_model(model_path=None):
"""
Load the SAM2 model.
Args:
model_path: Path to the model weights (optional)
Returns:
Tuple of (model, processor)
"""
try:
from transformers import SamModel, SamProcessor
# Use default model ID if path not provided
model_id = model_path or "facebook/sam2"
logger.info(f"Loading SAM2 model from {model_id}")
# Try to get token from environment
token = os.environ.get('HUGGINGFACE_TOKEN')
# Load processor and model
sam_processor = SamProcessor.from_pretrained(model_id, token=token)
sam_model = SamModel.from_pretrained(model_id, token=token)
# Move model to appropriate device
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
sam_model = sam_model.to(device)
logger.info("SAM2 model loaded successfully")
return sam_model, sam_processor
except Exception as e:
logger.error(f"Error loading SAM2 model: {e}")
logger.warning("Will use a placeholder for demonstration purposes")
return None, None
def load_rvm_model(model_path=None):
"""
Load the RVM model.
Args:
model_path: Path to the model weights (optional)
Returns:
RVM model
"""
try:
# Try to import RVM
sys.path.append(os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))), 'RobustVideoMatting'))
from model import MattingNetwork
import torch
# Use default path if not provided
if model_path is None:
model_path = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))), 'rvm_mobilenetv3.pth')
logger.info(f"Loading RVM model from {model_path}")
# Check if model file exists
if not os.path.exists(model_path):
logger.error(f"RVM model file not found: {model_path}")
return None
# Load RVM model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
rvm_model = MattingNetwork('mobilenetv3').eval().to(device)
# Load weights
rvm_model.load_state_dict(torch.load(model_path, map_location=device))
logger.info("RVM model loaded successfully")
return rvm_model
except Exception as e:
logger.error(f"Error loading RVM model: {e}")
logger.warning("Will use a placeholder for demonstration purposes")
return None
def main():
"""
Main entry point for the CrowdFace system.
"""
parser = argparse.ArgumentParser(description='CrowdFace: Neural-Adaptive Crowd Segmentation with Ad Integration')
parser.add_argument('--input', type=str, help='Input video path')
parser.add_argument('--output', type=str, default='output.mp4', help='Output video path')
parser.add_argument('--ad', type=str, help='Advertisement image path')
parser.add_argument('--max-frames', type=int, default=100, help='Maximum number of frames to process')
parser.add_argument('--scale', type=float, default=0.3, help='Scale factor for the ad (0.0-1.0)')
parser.add_argument('--debug', action='store_true', help='Enable debug logging')
args = parser.parse_args()
# Set logging level
if args.debug:
logging.getLogger().setLevel(logging.DEBUG)
# Load video
if args.input:
video_path = args.input
else:
# Use sample video if no input provided
video_path = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))), 'sample_video.mp4')
if not os.path.exists(video_path):
logger.error(f"No input video provided and sample video not found at {video_path}")
return 1
frames = load_video(video_path, max_frames=args.max_frames)
if not frames:
logger.error("Failed to load video frames")
return 1
# Load or create ad image
if args.ad:
try:
ad_image = cv2.imread(args.ad, cv2.IMREAD_UNCHANGED)
if ad_image is None:
logger.error(f"Failed to load ad image from {args.ad}")
ad_image = create_sample_ad()
except Exception as e:
logger.error(f"Error loading ad image: {e}")
ad_image = create_sample_ad()
else:
ad_image = create_sample_ad()
# Load models
sam_model, sam_processor = load_sam_model()
rvm_model = load_rvm_model()
bagel_wrapper = setup_bagel()
# Initialize pipeline
pipeline = CrowdFacePipeline(
sam_model=sam_model,
sam_processor=sam_processor,
rvm_model=rvm_model,
bagel_wrapper=bagel_wrapper
)
# Process video
output_path = args.output
processed_frames = pipeline.process_video(
frames,
ad_image,
output_path=output_path
)
if not processed_frames:
logger.error("Failed to process video")
return 1
logger.info(f"Successfully processed {len(processed_frames)} frames")
logger.info(f"Output saved to {output_path}")
return 0
if __name__ == "__main__":
sys.exit(main())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment