Contributor | Aryan Nanda |
Organization | BeagleBoard.org |
Mentor | Deepak Khatri, Kumar Abhishek, Jason Kridner |
Project | Enhanced Media Experience with AI-Powered Commercial Detection and Replacement |
- Main Code Repository - Gitlab Link
- Mirror Code Repository - Github Link
- Pre GSoC - Project Description and Discussion
- Proposal Link - Commercial Detection and Replacement Proposal
- Project Intro Video - Youtube Link
- Weekly Updates - Forum Thread
- Blogs Site - Biweekly Blogs
- Discord Discussion Link - Discord Project Thread
I have given very detailed weekly updates here
Also, I have written very detailed blogs on the approaches I used and errors I encountered here
Getting into GSoC was one of my goal which I aimed to achieve before finishing college. On May 1, I was thrilled to be accepted into my preferred project with a highly respected open-source organization. I’ll always cherish the hours of research, the extensive documentation I reviewed, and the meaningful discussions I had during that time. I interacted frequently with my mentors and community in the process, which you can see here. These interactions were invaluable, as they helped resolve many potential blockers that could have caused issues later on.
Contributors use this Period to connect with their mentors, other Gsoc Candidates and the organisation as a whole!
-
In BeagleBoard we usually have weekly meetings to discuss about the projects and solve the difficulties of everyone. So attending those meetings are mandatory as it involves your communication with mentors and other gsoc contributors. Also the round table discussion solves all queries which you have as all the students and mentors give solutions and guidelines on it. Additionally we can also set a meeting with the mentor if required.
-
In BeagleBoard, we have a specific idea of presenting our projects to the community and others who are interested so we can get familier with other's project as well and by that time our concrete strategy of project is also ready.
Here is my presentation for reference:- Project Presentation
- Lastly, I explored Youtube 8M-dataset and read the documentation of BeagleBone AI-64 to give me a boost before the start of Coding period.
The goal of my GSoC project was to make a system that is capable of detecting commercials from media stream and accelerate the system using the C7x DSPs and DL accelerators present in BeagleBone AI-64.
The dataset Collection was in itself a huge task because videos occupies a lot of space and having a large dataset of videos is not feasible. For projects like “Enhanced Media Experience with AI-Powered Commercial Detection and Replacement,” obtaining comprehensive datasets is crucial. However, acquiring large-scale video datasets, particularly for commercial and non-commercial content, presents challenges due to their size and storage requirements.
It was my finding in the Pre-GSoC period that tells that the YouTube-8M dataset stands out as a significant resource for this project. It includes millions of YouTube video IDs with detailed machine-generated annotations across a diverse set of visual entities, including "Television Advertisement," which is particularly relevant here
Pro-Tip :- Use the Pre-GSoC period very well. Clear out all your doubts, research details of your project. You will definitely be benefitted from it later.
Code Snippet of implementation:
counter = 0
for tfrecord_file in tfrecord_files:
print(f'counter {counter}')
counter+=1
print(f'Processing file: {tfrecord_file}')
# Iterate over each tfrecord in the TFRecord file
try:
for record in tf.compat.v1.python_io.tf_record_iterator(tfrecord_file):
try:
# Parse the SequenceExample from the binary record data
seq_example = tf.train.SequenceExample.FromString(record)
# Extract the video_id and labels
vid_id = seq_example.context.feature['id'].bytes_list.value[0].decode('UTF-8')
labels_list = seq_example.context.feature['labels'].int64_list.value
# Check if the label ID 315 is in the labels list (label id 315 corresponds to Television Advertisement)
if 315 in labels_list:
# Append video ID and labels to the filtered lists
filtered_vid_ids.append(vid_id)
filtered_labels.append(labels_list)
# Lists to store frame-level and audio-level features
rgb_features = []
audio_features = []
# Iterate over each frame-level feature pair (RGB and audio) in the SequenceExample
for rgb_feature, audio_feature in zip(seq_example.feature_lists.feature_list['rgb'].feature,
seq_example.feature_lists.feature_list['audio'].feature):
# Decode the quantized RGB features and audio features
decoded_rgb = decode_quantized_features(rgb_feature.bytes_list.value[0])
decoded_audio = decode_quantized_features(audio_feature.bytes_list.value[0])
rgb_features.append(decoded_rgb)
audio_features.append(decoded_audio)
# Append frame-level lists to the main filtered lists
filtered_rgb.append(rgb_features)
filtered_audio.append(audio_features)
else:
# Check if the labels list contains any of the specific labels
if any(label in specific_labels for label in labels_list):
# Append video ID and labels to the non-filtered lists
non_filtered_vid_ids.append(vid_id)
non_filtered_labels.append(labels_list)
# Lists to store frame-level and audio-level features
rgb_features = []
audio_features = []
# Iterate over each frame-level feature pair (RGB and audio) in the SequenceExample
for rgb_feature, audio_feature in zip(seq_example.feature_lists.feature_list['rgb'].feature,
seq_example.feature_lists.feature_list['audio'].feature):
# Decode the quantized RGB features and audio features
decoded_rgb = decode_quantized_features(rgb_feature.bytes_list.value[0])
decoded_audio = decode_quantized_features(audio_feature.bytes_list.value[0])
rgb_features.append(decoded_rgb)
audio_features.append(decoded_audio)
# Append frame-level lists to the main non-filtered lists
non_filtered_rgb.append(rgb_features)
non_filtered_audio.append(audio_features)
except tf.errors.DataLossError as e:
print(f'Skipping corrupted record in file {tfrecord_file}: {e}')
except tf.errors.DataLossError as e:
print(f'Skipping entire file due to corruption: {tfrecord_file}: {e}')
- I have written a very detailed blog on the process of Dataset Collection and Feature Extraction which I used here.
After dataset collection and feature extraction, the shape of the different features is as shown:
- commercialRgb : [4364][119-300][1024]
- commercialAudio : [4364][119-300][128]
- nonCommercialRgb : [4600][109-301][1024]
- nonCommercialAudio : [4600][109-301][128]
Here, the first dimension represents the number of videos whose features are extracted. This is why this is constant for both audio and visual features of same class as both belongs to same number of videos. For example, 4364 represents that audio-visual features of 4364 videos are extracted.
The second dimension is the number of frames in each video. The YouTube-8M dataset limits the feature extraction process to the first 300 seconds with 1 frame per second. By this, every video should have 300 frames (1 per second for 300 seconds).
Then why is it variable? This is because the video could also be shorter than 300 seconds; in such a case, the number of frames would be fewer than 300.
The third dimension is where the features lie. Corresponding to every frame, there are visual features of length 1024 and audio features of length 128.
- Next I merged the audio and visual features
Code snippet of implementation:-
# Function for merging audio-visual features
def merge_visual_audio_features(visual_features, audio_features):
"""
This Function merges visual and audio features for each video.
Parameters:
- visual_features: list of length num_videos, where each element is a list of shape (num_frames_visual[i], 1024).
- audio_features: list of length num_videos, where each element is a list of shape (num_frames_audio[i], 128).
- num_frames_visual[i] = num_frames_audio[i]
Returns:
- merged_features: list of length num_videos, where each element is a numpy array of shape (max_num_frames, 1152).
Note:
- num_videos is the number of audio-visual features = ( len(visual_features) || len(audio_features) ) = 4364
"""
merged_features = []
for visual, audio in zip(visual_features, audio_features):
# Convert each list to numpy arrays
visual_array = np.array(visual)
audio_array = np.array(audio)
# Concatenate visual and audio features for this video
merged_video_features = np.concatenate((visual_array, audio_array), axis=1)
merged_features.append(merged_video_features)
return merged_features
- Now, uniform features are generated:-
def preprocess_features(max_frames, features):
"""
Preprocess video features by trimming or padding them to a fixed number of frames.
Parameters:
- max_frames (int) - The maximum number of frames each video should have.
- features (list of ndarray) - A list where each element is a numpy array representing the features of a video. Each array is of shape (number_of_frames, number_Vf).
Returns:
- video_data (ndarray) - A numpy array of shape (num_videos, max_frames, number_Vf) containing the preprocessed video data.
"""
num_videos = len(features)
number_Vf = features[0].shape[1]
video_data = np.zeros((num_videos, max_frames, number_Vf))
# Process each video
for i, video in enumerate(features):
length = len(video)
if length >= max_frames:
# Trim the video if longer than max_frames
video_data[i, :, :] = video[:max_frames]
else:
# Pad the video with zeros if shorter than max_frames
video_data[i, :length, :] = video
return video_data
def convertToNumpyArray(features):
"""
Convert a list of features to a NumPy array.
Parameters:
- features (list) - A list of features to be converted to a NumPy array.
Returns:
- numpy.ndarray: A NumPy array containing the elements of the input list.
"""
return np.array(features)
Pro-tip: (Always ensure that before pushing any Python code, include comments in the NumPy docstring style.)
After considering the trade-off between padding and trimming, I chose 150 as max_frames length. This generated a NumPy array of dimensions [len(features)][max_frames(=150)][1152].
- Again I have written a very detailed blog on dataset Pre-processing here.
I implemented a number of models so that I can compare the results and finally take out the best one.
The results of all the models is as shown:
Model | Accuracy | True Positives | False Positives | True Negatives | False Negatives |
---|---|---|---|---|---|
Unidirectional LSTMs | 96.28% | 628 | 30 | 667 | 20 |
Bidirectional LSTMs | 97.1% | 648 | 10 | 652 | 35 |
CNNs | 93.97% | 623 | 35 | 646 | 41 |
LSTMs+CNNs | 54.57% | 146 | 512 | 588 | 99 |
Transformers | 96.05% | 640 | 18 | 652 | 35 |
From the results, it is clear that the LSTM models gave the best accuracy. But, later I found out that in BeagleBone AI-64, the supported version of tidl is 08_02_00_05. And, this version of tidl only supports CNNs. In comparison, the latest version of tidl 10_00_04_00 supports transformers layers as well. Then, I had no other choice but to use a CNNs model.
- Again, I have a very detailed blog on the model pipeline and training process which I used here
CNNs are typically used for learning spatial patterns, but my problem required understanding sequential meaning. In advertisements, for instance, a sequence often starts with some content and ends with a logo appearing. To capture this sequential aspect, I trained my model on sequences of frames rather than individual frames. Instead of processing each frame independently, I grouped multiple frames together and trained the model on these sets. This approach introduced sequential context, allowing the model to learn from the sequence of frames rather than just one frame at a time. This significantly improved the model’s real-time performan
Snippet of my model.summary() after fixing the batch size to 1:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
reshape (Reshape) (1, 150, 1152, 1) 0
conv2d (Conv2D) (1, 150, 1152, 32) 320
max_pooling2d (MaxPooling2 (1, 75, 576, 32) 0
D)
conv2d_1 (Conv2D) (1, 75, 576, 64) 18496
max_pooling2d_1 (MaxPoolin (1, 37, 288, 64) 0
g2D)
conv2d_2 (Conv2D) (1, 37, 288, 128) 73856
max_pooling2d_2 (MaxPoolin (1, 18, 144, 128) 0
g2D)
flatten (Flatten) (1, 331776) 0
dense (Dense) (1, 256) 84934912
dropout (Dropout) (1, 256) 0
dense_1 (Dense) (1, 128) 32896
dropout_1 (Dropout) (1, 128) 0
dense_2 (Dense) (1, 1) 129
=================================================================
Total params: 85060609 (324.48 MB)
Trainable params: 85060609 (324.48 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
In one inference, the model output 0(sequence of frames is of non-commercial video) or 1(sequence of frames is of commercial video).
- I have written in detail the steps of model compilation and inferencing on BeagleBone AI-64 using C7x DSPs and DL Accelerator and the errors I faced in these blogs:-
In summary, after training the model, I fixed its batch size to 1 to avoid RuntimeError: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors
error while compiling the model.
- The output of cnn_model.get_config() before fixing the batch size to 1.
{'name': 'sequential',
'layers': [{'module': 'keras.layers',
'class_name': 'InputLayer',
'config': {'batch_input_shape': (None, 150, 1152),
'dtype': 'float32',
'sparse': False,
.
.
.
.
- Code implementation to fix the batch size to 1:-
config = cnn_model.get_config()
for layer in config['layers']:
if 'batch_input_shape' in layer['config']:
shape = layer['config']['batch_input_shape']
shape = (1, *shape[1:])
layer['config']['batch_input_shape'] = shape
# Now, we need to create a new model from the updated config:
cnn_model_new = cnn_model.from_config(config)
cnn_model_new.set_weights(cnn_model.get_weights())
cnn_model_new.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
The output of the cnn_model_new.get_config() after fixing the batch size
{'name': 'sequential',
'layers': [{'module': 'keras.layers',
'class_name': 'InputLayer',
'config': {'batch_input_shape': (1, 150, 1152),
'dtype': 'float32',
'sparse': False,
.
.
.
.
After this, I converted the model into tflite runtime. And then, downloaded the model, calibration data and test data.
converter = tf.lite.TFLiteConverter.from_keras_model(cnn_model_new)
converter.experimental_new_converter = True
tflite_model = converter.convert()
# Save the converted model
with open('cnn_model.tflite', 'wb') as f:
f.write(tflite_model)
Faced a number of errors during the model_compilation step which I have discussed in discord and forum project thread and in blog. This is because BeagleBone AI-64 supports Tidl SDK Version 08_02_00_05, whereas the latest tidl sdk release is 10_00_04_00. Due to this, there were a lot of python compatibility issues.
My folder structure for model compilation (Here calibration files are used for quantization of model by compilation script) :-
.
├── cal_Data
│ ├── cal_1.npy
│ ├── cal_2.npy
│ ├── cal_3.npy
│ ├── cal_4.npy
│ ├── cal_5.npy
├── Model
│ ├──cnn_model.tflite
├── compile.py
├── config.json
├── edgeai-tidl-tools-08_02_00_05.Dockerfile
After researching a lot, interacting in forum to resolve errors, I finally ran the script to compile the model successfully. After compilation of model, artifacts folder got generated.
The generate runtime_visualization.svg is as shown:-
The directory structure for inferencing on BeagleBone AI-64:
├── Model
│ └── cnn_model.tflite
├── artifacts
│ ├── 26_tidl_io_1.bin
│ ├── 26_tidl_net.bin
│ ├── allowedNode.txt
│ ├── cnn_model.tflite
│ ├── param.yaml
│ └── tempDir
│ ├── 26_calib_raw_data.bin
│ ├── 26_tidl_io_.perf_sim_config.txt
│ ├── 26_tidl_io_.qunat_stats_config.txt
│ ├── 26_tidl_io_1.bin
│ ├── 26_tidl_io__LayerPerChannelMean.bin
│ ├── 26_tidl_io__stats_tool_out.bin
│ ├── 26_tidl_net
│ │ ├── bufinfolog.csv
│ │ ├── bufinfolog.txt
│ │ └── perfSimInfo.bin
│ ├── 26_tidl_net.bin
│ ├── 26_tidl_net.bin.layer_info.txt
│ ├── 26_tidl_net.bin.svg
│ ├── 26_tidl_net.bin_netLog.txt
│ ├── 26_tidl_net.bin_paramDebug.csv
│ ├── graphvizInfo.txt
│ └── runtimes_visualization.svg
├── environment.yml
├── inferencing
│ └── tflite_model_inferencing.ipynb
└── test data
├── X_test.npy
└── y_test.npy
- Code snippet for inferencing on BeagleBone AI-64 with custom_delegate:-
import tensorflow as tf
tflite_model_path = '../Model/cnn_model.tflite'
delegate_lib_path = '/usr/lib/libtidl_tfl_delegate.so'
delegate_options = {
'artifacts_folder': '../artifacts_folder',
'tidl_tools_path': '/usr/lib/',
'import': 'no'
}
try:
print('1')
tidl_delegate = tf.lite.experimental.load_delegate(delegate_lib_path, delegate_options)
print(tidl_delegate)
except Exception as e:
print("An error occurred:", str(e))
if tidl_delegate:
print("I am inside")
interpreter = tf.lite.Interpreter(model_path=tflite_model_path, experimental_delegates=[tidl_delegate])
else:
print("I didn't go inside")
interpreter = tf.lite.Interpreter(model_path=tflite_model_path)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
print("Input details:", input_details)
print("Output details:", output_details)
I’ve encountered several errors while running this code. I’ve managed to resolve many of them, but I’m currently working on getting the model to execute on the BeagleBone AI-64's onboard DSPs and accelerators with assistance from experts on the e2e forum. I’ve raised an issue there. Link
Before performing inference, there’s an additional step: developing an input pipeline. This involves extracting features from the videos and then processing them into a format that can be fed into the model.
- Initialization: Loads an InceptionV3 model for feature extraction. Loads a TensorFlow Lite model for classification. Loads PCA (Principal Component Analysis) parameters.
- Feature Extraction: Extracts features from video frames using the InceptionV3 model.
- Video Processing: Processes the video in chunks of 150 frames. Extracts features from frames at regular intervals.
- Feature Processing: Applies PCA to reduce feature dimensionality. Quantizes the features to 8 bits.
- Classification: Uses the TensorFlow Lite model to classify the processed features. Returns prediction ranges (start frame, end frame, and predicted label).
- Initialization and Setup: Imports necessary libraries. Defines paths for input video and output audio.
- VGGish Model Preparation: Creates a frozen graph of the VGGish model if it doesn’t exist.
- Audio Preprocessing: Extracts audio from the video. Preprocesses the audio (resampling, converting to mono if needed).
- Feature Extraction: Computes log-mel spectrogram from audio segments. Uses the VGGish model to extract features from the spectrogram.
- Post-processing: Quantizes the extracted features to 8-bit integers.
The video is loaded from its file, and frames are recorded at intervals of 250ms (i.e., 4 frames per second). After collecting 150 frames (150/4 = 37.5 seconds), inference is performed. The result of the inference is stored in a tuple, which contains the label, start frame and end frame at the rate the video is loaded (not at 4 frames per second to avoid a flickery output video). After performing inference and storing the results for all video frames in chunks of 150 frames, the output video is displayed. When the label for a frame is 0 (non-commercial), the frame is displayed as it is. However, when the label for a frame is 1 (commercial), a black screen is shown instead.
Initially, when I followed this workflow and started performing inference on videos, I realized that extracting audio features was taking significantly longer compared to visual features. For example, for a 30-second video, visual features were extracted in 30-35 seconds, but audio features took 5 minutes for the same. Therefore, I decided to exclude audio features, as the accuracy was similar with or without them. I trained a new CNN model using only visual features (shape: (150, 1024)). (1152-128 = 1024). The results below are based on that model.
I tested the complete pipeline on three videos. One was a random commercial video downloaded from YouTube, another was a non-commercial news video, and the last was a mix of commercial and non-commercial content (drama + commercial). (Small chunks of compressed videos are included with the results.)
-
Results on a random commercial-video:
- Video length: 150 seconds.
- Processing time: 151 seconds.
- Accuracy: 80%
video2.mp4
-
Results on non-commercial video(news):
- Video length: 176 seconds.
- Processing time: 186 seconds
- Accuracy: 80%
video3.mp4
-
Results on Mix Video(dramas+Commercial):
- Video length: 30 mins
- Processing time: 33-34 mins
- Accuracy: 65-70%
- Here, see how the transition at 1:20 happens when commercial gets over and drama get started.
video5.mp4
- Above video after post-processing:
video4.mp4
The pipeline consists of two threads. The main thread is responsible for display, while the processing thread captures input from the VideoCapture device and processes it. Frames are continuously added to a queue as they come in. Every second, four frames are selected at 250ms intervals. Once 150 frames have been collected at a rate of 4fps, inference is performed, and the results are displayed by the main thread using the frame queue. Meanwhile, the processing of the next set of frames occurs in the second thread simultaneously.
Code snippet of implementation of real-time pipeline(above flowchart one):-
import argparse
import cv2
import numpy as np
import tensorflow as tf
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input
from tensorflow.keras.preprocessing import image
import time
import queue
import threading
import pygame
pygame.mixer.init()
pygame.mixer.music.load('./music/music.mp3')
# Initialize queues for frames and predictions, and an event to signal when a label is available.
q = queue.Queue()
predicted_label = queue.Queue()
label_event = threading.Event()
end_frame = 0
def extract_features_from_frame(frame, model):
"""
Extract features from a single frame using the InceptionV3 model.
Parameters:
- frame (numpy array): The image frame from which to extract features.
- model (Keras Model): The pre-trained InceptionV3 model for feature extraction.
Returns:
- numpy array: Flattened feature vector.
"""
img = cv2.resize(frame, (299, 299))
img = img[:, :, ::-1] # Convert BGR to RGB
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
return features.flatten()
def load_pca(pca_mean_path, pca_eigenvals_path, pca_eigenvecs_path):
"""
Load PCA parameters from saved numpy files.
Parameters:
- pca_mean_path (str): Path to the PCA mean vector file.
- pca_eigenvals_path (str): Path to the PCA eigenvalues file.
- pca_eigenvecs_path (str): Path to the PCA eigenvectors file.
Returns:
- tuple: PCA mean, eigenvalues, and eigenvectors.
"""
pca_mean = np.load(pca_mean_path)[:, 0]
pca_eigenvals = np.load(pca_eigenvals_path)[:1024, 0]
pca_eigenvecs = np.load(pca_eigenvecs_path).T[:, :1024]
return pca_mean, pca_eigenvals, pca_eigenvecs
def apply_pca(features, pca_mean, pca_eigenvals, pca_eigenvecs):
"""
Apply PCA to reduce the dimensionality of the feature vectors.
Parameters:
- features (numpy array): The feature vector to be reduced.
- pca_mean (numpy array): The PCA mean vector.
- pca_eigenvals (numpy array): The PCA eigenvalues.
- pca_eigenvecs (numpy array): The PCA eigenvectors.
Returns:
- numpy array: The PCA-reduced feature vector.
"""
features = features - pca_mean
features = features.dot(pca_eigenvecs)
features /= np.sqrt(pca_eigenvals + 1e-4)
return features
def quantize_features(features, num_bits=8):
"""
Quantize the PCA-reduced features to a lower bit representation.
Parameters:
- features (numpy array): The PCA-reduced feature vector.
- num_bits (int): The number of bits for quantization. Default is 8.
Returns:
- numpy array: The quantized feature vector.
"""
num_bins = 2 ** num_bits
min_val = np.min(features)
max_val = np.max(features)
bin_edges = np.linspace(min_val, max_val, num_bins + 1)
quantized = np.digitize(features, bin_edges[:-1])
quantized = np.clip(quantized, 0, num_bins - 1)
return quantized
def run_inference(interpreter, input_data):
"""
Run inference using a TensorFlow Lite interpreter.
Parameters:
- interpreter (tf.lite.Interpreter): The TensorFlow Lite interpreter.
- input_data (numpy array): The input data for the model.
Returns:
- numpy array: The output from the model after inference.
"""
interpreter.set_tensor(interpreter.get_input_details()[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(interpreter.get_output_details()[0]['index'])
return output_data
def evaluate_model(interpreter, X_test, threshold=0.5):
"""
Evaluate the model on the test data to predict the label.
Parameters:
- interpreter (tf.lite.Interpreter): The TensorFlow Lite interpreter.
- X_test (numpy array): The test data for model evaluation.
- threshold (float): The threshold for binary classification. Default is 0.5.
Returns:
- int: The predicted label (0 or 1).
"""
input_data = np.expand_dims(X_test[0], axis=0).astype(np.float32)
predicted_output = run_inference(interpreter, input_data)
predicted_label = (predicted_output >= threshold).astype(int)[0][0]
return predicted_label
def display_chunk_results(target_fps=20):
"""
Display the video frames with classification results.
Parameters:
- target_fps (int): The target frames per second for display. Default is 40.
"""
frame_delay = 1 / target_fps
prev_time = time.time()
frameCount = 0
flag = 0
global end_frame
end_frame1 = end_frame
fl = 0
while True:
if not label_event.is_set():
continue
if q.empty():
break
frameCount += 1
if flag == 0 or frameCount >= end_frame1:
current_label = predicted_label.get()
flag = 1
end_frame1 = end_frame
frame = q.get()
if current_label == 1:
# Apply effect based on prediction
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
frame = cv2.cvtColor(frame, cv2.COLOR_GRAY2BGR)
frame[:, :] = 0 # Set frame to black
current_time = time.time()
elapsed_time = current_time - prev_time
sleep_time = frame_delay - elapsed_time
if current_label == 0:
cv2.putText(frame, f'Label: Content', (frame.shape[1] - 250, 40), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2, cv2.LINE_AA)
if fl == 1:
fl = 0
pygame.mixer.music.stop()
elif current_label == 1:
text = 'Commercial'
(text_width, text_height), _ = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, 1, 2)
x_position = (frame.shape[1] - text_width) // 2
y_position = (frame.shape[0] + text_height) // 2
cv2.putText(frame, text, (x_position, y_position), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2, cv2.LINE_AA)
if fl == 0:
pygame.mixer.music.play()
fl = 1
cv2.imshow('Frame', frame)
if sleep_time > 0:
time.sleep(sleep_time)
prev_time = time.time()
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cv2.destroyAllWindows()
def process_video(args):
"""
Process the input video, extract features, apply PCA, quantize, and classify frames.
Parameters:
- args (argparse.Namespace): Parsed command-line arguments.
"""
global predicted_label, end_frame
model = InceptionV3(weights='imagenet', include_top=False, pooling='avg')
interpreter = tf.lite.Interpreter(model_path=args.model_path)
interpreter.allocate_tensors()
# Load PCA components
pca_mean_path = './pca/mean.npy'
pca_eigenvals_path = './pca/eigenvals.npy'
pca_eigenvecs_path = './pca/eigenvecs.npy'
pca_mean, pca_eigenvals, pca_eigenvecs = load_pca(pca_mean_path, pca_eigenvals_path, pca_eigenvecs_path)
cap = cv2.VideoCapture(args.video_path)
chunk_size = 150
frame_count = 0
# Determine frame interval for feature extraction
fps = cap.get(cv2.CAP_PROP_FPS)
interval_frames = int(fps * (250 / 1000))
while True:
frame_features_list = []
frames_in_chunk = 0
start_frame = frame_count
while frames_in_chunk < chunk_size:
ret, frame = cap.read()
if not ret:
break
q.put(frame)
# Extract features at regular intervals
if frame_count % interval_frames == 0:
features = extract_features_from_frame(frame, model)
frame_features_list.append(features)
frames_in_chunk += 1
frame_count += 1
if frames_in_chunk < chunk_size:
break
# Apply PCA and quantization to extracted features
features = np.array(frame_features_list)
reduced_features = apply_pca(features, pca_mean, pca_eigenvals, pca_eigenvecs)
final_features = quantize_features(reduced_features)
final_features = final_features / 255
final_features = np.reshape(final_features, (1, 150, 1024))
# Classify and store the predicted label
predicted_label.put(evaluate_model(interpreter, final_features))
end_frame = frame_count - 1
label_event.set()
print(f"Inferencing on frames {start_frame} to {end_frame} done. Prediction: {predicted_label.queue[-1]}")
cap.release()
label_event.set()
def main():
"""
Main function to parse command-line arguments, start video processing and display results.
"""
parser = argparse.ArgumentParser(description='Process a video and classify frames.')
parser.add_argument('--model_path', type=str, required=True, help='Path to the TFLite model file')
parser.add_argument('--video_path', type=str, required=True, help='Path to the video file')
args = parser.parse_args()
# Start thread for video processing
thread = threading.Thread(target=process_video, args=(args,))
thread.start()
# Start displaying the video
display_chunk_results()
# Wait for the thread to finish
thread.join()
if __name__ == '__main__':
main()
Laptop: The pipeline is extremely fast, maintaining continuity. It takes 1 minute to process the initial chunk of frames. As soon as the display of the first chunk begins (at >35fps), processing of the next chunk starts concurrently. By the time the first chunk finishes displaying, the second chunk’s processing is complete, and the third chunk’s processing is already underway, ensuring continuous display.
BeagleBone AI-64: The results show significant improvement over the previous approach of processing the entire video before displaying it. The first chunk of frames is ready in 7 minutes. With the display set at 7fps, continuity is maintained, allowing smooth viewing after the initial 7 minutes without any lag.
- In the following demonstrations, video is captured using a VideoCapture device, processed through the pipeline, and then displayed.
Setup_bbai64.mp4
Setup_laptop.mp4
The future of this project is bright. We have a working model that is able to detect commercials. We have a working input pipeline to extract audio-visual features. We have a working real-time pipeline to perform inferencing and display the results based on classification. I mentioned certain challenges we are currently facing in my edgeai continued blog. I’ll continue to figure out a solution for them and getting my model run on BeagleBone AI-64. Once we achieve this, we will have our first custom model that can be inferenced on the BeagleBone AI-64, leveraging its onboard DSPs and accelerators. I will also focus on developing the documentation for the BeagleBone AI-64 and actively contributing to the community.
GSoC has been a great learning experience for me. I’m extremely grateful to everyone who was part of this journey. I would like to thank my mentors Deepak Khatri, Kumar Abhishek, Jason Kridner. Everytime I join a meeting with them, I learnt something new. I’m very grateful to them for giving me this opportunity to work on this project.
I would also like to thank the whole BeagleBoard community for their support and feedback throughout the project. Its been such a wonderful time working with the community. I am genuinely grateful for the opportunity to collaborate with such a talented and committed group, and I look forward to work and grow with them in the future.
I would also like to thank the Google Summer of Code team for taking the wonderful initiative and giving me this opportunity to work on this project.
Regards,
Aryan Nanda
Awesome work !