In GSoc 2022, I worked with Redhen Lab. The objective was to develop a machine learning model to tag sound effects in streams (like police sirens in a news-stream) of Red Hen’s data. A single stream of data can contain multiple sound effects, so the model should be able to label them from a group of known sound effects like a Multi-label classification problem. YamNet is used a pre-trained model in this project. The video files are converted into audio files. Then they are tagged by YamNet for the sound effects and the results are dumped into different kinds of files to understand the tagging on the video files.
The project contains several blocks (in the form of scripts) which integrate together to annotate the tagging on RedHen videos.