Skip to content

Instantly share code, notes, and snippets.

@technosaby
Last active March 20, 2023 20:26
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save technosaby/b1e68810f63ff47207a18ee5e46359c6 to your computer and use it in GitHub Desktop.
Save technosaby/b1e68810f63ff47207a18ee5e46359c6 to your computer and use it in GitHub Desktop.
GSOC 2022 [RedHen Lab] Tagging Audio Effects Consolidated Report

Introduction

In GSoc 2022, I worked with Redhen Lab. The objective was to develop a machine learning model to tag sound effects in streams (like police sirens in a news-stream) of Red Hen’s data. A single stream of data can contain multiple sound effects, so the model should be able to label them from a group of known sound effects like a Multi-label classification problem. YamNet is used a pre-trained model in this project. The video files are converted into audio files. Then they are tagged by YamNet for the sound effects and the results are dumped into different kinds of files to understand the tagging on the video files.

Multilabel

Description

The project contains several blocks (in the form of scripts) which integrate together to annotate the tagging on RedHen videos. Each of the block is described below in details.

Generation of Audio File

In the first step, the Video files are converted into the Audio files. This is done using the audio convertor script. The script uses the ffmpeg tool, which handles most of the common known video formats. In the context of our developed pipeline using YamNet, we only convert all the videos to the “wav” format with a sampling rate of 16000 due to restrictions in YamNet.

Tagging of Audio Files

Audio files contain samples captured over time which can be mapped into waveforms. The waveforms can be converted into a log-mel spectrogram which contains a visual representation of all the frequencies over time. The below figure shows an example of a waveform (top) and a log-mel spectrogram(bottom) for a cat sound taken from the AudioSet data set. The Y axis of the spectrogram is the frequency is Hertz and the X axis is time(ms), the color represents the amplitude (in dB). The brighter color represents higher amplitude. Several studies have found that computer vision techniques, like Convolution Neural Network (CNN), can be applied on the images of the spectrograms to train a model to identify sounds with similar kinds of characteristics.

waveform

The process of generation of a log-mel spectrogram goes through a sequence of steps as given below. The log-mel spectrograms can be used as features which can then be framed into overlapping examples of fixed time thus generating patches. These patches can be used to train a deep learning (CNN based) model that can extract the dominant audio per time frame by finding patterns in them. There is also a concept of window framing, where some minimum length of input waveform is required to get the first frame of output scores.

FeatureCalculation

All these work is done by YamNet and we use the pre-trained model to run on the converted audio files to generate the audio tags. These tags are dumped in 2 different kinds of files. All the work is done by the audio tagging script.

  • SFX Files: These files are based on RedHen's standards where tags are mapped with every frame of audio data. JQ queries can be used to filter the SFX tags.
  • CSV Files: These files contain tags and their begin and end times for the frames. These CSV files are consumed by ELAN tool for annotations based on the tiers as sound effects.

The details and the use of each of the formats are given below.

Annotating the CSV results in ELAN

ELAN tool is a professional annotation tool to manually and semi-automatically annotate and transcribe audio or video recordings which is used by linguistic and gesture researchers all over the world. So we tried to import the CSV file generated by the audio tagger into the ELAN using the options as shown below. This is done by using the File->Import option after the video and the corresponding EAF file(ELAN annotation file) is loaded for that video.

image

The first column which consists of the the tags are mapped against the Tiers, the start and the end times of the tagged audio frames are tagged as Begin & End Time. The score for the tags is made as an Annotation.

image

After the import is done, the audio tags are shown along with the other annotations. The audio tags are plotted as tiers and the score of the tags are plotted on the video time frames along with the scores.

Screenshot 2022-08-27 at 10 41 22 AM

It is interesting to see that how nicely the tags are represented in the video timeline. The Speech of the video actually starts from 00:22 and it is detected with a confidence of 82%.

Parsing the SFX file

Another output file format is the SFX file, which is complaint to RedHen output formats. A sample SFX file will contain a TOP block with the file name along with other blocks like COL, UID, SRC, TTL, PID etc. These data are captured from .seg files sometimes present with the video in the RedHen database. Along with the blocks, it will contain the audio tags in a frame by frame basis along wth the scores. A sample SFX file might look like this below.

image

Sometimes the data in the SFX file can be overwhelming for a researcher to use or understand. So a special sfx file parser is written to parse and filter the metadata of tags. This filtering is done through JQ queries which can be passed along with the script as shown in the script documentation. Normally RedHen data are videos organized in the form of year, month and days. So if there is a set of generated SFX files in a folder and a researcher wants to filter between certain dates of files with a tag filter like Music or Speech, it can be passed with a JQ query and the output will be a filtered CSV file.

Codebook Generator

This is a simple tool which prepares a code book file for the RedHen website. A sample codebook file will contain the tag display names along with indices.

Usage

Singularity Environment

The below mentioned steps are to run the Audio Tagger in Case Western Reserve HPC as of August 2022.

  1. Create a folder name with videos required for tagging or use an existing folder from RedHen's mount point. Store it in a variable VIDEO_FILES. VIDEO_FILES=/mnt/rds/redhen/gallina/tv/2022/2022-01/2022-01-01/ If you are planning to create in the tags in SFX file, it is better to have a .seg files for your videos. If you don't have a .seg file only TOP Block will be generated along with the Audio tagging.

  2. Please clone the repo in RedHen's HPC as a scratch user in the home of the scratch user (e.g: /scratch/users/sxg1263/). After cloning you will have a gsoc2022 folder.

  3. Set all the variables as below

     SCRATCH_USER=/scratch/users/$USER
     TOOLS_FOLDER=$SCRATCH_USER/gsoc2022/tagging_audio_effects/tools
     ROOT_FOLDER=$SCRATCH_USER/gsoc2022/tagging_audio_effects
     HOME_FOLDER=$SCRATCH_USER/gsoc2022
    
  4. Load the singularity container. The current version in Case Western Reserve HPC is 3.8.1. module load singularity/3.8.1

  5. In the scratch workspace, (e.g: /scratch/users/sxg1263/) create the singularity image from Github workspace. singularity pull image.sif docker://ghcr.io/technosaby/gsoc2022-redhen-audio-tagging-stages:1

  6. Create temporary folders for Outputs. The AudioFiles folder will contain the converted audio files while the TaggedAudioFiles contain the tagged files.

    mkdir Output/
    cd Output || exit
    mkdir AudioFiles
    mkdir TaggedAudioFiles
    
  7. Execute the following command to convert the video (from $VIDEO_FILES) to audio file im the wav format (in Output/AudioFiles). This command runs the audio_file_convertor.py script. The script documentation contains details description of the different arguments which can be used with the script.

singularity exec --bind $SCRATCH_USER $SCRATCH_USER/image.sif python3 $TOOLS_FOLDER/audio_file_convertor.py -i $VIDEO_FILES -a "wav" -o $SCRATCH_USER/Output/AudioFiles/

  1. Execute the following command to use the Audio Files generated from the last step to generate the Audio Tags in CSV (with confidence >= 0.2) and SFX format. This will run the tag_audio_effects.py script to generate the tags in TaggedAudioFiles folder. The script documentation contains several optional arguments which can be configured to get customized outputs. singularity exec --bind $SCRATCH_USER $SCRATCH_USER/image.sif python3 $ROOT_FOLDER/tag_audio_effects.py -i $SCRATCH_USER/Output/AudioFiles/ -o $SCRATCH_USER/Output/TaggedAudioFiles/ -s 0.2

  2. After the script is run, a TaggedAudioFiles folders will be generated with the tagged audio files in the Output folder. Samples of the generated SFX and CSV files are given in this folder

  3. You can now choose to copy the tagged files to your HPC home/PC for analysis using ELAN or JQ.

A script called hpc_script.py contains all the steps for running in a singularity container. But it is better to run the steps individually.

Future work

Other audio taggers

It will be great to evaluate other audio taggers and compare their results with that of YamNet (used here).

Transfer Learning

RedHen has a big set of video database (news in different languages), but they are not labelled with audio tags. So doing transfer learning can be a challenge with them

Develop better utliity scripts

There are lot of scripts developed in this project. They can be improved by using CORBA, Reference

The details backlog is also given in the Project Dashboard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment