Learning Embeddings for Laughter Categorization
UPDATE: This project was deemed successful, and I received a very positive evaluation from my mentors! :-) (you can view it at http://ganesh-srinivas.github.io/gsoc_final_evaluation.pdf)
The main deliverables from this project are machine learning classifiers that can perform laughter detection and categorization: identify if an audio clip contains laughter or not, and categorize the laughter (giggle, baby laugh, chuckle/chortle, snicker, belly laugh).
|Model Architecture||Input Feature||Output pooling||Test set metrics|
|Bidirectional LSTM with dropout and batch normalization (Adam Optimizer)||VGGish embeddings||None||67%|
|Bidirectional LSTM with dropout and batch normalization (SGD Optimizer)||VGGish embeddings||None||64%|
|Convolutional Neural Network||spectrogram of entire 10-second audio clip||None||43%|
The first deliverable is a hybrid model: a VGGish convolutional network that produced a 128-dimension embedding for each second of audio and a Bidirectional LSTM model with Dropout and Batch Normalization trained to classify the embedding sequence. The VGGish model was recently released by the AudioSet team and has the advantage of being pre-trained on millions of audio examples i.e., its features/embeddings would be more informative than raw audio spectrograms.
This model was trained on about five thousand examples from the laughter categories and five thousand from the remaining 521 categories in the AudioSet ontology. This particular sequential model gave the best performance: 67% top-1 classification accuracy. Its hyperparameters are partly responsible for its performance: sigmoid loss instead of softmax (to take advantage of multi-labelled examples, as recommended by the AudioSet team), usage of Dropout and Batch Normalization layers, etc.
This sequential model gives better performance on the AudioSet embeddings than any shallow classifier (k-Nearest Neighbors, SVM, Logistic Regression, etc.)
The second deliverable is the proposed convolutional neural network that categorizes laughter. More than a dozen experiments were performed for picking optimal hyperparameters i.e., varying number and type of layers, varying learning rates, number of training epochs, etc. The best performing model achieved an accuracy of around 42% was trained on more than 5600 examples evenly chosen from each of the six categories: the five laughter categories and a sixth "none of the above" category with examples from the remaining 521 classes in the AudioSet Ontology.
The following trained models are provided for prediction/inference tasks:
- A Convolutional Neural Network for doing laughter categorization on a 10-second audio clip (43% accuracy)
- A ConvNet for doing laughter categorization on a 1-second audio clip (22.79% accuracy)
- A ConvNet for detecting the presence of laughter on a 10-second clip (90% accuracy)
Visualization scripts that will produce t-SNE plots from audio spectrogram features and from AudioSet embeddings. t-SNE can be used to understand the structure of the space of laughter sounds.
- A very complex model ("feedforward attention with triplet loss") was initially proposed for this project. It would operate on variable sized inputs and deal with large number of classes. Given the difficulties with straightforward softmax/sigmoid loss classifiers on fixed-length clips, I did not implement this model. I am confident that we did not forgo any performance gains.
- Scripts for reading in large audio files and doing segment-wise classification (1-second segment, 10-second segment) haven't been extensively tested for our pure ConvNet models (the best performing hybrid model script has been tested a lot). This should take less than one hour.
- Visualizing the ConvNet filters hasn't been done due to the ConvNet not achieving excellent results.
- The hybrid model (VGGish + LSTM) hasn't been trained on all of the AudioSet embeddings (2 million+) due to difficulties with the TFRECORD data format. One way to solve this issue is to load the embeddings from the thousands of TFRECORDS and store them as a pickled dictionary. The dictionary can later be loaded into memory and inputs can be supplied to the model.
Merging into Red Hen's Repositories
This is a standalone project for now so a merge with other code isn't required. My mentors have informed me that Red Hen is studying options for storing/managing its large number of repositories, and have asked me to not add it to the Red Hen Lab organization on Github.
- Audio ML using deep neural networks isn't straightforward. For example: likening the spectrogram to an "image" of a sound doesn't work very well.
- Large datasets (thousands of clips, millions of embeddings) are the bottleneck in learning workflows. With GPUs, the learning phase doesn't take too much time. We still haven't figured out a good way to load the dataset into memory very quickly.
Useful Knowledge Gained
- Pretrained models provide useful features on top of which a classifier can be trained. The AudioSet embeddings are a useful place to start for researchers/programmers with less labelled data and few computational resources.
- Keras, TF-Slim provide APIs allow one to define a model in very few lines of code. This reduces chances of introducing bugs into code when one is modifying hyperparameters.
- Reddit.com/r/MachineLearning and the audioset-users google group are a good place to ask questions about (audio) ML issues.
- There was a YouTube-8M Kaggle competition a few months ago and some of the winning submissions dealt with similar sequence of embeddings for video/audio data. These can be applied to our problem.