Skip to content

Instantly share code, notes, and snippets.

@artificialsoph
Last active May 1, 2019 14:50
Show Gist options
  • Save artificialsoph/da5ee455aed6f80e44710babe656c963 to your computer and use it in GitHub Desktop.
Save artificialsoph/da5ee455aed6f80e44710babe656c963 to your computer and use it in GitHub Desktop.
Audio Recognition in Keras
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Brief intro\n",
"\n",
"Tensorflow has a [fantastic tutorial](https://www.tensorflow.org/versions/master/tutorials/audio_recognition) that illustrates how audio recognition works in modern deep learning models. I won't reiterate the details in that post, so I recommend you read it. \n",
"\n",
"Here, I want to demo:\n",
"- How to prep audio data according to the Tensorflow examle using python (the tutorial uses tensorflow-specific code).\n",
"- How to reimplement several of the included models using Keras.\n",
"\n",
"\n",
"## Further reading\n",
"\n",
"- [mfcc intro with good features](http://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html)\n",
"- [mfcc explanation with math](http://www.practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/)\n",
"- [tensorlfow tutorial](https://www.tensorflow.org/versions/master/tutorials/audio_recognition) \n",
"- [Python Speech Features](http://python-speech-features.readthedocs.io/en/latest/) the library we'll use today.\n",
"- [Librosa](https://librosa.github.io/librosa/index.html) popular alternative\n",
"- [Kapre](https://github.com/keunwoochoi/kapre) library for computing audio features *within* Keras. great if you want to do this on GPU.\n",
"\n",
"\n",
"## Prep\n",
"\n",
"The data I'll use comes from this kaggle competition. Download the train files, unzip them, and set the directory below. The following blocks grabs all the training files and build a dataframe that organizes them along with their respective labels."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T19:56:27.119884Z",
"start_time": "2018-03-05T19:56:25.840989Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Populating the interactive namespace from numpy and matplotlib\n"
]
}
],
"source": [
"%pylab inline\n",
"\n",
"import pandas\n",
"import glob\n",
"import scipy"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:02:21.177017Z",
"start_time": "2018-03-05T20:02:21.173313Z"
}
},
"outputs": [],
"source": [
"src_path = \"/home/soph/kaggle/tensorflow-src\" #set the path to your data files"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:03:23.114711Z",
"start_time": "2018-03-05T20:03:22.874218Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"found 64727 files\n"
]
}
],
"source": [
"filenames = glob.glob(src_path + \"/train/**/*.wav\", recursive=True)\n",
"print(f\"found {len(filenames)} files\")"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:03:26.352185Z",
"start_time": "2018-03-05T20:03:26.337642Z"
}
},
"outputs": [],
"source": [
"ex_df = pandas.DataFrame({\"filename\":filenames})"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:03:32.741795Z",
"start_time": "2018-03-05T20:03:32.637013Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>14591</th>\n",
" <th>1274</th>\n",
" <th>24990</th>\n",
" <th>38870</th>\n",
" <th>49206</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>filename</th>\n",
" <td>/home/soph/kaggle/tensorflow-src/train/audio/t...</td>\n",
" <td>/home/soph/kaggle/tensorflow-src/train/audio/s...</td>\n",
" <td>/home/soph/kaggle/tensorflow-src/train/audio/f...</td>\n",
" <td>/home/soph/kaggle/tensorflow-src/train/audio/s...</td>\n",
" <td>/home/soph/kaggle/tensorflow-src/train/audio/l...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>label</th>\n",
" <td>two</td>\n",
" <td>seven</td>\n",
" <td>five</td>\n",
" <td>stop</td>\n",
" <td>left</td>\n",
" </tr>\n",
" <tr>\n",
" <th>label_i</th>\n",
" <td>7</td>\n",
" <td>0</td>\n",
" <td>12</td>\n",
" <td>18</td>\n",
" <td>23</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 14591 \\\n",
"filename /home/soph/kaggle/tensorflow-src/train/audio/t... \n",
"label two \n",
"label_i 7 \n",
"\n",
" 1274 \\\n",
"filename /home/soph/kaggle/tensorflow-src/train/audio/s... \n",
"label seven \n",
"label_i 0 \n",
"\n",
" 24990 \\\n",
"filename /home/soph/kaggle/tensorflow-src/train/audio/f... \n",
"label five \n",
"label_i 12 \n",
"\n",
" 38870 \\\n",
"filename /home/soph/kaggle/tensorflow-src/train/audio/s... \n",
"label stop \n",
"label_i 18 \n",
"\n",
" 49206 \n",
"filename /home/soph/kaggle/tensorflow-src/train/audio/l... \n",
"label left \n",
"label_i 23 "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#build labels\n",
"def get_label(fn):\n",
" return fn.split(\"/\")[-2]\n",
"ex_df[\"label\"] = ex_df.filename.map(get_label)\n",
"\n",
"# build indices\n",
"unique_labels = ex_df.label.unique()\n",
"label_dict = dict(zip(unique_labels, range(len(unique_labels))))\n",
"ex_df[\"label_i\"] = ex_df.label.map(label_dict)\n",
"\n",
"# remove background noise category\n",
"ex_df.label.where(ex_df.label != \"_background_noise_\", inplace=True)\n",
"\n",
"ex_df.sample(5).T"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:03:42.609146Z",
"start_time": "2018-03-05T20:03:42.603995Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array(['_background_noise_', 'bed', 'bird', 'cat', 'dog', 'down', 'eight',\n",
" 'five', 'four', 'go', 'happy', 'house', 'left', 'marvin', 'nine',\n",
" 'no', 'off', 'on', 'one', 'right', 'seven', 'sheila', 'six',\n",
" 'stop', 'three', 'tree', 'two', 'up', 'wow', 'yes', 'zero'],\n",
" dtype=object)"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sort(unique_labels)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-18T16:08:50.498638Z",
"start_time": "2017-11-18T16:08:50.483423Z"
}
},
"source": [
"# Initial analysis on limited samples\n",
"\n",
"**Important** You will likely want to reduce `num_ex` if you are on a laptop."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:06:07.370345Z",
"start_time": "2018-03-05T20:06:07.328397Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>25610</th>\n",
" <th>50262</th>\n",
" <th>41729</th>\n",
" <th>62226</th>\n",
" <th>38065</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>filename</th>\n",
" <td>/home/soph/kaggle/tensorflow-src/train/audio/f...</td>\n",
" <td>/home/soph/kaggle/tensorflow-src/train/audio/o...</td>\n",
" <td>/home/soph/kaggle/tensorflow-src/train/audio/t...</td>\n",
" <td>/home/soph/kaggle/tensorflow-src/train/audio/t...</td>\n",
" <td>/home/soph/kaggle/tensorflow-src/train/audio/s...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>label</th>\n",
" <td>five</td>\n",
" <td>one</td>\n",
" <td>three</td>\n",
" <td>tree</td>\n",
" <td>stop</td>\n",
" </tr>\n",
" <tr>\n",
" <th>label_i</th>\n",
" <td>6</td>\n",
" <td>17</td>\n",
" <td>23</td>\n",
" <td>24</td>\n",
" <td>22</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 25610 \\\n",
"filename /home/soph/kaggle/tensorflow-src/train/audio/f... \n",
"label five \n",
"label_i 6 \n",
"\n",
" 50262 \\\n",
"filename /home/soph/kaggle/tensorflow-src/train/audio/o... \n",
"label one \n",
"label_i 17 \n",
"\n",
" 41729 \\\n",
"filename /home/soph/kaggle/tensorflow-src/train/audio/t... \n",
"label three \n",
"label_i 23 \n",
"\n",
" 62226 \\\n",
"filename /home/soph/kaggle/tensorflow-src/train/audio/t... \n",
"label tree \n",
"label_i 24 \n",
"\n",
" 38065 \n",
"filename /home/soph/kaggle/tensorflow-src/train/audio/s... \n",
"label stop \n",
"label_i 22 "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# grab a fixed number of examples\n",
"num_ex = 30000\n",
"include_labels = ['bed', 'bird', 'cat', 'dog', 'down', 'eight',\n",
" 'five', 'four', 'go', 'happy', 'house', 'left', 'marvin', 'nine',\n",
" 'no', 'off', 'on', 'one', 'right', 'seven', 'sheila', 'six',\n",
" 'stop', 'three', 'tree', 'two', 'up', 'wow', 'yes', 'zero']\n",
"small_bool = ex_df.label.isin(include_labels)\n",
"num_ex = min(num_ex, sum(small_bool))\n",
"small_data = ex_df[ex_df.label.isin(include_labels)].sample(num_ex)\n",
"\n",
"label_dict = dict(zip(include_labels, range(len(include_labels))))\n",
"small_data[\"label_i\"] = small_data.label.map(label_dict)\n",
"y = small_data.label_i.as_matrix()\n",
"small_data.head().T"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:06:11.569410Z",
"start_time": "2018-03-05T20:06:11.564106Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"30000"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(small_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## MFCC\n",
"\n",
"We'll use Mel-Frequency Cepstrum Coefficients to transform audio files into a format a neural network can understand. "
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:06:58.593306Z",
"start_time": "2018-03-05T20:06:58.437349Z"
}
},
"outputs": [],
"source": [
"import scipy.io.wavfile\n",
"from python_speech_features import mfcc"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:07:04.384381Z",
"start_time": "2018-03-05T20:07:04.333192Z"
}
},
"outputs": [],
"source": [
"figsize(8,4)\n",
"rate, wave = scipy.io.wavfile.read(ex_df.filename[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, plot the amplitude of the wav file just to get an idea of what it looks like"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:07:05.611964Z",
"start_time": "2018-03-05T20:07:05.458078Z"
}
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7fa97e6c55c0>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.plot(wave)\n",
"plt.xlabel(\"Sample num → (16kHz)\")\n",
"plt.ylabel(\"Amplitude\");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's plot the MFCC features that we'll be using"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:13:21.658987Z",
"start_time": "2018-03-05T20:13:21.448792Z"
}
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7fa96e998668>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"mfcc_feat = mfcc(wave, rate, numcep=40, winlen=100/16000, winstep=100/16000, nfilt=40)\n",
"plt.imshow(mfcc_feat.T, cmap=\"binary\", aspect='auto')\n",
"plt.xlabel(\"Time → (each step is 100 samples)\")\n",
"plt.ylabel(\"Cepstrum #\");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's pre-calculate the MFCCs for each audio file so we don't have to repeat this. It might take a minute or so on a fast computer. Longer on a laptop."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:16:34.098961Z",
"start_time": "2018-03-05T20:15:02.856911Z"
}
},
"outputs": [],
"source": [
"maxlen = 16000\n",
"num_windows = 160\n",
"num_cep = 40\n",
"\n",
"mfcc_dict = dict(\n",
" numcep=num_cep,\n",
" winlen=1 / num_windows,\n",
" winstep=1 / num_windows,\n",
" nfilt=num_cep)\n",
"num_ex = len(small_data)\n",
"x_mfcc = numpy.zeros((num_ex, num_windows, num_cep, 1))\n",
"for i, fn in enumerate(small_data.filename):\n",
" rate, wave = scipy.io.wavfile.read(fn)\n",
" wave = wave[:maxlen]\n",
" wave = np.pad(wave, (0, maxlen - wave.shape[0]), 'minimum')\n",
" mfcc_feat = mfcc(wave, rate, **mfcc_dict)\n",
" x_mfcc[i, :, :, 0] = mfcc_feat"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-20T21:18:14.939494Z",
"start_time": "2017-11-20T21:18:14.215119Z"
}
},
"outputs": [],
"source": [
"# x_shift = x_mfcc - (numpy.mean(x_mfcc, axis=1, keepdims=True) + 1e-8)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:17:30.014828Z",
"start_time": "2018-03-05T20:17:29.878672Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape: (30000, 160, 40, 1), (num_ex, num_windows, num_cep, num_chanel)\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7fa912fc4b00>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"figsize(8,4)\n",
"plt.imshow(x_mfcc[0,:,:,0].T, aspect=\"equal\", cmap=\"binary\")\n",
"print(\"Shape: {}, (num_ex, num_windows, num_cep, num_chanel)\".format(x_mfcc.shape))\n",
"plt.xlabel(\"Time → (each step is 100 samples)\")\n",
"plt.ylabel(\"Cepstrum #\");"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:17:50.887407Z",
"start_time": "2018-03-05T20:17:48.364953Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/soph/miniconda3/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
" from ._conv import register_converters as _register_converters\n",
"Using TensorFlow backend.\n"
]
}
],
"source": [
"from keras.utils.np_utils import to_categorical\n",
"y_sparse = small_data.label_i.as_matrix()\n",
"num_labels = max(y_sparse)+1"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:18:09.207639Z",
"start_time": "2018-03-05T20:18:09.204023Z"
}
},
"outputs": [],
"source": [
"import keras\n",
"import keras.backend as K"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:18:09.959384Z",
"start_time": "2018-03-05T20:18:09.938647Z"
}
},
"outputs": [],
"source": [
"def plot_history(history):\n",
"\n",
" measures = np.unique([m.replace('val_', '') for m in history.history.keys()])\n",
" num_meas = len(measures)\n",
" x = arange(len(history.history[measures[0]]))\n",
" fix, axes = subplots(nrows=num_meas,ncols=1,squeeze=True, sharex=True,figsize=(6,2*num_meas), tight_layout=True)\n",
" if num_meas == 1:\n",
" axes = [axes]\n",
" for i,meas in enumerate(measures):\n",
" axes[i].plot(x, history.history[meas], label=meas)\n",
" if \"val_\"+meas in history.history.keys():\n",
" axes[i].plot(x, history.history[\"val_\"+meas], label=\"val_\"+meas)\n",
" axes[i].legend()\n",
" if meas in [\"acc\", \"top_3\"]:\n",
" axes[i].set_ylim((-0.01,1.01))\n",
" axes[-1].set_xlabel(\"epoch\");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model 1: Feedforward\n",
"\n",
"This is a very simple model with one hidden layer. I recommend running a model like this on your data as an initial step for almost any task."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:27:44.319621Z",
"start_time": "2018-03-05T20:27:44.297002Z"
}
},
"outputs": [],
"source": [
"# default values from tensorflow tutorial\n",
"\n",
"dropout_prob = .2\n",
"\n",
"init_stddev = 0.01\n",
"\n",
"x = x_mfcc\n",
"y = y_sparse\n",
"\n",
"\n",
"def top_3(y_true, y_pred):\n",
" return keras.metrics.sparse_top_k_categorical_accuracy(y_true, y_pred, k=3)\n",
"\n",
"\n",
"def my_fit(model_layers):\n",
"\n",
" model = keras.Sequential(model_layers)\n",
"\n",
" model.summary()\n",
"\n",
" model.compile(\n",
" loss='sparse_categorical_crossentropy',\n",
" optimizer='nadam',\n",
" metrics=['accuracy', top_3])\n",
"\n",
" history = model.fit(\n",
" x,\n",
" y,\n",
" epochs=100,\n",
" verbose=1,\n",
" validation_split=.25,\n",
" callbacks=[\n",
" keras.callbacks.EarlyStopping(verbose=1, patience=5),\n",
" keras.callbacks.ReduceLROnPlateau(\n",
" factor=.5, patience=1, cooldown=1, verbose=1)\n",
" ])\n",
"\n",
" plot_history(history)\n",
"\n",
" return model, history"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:27:46.257558Z",
"start_time": "2018-03-05T20:27:46.248184Z"
}
},
"outputs": [],
"source": [
"model_layers1 = [\n",
" keras.layers.InputLayer(input_shape=x[0].shape),\n",
" keras.layers.Flatten(),\n",
" keras.layers.BatchNormalization(),\n",
" keras.layers.Dropout(.5),\n",
" keras.layers.Dense(128, activation=\"elu\"),\n",
" keras.layers.BatchNormalization(),\n",
" keras.layers.Dropout(.5),\n",
" # Classification\n",
" keras.layers.Dense(\n",
" num_labels,\n",
" activation=\"softmax\"),\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:30:55.795631Z",
"start_time": "2018-03-05T20:27:47.207694Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"_________________________________________________________________\n",
"Layer (type) Output Shape Param # \n",
"=================================================================\n",
"input_7 (InputLayer) (None, 160, 40, 1) 0 \n",
"_________________________________________________________________\n",
"flatten_7 (Flatten) (None, 6400) 0 \n",
"_________________________________________________________________\n",
"batch_normalization_8 (Batch (None, 6400) 25600 \n",
"_________________________________________________________________\n",
"dropout_9 (Dropout) (None, 6400) 0 \n",
"_________________________________________________________________\n",
"dense_9 (Dense) (None, 128) 819328 \n",
"_________________________________________________________________\n",
"batch_normalization_9 (Batch (None, 128) 512 \n",
"_________________________________________________________________\n",
"dropout_10 (Dropout) (None, 128) 0 \n",
"_________________________________________________________________\n",
"dense_10 (Dense) (None, 30) 3870 \n",
"=================================================================\n",
"Total params: 849,310\n",
"Trainable params: 836,254\n",
"Non-trainable params: 13,056\n",
"_________________________________________________________________\n",
"Train on 22500 samples, validate on 7500 samples\n",
"Epoch 1/100\n",
"22500/22500 [==============================] - 6s 286us/step - loss: 2.7086 - acc: 0.2562 - top_3: 0.4809 - val_loss: 1.9437 - val_acc: 0.4495 - val_top_3: 0.6892\n",
"Epoch 2/100\n",
"22500/22500 [==============================] - 6s 260us/step - loss: 2.0300 - acc: 0.4052 - top_3: 0.6604 - val_loss: 1.7610 - val_acc: 0.4940 - val_top_3: 0.7324\n",
"Epoch 3/100\n",
"22500/22500 [==============================] - 6s 259us/step - loss: 1.8258 - acc: 0.4573 - top_3: 0.7116 - val_loss: 1.6269 - val_acc: 0.5327 - val_top_3: 0.7593\n",
"Epoch 4/100\n",
"22500/22500 [==============================] - 6s 259us/step - loss: 1.7082 - acc: 0.4869 - top_3: 0.7414 - val_loss: 1.5519 - val_acc: 0.5560 - val_top_3: 0.7776\n",
"Epoch 5/100\n",
"22500/22500 [==============================] - 6s 261us/step - loss: 1.6037 - acc: 0.5163 - top_3: 0.7652 - val_loss: 1.5119 - val_acc: 0.5680 - val_top_3: 0.7903\n",
"Epoch 6/100\n",
"22500/22500 [==============================] - 6s 261us/step - loss: 1.5339 - acc: 0.5347 - top_3: 0.7821 - val_loss: 1.4679 - val_acc: 0.5745 - val_top_3: 0.7963\n",
"Epoch 7/100\n",
"22500/22500 [==============================] - 6s 258us/step - loss: 1.4880 - acc: 0.5480 - top_3: 0.7891 - val_loss: 1.4532 - val_acc: 0.5781 - val_top_3: 0.7976\n",
"Epoch 8/100\n",
"22500/22500 [==============================] - 6s 259us/step - loss: 1.4309 - acc: 0.5635 - top_3: 0.8041 - val_loss: 1.4284 - val_acc: 0.5855 - val_top_3: 0.8017\n",
"Epoch 9/100\n",
"22500/22500 [==============================] - 6s 259us/step - loss: 1.3802 - acc: 0.5798 - top_3: 0.8172 - val_loss: 1.4037 - val_acc: 0.5868 - val_top_3: 0.8061\n",
"Epoch 10/100\n",
"22500/22500 [==============================] - 6s 259us/step - loss: 1.3521 - acc: 0.5851 - top_3: 0.8199 - val_loss: 1.3990 - val_acc: 0.5960 - val_top_3: 0.8064\n",
"Epoch 11/100\n",
"22500/22500 [==============================] - 6s 261us/step - loss: 1.3054 - acc: 0.6000 - top_3: 0.8296 - val_loss: 1.3865 - val_acc: 0.5971 - val_top_3: 0.8084\n",
"Epoch 12/100\n",
"22500/22500 [==============================] - 6s 256us/step - loss: 1.2818 - acc: 0.6052 - top_3: 0.8359 - val_loss: 1.3761 - val_acc: 0.6004 - val_top_3: 0.8105\n",
"Epoch 13/100\n",
"22500/22500 [==============================] - 6s 259us/step - loss: 1.2588 - acc: 0.6095 - top_3: 0.8419 - val_loss: 1.3716 - val_acc: 0.6005 - val_top_3: 0.8097\n",
"Epoch 14/100\n",
"22500/22500 [==============================] - 6s 256us/step - loss: 1.2324 - acc: 0.6143 - top_3: 0.8458 - val_loss: 1.3675 - val_acc: 0.5997 - val_top_3: 0.8121\n",
"Epoch 15/100\n",
"22500/22500 [==============================] - 6s 260us/step - loss: 1.2066 - acc: 0.6262 - top_3: 0.8502 - val_loss: 1.3654 - val_acc: 0.5983 - val_top_3: 0.8107\n",
"Epoch 16/100\n",
"22500/22500 [==============================] - 6s 259us/step - loss: 1.1980 - acc: 0.6294 - top_3: 0.8534 - val_loss: 1.3421 - val_acc: 0.6101 - val_top_3: 0.8141\n",
"Epoch 17/100\n",
"22500/22500 [==============================] - 6s 259us/step - loss: 1.1620 - acc: 0.6354 - top_3: 0.8576 - val_loss: 1.3569 - val_acc: 0.6072 - val_top_3: 0.8141\n",
"Epoch 18/100\n",
"22464/22500 [============================>.] - ETA: 0s - loss: 1.1571 - acc: 0.6415 - top_3: 0.8627\n",
"Epoch 00018: ReduceLROnPlateau reducing learning rate to 0.0010000000474974513.\n",
"22500/22500 [==============================] - 6s 266us/step - loss: 1.1566 - acc: 0.6416 - top_3: 0.8628 - val_loss: 1.3435 - val_acc: 0.6095 - val_top_3: 0.8192\n",
"Epoch 19/100\n",
"22500/22500 [==============================] - 6s 256us/step - loss: 1.0797 - acc: 0.6588 - top_3: 0.8779 - val_loss: 1.3310 - val_acc: 0.6088 - val_top_3: 0.8181\n",
"Epoch 20/100\n",
"22500/22500 [==============================] - 6s 262us/step - loss: 1.0493 - acc: 0.6690 - top_3: 0.8817 - val_loss: 1.3163 - val_acc: 0.6157 - val_top_3: 0.8219\n",
"Epoch 21/100\n",
"22500/22500 [==============================] - 6s 258us/step - loss: 1.0302 - acc: 0.6776 - top_3: 0.8872 - val_loss: 1.3103 - val_acc: 0.6145 - val_top_3: 0.8245\n",
"Epoch 22/100\n",
"22500/22500 [==============================] - 6s 260us/step - loss: 1.0242 - acc: 0.6767 - top_3: 0.8881 - val_loss: 1.3138 - val_acc: 0.6119 - val_top_3: 0.8231\n",
"Epoch 23/100\n",
"22336/22500 [============================>.] - ETA: 0s - loss: 0.9895 - acc: 0.6868 - top_3: 0.8926\n",
"Epoch 00023: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.\n",
"22500/22500 [==============================] - 6s 259us/step - loss: 0.9901 - acc: 0.6867 - top_3: 0.8925 - val_loss: 1.3121 - val_acc: 0.6148 - val_top_3: 0.8231\n",
"Epoch 24/100\n",
"22500/22500 [==============================] - 6s 260us/step - loss: 0.9677 - acc: 0.6902 - top_3: 0.8973 - val_loss: 1.3065 - val_acc: 0.6208 - val_top_3: 0.8251\n",
"Epoch 25/100\n",
"22500/22500 [==============================] - 6s 260us/step - loss: 0.9478 - acc: 0.6977 - top_3: 0.9011 - val_loss: 1.3055 - val_acc: 0.6184 - val_top_3: 0.8264\n",
"Epoch 26/100\n",
"22500/22500 [==============================] - 6s 260us/step - loss: 0.9322 - acc: 0.7050 - top_3: 0.9025 - val_loss: 1.3082 - val_acc: 0.6188 - val_top_3: 0.8247\n",
"Epoch 27/100\n",
"22500/22500 [==============================] - 6s 258us/step - loss: 0.9216 - acc: 0.7065 - top_3: 0.9048 - val_loss: 1.3018 - val_acc: 0.6204 - val_top_3: 0.8252\n",
"Epoch 28/100\n",
"22500/22500 [==============================] - 6s 262us/step - loss: 0.9067 - acc: 0.7121 - top_3: 0.9066 - val_loss: 1.3080 - val_acc: 0.6168 - val_top_3: 0.8236\n",
"Epoch 29/100\n",
"22336/22500 [============================>.] - ETA: 0s - loss: 0.9238 - acc: 0.7072 - top_3: 0.9045\n",
"Epoch 00029: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.\n",
"22500/22500 [==============================] - 6s 258us/step - loss: 0.9241 - acc: 0.7071 - top_3: 0.9045 - val_loss: 1.3050 - val_acc: 0.6175 - val_top_3: 0.8252\n",
"Epoch 30/100\n",
"22500/22500 [==============================] - 6s 262us/step - loss: 0.8991 - acc: 0.7168 - top_3: 0.9080 - val_loss: 1.3086 - val_acc: 0.6181 - val_top_3: 0.8232\n",
"Epoch 31/100\n",
"22496/22500 [============================>.] - ETA: 0s - loss: 0.8972 - acc: 0.7159 - top_3: 0.9071\n",
"Epoch 00031: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.\n",
"22500/22500 [==============================] - 6s 261us/step - loss: 0.8976 - acc: 0.7159 - top_3: 0.9071 - val_loss: 1.3078 - val_acc: 0.6192 - val_top_3: 0.8240\n",
"Epoch 32/100\n",
"22500/22500 [==============================] - 6s 262us/step - loss: 0.8873 - acc: 0.7188 - top_3: 0.9096 - val_loss: 1.3053 - val_acc: 0.6193 - val_top_3: 0.8235\n",
"Epoch 00032: early stopping\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/soph/miniconda3/lib/python3.6/site-packages/matplotlib/figure.py:2022: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.\n",
" warnings.warn(\"This figure includes Axes that are not compatible \"\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7fa86534e828>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"model1, history1 = my_fit(model_layers1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Simple feedforward: ~60% val accuracy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Convolutional networks\n",
"\n",
"Convolutional networks are the go-to tool for image processing. Here, we'll treat our MFCC features as images. \n",
"\n",
"This network includes 3 convolution layers of the same size. Max pooling serves to step down the dimension size after the first two. Finally, global pooling is used to pool the final convolutional layers across the entire remaining image, giving a summary value for each feature. "
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:36:47.041174Z",
"start_time": "2018-03-05T20:36:47.017720Z"
}
},
"outputs": [],
"source": [
"DROP = .25\n",
"\n",
"model_layers3 = [\n",
" keras.layers.InputLayer(input_shape=x[0].shape),\n",
" keras.layers.BatchNormalization(),\n",
"\n",
" # conv layer 1\n",
" keras.layers.Conv2D(64, 3, padding=\"same\", activation=\"relu\"),\n",
" keras.layers.MaxPool2D(3, 2, padding=\"same\"),\n",
" keras.layers.BatchNormalization(),\n",
" keras.layers.Dropout(DROP),\n",
"\n",
" # Conv layer 2\n",
" keras.layers.Conv2D(128, 3, padding=\"same\", activation=\"relu\"),\n",
" keras.layers.MaxPool2D(3, 2, padding=\"same\"),\n",
" keras.layers.BatchNormalization(),\n",
" keras.layers.Dropout(DROP),\n",
" \n",
" # Conv layer 3\n",
" keras.layers.Conv2D(256, 3, padding=\"same\", activation=\"relu\"),\n",
" keras.layers.GlobalAveragePooling2D(),\n",
" keras.layers.BatchNormalization(),\n",
" keras.layers.Dropout(DROP),\n",
"\n",
" # Hidden Layer\n",
" keras.layers.Dense(128, activation=\"relu\"),\n",
" keras.layers.BatchNormalization(),\n",
" keras.layers.Dropout(DROP),\n",
"\n",
" # Classification\n",
" keras.layers.Dense(num_labels, activation=\"softmax\"),\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-05T20:51:35.307893Z",
"start_time": "2018-03-05T20:36:47.981633Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"_________________________________________________________________\n",
"Layer (type) Output Shape Param # \n",
"=================================================================\n",
"input_11 (InputLayer) (None, 160, 40, 1) 0 \n",
"_________________________________________________________________\n",
"batch_normalization_22 (Batc (None, 160, 40, 1) 4 \n",
"_________________________________________________________________\n",
"conv2d_8 (Conv2D) (None, 160, 40, 64) 640 \n",
"_________________________________________________________________\n",
"max_pooling2d_6 (MaxPooling2 (None, 80, 20, 64) 0 \n",
"_________________________________________________________________\n",
"batch_normalization_23 (Batc (None, 80, 20, 64) 256 \n",
"_________________________________________________________________\n",
"dropout_19 (Dropout) (None, 80, 20, 64) 0 \n",
"_________________________________________________________________\n",
"conv2d_9 (Conv2D) (None, 80, 20, 128) 73856 \n",
"_________________________________________________________________\n",
"max_pooling2d_7 (MaxPooling2 (None, 40, 10, 128) 0 \n",
"_________________________________________________________________\n",
"batch_normalization_24 (Batc (None, 40, 10, 128) 512 \n",
"_________________________________________________________________\n",
"dropout_20 (Dropout) (None, 40, 10, 128) 0 \n",
"_________________________________________________________________\n",
"conv2d_10 (Conv2D) (None, 40, 10, 256) 295168 \n",
"_________________________________________________________________\n",
"global_average_pooling2d_3 ( (None, 256) 0 \n",
"_________________________________________________________________\n",
"batch_normalization_25 (Batc (None, 256) 1024 \n",
"_________________________________________________________________\n",
"dropout_21 (Dropout) (None, 256) 0 \n",
"_________________________________________________________________\n",
"dense_15 (Dense) (None, 128) 32896 \n",
"_________________________________________________________________\n",
"batch_normalization_26 (Batc (None, 128) 512 \n",
"_________________________________________________________________\n",
"dropout_22 (Dropout) (None, 128) 0 \n",
"_________________________________________________________________\n",
"dense_16 (Dense) (None, 30) 3870 \n",
"=================================================================\n",
"Total params: 408,738\n",
"Trainable params: 407,584\n",
"Non-trainable params: 1,154\n",
"_________________________________________________________________\n",
"Train on 22500 samples, validate on 7500 samples\n",
"Epoch 1/100\n",
"22500/22500 [==============================] - 43s 2ms/step - loss: 2.4186 - acc: 0.3131 - top_3: 0.5460 - val_loss: 4.3830 - val_acc: 0.2092 - val_top_3: 0.3935\n",
"Epoch 2/100\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 1.1826 - acc: 0.6499 - top_3: 0.8535 - val_loss: 1.3374 - val_acc: 0.6595 - val_top_3: 0.8416\n",
"Epoch 3/100\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.8081 - acc: 0.7630 - top_3: 0.9136 - val_loss: 0.7689 - val_acc: 0.7760 - val_top_3: 0.9201\n",
"Epoch 4/100\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.6664 - acc: 0.8027 - top_3: 0.9336 - val_loss: 0.7322 - val_acc: 0.7996 - val_top_3: 0.9263\n",
"Epoch 5/100\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.5925 - acc: 0.8234 - top_3: 0.9440 - val_loss: 0.6394 - val_acc: 0.8213 - val_top_3: 0.9333\n",
"Epoch 6/100\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.5462 - acc: 0.8379 - top_3: 0.9482 - val_loss: 1.0433 - val_acc: 0.7455 - val_top_3: 0.8921\n",
"Epoch 7/100\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.5056 - acc: 0.8505 - top_3: 0.9508 - val_loss: 0.5928 - val_acc: 0.8261 - val_top_3: 0.9368\n",
"Epoch 8/100\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.4688 - acc: 0.8590 - top_3: 0.9571 - val_loss: 0.4811 - val_acc: 0.8651 - val_top_3: 0.9509\n",
"Epoch 9/100\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.4480 - acc: 0.8682 - top_3: 0.9580 - val_loss: 0.8812 - val_acc: 0.7653 - val_top_3: 0.9111\n",
"Epoch 10/100\n",
"22496/22500 [============================>.] - ETA: 0s - loss: 0.4324 - acc: 0.8731 - top_3: 0.9593\n",
"Epoch 00010: ReduceLROnPlateau reducing learning rate to 0.0010000000474974513.\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.4325 - acc: 0.8731 - top_3: 0.9593 - val_loss: 0.4995 - val_acc: 0.8544 - val_top_3: 0.9513\n",
"Epoch 11/100\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.3631 - acc: 0.8924 - top_3: 0.9675 - val_loss: 0.3059 - val_acc: 0.9143 - val_top_3: 0.9696\n",
"Epoch 12/100\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.3275 - acc: 0.9005 - top_3: 0.9704 - val_loss: 0.3163 - val_acc: 0.9116 - val_top_3: 0.9679\n",
"Epoch 13/100\n",
"22496/22500 [============================>.] - ETA: 0s - loss: 0.3201 - acc: 0.9038 - top_3: 0.9706\n",
"Epoch 00013: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.3203 - acc: 0.9038 - top_3: 0.9706 - val_loss: 0.3600 - val_acc: 0.8968 - val_top_3: 0.9661\n",
"Epoch 14/100\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.2807 - acc: 0.9180 - top_3: 0.9752 - val_loss: 0.3217 - val_acc: 0.9085 - val_top_3: 0.9671\n",
"Epoch 15/100\n",
"22496/22500 [============================>.] - ETA: 0s - loss: 0.2668 - acc: 0.9195 - top_3: 0.9772\n",
"Epoch 00015: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.2671 - acc: 0.9194 - top_3: 0.9771 - val_loss: 0.3234 - val_acc: 0.9081 - val_top_3: 0.9679\n",
"Epoch 16/100\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.2509 - acc: 0.9233 - top_3: 0.9784 - val_loss: 0.2813 - val_acc: 0.9228 - val_top_3: 0.9713\n",
"Epoch 17/100\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.2408 - acc: 0.9266 - top_3: 0.9798 - val_loss: 0.2854 - val_acc: 0.9213 - val_top_3: 0.9709\n",
"Epoch 18/100\n",
"22496/22500 [============================>.] - ETA: 0s - loss: 0.2286 - acc: 0.9295 - top_3: 0.9808\n",
"Epoch 00018: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.2290 - acc: 0.9294 - top_3: 0.9808 - val_loss: 0.3008 - val_acc: 0.9173 - val_top_3: 0.9695\n",
"Epoch 19/100\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.2226 - acc: 0.9329 - top_3: 0.9818 - val_loss: 0.2832 - val_acc: 0.9225 - val_top_3: 0.9712\n",
"Epoch 20/100\n",
"22496/22500 [============================>.] - ETA: 0s - loss: 0.2189 - acc: 0.9351 - top_3: 0.9822\n",
"Epoch 00020: ReduceLROnPlateau reducing learning rate to 6.25000029685907e-05.\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.2189 - acc: 0.9351 - top_3: 0.9822 - val_loss: 0.2852 - val_acc: 0.9217 - val_top_3: 0.9712\n",
"Epoch 21/100\n",
"22500/22500 [==============================] - 42s 2ms/step - loss: 0.2131 - acc: 0.9358 - top_3: 0.9822 - val_loss: 0.2819 - val_acc: 0.9229 - val_top_3: 0.9713\n",
"Epoch 00021: early stopping\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/soph/miniconda3/lib/python3.6/site-packages/matplotlib/figure.py:2022: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.\n",
" warnings.warn(\"This figure includes Axes that are not compatible \"\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7fa85f2079b0>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"model3, history3 = my_fit(model_layers3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3 layer CNN: >90% validation accuracy!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
},
"toc": {
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"toc_cell": false,
"toc_position": {},
"toc_section_display": "block",
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment