hbredin/VoxCeleb.ipynb Secret

## VoxCeleb.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Requirements\n",
    "\n",
    "\n",
    "### pyannote.database\n",
    "\n",
    "\n",
    "[`pyannote.database`](https://github.com/pyannote/pyannote-database) is a library that provides utility functions to manipulate (mostly audio) corpora, through database plugins. It can be installed with\n",
    "\n",
    "```bash\n",
    "pip install pyannote.database\n",
    "```\n",
    "\n",
    "### pyannote.db.voxceleb\n",
    "\n",
    "[`pyannote.db.voxceleb`](https://github.com/pyannote/pyannote-db-voxceleb) is the VoxCeleb plugin for pyannote.database. It can be installed with\n",
    "\n",
    "\n",
    "```bash\n",
    "pip install pyannote.db.voxceleb \n",
    "```\n",
    "\n",
    "### `database.yml ` configuration file\n",
    "\n",
    "`pyannote.db.*` plugins do not provide actual data. They just provide manual annotation and define evaluation protocols (e.g. train/dev/test splits). Therefore, one has to tell `pyannote.database` where to look for audio files. This is achieved with the `database.yml` configuration file:\n",
    "\n",
    "```bash\n",
    "cp /vol/work1/bredin/for/juan/database.yml ~/.pyannote/database.yml\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Databases:\r\n",
      "  VoxCeleb:\r\n",
      "    - /vol/corpora4/voxceleb/voxceleb1/dev/wav/{uri}.wav\r\n",
      "    - /vol/corpora4/voxceleb/voxceleb1/test/wav/{uri}.wav\r\n",
      "    - /vol/corpora4/voxceleb/voxceleb2/dev/aac/{uri}.wav\r\n",
      "    - /vol/corpora4/voxceleb/voxceleb2/test/aac/{uri}.wav\r\n"
     ]
    }
   ],
   "source": [
    "!cat /vol/work1/bredin/for/juan/database.yml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quick introduction to pyannote.database and VoxCeleb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I suggest you start playing with `VoxCeleb1` (which is a smaller database than `VoxCeleb2` but already quite large)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyannote.database import get_protocol\n",
    "protocol = get_protocol('VoxCeleb.SpeakerVerification.VoxCeleb1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above `protocol` object has several useful methods. \n",
    " \n",
    "  * `protocol.train()` allows you to iterate over all files in the training set\n",
    "  * `protocol.development()` does the same for the development set\n",
    "  * `protocol.test()` does the same for the test set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gathering training data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Speaker \"id10092\" speaks between t=0.000s and t=17.680s in file id10092/LbVIZMrQGmQ/00001\n"
     ]
    }
   ],
   "source": [
    "for current_file in protocol.train():\n",
    "\n",
    "    # `current_file` is a dictionary whose keys I will describe here:\n",
    "    \n",
    "    # key \"uri\" assigns a unique identifier to each file\n",
    "    # \"uri\" stands for 'uniform resource identifier'\n",
    "    uri = current_file['uri']\n",
    "    \n",
    "\n",
    "    # key \"annotation\" provides a `pyannote.core.Annotation` instance\n",
    "    # that basically tells you \"who speaks when\" in the file\n",
    "    \n",
    "    # see pyannote.core documentation for a description of `pyannote.core.Annotation` data structure\n",
    "    # http://pyannote.github.io/pyannote-core\n",
    "    \n",
    "    # in particular, one can use the `itertracks` method to iterate over annotated segments\n",
    "    # this should speak for itself:\n",
    "    \n",
    "    for segment, _, label in current_file['annotation'].itertracks(yield_label=True):\n",
    "        # each speaker has a unique `label`\n",
    "        print(f'Speaker \"{label}\" speaks between t={segment.start:.3f}s and t={segment.end:.3f}s in file {uri}')\n",
    "    \n",
    "    # for illustration purposes, I stop after the first file -- but there are obvisouly lots of them!\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`current_file` does not contain any information about where the original audio file is located:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'uri': 'id10092/LbVIZMrQGmQ/00001',\n",
       " 'database': 'VoxCeleb',\n",
       " 'annotation': <pyannote.core.annotation.Annotation at 0x7f230b230908>,\n",
       " 'annotated': <Timeline(uri=id10092/LbVIZMrQGmQ/00001, segments=[<Segment(0, 17.6801)>])>}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "current_file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use `FileFinder` to automatically locate the audio file (based on the previously mentioned `database.yml` configuration file):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyannote.database import FileFinder\n",
    "preprocessors = {'audio': FileFinder()}\n",
    "protocol = get_protocol('VoxCeleb.SpeakerVerification.VoxCeleb1', preprocessors=preprocessors)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Audio for \"id10092/LbVIZMrQGmQ/00001\" is \"/vol/corpora4/voxceleb/voxceleb1/dev/wav/id10092/LbVIZMrQGmQ/00001.wav\".\n"
     ]
    }
   ],
   "source": [
    "for current_file in protocol.train():\n",
    "    uri = current_file['uri']\n",
    "    audio = current_file['audio']\n",
    "    print(f'Audio for \"{uri}\" is \"{audio}\".')\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can have a look at [`SpeechSegmentGenerator`](https://github.com/pyannote/pyannote-audio/blob/814c0d8cbcaca373c7091a21f6bace9cc4299b90/pyannote/audio/embedding/generators.py#L37) to get an idea of how I used this `protocol` object to gather training data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Performing speaker verification experiment\n",
    "\n",
    "Once an embedding model is trained, you will have to evaluate its performance on the standard speaker verification task.\n",
    "\n",
    "A speaker verification task is composed of a bunch of trials.  \n",
    "Each trial consists in deciding whether two files actually corresponds to the same speaker or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id10270/5r0dWxy17C8/00001 vs. id10270/8jEAjG6SegY/00030 same\n",
      "id10270/5r0dWxy17C8/00001 vs. id10309/0cYFdtyWVds/00002 different\n",
      "id10270/5r0dWxy17C8/00001 vs. id10302/WAbHmvQ9zME/00006 different\n",
      "id10270/5r0dWxy17C8/00001 vs. id10270/8jEAjG6SegY/00012 same\n",
      "id10270/5r0dWxy17C8/00001 vs. id10270/8jEAjG6SegY/00027 same\n",
      "id10270/5r0dWxy17C8/00001 vs. id10270/5r0dWxy17C8/00022 same\n",
      "id10270/5r0dWxy17C8/00001 vs. id10292/gm6PJowclv0/00027 different\n",
      "id10270/5r0dWxy17C8/00001 vs. id10272/wb6ligRbbZ4/00001 different\n",
      "id10270/5r0dWxy17C8/00002 vs. id10270/8jEAjG6SegY/00018 same\n",
      "id10270/5r0dWxy17C8/00002 vs. id10307/yUv37vQWmzE/00014 different\n",
      "id10270/5r0dWxy17C8/00002 vs. id10309/e-IdJ8a4gy4/00009 different\n",
      "id10270/5r0dWxy17C8/00002 vs. id10270/8jEAjG6SegY/00035 same\n"
     ]
    }
   ],
   "source": [
    "for t, trial in enumerate(protocol.test_trial()):\n",
    "    \n",
    "    # first file\n",
    "    file1 = trial['file1']\n",
    "    \n",
    "    # second file\n",
    "    file2 = trial['file2']    \n",
    "    \n",
    "    # boolean indicating whether this is the same speaker (same == True) \n",
    "    # or two different speakers (same == False)\n",
    "    same = trial['reference']\n",
    "    \n",
    "    msg = f\"{file1['uri']} vs. {file2['uri']} {'same' if same else 'different'}\"\n",
    "    print(msg)\n",
    "    \n",
    "    # stopping early but they are much more trials than 10!\n",
    "    if t > 10:\n",
    "        break    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A standard evaluation metric for speaker verification task is the Equal Error Rate.\n",
    "See [det_curve](https://github.com/pyannote/pyannote-metrics/blob/master/pyannote/metrics/binary_classification.py#L38) function to know how to compute it."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Requirements\n",
	"\n",
	"\n",
	"### pyannote.database\n",
	"\n",
	"\n",
	"[`pyannote.database`](https://github.com/pyannote/pyannote-database) is a library that provides utility functions to manipulate (mostly audio) corpora, through database plugins. It can be installed with\n",
	"\n",
	"```bash\n",
	"pip install pyannote.database\n",
	"```\n",
	"\n",
	"### pyannote.db.voxceleb\n",
	"\n",
	"[`pyannote.db.voxceleb`](https://github.com/pyannote/pyannote-db-voxceleb) is the VoxCeleb plugin for pyannote.database. It can be installed with\n",
	"\n",
	"\n",
	"```bash\n",
	"pip install pyannote.db.voxceleb \n",
	"```\n",
	"\n",
	"### `database.yml ` configuration file\n",
	"\n",
	"`pyannote.db.*` plugins do not provide actual data. They just provide manual annotation and define evaluation protocols (e.g. train/dev/test splits). Therefore, one has to tell `pyannote.database` where to look for audio files. This is achieved with the `database.yml` configuration file:\n",
	"\n",
	"```bash\n",
	"cp /vol/work1/bredin/for/juan/database.yml ~/.pyannote/database.yml\n",
	"```\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Databases:\r\n",
	" VoxCeleb:\r\n",
	" - /vol/corpora4/voxceleb/voxceleb1/dev/wav/{uri}.wav\r\n",
	" - /vol/corpora4/voxceleb/voxceleb1/test/wav/{uri}.wav\r\n",
	" - /vol/corpora4/voxceleb/voxceleb2/dev/aac/{uri}.wav\r\n",
	" - /vol/corpora4/voxceleb/voxceleb2/test/aac/{uri}.wav\r\n"
	]
	}
	],
	"source": [
	"!cat /vol/work1/bredin/for/juan/database.yml"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Quick introduction to pyannote.database and VoxCeleb"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"I suggest you start playing with `VoxCeleb1` (which is a smaller database than `VoxCeleb2` but already quite large)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"from pyannote.database import get_protocol\n",
	"protocol = get_protocol('VoxCeleb.SpeakerVerification.VoxCeleb1')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The above `protocol` object has several useful methods. \n",
	" \n",
	" * `protocol.train()` allows you to iterate over all files in the training set\n",
	" * `protocol.development()` does the same for the development set\n",
	" * `protocol.test()` does the same for the test set"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Gathering training data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Speaker \"id10092\" speaks between t=0.000s and t=17.680s in file id10092/LbVIZMrQGmQ/00001\n"
	]
	}
	],
	"source": [
	"for current_file in protocol.train():\n",
	"\n",
	" # `current_file` is a dictionary whose keys I will describe here:\n",
	" \n",
	" # key \"uri\" assigns a unique identifier to each file\n",
	" # \"uri\" stands for 'uniform resource identifier'\n",
	" uri = current_file['uri']\n",
	" \n",
	"\n",
	" # key \"annotation\" provides a `pyannote.core.Annotation` instance\n",
	" # that basically tells you \"who speaks when\" in the file\n",
	" \n",
	" # see pyannote.core documentation for a description of `pyannote.core.Annotation` data structure\n",
	" # http://pyannote.github.io/pyannote-core\n",
	" \n",
	" # in particular, one can use the `itertracks` method to iterate over annotated segments\n",
	" # this should speak for itself:\n",
	" \n",
	" for segment, _, label in current_file['annotation'].itertracks(yield_label=True):\n",
	" # each speaker has a unique `label`\n",
	" print(f'Speaker \"{label}\" speaks between t={segment.start:.3f}s and t={segment.end:.3f}s in file {uri}')\n",
	" \n",
	" # for illustration purposes, I stop after the first file -- but there are obvisouly lots of them!\n",
	" break"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"`current_file` does not contain any information about where the original audio file is located:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"{'uri': 'id10092/LbVIZMrQGmQ/00001',\n",
	" 'database': 'VoxCeleb',\n",
	" 'annotation': <pyannote.core.annotation.Annotation at 0x7f230b230908>,\n",
	" 'annotated': <Timeline(uri=id10092/LbVIZMrQGmQ/00001, segments=[<Segment(0, 17.6801)>])>}"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"current_file"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We use `FileFinder` to automatically locate the audio file (based on the previously mentioned `database.yml` configuration file):"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [],
	"source": [
	"from pyannote.database import FileFinder\n",
	"preprocessors = {'audio': FileFinder()}\n",
	"protocol = get_protocol('VoxCeleb.SpeakerVerification.VoxCeleb1', preprocessors=preprocessors)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Audio for \"id10092/LbVIZMrQGmQ/00001\" is \"/vol/corpora4/voxceleb/voxceleb1/dev/wav/id10092/LbVIZMrQGmQ/00001.wav\".\n"
	]
	}
	],
	"source": [
	"for current_file in protocol.train():\n",
	" uri = current_file['uri']\n",
	" audio = current_file['audio']\n",
	" print(f'Audio for \"{uri}\" is \"{audio}\".')\n",
	" break"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You can have a look at [`SpeechSegmentGenerator`](https://github.com/pyannote/pyannote-audio/blob/814c0d8cbcaca373c7091a21f6bace9cc4299b90/pyannote/audio/embedding/generators.py#L37) to get an idea of how I used this `protocol` object to gather training data."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Performing speaker verification experiment\n",
	"\n",
	"Once an embedding model is trained, you will have to evaluate its performance on the standard speaker verification task.\n",
	"\n",
	"A speaker verification task is composed of a bunch of trials. \n",
	"Each trial consists in deciding whether two files actually corresponds to the same speaker or not."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"id10270/5r0dWxy17C8/00001 vs. id10270/8jEAjG6SegY/00030 same\n",
	"id10270/5r0dWxy17C8/00001 vs. id10309/0cYFdtyWVds/00002 different\n",
	"id10270/5r0dWxy17C8/00001 vs. id10302/WAbHmvQ9zME/00006 different\n",
	"id10270/5r0dWxy17C8/00001 vs. id10270/8jEAjG6SegY/00012 same\n",
	"id10270/5r0dWxy17C8/00001 vs. id10270/8jEAjG6SegY/00027 same\n",
	"id10270/5r0dWxy17C8/00001 vs. id10270/5r0dWxy17C8/00022 same\n",
	"id10270/5r0dWxy17C8/00001 vs. id10292/gm6PJowclv0/00027 different\n",
	"id10270/5r0dWxy17C8/00001 vs. id10272/wb6ligRbbZ4/00001 different\n",
	"id10270/5r0dWxy17C8/00002 vs. id10270/8jEAjG6SegY/00018 same\n",
	"id10270/5r0dWxy17C8/00002 vs. id10307/yUv37vQWmzE/00014 different\n",
	"id10270/5r0dWxy17C8/00002 vs. id10309/e-IdJ8a4gy4/00009 different\n",
	"id10270/5r0dWxy17C8/00002 vs. id10270/8jEAjG6SegY/00035 same\n"
	]
	}
	],
	"source": [
	"for t, trial in enumerate(protocol.test_trial()):\n",
	" \n",
	" # first file\n",
	" file1 = trial['file1']\n",
	" \n",
	" # second file\n",
	" file2 = trial['file2'] \n",
	" \n",
	" # boolean indicating whether this is the same speaker (same == True) \n",
	" # or two different speakers (same == False)\n",
	" same = trial['reference']\n",
	" \n",
	" msg = f\"{file1['uri']} vs. {file2['uri']} {'same' if same else 'different'}\"\n",
	" print(msg)\n",
	" \n",
	" # stopping early but they are much more trials than 10!\n",
	" if t > 10:\n",
	" break "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"A standard evaluation metric for speaker verification task is the Equal Error Rate.\n",
	"See [det_curve](https://github.com/pyannote/pyannote-metrics/blob/master/pyannote/metrics/binary_classification.py#L38) function to know how to compute it."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.5"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}