Skip to content

Instantly share code, notes, and snippets.

@hbredin
Last active March 19, 2024 04:16
Show Gist options
  • Save hbredin/049f2b629700bcea71324d2c1e7f8337 to your computer and use it in GitHub Desktop.
Save hbredin/049f2b629700bcea71324d2c1e7f8337 to your computer and use it in GitHub Desktop.
rich-transcription.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"collapsed_sections": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU",
"gpuClass": "standard"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/hbredin/049f2b629700bcea71324d2c1e7f8337/rich-transcription.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# Rich transcription with OpenAI [Whisper](https://github.com/openai/whisper/) and [pyannote.audio](https://github.com/pyannote/pyannote-audio)"
],
"metadata": {
"id": "gTpyDU0Ob1gX"
}
},
{
"cell_type": "markdown",
"source": [
"## Installation"
],
"metadata": {
"id": "UfhHveF0dInb"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "JXENoRuVUB8U"
},
"outputs": [],
"source": [
"# speechbrain (used for speaker embedding)\n",
"!pip install -qq torch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 torchtext==0.12.0\n",
"!pip install -qq speechbrain==0.5.12\n",
"\n",
"# pyannote.audio (used for speaker diarization)\n",
"!pip install -qq pyannote.audio==2.1.1\n",
"\n",
"# OpenAI whisper (used for automatic speech recognition)\n",
"!pip install -qq git+https://github.com/openai/whisper.git "
]
},
{
"cell_type": "markdown",
"source": [
"## File upload"
],
"metadata": {
"id": "HW8wu1fVdaSq"
}
},
{
"cell_type": "code",
"source": [
"# upload an audio file (might not work for large files)\n",
"import google.colab\n",
"audio_file = list(google.colab.files.upload())[0]"
],
"metadata": {
"id": "buPNetoUVvP3"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Speaker diarization\n",
"\n",
"* Visit [hf.co/pyannote/speaker-diarization](https://hf.co/pyannote/speaker-diarization) and accept user conditions\n",
"* Visit [hf.co/pyannote/segmentation](https://hf.co/pyannote/segmentation) and accept user conditions\n"
],
"metadata": {
"id": "cXqEufVucxd5"
}
},
{
"cell_type": "code",
"source": [
"# log in on Huggingface hub (where pretrained pyannote models are hosted)\n",
"from huggingface_hub import notebook_login\n",
"notebook_login()"
],
"metadata": {
"id": "GhndY4WYZaYI"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# load pyannote.audio speaker diarization \n",
"from pyannote.audio import Pipeline\n",
"speaker_diarization = Pipeline.from_pretrained(\"pyannote/speaker-diarization@2.1\", \n",
" use_auth_token=True)"
],
"metadata": {
"id": "CFgecc6KZL3y"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# apply speaker diarization\n",
"who_speaks_when = speaker_diarization(audio_file, \n",
" num_speakers=None, # these values can be\n",
" min_speakers=None, # provided by the user\n",
" max_speakers=None) # when they are known"
],
"metadata": {
"id": "1JHnwrejbLNP"
},
"execution_count": 8,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# reset notebook visualization (including start time, end time and speaker colors)\n",
"from pyannote.core import notebook\n",
"notebook.reset()\n",
"\n",
"# uncomment line below to only visualize the first minute of the file\n",
"#from pyannote.core import Segment\n",
"#notebook.crop = Segment(0, 60)"
],
"metadata": {
"id": "dd5IWK-PKrKV"
},
"execution_count": 9,
"outputs": []
},
{
"cell_type": "code",
"source": [
"who_speaks_when"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 205
},
"id": "ZkS2XVPdbXiS",
"outputId": "6e1bed85-92f9-4311-970c-56662364080d"
},
"execution_count": 10,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<pyannote.core.annotation.Annotation at 0x7f502dd50a50>"
],
"image/png": "\n"
},
"metadata": {},
"execution_count": 10
}
]
},
{
"cell_type": "code",
"source": [
"# rename speakers if you know their name\n",
"who_speaks_when = who_speaks_when.rename_labels({\"SPEAKER_06\": \"Sheldon\", \"SPEAKER_05\": \"Leonard\"})\n",
"who_speaks_when"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 205
},
"id": "AnvT_sfKIHOq",
"outputId": "c68b74c2-6555-462b-ba4a-869c86e160c6"
},
"execution_count": 11,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<pyannote.core.annotation.Annotation at 0x7f502c4b8250>"
],
"image/png": "\n"
},
"metadata": {},
"execution_count": 11
}
]
},
{
"cell_type": "markdown",
"source": [
"## Transcription"
],
"metadata": {
"id": "6orsP5ymc1Je"
}
},
{
"cell_type": "code",
"source": [
"# load OpenAI Whisper automatic speech transcription \n",
"import whisper\n",
"\n",
"# choose among \"tiny\", \"base\", \"small\", \"medium\", \"large\"\n",
"# see https://github.com/openai/whisper/\n",
"model = whisper.load_model(\"small\") "
],
"metadata": {
"id": "u36cv-C5bbMj"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# transcribing first minute\n",
"from pyannote.core import Segment\n",
"first_minute = Segment(0, 60)\n",
"\n",
"from pyannote.audio import Audio\n",
"audio = Audio(sample_rate=16000, mono=True)\n",
"\n",
"for segment, _, speaker in who_speaks_when.crop(first_minute).itertracks(yield_label=True):\n",
" waveform, sample_rate = audio.crop(audio_file, segment)\n",
" text = model.transcribe(waveform.squeeze().numpy())[\"text\"]\n",
" print(f\"{segment.start:06.1f}s {segment.end:06.1f}s {speaker}: {text}\")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "SLNMSx0GczfU",
"outputId": "ef6d556d-4e9e-43b4-daed-7aa4986929ca"
},
"execution_count": 12,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"0001.4s 0012.9s Sheldon: So if a photon is directed through a plane with two slits in it and either slip is observed, it will not go through both slits. If it's unobserved, it will. However, if it's observed after it's left the plane but before it hits its target, it will not have gone through both slits.\n",
"0006.7s 0006.7s SPEAKER_03: \n",
"0012.9s 0013.4s Leonard: Agreed.\n",
"0014.0s 0014.5s Leonard: What's your point?\n",
"0014.5s 0017.1s Sheldon: There's no point. I just think it's a good idea for a t-shirt.\n",
"0023.5s 0024.8s Leonard: Excuse me. We are on weather alert.\n",
"0029.2s 0036.8s Leonard: One across is a G in, eight down is Novikov, 26 across is MCM, 14 down is, I'm your finger.\n",
"0037.9s 0040.3s Leonard: which makes 14 across pot of prints.\n",
"0042.0s 0045.0s Leonard: C, Popa Docs, Capital Idea. So that's Port-au-Prince.\n",
"0046.5s 0047.0s Leonard: Hey, Eddie.\n",
"0048.9s 0050.3s Leonard: Can I help you? Yes!\n",
"0052.5s 0056.1s Leonard: Is this the high IQ Spombank?\n",
"0057.8s 0057.9s Leonard: So, that's it. Thank you, everybody. Thank you. And we'll see you in the next video.\n",
"0057.9s 0060.0s SPEAKER_03: If you have to ask, maybe you shouldn't be here.\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## Local installation (based on conda)\n",
"\n",
"```bash\n",
"conda create -n rich_transcription python=3.9\n",
"conda activate rich_transcription\n",
"\n",
"# speechbrain (used for speaker embedding)\n",
"pip install -qq torch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 torchtext==0.12.0\n",
"pip install -qq speechbrain==0.5.12\n",
"\n",
"# pyannote.audio (used for speaker diarization)\n",
"pip install -qq pyannote.audio==2.1.1\n",
"\n",
"# OpenAI whisper (used for automatic speech recognition)\n",
"pip install -qq git+https://github.com/openai/whisper.git \n",
"```"
],
"metadata": {
"id": "6Usm50J9D4yg"
}
}
]
}
@Eranba92
Copy link

Eranba92 commented Apr 10, 2023

@hbredin 3 questions:

  1. does this code include faster inference?
  2. does the weights are free to use without charge?
  3. how to handle a lot of audio files, with more than 1,000 minuets for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment