Skip to content

Instantly share code, notes, and snippets.

@a-chen
Last active August 30, 2021 18:39
Show Gist options
  • Save a-chen/c27ba80f0d843814bf7aa298393d8efa to your computer and use it in GitHub Desktop.
Save a-chen/c27ba80f0d843814bf7aa298393d8efa to your computer and use it in GitHub Desktop.
Extract and prettify YouTube transcripts
# -*- coding: utf-8 -*-
# @Author: Isaac Pei
# @Date: 2020-09-28 18:11:14
# @Last Modified by: Isaac Pei
# @Last Modified time: 2021-05-29 13:49:12
## Take the vtt file as input, to generate the transcript text, 1 per line
# --
# -- There is an online tool doing similar work: https://hierogly.ph/
import sys
import re
vtt_input = sys.argv[1]
previous_line = "" # Define previous line, only when line is different then print
with open(vtt_input) as f:
for l in f:
if "align:start" in l:
pass
elif "<c>" in l:
# for those <c> </c> marked split worded lines
# there are often repeated full line in the next few lines
pass
elif l.strip():
# Reach below only when line is not empty
line=re.sub(r"<.*>", " ", l.strip())
if line != previous_line:
print(line)
previous_line = line
#!/usr/bin/env bash
# DESCRIPTION
# Given the input url, download the transcipt from url ($1)
# USAGE
# yt_transcript_dl.sh $URL
# e.g. yt_transcript_dl.sh https://youtu.be/LSvJX2pJxQ
# dependencies for MacOS, comment these out if not needed
# needs youtube-dl
brew install youtube-dl
# needs jq
brew install jq
# needs coreutils
brew install coreutils
# The following command is used to download the youtube transcript
url="$1"
mkdir -p input/
mkdir -p output/
# getting title
title=$(youtube-dl -f mp4 -o '%(id)s.%(ext)s' --print-json --no-warnings --skip-download "$url" | jq -r .title)
title=$(echo "$title" | sed 's/ //g')
# downloading transcript
youtube-dl --write-auto-sub --output "input/transcript" --skip-download "$url"
# cleaning up
python ./parse_youtube_transcript.py input/transcript.en.vtt > output/"$title"-transcript.txt
rm -rf ./input
echo "The transcript is available at:" $(find . -iname "$title*" -exec realpath {} \;)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment