Skip to content

Instantly share code, notes, and snippets.

@glasslion
Last active March 23, 2024 20:47
Show Gist options
  • Star 95 You must be signed in to star a gist
  • Fork 28 You must be signed in to fork a gist
  • Save glasslion/b2fcad16bc8a9630dbd7a945ab5ebf5e to your computer and use it in GitHub Desktop.
Save glasslion/b2fcad16bc8a9630dbd7a945ab5ebf5e to your computer and use it in GitHub Desktop.
This script convert youtube subtitle file(vtt) to plain text.
"""
Convert YouTube subtitles(vtt) to human readable text.
Download only subtitles from YouTube with youtube-dl:
youtube-dl --skip-download --convert-subs vtt <video_url>
Note that default subtitle format provided by YouTube is ass, which is hard
to process with simple regex. Luckily youtube-dl can convert ass to vtt, which
is easier to process.
To conver all vtt files inside a directory:
find . -name "*.vtt" -exec python vtt2text.py {} \;
"""
import sys
import re
def remove_tags(text):
"""
Remove vtt markup tags
"""
tags = [
r'</c>',
r'<c(\.color\w+)?>',
r'<\d{2}:\d{2}:\d{2}\.\d{3}>',
]
for pat in tags:
text = re.sub(pat, '', text)
# extract timestamp, only kep HH:MM
text = re.sub(
r'(\d{2}:\d{2}):\d{2}\.\d{3} --> .* align:start position:0%',
r'\g<1>',
text
)
text = re.sub(r'^\s+$', '', text, flags=re.MULTILINE)
return text
def remove_header(lines):
"""
Remove vtt file header
"""
pos = -1
for mark in ('##', 'Language: en',):
if mark in lines:
pos = lines.index(mark)
lines = lines[pos+1:]
return lines
def merge_duplicates(lines):
"""
Remove duplicated subtitles. Duplacates are always adjacent.
"""
last_timestamp = ''
last_cap = ''
for line in lines:
if line == "":
continue
if re.match('^\d{2}:\d{2}$', line):
if line != last_timestamp:
yield line
last_timestamp = line
else:
if line != last_cap:
yield line
last_cap = line
def merge_short_lines(lines):
buffer = ''
for line in lines:
if line == "" or re.match('^\d{2}:\d{2}$', line):
yield '\n' + line
continue
if len(line+buffer) < 80:
buffer += ' ' + line
else:
yield buffer.strip()
buffer = line
yield buffer
def main():
vtt_file_name = sys.argv[1]
txt_name = re.sub(r'.vtt$', '.txt', vtt_file_name)
with open(vtt_file_name) as f:
text = f.read()
text = remove_tags(text)
lines = text.splitlines()
lines = remove_header(lines)
lines = merge_duplicates(lines)
lines = list(lines)
lines = merge_short_lines(lines)
lines = list(lines)
with open(txt_name, 'w') as f:
for line in lines:
f.write(line)
f.write("\n")
if __name__ == "__main__":
main()
@dugsmith137
Copy link

dugsmith137 commented Feb 7, 2021

Good work, I have been playing with cleaning up vtt files from youtube. Using Notepad++ Search & Replace Regular Expressions

Explanation
Youtube vtt files seem to have a REPEATING structure every 24 lines not including the header

From start of block (first block starts at absolute line 5 then 2nd at 29, 51....)
using relative offsets inside the block
line 1 = first timecode
then just concat text (subtitles) from lines 3 11 & 19 missing off EOL from all but line 19
had originally included line 2 as first subtitle but this knocked out the timings

load VTT file into Notepadd++ then navigate to search and replace - ensure Search Mode = Regular expression, and matches newline is UNCHECKED


1. First remove all youtube tags of the form <.....>

search
<.*?>
replace

{line above} is the empty string
click 'REPLACE ALL'


2. Now concat timecode and 3 (unique) subtitles from block

cursor needs to be at top of file otherwise search & replace might not align to start of 24 line blocks, i missed off first subtitle from block as this knocks out timecode, so you may lose the very first subtitle (it is added as the last subtitle in the previous block...... which works for all but first block)

search
(.? --> ).?\n.?\n(.?)\n\n.?\n.?\n.?\n\n.?\n.?\n(.?)\n\n.?\n.?\n.?\n\n.?\n.?\n(.?\n)\n(.?)-->.?\n.?\n.?\n\n
replace
\1\2 \3 \4

click 'REPLACE ALL'


in python you could convert the timecode at the beggining of a line into seconds and an url that points to where text appears in youtube vid eg

https://youtu.be/ZxYOEwM6Wbk?t=64 --> those that existed on the unit circle so here I have a little complex plane drawn we've got the real

hope this is of interest

sample output from https://www.youtube.com/watch?v=ZxYOEwM6Wbk


WEBVTT
Kind: captions
Language: en

00:00:00.030 --> welcome back to lockdown math today we are going to be talking about Euler's formula and just to give you a little
00:00:06.000 --> sense of where we're going to be ending up with this lesson I'm gonna go ahead and show you what we're aiming for at
00:00:10.830 --> the end which is a certain visualization so I don't expect you to necessarily understand this immediately but the
00:00:17.490 --> point is that this is something we're going to walk towards what we're going to analyze is an extension of the idea
00:00:23.609 --> of Exponential's in a way that works in the complex plane and the illustration that you're looking at is showing very
00:00:30.090 --> literally what the claim of Euler's formula is because what I want you to appreciate is what the actual statement
00:00:36.120 --> says rather than letting it be shrouded in a certain mystery or a certain question of what the conventions are now
00:00:42.600 --> needless to say this is kind of a confusing thing we've got this spiral of vectors and if it's not entirely clear
00:00:48.539 --> don't worry about it I just want to give you a little sense of where we're going to be going with
00:00:52.140 --> this but before any of that let's take a step back and remember where were we okay back in the end of the last lesson
00:00:59.219 --> when we were talking about complex members one of the key types of complex numbers that we were looking at were
00:01:04.830 --> those that existed on the unit circle so here I have a little complex plane drawn we've got the real number line with the
00:01:11.369 --> points 1 and negative 1 indicated we've got the imaginary number line I being the square root of negative 1 and if you
00:01:18.060 --> remember one of the main points that we emphasized last time is that when you have a number who's sitting one unit
00:01:24.150 --> away from the origin at some angle theta multiplying by this number has the effect of rotating things by that angle
00:01:32.070 --> this is incredibly important throughout physics throughout electrical engineering all throughout math you see
00:01:36.840 --> these numbers everywhere they describe wave mechanics they're very important for polynomials it's really hard to
00:01:42.659 --> overstate how important numbers that sit on this unit circle are now one way that you could write them is with the real
00:01:49.170 --> and imaginary parts and based on lecture two if we know our trigonometry the x coordinate is going to be the cosine of
00:01:55.350 --> that angle and the y coordinate which is the imaginary part is going to be I times the sine of that angle okay so you
00:02:03.119 --> might think all throughout physics all throughout electrical engineering you see the expression cosine of theta plus
00:02:08.759 --> I sine of theta in fact what you often see is another form of this almost always you see this
00:02:15.569 --> written down as e to the power I times theta and this relationship is what's known as Euler's formula okay now he is
00:02:25.200 --> a special constant of nature and I always remember in high school it was never crystal clear to me exactly what
00:02:30.180 --> it was it was something that was just kind of handed down okay it's 2.71828 on and on and we were just taking you know
00:02:39.480 --> we were to take this as a an analogue of Pi it's an irrational number that evidently the universe side finds


if you want to just clean up vtt files (no dupes) so they play nice in vlc (good for checking things work)

remove tags <......> as above

now (search is the same - but replace is different)

search
(.? --> ).?\n.?\n(.?)\n\n.?\n.?\n.?\n\n.?\n.?\n(.?)\n\n.?\n.?\n.?\n\n.?\n.?\n(.?\n)\n(.?)-->.?\n.?\n.?\n\n
replace
\1\5\n\2 \3 \4\n

sample output from https://www.youtube.com/watch?v=ZxYOEwM6Wbk


WEBVTT
Kind: captions
Language: en

00:00:00.030 --> 00:00:05.990
welcome back to lockdown math today we are going to be talking about Euler's formula and just to give you a little

00:00:06.000 --> 00:00:10.820
sense of where we're going to be ending up with this lesson I'm gonna go ahead and show you what we're aiming for at

00:00:10.830 --> 00:00:17.480
the end which is a certain visualization so I don't expect you to necessarily understand this immediately but the

00:00:17.490 --> 00:00:23.599
point is that this is something we're going to walk towards what we're going to analyze is an extension of the idea

00:00:23.609 --> 00:00:30.080
of Exponential's in a way that works in the complex plane and the illustration that you're looking at is showing very

00:00:30.090 --> 00:00:36.110
literally what the claim of Euler's formula is because what I want you to appreciate is what the actual statement

00:00:36.120 --> 00:00:42.590
says rather than letting it be shrouded in a certain mystery or a certain question of what the conventions are now

00:00:42.600 --> 00:00:48.529
needless to say this is kind of a confusing thing we've got this spiral of vectors and if it's not entirely clear

@jtsoftware
Copy link

Should it work with Japanese files?

Error message:

C:\Video\YouTube\Benjiro\Subtitles>python c:\tools\bin\vtt2text.py Kensuke1.ja.vtt
Traceback (most recent call last):
File "c:\tools\bin\vtt2text.py", line 110, in
main()
File "c:\tools\bin\vtt2text.py", line 93, in main
text = f.read()
File "C:\Users\JohnT\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 93: character maps to

Small file for testing:

WEBVTT
Kind: captions
Language: ja

00:00:00.500 --> 00:00:03.260 align:start position:0%

こんにちは<00:00:01.069>こんにちは

00:00:03.260 --> 00:00:03.270 align:start position:0%
こんにちはこんにちは

00:00:03.270 --> 00:00:04.730 align:start position:0%
こんにちはこんにちは
元気<00:00:03.750>です<00:00:03.900>か

00:00:04.730 --> 00:00:04.740 align:start position:0%
元気ですか

00:00:04.740 --> 00:00:07.670 align:start position:0%
元気ですか
言及<00:00:05.189>数<00:00:06.269>だん<00:00:06.450>すけ<00:00:06.569>さん<00:00:06.779>は

00:00:07.670 --> 00:00:07.680 align:start position:0%
言及数だんすけさんは

@dugsmith137
Copy link

jtsoftware can you post url of example from youtube (Kensuke1.ja) ?

@jtsoftware
Copy link

Here's the Kensuke1 video: https://www.youtube.com/watch?v=wHyf3Hy8InQ

Thanks!

@parthu34
Copy link

Thank you for the script. I want to add one feature in it, for analyzing time gaps between sentences. For example, if one sentence takes 5second, then put full-stop, vice versa for all text.
Let me know, if anyone has ideas or script fit this.

Copy link

ghost commented Mar 11, 2021

thank you, good script

@saerdnaer
Copy link

FYI: Here is a different approach inspired by this script. I needed to download a autogenerated transcript which at least my version of ytdl was not downloading. Also check spelling of words, so you have to change it for your target language. https://gist.github.com/saerdnaer/23ddea28f1ce8efca3377151c1c9f5c8

@totoLab
Copy link

totoLab commented Jun 2, 2021

# extract timestamp, only kep HH:MM
How are you obtaining this behavior? I ask because I wanted to remove these timestamps too, but I couldn't figure out how.

@wankio
Copy link

wankio commented Jun 25, 2021

dont know if it can convert this https://www.youtube.com/watch?v=77tTyXRpPx4 into timed words , tried everything but it converted the whole lines

@epogrebnyak
Copy link

does doenload os subtitles any longer work? youtube-dl -o ytdl-subs --skip-download --write-sub --sub-format vtt has no effect - not text fiels written.

@freeload101
Copy link

freeload101 commented Sep 27, 2021

does doenload os subtitles any longer work? youtube-dl -o ytdl-subs --skip-download --write-sub --sub-format vtt has no effect - not text fiels written.

I had to youtube-dl --write-auto-sub --convert-subs=srt --skip-download URL

see also WIP https://github.com/freeload101/SCRIPTS/blob/master/Bash/Stream_to_Text_with_Keywords.sh

@dugsmith137
Copy link

dugsmith137 commented Oct 30, 2021 via email

@claudchereji
Copy link

when i run this with the asterisk, the program only converts one file. not all of them.

@freeload101
Copy link

freeload101 commented Nov 9, 2021

when i run this with the asterisk, the program only converts one file. not all of them.

use a for loop ? or

find . -iname "*.vtt" -exec python vtt2text.py '{}' \;

Reference: https://github.com/freeload101/SCRIPTS/blob/master/Bash/Stream_to_Text_with_Keywords.sh

@claudchereji
Copy link

find . -iname "*.vtt" -exec python vtt2text.py '{}' \;

how do I run this? sorry I'm still learning, I feel like a script kiddie

@freeload101
Copy link

find . -iname "*.vtt" -exec python vtt2text.py '{}' \;

how do I run this? sorry I'm still learning, I feel like a script kiddie

Well you know what a script kiddie is so your 1/2 way there! Not sure this is the place to have this conversation so hit me up on Discord operat0r#1379 or 404.647.4250 -RMcCurdy.com

@xloem
Copy link

xloem commented Dec 1, 2021

@claudchereji it's a script for a linux terminal . it also not hard to modify the python script so as to handle multiple files.

I had trouble with international characters using this script with python3 (works with python2). seems youtube doesn't use utf-8 for everything. passing encoding='iso-8859-1' to preserve bytes when opening the vtt file fixed this for me. i plan to fork the gist.

@xloem
Copy link

xloem commented Dec 1, 2021

My fork is at https://gist.github.com/xloem/f7ecb8668c14ef07718b4d3447ebe9a2 . This fork handles unexpected encodings and multiple vtt files (@claudchereji ). If people work on this further I request somebody make a git repository for it to track the work.

@ashutoshdubey133
Copy link

Kudos for the awesome work. Just a question, how do I make it such that it removes the time stamp altogether. I don't even want the HH:MM.
Thanks

@xloem
Copy link

xloem commented Dec 16, 2021

It looks like timestamp output is produced by line 66 in this file (yield line after matching a time format), not sure.

@Arkohub
Copy link

Arkohub commented Jun 28, 2022

I am also seeking a way to remove the timestamp. I'm very new to python so I am struggling to follow where I can tweak the code without breaking it. But I think it's falling off somewhere because it's removing duplicates. I tried making another def later on with re.sub but no dice.

@vuslatx
Copy link

vuslatx commented Jul 25, 2022

Alternative is https://github.com/vuslatx/vtt-to-plain-text

Working great.

@haazy
Copy link

haazy commented Nov 9, 2022

Alternative is https://github.com/vuslatx/vtt-to-plain-text

Working great.

This looks like what I want but I am not sure of how to use it.

@freeload101
Copy link

Alternative is https://github.com/vuslatx/vtt-to-plain-text
Working great.

This looks like what I want but I am not sure of how to use it.

if you want to join me on a Stream we can walk though it and record podcast/video for HackerPublicRadio.org ! just hit me up sometime freeload01____yahoo.com

@gala8y
Copy link

gala8y commented Jan 11, 2023

Thanks a lot for the script @glasslion.

@arturmartins
Copy link

Just found out this script after I made this one:
https://gist.github.com/arturmartins/1c78de3e8c21ffce81a17dc2f2181de4

Might be of help to some.

@epogrebnyak
Copy link

Would a command-line tool with interface below be welcome?

yt-text bZ6pA--F3D4 > subtitles.txt

or better with full URL?

yt-text https://youtu.be/bZ6pA--F3D4 > subtitles.txt

@ibrahimkettaneh
Copy link

ibrahimkettaneh commented Jan 26, 2024

Would a command-line tool with interface below be welcome?

yt-text bZ6pA--F3D4 > subtitles.txt

or better with full URL?

yt-text https://youtu.be/bZ6pA--F3D4 > subtitles.txt

Yes, it would be 😁

EDIT: For anyone interested, https://gist.github.com/epogrebnyak/ba87ba52f779f7ebd93b04b2af1059aa

@epogrebnyak
Copy link

Hi everyone, wrapped this script here: https://github.com/epogrebnyak/justsubs

Sample usage:

from justsubs import Video

subs = Video("KzWS7gJX5Z8").subtitles(language="en-uYU-mmqFLq8")
subs.download()
print(subs.get_text_blocks()[:10])
print(subs.get_plain_text()[:550])

It seems simply "en" does not work, need "en-uYU-mmqFLq8".

@epogrebnyak
Copy link

Also pip install justsubs should work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment