-
-
Save glasslion/b2fcad16bc8a9630dbd7a945ab5ebf5e to your computer and use it in GitHub Desktop.
""" | |
Convert YouTube subtitles(vtt) to human readable text. | |
Download only subtitles from YouTube with youtube-dl: | |
youtube-dl --skip-download --convert-subs vtt <video_url> | |
Note that default subtitle format provided by YouTube is ass, which is hard | |
to process with simple regex. Luckily youtube-dl can convert ass to vtt, which | |
is easier to process. | |
To conver all vtt files inside a directory: | |
find . -name "*.vtt" -exec python vtt2text.py {} \; | |
""" | |
import sys | |
import re | |
def remove_tags(text): | |
""" | |
Remove vtt markup tags | |
""" | |
tags = [ | |
r'</c>', | |
r'<c(\.color\w+)?>', | |
r'<\d{2}:\d{2}:\d{2}\.\d{3}>', | |
] | |
for pat in tags: | |
text = re.sub(pat, '', text) | |
# extract timestamp, only kep HH:MM | |
text = re.sub( | |
r'(\d{2}:\d{2}):\d{2}\.\d{3} --> .* align:start position:0%', | |
r'\g<1>', | |
text | |
) | |
text = re.sub(r'^\s+$', '', text, flags=re.MULTILINE) | |
return text | |
def remove_header(lines): | |
""" | |
Remove vtt file header | |
""" | |
pos = -1 | |
for mark in ('##', 'Language: en',): | |
if mark in lines: | |
pos = lines.index(mark) | |
lines = lines[pos+1:] | |
return lines | |
def merge_duplicates(lines): | |
""" | |
Remove duplicated subtitles. Duplacates are always adjacent. | |
""" | |
last_timestamp = '' | |
last_cap = '' | |
for line in lines: | |
if line == "": | |
continue | |
if re.match('^\d{2}:\d{2}$', line): | |
if line != last_timestamp: | |
yield line | |
last_timestamp = line | |
else: | |
if line != last_cap: | |
yield line | |
last_cap = line | |
def merge_short_lines(lines): | |
buffer = '' | |
for line in lines: | |
if line == "" or re.match('^\d{2}:\d{2}$', line): | |
yield '\n' + line | |
continue | |
if len(line+buffer) < 80: | |
buffer += ' ' + line | |
else: | |
yield buffer.strip() | |
buffer = line | |
yield buffer | |
def main(): | |
vtt_file_name = sys.argv[1] | |
txt_name = re.sub(r'.vtt$', '.txt', vtt_file_name) | |
with open(vtt_file_name) as f: | |
text = f.read() | |
text = remove_tags(text) | |
lines = text.splitlines() | |
lines = remove_header(lines) | |
lines = merge_duplicates(lines) | |
lines = list(lines) | |
lines = merge_short_lines(lines) | |
lines = list(lines) | |
with open(txt_name, 'w') as f: | |
for line in lines: | |
f.write(line) | |
f.write("\n") | |
if __name__ == "__main__": | |
main() |
Should it work with Japanese files?
Error message:
C:\Video\YouTube\Benjiro\Subtitles>python c:\tools\bin\vtt2text.py Kensuke1.ja.vtt
Traceback (most recent call last):
File "c:\tools\bin\vtt2text.py", line 110, in
main()
File "c:\tools\bin\vtt2text.py", line 93, in main
text = f.read()
File "C:\Users\JohnT\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 93: character maps to
Small file for testing:
WEBVTT
Kind: captions
Language: ja
00:00:00.500 --> 00:00:03.260 align:start position:0%
こんにちは<00:00:01.069>こんにちは
00:00:03.260 --> 00:00:03.270 align:start position:0%
こんにちはこんにちは
00:00:03.270 --> 00:00:04.730 align:start position:0%
こんにちはこんにちは
元気<00:00:03.750>です<00:00:03.900>か
00:00:04.730 --> 00:00:04.740 align:start position:0%
元気ですか
00:00:04.740 --> 00:00:07.670 align:start position:0%
元気ですか
言及<00:00:05.189>数<00:00:06.269>だん<00:00:06.450>すけ<00:00:06.569>さん<00:00:06.779>は
00:00:07.670 --> 00:00:07.680 align:start position:0%
言及数だんすけさんは
jtsoftware can you post url of example from youtube (Kensuke1.ja) ?
Here's the Kensuke1 video: https://www.youtube.com/watch?v=wHyf3Hy8InQ
Thanks!
Thank you for the script. I want to add one feature in it, for analyzing time gaps between sentences. For example, if one sentence takes 5second, then put full-stop, vice versa for all text.
Let me know, if anyone has ideas or script fit this.
thank you, good script
FYI: Here is a different approach inspired by this script. I needed to download a autogenerated transcript which at least my version of ytdl was not downloading. Also check spelling of words, so you have to change it for your target language. https://gist.github.com/saerdnaer/23ddea28f1ce8efca3377151c1c9f5c8
# extract timestamp, only kep HH:MM
How are you obtaining this behavior? I ask because I wanted to remove these timestamps too, but I couldn't figure out how.
dont know if it can convert this https://www.youtube.com/watch?v=77tTyXRpPx4 into timed words , tried everything but it converted the whole lines
does doenload os subtitles any longer work? youtube-dl -o ytdl-subs --skip-download --write-sub --sub-format vtt
has no effect - not text fiels written.
does doenload os subtitles any longer work?
youtube-dl -o ytdl-subs --skip-download --write-sub --sub-format vtt
has no effect - not text fiels written.
I had to youtube-dl --write-auto-sub --convert-subs=srt --skip-download URL
see also WIP https://github.com/freeload101/SCRIPTS/blob/master/Bash/Stream_to_Text_with_Keywords.sh
when i run this with the asterisk, the program only converts one file. not all of them.
when i run this with the asterisk, the program only converts one file. not all of them.
use a for loop ? or
find . -iname "*.vtt" -exec python vtt2text.py '{}' \;
Reference: https://github.com/freeload101/SCRIPTS/blob/master/Bash/Stream_to_Text_with_Keywords.sh
find . -iname "*.vtt" -exec python vtt2text.py '{}' \;
how do I run this? sorry I'm still learning, I feel like a script kiddie
find . -iname "*.vtt" -exec python vtt2text.py '{}' \;
how do I run this? sorry I'm still learning, I feel like a script kiddie
Well you know what a script kiddie is so your 1/2 way there! Not sure this is the place to have this conversation so hit me up on Discord operat0r#1379 or 404.647.4250 -RMcCurdy.com
@claudchereji it's a script for a linux terminal . it also not hard to modify the python script so as to handle multiple files.
I had trouble with international characters using this script with python3 (works with python2). seems youtube doesn't use utf-8 for everything. passing encoding='iso-8859-1'
to preserve bytes when opening the vtt file fixed this for me. i plan to fork the gist.
My fork is at https://gist.github.com/xloem/f7ecb8668c14ef07718b4d3447ebe9a2 . This fork handles unexpected encodings and multiple vtt files (@claudchereji ). If people work on this further I request somebody make a git repository for it to track the work.
Kudos for the awesome work. Just a question, how do I make it such that it removes the time stamp altogether. I don't even want the HH:MM.
Thanks
It looks like timestamp output is produced by line 66 in this file (yield line after matching a time format), not sure.
I am also seeking a way to remove the timestamp. I'm very new to python so I am struggling to follow where I can tweak the code without breaking it. But I think it's falling off somewhere because it's removing duplicates. I tried making another def later on with re.sub but no dice.
Alternative is https://github.com/vuslatx/vtt-to-plain-text
Working great.
Alternative is https://github.com/vuslatx/vtt-to-plain-text
Working great.
This looks like what I want but I am not sure of how to use it.
Alternative is https://github.com/vuslatx/vtt-to-plain-text
Working great.This looks like what I want but I am not sure of how to use it.
if you want to join me on a Stream we can walk though it and record podcast/video for HackerPublicRadio.org ! just hit me up sometime freeload01____yahoo.com
Thanks a lot for the script @glasslion.
Just found out this script after I made this one:
https://gist.github.com/arturmartins/1c78de3e8c21ffce81a17dc2f2181de4
Might be of help to some.
Would a command-line tool with interface below be welcome?
yt-text bZ6pA--F3D4 > subtitles.txt
or better with full URL?
yt-text https://youtu.be/bZ6pA--F3D4 > subtitles.txt
Would a command-line tool with interface below be welcome?
yt-text bZ6pA--F3D4 > subtitles.txt
or better with full URL?
yt-text https://youtu.be/bZ6pA--F3D4 > subtitles.txt
Yes, it would be 😁
EDIT: For anyone interested, https://gist.github.com/epogrebnyak/ba87ba52f779f7ebd93b04b2af1059aa
Hi everyone, wrapped this script here: https://github.com/epogrebnyak/justsubs
Sample usage:
from justsubs import Video
subs = Video("KzWS7gJX5Z8").subtitles(language="en-uYU-mmqFLq8")
subs.download()
print(subs.get_text_blocks()[:10])
print(subs.get_plain_text()[:550])
It seems simply "en"
does not work, need "en-uYU-mmqFLq8"
.
Also pip install justsubs
should work
Good work, I have been playing with cleaning up vtt files from youtube. Using Notepad++ Search & Replace Regular Expressions
Explanation
Youtube vtt files seem to have a REPEATING structure every 24 lines not including the header
From start of block (first block starts at absolute line 5 then 2nd at 29, 51....)
using relative offsets inside the block
line 1 = first timecode
then just concat text (subtitles) from lines 3 11 & 19 missing off EOL from all but line 19
had originally included line 2 as first subtitle but this knocked out the timings
load VTT file into Notepadd++ then navigate to search and replace - ensure Search Mode = Regular expression, and matches newline is UNCHECKED
1. First remove all youtube tags of the form <.....>
search
<.*?>
replace
{line above} is the empty string
click 'REPLACE ALL'
2. Now concat timecode and 3 (unique) subtitles from block
cursor needs to be at top of file otherwise search & replace might not align to start of 24 line blocks, i missed off first subtitle from block as this knocks out timecode, so you may lose the very first subtitle (it is added as the last subtitle in the previous block...... which works for all but first block)
search
(.? --> ).?\n.?\n(.?)\n\n.?\n.?\n.?\n\n.?\n.?\n(.?)\n\n.?\n.?\n.?\n\n.?\n.?\n(.?\n)\n(.?)-->.?\n.?\n.?\n\n
replace
\1\2 \3 \4
click 'REPLACE ALL'
in python you could convert the timecode at the beggining of a line into seconds and an url that points to where text appears in youtube vid eg
https://youtu.be/ZxYOEwM6Wbk?t=64 --> those that existed on the unit circle so here I have a little complex plane drawn we've got the real
hope this is of interest
sample output from https://www.youtube.com/watch?v=ZxYOEwM6Wbk
WEBVTT
Kind: captions
Language: en
00:00:00.030 --> welcome back to lockdown math today we are going to be talking about Euler's formula and just to give you a little
00:00:06.000 --> sense of where we're going to be ending up with this lesson I'm gonna go ahead and show you what we're aiming for at
00:00:10.830 --> the end which is a certain visualization so I don't expect you to necessarily understand this immediately but the
00:00:17.490 --> point is that this is something we're going to walk towards what we're going to analyze is an extension of the idea
00:00:23.609 --> of Exponential's in a way that works in the complex plane and the illustration that you're looking at is showing very
00:00:30.090 --> literally what the claim of Euler's formula is because what I want you to appreciate is what the actual statement
00:00:36.120 --> says rather than letting it be shrouded in a certain mystery or a certain question of what the conventions are now
00:00:42.600 --> needless to say this is kind of a confusing thing we've got this spiral of vectors and if it's not entirely clear
00:00:48.539 --> don't worry about it I just want to give you a little sense of where we're going to be going with
00:00:52.140 --> this but before any of that let's take a step back and remember where were we okay back in the end of the last lesson
00:00:59.219 --> when we were talking about complex members one of the key types of complex numbers that we were looking at were
00:01:04.830 --> those that existed on the unit circle so here I have a little complex plane drawn we've got the real number line with the
00:01:11.369 --> points 1 and negative 1 indicated we've got the imaginary number line I being the square root of negative 1 and if you
00:01:18.060 --> remember one of the main points that we emphasized last time is that when you have a number who's sitting one unit
00:01:24.150 --> away from the origin at some angle theta multiplying by this number has the effect of rotating things by that angle
00:01:32.070 --> this is incredibly important throughout physics throughout electrical engineering all throughout math you see
00:01:36.840 --> these numbers everywhere they describe wave mechanics they're very important for polynomials it's really hard to
00:01:42.659 --> overstate how important numbers that sit on this unit circle are now one way that you could write them is with the real
00:01:49.170 --> and imaginary parts and based on lecture two if we know our trigonometry the x coordinate is going to be the cosine of
00:01:55.350 --> that angle and the y coordinate which is the imaginary part is going to be I times the sine of that angle okay so you
00:02:03.119 --> might think all throughout physics all throughout electrical engineering you see the expression cosine of theta plus
00:02:08.759 --> I sine of theta in fact what you often see is another form of this almost always you see this
00:02:15.569 --> written down as e to the power I times theta and this relationship is what's known as Euler's formula okay now he is
00:02:25.200 --> a special constant of nature and I always remember in high school it was never crystal clear to me exactly what
00:02:30.180 --> it was it was something that was just kind of handed down okay it's 2.71828 on and on and we were just taking you know
00:02:39.480 --> we were to take this as a an analogue of Pi it's an irrational number that evidently the universe side finds
if you want to just clean up vtt files (no dupes) so they play nice in vlc (good for checking things work)
remove tags <......> as above
now (search is the same - but replace is different)
search
(.? --> ).?\n.?\n(.?)\n\n.?\n.?\n.?\n\n.?\n.?\n(.?)\n\n.?\n.?\n.?\n\n.?\n.?\n(.?\n)\n(.?)-->.?\n.?\n.?\n\n
replace
\1\5\n\2 \3 \4\n
sample output from https://www.youtube.com/watch?v=ZxYOEwM6Wbk
WEBVTT
Kind: captions
Language: en
00:00:00.030 --> 00:00:05.990
welcome back to lockdown math today we are going to be talking about Euler's formula and just to give you a little
00:00:06.000 --> 00:00:10.820
sense of where we're going to be ending up with this lesson I'm gonna go ahead and show you what we're aiming for at
00:00:10.830 --> 00:00:17.480
the end which is a certain visualization so I don't expect you to necessarily understand this immediately but the
00:00:17.490 --> 00:00:23.599
point is that this is something we're going to walk towards what we're going to analyze is an extension of the idea
00:00:23.609 --> 00:00:30.080
of Exponential's in a way that works in the complex plane and the illustration that you're looking at is showing very
00:00:30.090 --> 00:00:36.110
literally what the claim of Euler's formula is because what I want you to appreciate is what the actual statement
00:00:36.120 --> 00:00:42.590
says rather than letting it be shrouded in a certain mystery or a certain question of what the conventions are now
00:00:42.600 --> 00:00:48.529
needless to say this is kind of a confusing thing we've got this spiral of vectors and if it's not entirely clear