Skip to content

Instantly share code, notes, and snippets.

@davidcortesortuno
Created August 30, 2020 12:04
Show Gist options
  • Save davidcortesortuno/64723e4262889f592def55c1927db651 to your computer and use it in GitHub Desktop.
Save davidcortesortuno/64723e4262889f592def55c1927db651 to your computer and use it in GitHub Desktop.
Remove duplicated lines from a .vtt file generated by youtube-dl when downaloading auto generated Youtube subtitles
# Remove duplicated lines from a .vtt file generated by youtube-dl when
# downloading auto-subs from a Youtube video using the --write-auto-sub option
# This script only prints the lines so save the edited subs as:
#
# python this_script.py original_sub.vtt > new_sub.vtt
import re
import sys
f = open(sys.argv[1])
patt = re.compile(r'^\d\d:\d\d:\d\d', re.M)
dup_line = ''
for line in f:
# line = f.readline()
# Find a line starting with a time stamp: 00:13:23 ...
res = re.findall(patt, line)
if res:
# If so, print this line and read the next line which we save to
# store the result in dup_line.
# In the next loop, If we find another sections starting with a timestamp,
# the dup_line will be matched with the line below. If True, just pass
# and do not print the duplicated line
# Else, read another pattern to match a duplicated line
print(line, end='')
next_line = f.readline()
if dup_line and next_line == dup_line:
dup_line = ''
res = []
continue
else:
dup_line = next_line
print(dup_line, end='')
res = []
else:
print(line, end='')
f.close()
@yogi555
Copy link

yogi555 commented Mar 30, 2021

Is it OK in line 15?

@davidcortesortuno
Copy link
Author

Yes, as we are already iterating through every line: for line in f

@felisucoibi
Copy link

is nto working for me i have this....
03:15.654 --> 03:20.325
granting Kyonan University,
a mere private university,

03:15.654 --> 03:20.325
granting Kyonan University,
a mere private university,

it does not remove duplicate.....

@jjjchens235
Copy link

It is not working for me either.

1021
00:20:18,630 --> 00:20:20,540
aquí desde medellín me despido nos vemos

1022
00:20:20,540 --> 00:20:20,550
aquí desde medellín me despido nos vemos

@GNtrazios
Copy link

i am interested in solving this issue but i need your help. Could u please share with me the urls of the videos of which vtt file have duplicated lines?

@davidcortesortuno
Copy link
Author

Yes, please share a video that can be tested to update the code :D

@jjjchens235
Copy link

jjjchens235 commented May 20, 2022

This is the video that generated dup lines for me:
https://www.youtube.com/watch?v=ubOqOCukR40&t=941s&ab_channel=GabrielHerrera

@hanimourra
Copy link

Where you able to find a fix? This is becoming a problem for us. Attached is a .vtt file from YouTube that needs duplicates fixed: https://drive.google.com/file/d/163Y-rg2qouJOQ2rjeudQ3dAFrE7TF_2M/view?usp=sharing

@volehuy1998
Copy link

Hello, is it oke?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment