Skip to content

Instantly share code, notes, and snippets.

@stefanschmidt
Created May 10, 2023 21:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save stefanschmidt/328e0ee65aaa7b18172be0e54b36eb1a to your computer and use it in GitHub Desktop.
Save stefanschmidt/328e0ee65aaa7b18172be0e54b36eb1a to your computer and use it in GitHub Desktop.
Convert BBC subtitles to plain text

Convert BBC subtitles to plain text

Install prerequisites

brew install youtube-dl
pip install pysrt beautifulsoup4
pip install --pre ttconv

Download the subtitles

Download the subtitles in ttml format and rename the file to subtitles.ttml.

youtube-dl --write-subs https://www.bbc.com/news/world-us-canada-65452940

Convert the subtitles

Convert the subtitles to srt format.1

tt convert -i subtitles.ttml -o subtitles.srt

Extract the plain text

Read subtitles from srt file, remove all formatting (e.g. font tags) and save as plain text.

import pysrt
from bs4 import BeautifulSoup
subs = pysrt.open("subtitles.srt")
html_text = "\n".join([sub.text for sub in subs])
soup = BeautifulSoup(html_text, 'lxml')
plain_text = soup.get_text()
with open("subtitles.txt", "w") as text_file:
    text_file.write(plain_text)

1. youtube-dl provides --convert-subs which could be used to extract subtitles in srt format, but ttconv automatically removes unnecessary line breaks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment