Skip to content

Instantly share code, notes, and snippets.

@teemow
Last active May 14, 2023 18:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save teemow/05aed2fd8f80c8abf30d471f4b6e805a to your computer and use it in GitHub Desktop.
Save teemow/05aed2fd8f80c8abf30d471f4b6e805a to your computer and use it in GitHub Desktop.
fetch youtube playlist with title, description and subtitles of each video and train gpt with the information
#!/bin/bash
set -eu
FOLDER=$1
PLAYLIST=$2
rm -f playlist.txt
mkdir -p $FOLDER
yt-dlp --flat-playlist -i --print-to-file url playlist.txt $PLAYLIST
for i in $(cat playlist.txt)
do
FILENAME=$(yt-dlp --get-title --skip-download "$i" | tr -s '[[:space:]]' '_').content
if [ -f $FOLDER/$FILENAME ]; then
continue
fi
rm -rf tmp
mkdir -p tmp
cd tmp
# fetch subtitle
yt-dlp --skip-download \
--sub-lang en-orig \
--write-auto-sub \
"$i"
if [ -f *.vtt ]; then
# convert subtitle
for j in *.vtt
do
vtt2text "$j"
done
# get title and description
yt-dlp --get-title --get-description --skip-download "$i" > $FILENAME
cat *.txt >> $FILENAME
mv $FILENAME ../$FOLDER/$FILENAME
fi
cd ..
done
import os
import logging
import sys
import textwrap
from llama_index import (
GPTKeywordTableIndex,
Document,
SimpleDirectoryReader,
LLMPredictor,
)
from langchain import OpenAI
if __name__ == "__main__":
logging.basicConfig(stream=sys.stdout, level=logging.CRITICAL)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
if not os.path.exists("index.json"):
subtitles_folder = sys.argv[1]
documents = SimpleDirectoryReader(subtitles_folder).load_data()
llm_predictor = LLMPredictor(
llm=OpenAI(temperature=0,
model_name="text-davinci-003",
max_tokens=2048)
)
index = GPTKeywordTableIndex(documents, llm_predictor=llm_predictor)
index.save_to_disk("index.json")
else:
index = GPTKeywordTableIndex.load_from_disk("index.json")
while True:
try:
prompt = input("What should I figure out? ")
response = index.query(prompt)
response = str(response).strip()
if not response:
continue
for line in textwrap.wrap(response, width=75):
print(line)
print("-----")
except KeyboardInterrupt:
break
@teemow
Copy link
Author

teemow commented Mar 21, 2023

eg. fetch all the subtitles of the videos from Kubecon Europe 2022

./fetch_subtitles.sh kubecon-europe-22 https://www.youtube.com/playlist?list=PLj6h78yzYM2MCEgkd8zH0vJWF7jdQ-GRR

@teemow
Copy link
Author

teemow commented Mar 21, 2023

And then you run python train-with-subtitles.py kubecon-europe-22

Don't forget to put your OPENAI_API_KEY in the env.

Prerequisites:

  • install yt-dlp
  • install some python dependencies:
pip install llama-index openai nltk

@IntegralD-503
Copy link

I was wondering where vtt2text (line 34 of fetch script) was coming from? I don't see it in any linux repos. I found a python package but i needs to be in its own script. Am I missing something here? thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment