Last active
October 10, 2023 08:15
-
-
Save linnil1/0635c75cfc0603f30645535a405bc4c7 to your computer and use it in GitHub Desktop.
Use GPT to summarize the SRT files and output SRT.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# https://www.youtube.com/watch?v=I3gKT1CL5Z | |
0:20:13 XX的設計 | |
0:22:50 OO的設計 | |
0:31:51 XX的設計 | |
0:52:08 新衣裝發表 | |
0:52:42 描述鞋底和鞋帶 | |
0:53:50 描述大腿骨和骨頭 | |
0:54:40 描述臉部設計包括牙齒和鼻子 | |
0:57:42 詳細介紹外套設計和LOGO | |
0:58:52 獵物胸章和腰帶的設計 | |
1:10:51 解釋會員福利,包括桌布和個人信件 | |
1:15:26 描述食物插片和車車設計 | |
1:20:12 談到未來的周邊以及參展計劃 | |
1:22:10 介紹畫作的背景和使用AI的加筆過程 | |
1:22:51 提及畫作加筆花費的時間和製作的版本 | |
1:23:01 講解將畫作製成掛軸並舉行XX大賽的計畫 | |
1:23:32 提到會員等級和報名費的變動 | |
1:24:14 介紹AI輔助的透視線調整和產品贈品 | |
1:25:36 討論OO大賽的競標方式和獎品 | |
1:27:02 宣布改變OO會的頻率 | |
1:29:57 講解休息的重要性和自我評估量表的使用 | |
1:36:32 需要顧好身體、心靈和靈魂的平衡 | |
1:39:19 描述觀眾遇到類似困境時的反應,並強調要意識自己的狀況 | |
1:40:01 討論XX的成長及改善 | |
1:41:06 強調重視身心靈健康,並探討如何測量和改善自己的狀態 | |
1:41:50 提及對OO的興趣,並計劃介紹他的事蹟 | |
1:44:59 討論YT內建自我評估量表的使用 | |
1:45:41 討論追夢過程中的壓力與挑戰 | |
1:47:17 探討身心靈均衡的重要性,並提及各種方法和技巧 | |
1:49:49 討論未來頻道發展方向及打造IP的計劃 | |
1:52:07 討論委託和合作的選擇,提到設計師和教師的專業 | |
1:54:00 討論直播內容和網絡影響力的調整計劃 | |
1:56:47 介紹OO會和休息計劃 | |
1:59:43 討論XX斗內和雜談計劃 | |
2:03:31 感謝觀眾支持並介紹會員里程碑 | |
2:18:51 回顧OO會和休息期間的想法 | |
2:22:56 討論生日特別節目和寄送物品的事項 | |
2:25:33 討論颱風及購物需求 | |
2:27:20 討論郵政信箱使用問題 | |
2:28:04 互相打招呼及自我介紹 | |
2:29:10 推薦其他主播 | |
2:30:51 感謝觀眾支持並討論自身影響力 | |
2:32:49 討論其他主播的特色及推薦 | |
2:43:43 感謝支持並鼓勵觀眾珍惜貴人 | |
2:48:53 發布新衣裝及生日開台消息 | |
2:50:54 評價觀眾訂閱並感謝他們 | |
2:51:04 貴人的重要性和自我成長 | |
2:51:31 感謝觀眾支持和聊天 | |
2:52:22 回顧直播和新衣裝 | |
2:53:18 感謝觀眾並提及下一次直播 | |
2:54:00 提及今天直播的遊戲和下一步計劃 | |
2:54:26 分別對觀眾道晚安 | |
2:55:17 感謝觀眾並說晚安 | |
2:55:59 轉場到下一個節目並提醒觀眾要有禮貌 | |
2:57:32 提醒觀眾休息和說晚安 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# pip install srt openai | |
import re | |
import srt | |
import openai | |
openai_key = "" | |
def chatGPT_summary(file_in: str, file_out: str) -> None: | |
""" | |
Summarize the words in srt | |
Parameters: | |
file_in: Path to input srt | |
file_out: Path to output srt | |
""" | |
transcripts = list(srt.parse(open(file_in))) | |
transcripts_txt = [f"{i}. {t.content}\n" for i, t in enumerate(transcripts)] | |
prompt_sys = """\ | |
你的任務是根據提供的字幕,製作一份包含大綱主題和相對應行數,以幫助觀眾迅速了解內容結構 | |
請注意以下要求: | |
1. 大綱應該簡明扼要地描述相應部分的內容或主題 | |
2. 如果有多個雜項行與同一主題相鄰,請將它們合併成一個整體主題,雜項行號算入主題行號 | |
3. 大綱按時間軸排列 | |
4. 確保在大綱之間不會略過超過30行 | |
5. 每個大綱項目應保持在20個字以內,不使用標點符號 | |
6. 大綱應包含至少5個行數,但如果有小於5行的項目,請將它們合併到相鄰主題中,或者略過不顯示 | |
7. 一個大綱只能有一個開始行號跟結束行號 | |
8. 不同大綱的行號不可重疊(overlap) | |
格式: | |
1. {{主題描述}} line_number: {{開始行號}} - {{結束行號}}\\n | |
輸出範例: | |
1. 說明XXX做了OOO line_number: 1 - 30 | |
2. 介紹AAA line_number: 31 - 45 | |
""" | |
openai.api_key = openai_key | |
batch = 10000 # <16k token is about 12k chinese words | |
max_error = 3 | |
err_count = 0 | |
start = 0 | |
topics_srt = [] | |
while True: | |
# Batch the content | |
current_words = 0 | |
current_end = start | |
while current_words < batch and current_end < len(transcripts_txt): | |
current_words += len(transcripts_txt[current_end]) | |
current_end += 1 | |
content = "".join(transcripts_txt[start:current_end]) | |
print(f"ChatGPT Topic: line {start}-{current_end} Total words: {len(content)}") | |
# Call API | |
response = openai.ChatCompletion.create( | |
model="gpt-3.5-turbo-16k", | |
messages=[ | |
{ | |
"role": "system", | |
"content": prompt_sys, | |
}, | |
{ | |
"role": "user", | |
"content": content, | |
}, | |
], | |
temperature=1, | |
max_tokens=512, | |
top_p=1, | |
frequency_penalty=0, | |
presence_penalty=0, | |
timeout=30, | |
) | |
print("ChatGPT Topic", response) | |
content = response["choices"][0]["message"]["content"] | |
print(content) | |
# Extract response: list[(topic_txt, line_st, line_ed)] | |
topic_re = re.findall("\d+\. (.*?)line_number: (\d+) - (\d+)", content) | |
topic_re = [(t.strip(), int(ts), int(te)) for t, ts, te in topic_re] | |
topic_re = sorted(topic_re, key=lambda i: i[1]) | |
topic_re = [(t, ts, te) for t, ts, te in topic_re if te - ts > 2] | |
# Check response is valid | |
try: | |
if not len(topic_re): | |
raise ValueError("Format Error") | |
for t in topic_re: | |
if len(t[0]) > 30: | |
raise ValueError("Topic too long") | |
if topic_re[-1][2] < start: | |
raise ValueError("line_number incorrect") | |
for i in range(1, len(topic_re)): | |
if topic_re[i][1] <= topic_re[i - 1][2]: | |
raise ValueError("Line Number Overlap") | |
# This check is always true if the te - ts > 2 filter | |
# for i in topic_re: | |
# if int(i[2]) - int(i[1]) < 3: | |
# raise ValueError("Topic contains too less lines") | |
except ValueError as e: | |
# retry policy | |
print(f"ChatGPT retry. {e}") | |
err_count += 1 | |
if err_count < max_error: | |
continue | |
raise ValueError(f"ChatGPT retry fail for {max_error} times") | |
err_count = 0 | |
# add to output srt | |
for t, ts, te in topic_re: | |
topics_srt.append( | |
srt.Subtitle( | |
len(topics_srt) + 1, | |
transcripts[ts].start, | |
transcripts[te].end, | |
t, | |
) | |
) | |
# Always write into file | |
with open(file_out, "w") as f: | |
f.write(srt.compose(topics_srt)) | |
# break if the last one | |
if current_end >= len(transcripts_txt): | |
break | |
# or start from last | |
start = topics_rc[-1][1] | |
topics_srt.pop() | |
def simple_display(file_in: str) -> None: | |
""" | |
Display srt. | |
Example: | |
``` | |
2:43:43 感謝支持並鼓勵觀眾珍惜貴人 | |
2:48:53 發布新衣裝及生日開台消息 | |
2:50:54 評價觀眾訂閱並感謝他們 | |
2:51:04 貴人的重要性和自我成長 | |
``` | |
""" | |
for s in srt.parse(open(file_in)): | |
t = str(s.start).split(".")[0] | |
print(f"{t} {s.content}") | |
if __name__ == "__main__": | |
name = "test" | |
chatGPT_summary(name + ".srt", name + ".summary.srt") | |
name += ".summary" | |
simple_display(name + ".srt") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment