Skip to content

Instantly share code, notes, and snippets.

@AeroXi
Created October 21, 2021 10:12
Show Gist options
  • Save AeroXi/53be4fe5bde3857c0c941c09dc51ea12 to your computer and use it in GitHub Desktop.
Save AeroXi/53be4fe5bde3857c0c941c09dc51ea12 to your computer and use it in GitHub Desktop.
将txt转换为GPT2-Chinese的训练格式
import json
dic = {}
with open("train.txt", "r", encoding="utf8") as f:
merge_line = ""
for line in f:
line = line.strip()
merge_line += line
if len(merge_line) > 500:
dic[merge_line] = 1
merge_line = ""
with open("train.json", "w", encoding="utf8") as f:
json.dump(dic, f, ensure_ascii=False)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment