Skip to content

Instantly share code, notes, and snippets.

@erickrf erickrf/json2txt.py
Created Aug 19, 2019

Embed
What would you like to do?
Script to join text from JSON files for training GPT-2
import json
import os
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('input_dir', help='Directory with wiki json files')
parser.add_argument('output', help='Txt file output')
args = parser.parse_args()
for filename in os.listdir(args.input_dir):
if not filename.startswith('wiki'):
continue
path = os.path.join(args.input_dir, filename)
with open(path, 'r') as fin, open(args.output, 'a') as fout:
for line in fin:
data = json.loads(line)
# remove non-processed templates
text = data['text'].replace('()', '')
fout.write(text)
fout.write('<|endoftext|>\n')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.