Skip to content

Instantly share code, notes, and snippets.

@jamesonthecrow
Created November 11, 2018 03:34
Show Gist options
  • Save jamesonthecrow/a38cc1267c1bbae585a2d165e683a95c to your computer and use it in GitHub Desktop.
Save jamesonthecrow/a38cc1267c1bbae585a2d165e683a95c to your computer and use it in GitHub Desktop.
Preprocess data for the subreddit suggester.
import pandas
import re
import json
# Use pandas and regex to clean up the post titles.
df = pandas.DataFrame(posts, columns=['subreddit', 'title'])
# Remove any [tag] markers in a post title
df.title = df.title.apply(lambda x: re.sub(r'\[.*\]', '', x))
# Remove all other punctuation except spaces
df.title = df.title.apply(lambda x: re.sub(r'\W(?<![ ])', '', x))
# Save the data in the exact format CreateML expects
output = []
for idx, row in df.iterrows():
output.append({'text': row.title, 'label': row.subreddit})
filename = 'PATH/TO/data.json'
with open(filename, 'w') as fid:
fid.write(json.dumps(output))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment