Skip to content

Instantly share code, notes, and snippets.

@derekpeterson
Last active December 14, 2015 04:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save derekpeterson/5030759 to your computer and use it in GitHub Desktop.
Save derekpeterson/5030759 to your computer and use it in GitHub Desktop.
Simple MRjob script to count words from a TSV with data in the form "category\t[item1,item2,item3]".
#!/usr/bin/env python
from mrjob.job import MRJob
import json
import re
class CityReviews(MRJob):
def mapper(self, _, line):
line = re.sub(r'\[|\]| ', '', line)
data = line.split('\t')
yield data[0], [item for item in data[1].split(',')]
def combiner(self, city, data):
words = dict()
for item in data:
if item in words:
words[item] += 1
else:
words[item] = 1
yield city, words
def reducer(self, city, counts):
yield city, [word for word in counts]
if __name__ == '__main__':
CityReviews.run()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment