Skip to content

Instantly share code, notes, and snippets.

@mkemp
Created May 11, 2015 15:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mkemp/3dc2b3a665aa96a2ca69 to your computer and use it in GitHub Desktop.
Save mkemp/3dc2b3a665aa96a2ca69 to your computer and use it in GitHub Desktop.
Used in CloudCamp Chicago 2015.05.11 presentation.
#!/usr/bin/env python
import re
import string
regex = re.compile('[%s]' % re.escape(string.punctuation))
def word_count(sc, in_file_name, out_file_name):
sc.textFile(in_file_name) \
.flatMap(lambda line: [(word, 1) for word in regex.sub(' ', line).strip().lower().split(' ') if word]) \
.reduceByKey(lambda a, b: a + b) \
.sortByKey() \
.map(lambda (word, count): '%s,%s' % (word, count)) \
.saveAsTextFile(out_file_name)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment