Skip to content

Instantly share code, notes, and snippets.

@qcl
Created August 6, 2014 05:52
Show Gist options
  • Save qcl/dfa0fd979c18738539c8 to your computer and use it in GitHub Desktop.
Save qcl/dfa0fd979c18738539c8 to your computer and use it in GitHub Desktop.
Hadoop Spark Word Count Python Example
# -*- coding: utf-8 -*-
# qcl
from pyspark import SparkContext
from datetime import datetime
if __name__ == "__main__":
sc = SparkContext(appName="WordCount")
start_time = datetime.now()
f = sc.textFile("hdfs://NLG-WKS-9:9000/user/qcl/Apache_Hadoop")
counts = f.flatMap(lambda line:line.split(" ")) \
.map(lambda word: (word,1)) \
.reduceByKey(lambda a,b: a+b)
counts.saveAsTextFile("hdfs://NLG-WKS-9:9000/user/qcl/sparkResult")
diff = datetime.now() - start_time
print "Spend %d.%d seconds" % (diff.seconds, diff.microseconds)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment