Skip to content

Instantly share code, notes, and snippets.

@drorata
Created July 16, 2015 07:26
Show Gist options
  • Save drorata/6d6be93ca74edffe0760 to your computer and use it in GitHub Desktop.
Save drorata/6d6be93ca74edffe0760 to your computer and use it in GitHub Desktop.
Word count in a file stored on S3 using Spark (python version)
from pyspark import SparkContext
sc = SparkContext(appName = "simple app")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "yourAccessKeyId")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "yourSecretAccessKey")
text_file = sc.textFile("s3n://bucketName/filename.tar.gz")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("output")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment