Skip to content

Instantly share code, notes, and snippets.

@afonsoaugusto
Last active June 23, 2019 17:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save afonsoaugusto/63b1ff49a8418eb1b94e645acdfd3607 to your computer and use it in GitHub Desktop.
Save afonsoaugusto/63b1ff49a8418eb1b94e645acdfd3607 to your computer and use it in GitHub Desktop.
rm -rf wordcount.py
curl https://gist.githubusercontent.com/afonsoaugusto/63b1ff49a8418eb1b94e645acdfd3607/raw/9004ff9993a4e2844af782646de58f7d1e8c2300/wordcount.py -o wordcount.py
from pyspark.sql import SparkSession
from operator import add
import re
print("Okay Google.")
spark = SparkSession\
.builder\
.appName("CountUniqueWords")\
.getOrCreate()
lines = spark.read.text("/sampledata/road-not-taken.txt").rdd.map(lambda x: x[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.filter(lambda x: re.sub('[^a-zA-Z]+', '', x)) \
.filter(lambda x: len(x)>1 ) \
.map(lambda x: x.upper()) \
.map(lambda x: (x, 1)) \
.reduceByKey(add) \
.sortByKey()
output = counts.collect()
for (word, count) in output:
print("%s = %i" % (word, count))
spark.stop()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment