Skip to content

Instantly share code, notes, and snippets.

@afonsoaugusto afonsoaugusto/curl.sh
Last active Jun 23, 2019

Embed
What would you like to do?
rm -rf wordcount.py
curl https://gist.githubusercontent.com/afonsoaugusto/63b1ff49a8418eb1b94e645acdfd3607/raw/9004ff9993a4e2844af782646de58f7d1e8c2300/wordcount.py -o wordcount.py
from pyspark.sql import SparkSession
from operator import add
import re
print("Okay Google.")
spark = SparkSession\
.builder\
.appName("CountUniqueWords")\
.getOrCreate()
lines = spark.read.text("/sampledata/road-not-taken.txt").rdd.map(lambda x: x[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.filter(lambda x: re.sub('[^a-zA-Z]+', '', x)) \
.filter(lambda x: len(x)>1 ) \
.map(lambda x: x.upper()) \
.map(lambda x: (x, 1)) \
.reduceByKey(add) \
.sortByKey()
output = counts.collect()
for (word, count) in output:
print("%s = %i" % (word, count))
spark.stop()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.