Skip to content

Instantly share code, notes, and snippets.

@dgadiraju
Last active May 29, 2020 17:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save dgadiraju/02ded19303c6cb59dac0cd2199df85ec to your computer and use it in GitHub Desktop.
Save dgadiraju/02ded19303c6cb59dac0cd2199df85ec to your computer and use it in GitHub Desktop.
data = sc.textFile('/public/randomtextwriter/part-m-00000')
wc = data. \
flatMap(lambda line: line.split(' ')). \
map(lambda word: (word, 1)). \
reduceByKey(lambda x, y: x + y)
wc. \
map(lambda rec: rec[0] + ',' + str(rec[1])). \
saveAsTextFile('/user/training/core/wordcount')
from pyspark.sql.functions import split, explode
data = spark.read.text('/public/randomtextwriter/part-m-00000')
wc = data.select(explode(split(data.value, ' ')).alias('words')). \
groupBy('words'). \
agg(count('words').alias('wc'))
wc.write.csv('/user/training/df/wordcount')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment