Skip to content

Instantly share code, notes, and snippets.

@walteryu
Created October 22, 2018 16:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save walteryu/88feb32d687595dd4b3779fb0ff71bbe to your computer and use it in GitHub Desktop.
Save walteryu/88feb32d687595dd4b3779fb0ff71bbe to your computer and use it in GitHub Desktop.
HW7 Script
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()
lines = spark.readStream.format("socket").option("host", "localhost") \
.option("port", 9999).load()
# Split the lines into words
words = lines.select( explode( split(lines.value, " ")).alias("word"))
# Generate running word count
wordCounts = words.groupBy("word").count()
# Start running the query that prints the running counts to the console
query = wordCounts.writeStream.outputMode("complete").format("console") \
.start()
query.awaitTermination()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment