Skip to content

Instantly share code, notes, and snippets.

@timvw
Created September 22, 2016 19:53
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save timvw/33460a39534809a59894fad800676b40 to your computer and use it in GitHub Desktop.
Save timvw/33460a39534809a59894fad800676b40 to your computer and use it in GitHub Desktop.
SparkSQL and CTE for increased readability
val df = spark.read.text(inputFile)
df.createOrReplaceTempView("data")
val query =
"""
| WITH loglevel AS (SELECT SPLIT(value, ' ')[0] AS level FROM data WHERE LENGTH(value) > 0),
| levelcount AS (SELECT level, COUNT(*) as count FROM loglevel GROUP BY level)
| SELECT level, count FROM levelcount ORDER BY count DESC
""".stripMargin
val result = spark.sql(query)
println(result.rdd.toDebugString)
result.show()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment