Skip to content

Instantly share code, notes, and snippets.

@oneryalcin
Created September 23, 2019 22:19
Show Gist options
  • Save oneryalcin/f4faade2c0210092fc852e698839708a to your computer and use it in GitHub Desktop.
Save oneryalcin/f4faade2c0210092fc852e698839708a to your computer and use it in GitHub Desktop.
7 Sparkify SQL
# Show that we can do the same calculation above using SQL
data.createOrReplaceTempView('sparkify')
sub_query = """
SELECT
userId,
sessionId,
max(itemInSession) as itemCount
FROM
sparkify
GROUP BY
userId, sessionId
ORDER BY
userId, sessionId
"""
query = """
SELECT
userId,
count(userId) as sessionCount,
avg(itemCount) as meanSongCount
FROM
({sub_query})
GROUP BY
userId
ORDER BY
userId
""".format(sub_query=sub_query)
spark.sql(query).show(10)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment