Skip to content

Instantly share code, notes, and snippets.

@oneryalcin
Created September 23, 2019 22:05
Show Gist options
  • Save oneryalcin/ac77d08a19c6dd3b6c5f33c70a20457b to your computer and use it in GitHub Desktop.
Save oneryalcin/ac77d08a19c6dd3b6c5f33c70a20457b to your computer and use it in GitHub Desktop.
5 Sparkify listen_freq
# Create a new aggreated dataframe called listen_freq
# (stands for listening frequency) for each user
listen_freq = data.select('userId','sessionId', 'timeStamp')\
.groupBy('userId','sessionId')\
.agg(F.min('timeStamp').alias('sessionTime'))\
.orderBy('userId', 'sessionId')\
.groupBy('userId')\
.agg(F.min('sessionTime').alias('minSessionTime'),
F.max('sessionTime').alias('maxSessionTime'),
F.count('sessionId').alias('sessionCount'))\
.withColumn('sessionsFreqDay', F.datediff('maxSessionTime', 'minSessionTime')/col('sessionCount'))\
.orderBy('userId')
listen_freq.cache()
listen_freq.show(10)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment