Last active
October 26, 2020 15:23
-
-
Save naiborhujosua/6c59233fd4ac8d6a96cb6d3e3cb2b9c0 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark.sql.functions import col,avg,min | |
# Min num implicit ratings for a song | |
print("Minimum implicit ratings for a song: ") | |
msd.filter(col("num_plays") > 0).groupBy("songId").count().select(min("count")).show() | |
# Avg num implicit ratings per songs | |
print("Average implicit ratings per song: ") | |
msd.filter(col("num_plays") > 0).groupBy("songId").count().select(avg("count")).show() | |
# Min num implicit ratings from a user | |
print("Minimum implicit ratings from a user: ") | |
msd.filter(col("num_plays") > 0).groupBy("userId").count().select(min("count")).show() | |
# Avg num implicit ratings for users | |
print("Average implicit ratings per user: ") | |
msd.filter(col("num_plays") > 0).groupBy("userId").count().select(avg("count")).show() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In this exercise, we are going to combine the .groupBy() and .filter() methods that you've used previously to calculate the min() and avg() number of users that have rated each song, and the min() and avg() number of songs that each user has rated.Because our data now includes 0's for items not yet consumed, we'll need to .filter() them out when doing grouped summary statistics like this. The msd dataset is provided for you here.
Great work. Users have at least 21 implicit ratings with an average of 77 and each song has at least 3 implicit ratings with an average of 35.