Skip to content

Instantly share code, notes, and snippets.

@naiborhujosua
Last active October 26, 2020 15:23
Show Gist options
  • Save naiborhujosua/6c59233fd4ac8d6a96cb6d3e3cb2b9c0 to your computer and use it in GitHub Desktop.
Save naiborhujosua/6c59233fd4ac8d6a96cb6d3e3cb2b9c0 to your computer and use it in GitHub Desktop.
from pyspark.sql.functions import col,avg,min
# Min num implicit ratings for a song
print("Minimum implicit ratings for a song: ")
msd.filter(col("num_plays") > 0).groupBy("songId").count().select(min("count")).show()
# Avg num implicit ratings per songs
print("Average implicit ratings per song: ")
msd.filter(col("num_plays") > 0).groupBy("songId").count().select(avg("count")).show()
# Min num implicit ratings from a user
print("Minimum implicit ratings from a user: ")
msd.filter(col("num_plays") > 0).groupBy("userId").count().select(min("count")).show()
# Avg num implicit ratings for users
print("Average implicit ratings per user: ")
msd.filter(col("num_plays") > 0).groupBy("userId").count().select(avg("count")).show()
@naiborhujosua
Copy link
Author

naiborhujosua commented Oct 26, 2020

In this exercise, we are going to combine the .groupBy() and .filter() methods that you've used previously to calculate the min() and avg() number of users that have rated each song, and the min() and avg() number of songs that each user has rated.Because our data now includes 0's for items not yet consumed, we'll need to .filter() them out when doing grouped summary statistics like this. The msd dataset is provided for you here.

Great work. Users have at least 21 implicit ratings with an average of 77 and each song has at least 3 implicit ratings with an average of 35.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment