Skip to content

Instantly share code, notes, and snippets.

@sjainit

sjainit/check Secret

Created January 28, 2020 04:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sjainit/af9a124fdb82b88161b6f58b93a52646 to your computer and use it in GitHub Desktop.
Save sjainit/af9a124fdb82b88161b6f58b93a52646 to your computer and use it in GitHub Desktop.
def get_artists(df):
"""
Get artist information (artist_name, artist_msid etc) for every user
ordered by listen count (number of times a user has listened to tracks
which belong to a particular artist).
Args:
table: name of the temporary table.
Returns:
artists: A dict of dicts which can be depicted as:
{
'user1': [{
'artist_name': str,
'artist_msid': str,
'artist_mbids': str,
'listen_count': int
}],
'user2' : [{...}]
}
"""
t0 = time.time()
df = df.select("user_name", "artist_name", "artist_msid", "artist_mbids")
df = df.groupBy("user_name", "artist_name", "artist_msid", "artist_mbids").count()
df = df.sort(df['count'].desc())
results = df.collect()
artists = defaultdict(list)
for row in rows:
artists[row.user_name].append({
'artist_name': row.artist_name,
'artist_msid': row.artist_msid,
'artist_mbid': row.artist_mbids,
'listen_count': row['count']
})
print("Query to calculate artist stats processed in %.2f s" % (time.time() - t0))
return artists
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment