Skip to content

Instantly share code, notes, and snippets.

@danielnegri
Created June 4, 2012 15:43
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 7 You must be signed in to fork a gist
  • Save danielnegri/2869132 to your computer and use it in GitHub Desktop.
Save danielnegri/2869132 to your computer and use it in GitHub Desktop.
LastFM - Ranking

Exercise

Description

The attached utf-8 encoded text file contains the favorite musical artists of 1000 users from LastFM. Each line is a list of up to 50 artists, formatted as follows:

  • Radiohead, Pulp, Morrissey, Delays, Stereophonics, Blur, Suede, Sleeper, The La's, Super Furry Animals
  • Band of Horses, Iggy Pop, The Velvet Underground, Radiohead, The Decemberists, Morrissey, Television

Write a program that, using this file as input, produces a list of pairs of artists which appear TOGETHER in at least fifty different lists. For example, in the above sample, Radiohead and Morrissey appear together twice, but every other pair appears only once. Your program should output the pair list to stdout in the same form as the input.

eg.: Artist Name 1, Artist Name 2

You MAY return an approximate solution, i.e. lists which appear at least 50 times with high probability, as long as you explain why this tradeoff improves the performance of the algorithm. Please include, either in comments or in a separate file, a brief description of the run-time and space complexity of your algorithm.

Resolution

Approach

First approach: I used Naive Search and Brute-force searching. That was very slow and takes 20s to calculate 1k lines.

Second approach: I transformed the Artists map into a par of key="artist name" and value=[line_index1, line_index2, ...]. After reduced and combined those groups, I used Levenshtein distance to find similarity into array values for every keys. That was very slow too and takes 11s to calculate 1k lines.

Third: Transform those values into Sets and calc the intersection between all those combined groups, after that we just need to reduce. This approach is very fast and takes ~1s to calculate 1k lines.

Resolution steps

[ Map Artist and Lines ] ---> [ Reduce Artists ] ---> [ Combining Artists to Groups ] ---> [ Reduce Groups ]

Map Artists and Lines

We create a map for every Artist a set containing all line index that was played.

Reduce Artists

If we want know if a combination was played a least 50 times, we need to certificate the set size is equal or greater than the min of 50 plays. That is, remove from the map all artists with sets smaller than 50.

Combining Artists to Groups

Create a group map from all reduced artists with your respective sets. Example:

{["Jason Mraz", "Lady Gaga"]=>[#<Set: {0, 1}>, #<Set: {0, 1}>]}

Reduce Groups

To know the similarity between one group we can make an intersection from your sets, that will give us a list with all lines in common. The size of this represents how many times this artists appear together. That is, remove from the map all groups with lines in commons smaller than 50.

Space Complexity

The high order complexity appear when we are making the intersection operations with set, so I believe we are working with a O(n*m), where n = number of lines and m is the number of artists.

Usage

Run

$ cd $PROGRAM_PATH
$ chmod +x /bin/ranking
$ ./bin/ranking sample/artists_list.txt output.txt

(Alternative)

$ cd $PROGRAM_PATH
$ ruby ./bin/ranking sample/artists_list.txt output.txt

Test

$ cd $PROGRAM_PATH
$ rake

or

$ cd $PROGRAM_PATH
$ rake test

Requirements

  • Ruby 1.9.3
  • Rake
@Baba-A
Copy link

Baba-A commented Sep 24, 2012

Hi,
could you explain further what you mean by: "Create a group map from all reduced artists with your respective sets" in third step (Combining Artists to Groups).

@blaine12100
Copy link

Could you Please Explain The 2nd,3rd and 4th steps properly please?
Also in the second step,are we supposed to remove artists which appear less than 50 times?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment