Skip to content

Instantly share code, notes, and snippets.

@yoshiki146
Last active August 11, 2022 13:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save yoshiki146/31d4a46c3d8e906c3cd24f425568d34e to your computer and use it in GitHub Desktop.
Save yoshiki146/31d4a46c3d8e906c3cd24f425568d34e to your computer and use it in GitHub Desktop.

Collaborative Filtering by R

This post impliments user-based collaborative filtering using R. Idea comes from the blog post by Salem. I vectorise the code thus significantly reducing the runtime. Visit his blog for more information. (http://www.salemmarafi.com/)
You can download Collaborative_Filtering.Rmd to run the code in your environment.

Import library and dataset

library(magrittr)
data.germany = read.csv("http://www.salemmarafi.com/wp-content/uploads/2014/04/lastfm-matrix-germany.csv")
data.germany[1:6,1:8] # show head of data
##   user a.perfect.circle abba ac.dc adam.green aerosmith afi air
## 1    1                0    0     0          0         0   0   0
## 2   33                0    0     0          1         0   0   0
## 3   42                0    0     0          0         0   0   0
## 4   51                0    0     0          0         0   0   0
## 5   62                0    0     0          0         0   0   0
## 6   75                0    0     0          0         0   0   0

Item-Based Collaborative Filtering

Then, let's start implementing item-based CF to get similarity matrix. Cosine similarity is defined as:

equation

where A and B denote vectors of items/songs and Ai represents i-th element of A (whether user i played song A). To calculate, I drop name column and convert dataframe into matrix

data.germany.ibs= as.matrix(data.germany[,!(names(data.germany) %in% c("user"))])

Then A ⋅ B (numerator) is calculated as

data.germany.ibs.similarity.num= t(data.germany.ibs) %*% data.germany.ibs

To understand, let's denote items A, B,..., Z. Then numerator in matrix form can be written as:

where A' is the transpose of A.

Notice that square root of the diagonal element is equivalent to the distance of the vector (|A|). Therefore denominator is

sq_diag=sqrt(diag(data.germany.ibs.similarity.num))
data.germany.ibs.similarity.denom= sq_diag %*% t(sq_diag)

Now we can get similarity matrix by simply dividing A ⋅ B by |A||B| (this is an element-wise operation)

data.germany.ibs.similarity= data.germany.ibs.similarity.num/data.germany.ibs.similarity.denom
data.germany.ibs.similarity=as.data.frame(data.germany.ibs.similarity) %>% 
  set_rownames(names(data.germany[-1]))
data.germany.ibs.similarity[1:6, 1:4]  # head 
##                  a.perfect.circle       abba      ac.dc adam.green
## a.perfect.circle       1.00000000 0.00000000 0.01791723 0.05155393
## abba                   0.00000000 1.00000000 0.05227877 0.02507061
## ac.dc                  0.01791723 0.05227877 1.00000000 0.11315371
## adam.green             0.05155393 0.02507061 0.11315371 1.00000000
## aerosmith              0.06277648 0.06105625 0.17715300 0.05663655
## afi                    0.00000000 0.00000000 0.06789420 0.00000000

Notice that data.germany.ibs.similarity is a synmetric matrix with ones on diagonal by definition.

Next, we sort the similarity in a decreasing order to get top 10 neighbours for each song. This process is the same as the original post except that I prefer to use pipe operator to look neat.

# Placeholder matrix 
data.germany.neighbours = matrix(NA, nrow=ncol(data.germany.ibs.similarity),ncol=11,
                                  dimnames=list(colnames(data.germany.ibs.similarity)))
# Get the names of the song
for(i in 1:ncol(data.germany.ibs)) { # loop through songs
  data.germany.neighbours[i,] = data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i],decreasing=T),][i] %>% 
    rownames %>% head(n=11) %>% t
}
data.germany.neighbours[1:6, 1:4] # head
##                  [,1]               [,2]                   
## a.perfect.circle "a.perfect.circle" "tool"                 
## abba             "abba"             "madonna"              
## ac.dc            "ac.dc"            "red.hot.chili.peppers"
## adam.green       "adam.green"       "the.libertines"       
## aerosmith        "aerosmith"        "u2"                   
## afi              "afi"              "funeral.for.a.friend" 
##                  [,3]              [,4]           
## a.perfect.circle "dredg"           "deftones"     
## abba             "robbie.williams" "elvis.presley"
## ac.dc            "metallica"       "iron.maiden"  
## adam.green       "the.strokes"     "babyshambles" 
## aerosmith        "led.zeppelin"    "metallica"    
## afi              "rise.against"    "fall.out.boy"

User-Based Collaborative Filtering

The steps to calculate the user score are:

  1. Choose an item
  2. Get the top 10 similar items and its similarity score (from item-based CF part, similj,  (j = 1, 2, …, 10))
  3. Get the consumption record to the similar items (from original dataset)
  4. Calculate the score equation , where j is the top 10 similar items, histj is the purchase history for item j (retreived from data.germany) and similj is the similarity to item j (retreive from 'data.germany.ibs.similarity')
  5. Check if a user already consumed that item

First, create a placeholder matrix for user score. The size of this matrix is #-of-users x #-of-songs.

data.germany.user.scores = matrix(NA, nrow=nrow(data.germany),ncol=ncol(data.germany)-1,
                                  dimnames=list((data.germany$user),colnames(data.germany[-1])))

Step 1 to 4 can be processed inside one loop

for (i in 1:ncol(data.germany.user.scores)){ # loop through songs (Step1)
  # get a product's top 10 neighbours sorted by similarity(Step2)
  topN=data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i], decreasing=T),][i] %>% 
    head(11)
  topN.names=as.character(rownames(topN))
  topN.similarities= as.numeric(topN[,1])
  # Drop the first one because it will always be the same song (Step2)
  topN.names= topN.names[-1]
  topN.similarities= topN.similarities[-1]
  # We then get the user's purchase history for those 10 items (Step3)
  topN.userPurchases= data.germany[,topN.names]
  # Fill the user score matrix (Step4)
  data.germany.user.scores[,i]= as.matrix(topN.userPurchases) %*% topN[-1,1] / sum(topN[-1,1])
}
data.germany.user.scores[1:6,1:6] # head
##    a.perfect.circle       abba      ac.dc adam.green  aerosmith        afi
## 1         0.0000000 0.00000000 0.20440540 0.00000000 0.26617451 0.00000000
## 33        0.0823426 0.00000000 0.09591153 0.29937093 0.08888522 0.00000000
## 42        0.0000000 0.08976655 0.00000000 0.00000000 0.00000000 0.00000000
## 51        0.0823426 0.08356811 0.00000000 0.08350177 0.00000000 0.09295439
## 62        0.0000000 0.00000000 0.11430459 0.09319688 0.08822137 0.09295439
## 75        0.0000000 0.00000000 0.00000000 0.00000000 0.00000000 0.22002398

Notice that the structure of data.germany.user.score is the same as data.germany except that user IDs is in the first column in data.germany instead of rownames.

data.germany[1:6, 1:7] 
##   user a.perfect.circle abba ac.dc adam.green aerosmith afi
## 1    1                0    0     0          0         0   0
## 2   33                0    0     0          1         0   0
## 3   42                0    0     0          0         0   0
## 4   51                0    0     0          0         0   0
## 5   62                0    0     0          0         0   0
## 6   75                0    0     0          0         0   0

This implies that if you drop the user ID from data.gernamy, then you get data.germany.ibs. Therefore, data.germany.user.score[i,j] and data.germany.ibs[i,j] refers to the same user i and the same item j.

Then, make sure that items are not recommended if a user has already consumed one. (i.e., we recommend item j for user i iff 'data.germany.ibs[i,j]` is zero)

data.germany.user.scores[data.germany.ibs!=0]= 0 # Step5

Finally, make the recommendation pretty.

# Placeholder. Top 100 songs
data.germany.user.scores.holder = matrix(NA, nrow=nrow(data.germany.user.scores), ncol=100, 
                                         dimnames=list(rownames(data.germany.user.scores)))
for(i in 1:nrow(data.germany.user.scores)) {
    data.germany.user.scores.holder[i,] = 
      data.germany.user.scores[, order(data.germany.user.scores[i,], decreasing=T)][i,] %>% 
      head(n=100) %>% names
}
data.germany.user.scores.holder[1:6,1:4] # head
##    [,1]             [,2]            [,3]                   
## 1  "flogging.molly" "coldplay"      "aerosmith"            
## 33 "peter.fox"      "gentleman"     "red.hot.chili.peppers"
## 42 "oomph."         "lacuna.coil"   "rammstein"            
## 51 "the.subways"    "the.kooks"     "the.hives"            
## 62 "mando.diao"     "the.fratellis" "jack.johnson"         
## 75 "hoobastank"     "papa.roach"    "the.prodigy"          
##    [,4]             
## 1  "the.beatles"    
## 33 "kings.of.leon"  
## 42 "schandmaul"     
## 51 "franz.ferdinand"
## 62 "incubus"        
## 75 "sum.41"

Summary

In this gist, I replicated the use-based collaborative filtering implemented by Salem using vectors and matrices, instead of scalers and loops as in the original post. I have found that this vectorisation does not change the result and reduces the runtime more than (from 70 minutes to 20 seconds in my environment)

Entire Code

# import library
library(magrittr)

# import data
data.germany = read.csv("http://www.salemmarafi.com/wp-content/uploads/2014/04/lastfm-matrix-germany.csv")

#### Item-Based ####
data.germany.ibs= as.matrix(data.germany[,!(names(data.germany) %in% c("user"))])

# Compute cosine similarity
data.germany.ibs.similarity.num= t(data.germany.ibs) %*% data.germany.ibs
sq_diag=sqrt(diag(data.germany.ibs.similarity.num))
data.germany.ibs.similarity.denom= sq_diag %*% t(sq_diag)
data.germany.ibs.similarity= data.germany.ibs.similarity.num/data.germany.ibs.similarity.denom
# back to dataframe
data.germany.ibs.similarity=as.data.frame(data.germany.ibs.similarity) %>% 
  set_rownames(names(data.germany[-1]))
# neat recommendation
data.germany.neighbours = matrix(NA, nrow=ncol(data.germany.ibs.similarity),ncol=11,
                                  dimnames=list(colnames(data.germany.ibs.similarity)))
for(i in 1:ncol(data.germany.ibs)) {
  data.germany.neighbours[i,] = data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i],decreasing=T),][i] %>% 
    rownames %>% head(n=11) %>% t
}

#### User-Based ####
data.germany.user.scores = matrix(NA, nrow=nrow(data.germany),ncol=ncol(data.germany)-1,
                                  dimnames=list((data.germany$user),colnames(data.germany[-1])))
# compute user score
for (i in 1:ncol(data.germany.user.scores)){ # loop through songs (Step1)
  # get a product's top 10 neighbours sorted by similarity(Step2)
  topN=data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i], decreasing=T),][i] %>% 
    head(11)
  topN.names=as.character(rownames(topN))
  topN.similarities= as.numeric(topN[,1])
  # Drop the first one because it will always be the same song (Step2)
  topN.names= topN.names[-1]
  topN.similarities= topN.similarities[-1]
  # We then get the user's purchase history for those 10 items (Step3)
  topN.userPurchases= data.germany[,topN.names]
  # Fill the user score matrix (Step4)
  data.germany.user.scores[,i]= as.matrix(topN.userPurchases) %*% topN[-1,1] / sum(topN[-1,1])
}

data.germany.user.scores[data.germany.ibs!=0]= 0 # Step5

# Placeholder. Top 100 songs
data.germany.user.scores.holder = matrix(NA, nrow=nrow(data.germany.user.scores), ncol=100, 
                                         dimnames=list(rownames(data.germany.user.scores)))
for(i in 1:nrow(data.germany.user.scores)) {
    data.germany.user.scores.holder[i,] = 
      data.germany.user.scores[, order(data.germany.user.scores[i,], decreasing=T)][i,] %>% 
      head(n=100) %>% names
}
@CDunhill
Copy link

This sounds just what I need! Was getting similar run times (ie. 1-2 hours). This will hopefully allow me to use a larger data set so thank you for this.

@martinpanacek789
Copy link

martinpanacek789 commented Aug 11, 2022

I did a slight change to the last part of the code as the last sorting loop runs crazy slow in case of very large dataset. The following runs much faster. The number of reccomendations can be adjusted to any appropriate amount.

`user_based_holder <- matrix(NA, nrow = 10, ncol = nrow(data.germany.user.scores))

sort_fun <- function(x){
sortedout <- names(head(sort(x, decreasing = TRUE),n = 10))
return(sortedout)
}

user_based_holder <- apply(data.germany.user.scores, 1, sort_fun)
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment