{{ message }}

Instantly share code, notes, and snippets.

# yoshiki146/Collaborative_Filtering.md

Last active Sep 7, 2018

## Collaborative Filtering by R

This post impliments user-based collaborative filtering using R. Idea comes from the blog post by Salem. I vectorise the code thus significantly reducing the runtime. Visit his blog for more information. (http://www.salemmarafi.com/)

#### Import library and dataset

library(magrittr)
data.germany = read.csv("http://www.salemmarafi.com/wp-content/uploads/2014/04/lastfm-matrix-germany.csv")
data.germany[1:6,1:8] # show head of data
##   user a.perfect.circle abba ac.dc adam.green aerosmith afi air
## 1    1                0    0     0          0         0   0   0
## 2   33                0    0     0          1         0   0   0
## 3   42                0    0     0          0         0   0   0
## 4   51                0    0     0          0         0   0   0
## 5   62                0    0     0          0         0   0   0
## 6   75                0    0     0          0         0   0   0


#### Item-Based Collaborative Filtering

Then, let's start implementing item-based CF to get similarity matrix. Cosine similarity is defined as:

where A and B denote vectors of items/songs and Ai represents i-th element of A (whether user i played song A). To calculate, I drop name column and convert dataframe into matrix

data.germany.ibs= as.matrix(data.germany[,!(names(data.germany) %in% c("user"))])

Then A ⋅ B (numerator) is calculated as

data.germany.ibs.similarity.num= t(data.germany.ibs) %*% data.germany.ibs

To understand, let's denote items A, B,..., Z. Then numerator in matrix form can be written as:

where A' is the transpose of A.

Notice that square root of the diagonal element is equivalent to the distance of the vector (|A|). Therefore denominator is

sq_diag=sqrt(diag(data.germany.ibs.similarity.num))
data.germany.ibs.similarity.denom= sq_diag %*% t(sq_diag)

Now we can get similarity matrix by simply dividing A ⋅ B by |A||B| (this is an element-wise operation)

data.germany.ibs.similarity= data.germany.ibs.similarity.num/data.germany.ibs.similarity.denom
data.germany.ibs.similarity=as.data.frame(data.germany.ibs.similarity) %>%
set_rownames(names(data.germany[-1]))
data.germany.ibs.similarity[1:6, 1:4]  # head 
##                  a.perfect.circle       abba      ac.dc adam.green
## a.perfect.circle       1.00000000 0.00000000 0.01791723 0.05155393
## abba                   0.00000000 1.00000000 0.05227877 0.02507061
## ac.dc                  0.01791723 0.05227877 1.00000000 0.11315371
## adam.green             0.05155393 0.02507061 0.11315371 1.00000000
## aerosmith              0.06277648 0.06105625 0.17715300 0.05663655
## afi                    0.00000000 0.00000000 0.06789420 0.00000000


Notice that data.germany.ibs.similarity is a synmetric matrix with ones on diagonal by definition.

Next, we sort the similarity in a decreasing order to get top 10 neighbours for each song. This process is the same as the original post except that I prefer to use pipe operator to look neat.

# Placeholder matrix
data.germany.neighbours = matrix(NA, nrow=ncol(data.germany.ibs.similarity),ncol=11,
dimnames=list(colnames(data.germany.ibs.similarity)))
# Get the names of the song
for(i in 1:ncol(data.germany.ibs)) { # loop through songs
data.germany.neighbours[i,] = data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i],decreasing=T),][i] %>%
}
data.germany.neighbours[1:6, 1:4] # head
##                  [,1]               [,2]
## a.perfect.circle "a.perfect.circle" "tool"
## ac.dc            "ac.dc"            "red.hot.chili.peppers"
## aerosmith        "aerosmith"        "u2"
## afi              "afi"              "funeral.for.a.friend"
##                  [,3]              [,4]
## a.perfect.circle "dredg"           "deftones"
## abba             "robbie.williams" "elvis.presley"
## ac.dc            "metallica"       "iron.maiden"
## aerosmith        "led.zeppelin"    "metallica"
## afi              "rise.against"    "fall.out.boy"


#### User-Based Collaborative Filtering

The steps to calculate the user score are:

1. Choose an item
2. Get the top 10 similar items and its similarity score (from item-based CF part, similj,  (j = 1, 2, …, 10))
3. Get the consumption record to the similar items (from original dataset)
4. Calculate the score , where j is the top 10 similar items, histj is the purchase history for item j (retreived from data.germany) and similj is the similarity to item j (retreive from 'data.germany.ibs.similarity')
5. Check if a user already consumed that item

First, create a placeholder matrix for user score. The size of this matrix is #-of-users x #-of-songs.

data.germany.user.scores = matrix(NA, nrow=nrow(data.germany),ncol=ncol(data.germany)-1,
dimnames=list((data.germany$user),colnames(data.germany[-1]))) Step 1 to 4 can be processed inside one loop for (i in 1:ncol(data.germany.user.scores)){ # loop through songs (Step1) # get a product's top 10 neighbours sorted by similarity(Step2) topN=data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i], decreasing=T),][i] %>% head(11) topN.names=as.character(rownames(topN)) topN.similarities= as.numeric(topN[,1]) # Drop the first one because it will always be the same song (Step2) topN.names= topN.names[-1] topN.similarities= topN.similarities[-1] # We then get the user's purchase history for those 10 items (Step3) topN.userPurchases= data.germany[,topN.names] # Fill the user score matrix (Step4) data.germany.user.scores[,i]= as.matrix(topN.userPurchases) %*% topN[-1,1] / sum(topN[-1,1]) } data.germany.user.scores[1:6,1:6] # head ## a.perfect.circle abba ac.dc adam.green aerosmith afi ## 1 0.0000000 0.00000000 0.20440540 0.00000000 0.26617451 0.00000000 ## 33 0.0823426 0.00000000 0.09591153 0.29937093 0.08888522 0.00000000 ## 42 0.0000000 0.08976655 0.00000000 0.00000000 0.00000000 0.00000000 ## 51 0.0823426 0.08356811 0.00000000 0.08350177 0.00000000 0.09295439 ## 62 0.0000000 0.00000000 0.11430459 0.09319688 0.08822137 0.09295439 ## 75 0.0000000 0.00000000 0.00000000 0.00000000 0.00000000 0.22002398  Notice that the structure of data.germany.user.score is the same as data.germany except that user IDs is in the first column in data.germany instead of rownames. data.germany[1:6, 1:7]  ## user a.perfect.circle abba ac.dc adam.green aerosmith afi ## 1 1 0 0 0 0 0 0 ## 2 33 0 0 0 1 0 0 ## 3 42 0 0 0 0 0 0 ## 4 51 0 0 0 0 0 0 ## 5 62 0 0 0 0 0 0 ## 6 75 0 0 0 0 0 0  This implies that if you drop the user ID from data.gernamy, then you get data.germany.ibs. Therefore, data.germany.user.score[i,j] and data.germany.ibs[i,j] refers to the same user i and the same item j. Then, make sure that items are not recommended if a user has already consumed one. (i.e., we recommend item j for user i iff 'data.germany.ibs[i,j] is zero) data.germany.user.scores[data.germany.ibs!=0]= 0 # Step5 Finally, make the recommendation pretty. # Placeholder. Top 100 songs data.germany.user.scores.holder = matrix(NA, nrow=nrow(data.germany.user.scores), ncol=100, dimnames=list(rownames(data.germany.user.scores))) for(i in 1:nrow(data.germany.user.scores)) { data.germany.user.scores.holder[i,] = data.germany.user.scores[, order(data.germany.user.scores[i,], decreasing=T)][i,] %>% head(n=100) %>% names } data.germany.user.scores.holder[1:6,1:4] # head ## [,1] [,2] [,3] ## 1 "flogging.molly" "coldplay" "aerosmith" ## 33 "peter.fox" "gentleman" "red.hot.chili.peppers" ## 42 "oomph." "lacuna.coil" "rammstein" ## 51 "the.subways" "the.kooks" "the.hives" ## 62 "mando.diao" "the.fratellis" "jack.johnson" ## 75 "hoobastank" "papa.roach" "the.prodigy" ## [,4] ## 1 "the.beatles" ## 33 "kings.of.leon" ## 42 "schandmaul" ## 51 "franz.ferdinand" ## 62 "incubus" ## 75 "sum.41"  #### Summary In this gist, I replicated the use-based collaborative filtering implemented by Salem using vectors and matrices, instead of scalers and loops as in the original post. I have found that this vectorisation does not change the result and reduces the runtime more than (from 70 minutes to 20 seconds in my environment) #### Entire Code # import library library(magrittr) # import data data.germany = read.csv("http://www.salemmarafi.com/wp-content/uploads/2014/04/lastfm-matrix-germany.csv") #### Item-Based #### data.germany.ibs= as.matrix(data.germany[,!(names(data.germany) %in% c("user"))]) # Compute cosine similarity data.germany.ibs.similarity.num= t(data.germany.ibs) %*% data.germany.ibs sq_diag=sqrt(diag(data.germany.ibs.similarity.num)) data.germany.ibs.similarity.denom= sq_diag %*% t(sq_diag) data.germany.ibs.similarity= data.germany.ibs.similarity.num/data.germany.ibs.similarity.denom # back to dataframe data.germany.ibs.similarity=as.data.frame(data.germany.ibs.similarity) %>% set_rownames(names(data.germany[-1])) # neat recommendation data.germany.neighbours = matrix(NA, nrow=ncol(data.germany.ibs.similarity),ncol=11, dimnames=list(colnames(data.germany.ibs.similarity))) for(i in 1:ncol(data.germany.ibs)) { data.germany.neighbours[i,] = data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i],decreasing=T),][i] %>% rownames %>% head(n=11) %>% t } #### User-Based #### data.germany.user.scores = matrix(NA, nrow=nrow(data.germany),ncol=ncol(data.germany)-1, dimnames=list((data.germany$user),colnames(data.germany[-1])))
# compute user score
for (i in 1:ncol(data.germany.user.scores)){ # loop through songs (Step1)
# get a product's top 10 neighbours sorted by similarity(Step2)
topN=data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i], decreasing=T),][i] %>%
topN.names=as.character(rownames(topN))
topN.similarities= as.numeric(topN[,1])
# Drop the first one because it will always be the same song (Step2)
topN.names= topN.names[-1]
topN.similarities= topN.similarities[-1]
# We then get the user's purchase history for those 10 items (Step3)
topN.userPurchases= data.germany[,topN.names]
# Fill the user score matrix (Step4)
data.germany.user.scores[,i]= as.matrix(topN.userPurchases) %*% topN[-1,1] / sum(topN[-1,1])
}

data.germany.user.scores[data.germany.ibs!=0]= 0 # Step5

# Placeholder. Top 100 songs
data.germany.user.scores.holder = matrix(NA, nrow=nrow(data.germany.user.scores), ncol=100,
dimnames=list(rownames(data.germany.user.scores)))
for(i in 1:nrow(data.germany.user.scores)) {
data.germany.user.scores.holder[i,] =
data.germany.user.scores[, order(data.germany.user.scores[i,], decreasing=T)][i,] %>%
}`