This post impliments user-based collaborative filtering using R. Idea comes from the blog post by Salem. I vectorise the code thus significantly reducing the runtime. Visit his blog for more information. (http://www.salemmarafi.com/)
You can download Collaborative_Filtering.Rmd to run the code in your environment.
library(magrittr)
data.germany = read.csv("http://www.salemmarafi.com/wp-content/uploads/2014/04/lastfm-matrix-germany.csv")
data.germany[1:6,1:8] # show head of data
## user a.perfect.circle abba ac.dc adam.green aerosmith afi air
## 1 1 0 0 0 0 0 0 0
## 2 33 0 0 0 1 0 0 0
## 3 42 0 0 0 0 0 0 0
## 4 51 0 0 0 0 0 0 0
## 5 62 0 0 0 0 0 0 0
## 6 75 0 0 0 0 0 0 0
Then, let's start implementing item-based CF to get similarity matrix. Cosine similarity is defined as:
where A and B denote vectors of items/songs and Ai represents i-th element of A (whether user i played song A). To calculate, I drop name
column and convert dataframe into matrix
data.germany.ibs= as.matrix(data.germany[,!(names(data.germany) %in% c("user"))])
Then A ⋅ B (numerator) is calculated as
data.germany.ibs.similarity.num= t(data.germany.ibs) %*% data.germany.ibs
To understand, let's denote items A, B,..., Z. Then numerator in matrix form can be written as:
where A' is the transpose of A.
Notice that square root of the diagonal element is equivalent to the distance of the vector (|A|). Therefore denominator is
sq_diag=sqrt(diag(data.germany.ibs.similarity.num))
data.germany.ibs.similarity.denom= sq_diag %*% t(sq_diag)
Now we can get similarity matrix by simply dividing A ⋅ B by |A||B| (this is an element-wise operation)
data.germany.ibs.similarity= data.germany.ibs.similarity.num/data.germany.ibs.similarity.denom
data.germany.ibs.similarity=as.data.frame(data.germany.ibs.similarity) %>%
set_rownames(names(data.germany[-1]))
data.germany.ibs.similarity[1:6, 1:4] # head
## a.perfect.circle abba ac.dc adam.green
## a.perfect.circle 1.00000000 0.00000000 0.01791723 0.05155393
## abba 0.00000000 1.00000000 0.05227877 0.02507061
## ac.dc 0.01791723 0.05227877 1.00000000 0.11315371
## adam.green 0.05155393 0.02507061 0.11315371 1.00000000
## aerosmith 0.06277648 0.06105625 0.17715300 0.05663655
## afi 0.00000000 0.00000000 0.06789420 0.00000000
Notice that data.germany.ibs.similarity
is a synmetric matrix with ones on diagonal by definition.
Next, we sort the similarity in a decreasing order to get top 10 neighbours for each song. This process is the same as the original post except that I prefer to use pipe operator to look neat.
# Placeholder matrix
data.germany.neighbours = matrix(NA, nrow=ncol(data.germany.ibs.similarity),ncol=11,
dimnames=list(colnames(data.germany.ibs.similarity)))
# Get the names of the song
for(i in 1:ncol(data.germany.ibs)) { # loop through songs
data.germany.neighbours[i,] = data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i],decreasing=T),][i] %>%
rownames %>% head(n=11) %>% t
}
data.germany.neighbours[1:6, 1:4] # head
## [,1] [,2]
## a.perfect.circle "a.perfect.circle" "tool"
## abba "abba" "madonna"
## ac.dc "ac.dc" "red.hot.chili.peppers"
## adam.green "adam.green" "the.libertines"
## aerosmith "aerosmith" "u2"
## afi "afi" "funeral.for.a.friend"
## [,3] [,4]
## a.perfect.circle "dredg" "deftones"
## abba "robbie.williams" "elvis.presley"
## ac.dc "metallica" "iron.maiden"
## adam.green "the.strokes" "babyshambles"
## aerosmith "led.zeppelin" "metallica"
## afi "rise.against" "fall.out.boy"
The steps to calculate the user score are:
- Choose an item
- Get the top 10 similar items and its similarity score (from item-based CF part, similj, (j = 1, 2, …, 10))
- Get the consumption record to the similar items (from original dataset)
- Calculate the score
, where j is the top 10 similar items, histj is the purchase history for item j (retreived from
data.germany
) and similj is the similarity to item j (retreive from 'data.germany.ibs.similarity') - Check if a user already consumed that item
First, create a placeholder matrix for user score. The size of this matrix is #-of-users x #-of-songs.
data.germany.user.scores = matrix(NA, nrow=nrow(data.germany),ncol=ncol(data.germany)-1,
dimnames=list((data.germany$user),colnames(data.germany[-1])))
Step 1 to 4 can be processed inside one loop
for (i in 1:ncol(data.germany.user.scores)){ # loop through songs (Step1)
# get a product's top 10 neighbours sorted by similarity(Step2)
topN=data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i], decreasing=T),][i] %>%
head(11)
topN.names=as.character(rownames(topN))
topN.similarities= as.numeric(topN[,1])
# Drop the first one because it will always be the same song (Step2)
topN.names= topN.names[-1]
topN.similarities= topN.similarities[-1]
# We then get the user's purchase history for those 10 items (Step3)
topN.userPurchases= data.germany[,topN.names]
# Fill the user score matrix (Step4)
data.germany.user.scores[,i]= as.matrix(topN.userPurchases) %*% topN[-1,1] / sum(topN[-1,1])
}
data.germany.user.scores[1:6,1:6] # head
## a.perfect.circle abba ac.dc adam.green aerosmith afi
## 1 0.0000000 0.00000000 0.20440540 0.00000000 0.26617451 0.00000000
## 33 0.0823426 0.00000000 0.09591153 0.29937093 0.08888522 0.00000000
## 42 0.0000000 0.08976655 0.00000000 0.00000000 0.00000000 0.00000000
## 51 0.0823426 0.08356811 0.00000000 0.08350177 0.00000000 0.09295439
## 62 0.0000000 0.00000000 0.11430459 0.09319688 0.08822137 0.09295439
## 75 0.0000000 0.00000000 0.00000000 0.00000000 0.00000000 0.22002398
Notice that the structure of data.germany.user.score
is the same as data.germany
except that user IDs is in the first column in data.germany
instead of rownames.
data.germany[1:6, 1:7]
## user a.perfect.circle abba ac.dc adam.green aerosmith afi
## 1 1 0 0 0 0 0 0
## 2 33 0 0 0 1 0 0
## 3 42 0 0 0 0 0 0
## 4 51 0 0 0 0 0 0
## 5 62 0 0 0 0 0 0
## 6 75 0 0 0 0 0 0
This implies that if you drop the user ID from data.gernamy
, then you get data.germany.ibs
. Therefore, data.germany.user.score[i,j]
and data.germany.ibs[i,j]
refers to the same user i and the same item j.
Then, make sure that items are not recommended if a user has already consumed one. (i.e., we recommend item j for user i iff 'data.germany.ibs[i,j]` is zero)
data.germany.user.scores[data.germany.ibs!=0]= 0 # Step5
Finally, make the recommendation pretty.
# Placeholder. Top 100 songs
data.germany.user.scores.holder = matrix(NA, nrow=nrow(data.germany.user.scores), ncol=100,
dimnames=list(rownames(data.germany.user.scores)))
for(i in 1:nrow(data.germany.user.scores)) {
data.germany.user.scores.holder[i,] =
data.germany.user.scores[, order(data.germany.user.scores[i,], decreasing=T)][i,] %>%
head(n=100) %>% names
}
data.germany.user.scores.holder[1:6,1:4] # head
## [,1] [,2] [,3]
## 1 "flogging.molly" "coldplay" "aerosmith"
## 33 "peter.fox" "gentleman" "red.hot.chili.peppers"
## 42 "oomph." "lacuna.coil" "rammstein"
## 51 "the.subways" "the.kooks" "the.hives"
## 62 "mando.diao" "the.fratellis" "jack.johnson"
## 75 "hoobastank" "papa.roach" "the.prodigy"
## [,4]
## 1 "the.beatles"
## 33 "kings.of.leon"
## 42 "schandmaul"
## 51 "franz.ferdinand"
## 62 "incubus"
## 75 "sum.41"
In this gist, I replicated the use-based collaborative filtering implemented by Salem using vectors and matrices, instead of scalers and loops as in the original post. I have found that this vectorisation does not change the result and reduces the runtime more than (from 70 minutes to 20 seconds in my environment)
# import library
library(magrittr)
# import data
data.germany = read.csv("http://www.salemmarafi.com/wp-content/uploads/2014/04/lastfm-matrix-germany.csv")
#### Item-Based ####
data.germany.ibs= as.matrix(data.germany[,!(names(data.germany) %in% c("user"))])
# Compute cosine similarity
data.germany.ibs.similarity.num= t(data.germany.ibs) %*% data.germany.ibs
sq_diag=sqrt(diag(data.germany.ibs.similarity.num))
data.germany.ibs.similarity.denom= sq_diag %*% t(sq_diag)
data.germany.ibs.similarity= data.germany.ibs.similarity.num/data.germany.ibs.similarity.denom
# back to dataframe
data.germany.ibs.similarity=as.data.frame(data.germany.ibs.similarity) %>%
set_rownames(names(data.germany[-1]))
# neat recommendation
data.germany.neighbours = matrix(NA, nrow=ncol(data.germany.ibs.similarity),ncol=11,
dimnames=list(colnames(data.germany.ibs.similarity)))
for(i in 1:ncol(data.germany.ibs)) {
data.germany.neighbours[i,] = data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i],decreasing=T),][i] %>%
rownames %>% head(n=11) %>% t
}
#### User-Based ####
data.germany.user.scores = matrix(NA, nrow=nrow(data.germany),ncol=ncol(data.germany)-1,
dimnames=list((data.germany$user),colnames(data.germany[-1])))
# compute user score
for (i in 1:ncol(data.germany.user.scores)){ # loop through songs (Step1)
# get a product's top 10 neighbours sorted by similarity(Step2)
topN=data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i], decreasing=T),][i] %>%
head(11)
topN.names=as.character(rownames(topN))
topN.similarities= as.numeric(topN[,1])
# Drop the first one because it will always be the same song (Step2)
topN.names= topN.names[-1]
topN.similarities= topN.similarities[-1]
# We then get the user's purchase history for those 10 items (Step3)
topN.userPurchases= data.germany[,topN.names]
# Fill the user score matrix (Step4)
data.germany.user.scores[,i]= as.matrix(topN.userPurchases) %*% topN[-1,1] / sum(topN[-1,1])
}
data.germany.user.scores[data.germany.ibs!=0]= 0 # Step5
# Placeholder. Top 100 songs
data.germany.user.scores.holder = matrix(NA, nrow=nrow(data.germany.user.scores), ncol=100,
dimnames=list(rownames(data.germany.user.scores)))
for(i in 1:nrow(data.germany.user.scores)) {
data.germany.user.scores.holder[i,] =
data.germany.user.scores[, order(data.germany.user.scores[i,], decreasing=T)][i,] %>%
head(n=100) %>% names
}
This sounds just what I need! Was getting similar run times (ie. 1-2 hours). This will hopefully allow me to use a larger data set so thank you for this.