Last active
March 19, 2018 04:47
-
-
Save yoshiki146/21cc9a76c366c880b25e1e4c10bc44fd to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Collaborative Filtering by R | |
This post impliments user-based collaborative filtering using R. Idea comes from [the blog post by Salem](http://www.salemmarafi.com/code/collaborative-filtering-r/). I vectorise the code thus significantly reducing the runtime. Visit his blog for more information. (http://www.salemmarafi.com/) | |
(\*Output is not visible for Rmarkdown file in Github. Please find [markdown version](https://gist.github.com/yoshiki146/31d4a46c3d8e906c3cd24f425568d34e) or download and knit in your environment) | |
#### Import library and dataset | |
```{r, message=F} | |
library(magrittr) | |
``` | |
```{r} | |
data.germany = read.csv("http://www.salemmarafi.com/wp-content/uploads/2014/04/lastfm-matrix-germany.csv") | |
data.germany[1:6,1:8] # show head of data | |
``` | |
#### Item-Based Collaborative Filtering | |
Then, let's start implementing item-based CF to get similarity matrix. | |
[Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is defined as: | |
$$\text{similarity}= cos(\theta)= \frac{A \cdot B}{|A||B|} = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i-1}^n A_i^2} \sqrt{\sum_{i-1}^n B_i^2}}$$ | |
where A and B denote vectors of items/songs and $A_i$ represents i-th element of A (whether user i played song A). To calculate, I drop `name` column and convert dataframe into matrix | |
```{r} | |
data.germany.ibs= as.matrix(data.germany[,!(names(data.germany) %in% c("user"))]) | |
``` | |
Then $A\cdot B$ (numerator) is calculated as | |
```{r} | |
data.germany.ibs.similarity.num= t(data.germany.ibs) %*% data.germany.ibs | |
``` | |
To understand, let's denote items A, B,..., Z. Then numerator in matrix form can be written as: | |
\[ | |
\left[ | |
\begin{array}{rrrr} | |
A'A & A'B & \cdots & A'Z \\ | |
B'A & B'B & \cdots & B'Z \\ | |
\vdots & \vdots & \ddots & \vdots \\ | |
Z'A & Z'B & \cdots & Z'Z | |
\end{array} | |
\right] | |
\] | |
where A' is the transpose of A. | |
Notice that square root of the diagonal element is equivalent to the distance of the vector (|A|). Therefore denominator is | |
```{r} | |
sq_diag=sqrt(diag(data.germany.ibs.similarity.num)) | |
data.germany.ibs.similarity.denom= sq_diag %*% t(sq_diag) | |
``` | |
Now we can get similarity matrix by simply dividing $A \cdot B$ by $|A||B|$ (this is an element-wise operation) | |
```{r} | |
data.germany.ibs.similarity= data.germany.ibs.similarity.num/data.germany.ibs.similarity.denom | |
data.germany.ibs.similarity=as.data.frame(data.germany.ibs.similarity) %>% | |
set_rownames(names(data.germany[-1])) | |
data.germany.ibs.similarity[1:6, 1:4] # head | |
``` | |
Notice that `data.germany.ibs.similarity` is a synmetric matrix with ones on diagonal by definition. | |
Next, we sort the similarity in a decreasing order to get top 10 neighbours for each song. This process is the same as the original post except that I prefer to use pipe operator to look neat. | |
```{r} | |
# Placeholder matrix | |
data.germany.neighbours = matrix(NA, nrow=ncol(data.germany.ibs.similarity),ncol=11, | |
dimnames=list(colnames(data.germany.ibs.similarity))) | |
# Get the names of the song | |
for(i in 1:ncol(data.germany.ibs)) { # loop through songs | |
data.germany.neighbours[i,] = data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i],decreasing=T),][i] %>% | |
rownames %>% head(n=11) %>% t | |
} | |
data.germany.neighbours[1:6, 1:4] # head | |
``` | |
#### User-Based Collaborative Filtering | |
The steps to calculate the user score are: | |
1. Choose an item | |
2. Get the top 10 similar items and its similarity score (from item-based CF part, $simil_j,\ (j=1,2,\dots,10)$) | |
3. Get the consumption record to the similar items (from original dataset) | |
4. Calculate the score $\frac{\sum_{j=1}^n hist_j \times simil_j}{\sum_{j=1}^n simil_j}$, where j is the top 10 similar items, $hist_j$ is the purchase history for item j (retreived from `data.germany`) and $simil_j$ is the similarity to item j (retreive from 'data.germany.ibs.similarity') | |
5. Check if a user already consumed that item | |
First, create a placeholder matrix for user score. The size of this matrix is \#-of-users x \#-of-songs. | |
```{r} | |
data.germany.user.scores = matrix(NA, nrow=nrow(data.germany),ncol=ncol(data.germany)-1, | |
dimnames=list((data.germany$user),colnames(data.germany[-1]))) | |
``` | |
Step 1 to 4 can be processed inside one loop | |
```{r} | |
for (i in 1:ncol(data.germany.user.scores)){ # loop through songs (Step1) | |
# get a product's top 10 neighbours sorted by similarity(Step2) | |
topN=data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i], decreasing=T),][i] %>% | |
head(11) | |
topN.names=as.character(rownames(topN)) | |
topN.similarities= as.numeric(topN[,1]) | |
# Drop the first one because it will always be the same song (Step2) | |
topN.names= topN.names[-1] | |
topN.similarities= topN.similarities[-1] | |
# We then get the user's purchase history for those 10 items (Step3) | |
topN.userPurchases= data.germany[,topN.names] | |
# Fill the user score matrix (Step4) | |
data.germany.user.scores[,i]= as.matrix(topN.userPurchases) %*% topN[-1,1] / sum(topN[-1,1]) | |
} | |
data.germany.user.scores[1:6,1:6] # head | |
``` | |
Notice that the structure of `data.germany.user.score` is the same as `data.germany` except that user IDs is in the first column in `data.germany` instead of rownames. | |
```{r} | |
data.germany[1:6, 1:7] | |
``` | |
This implies that if you drop the user ID from `data.gernamy`, then you get `data.germany.ibs`. Therefore, `data.germany.user.score[i,j]` and `data.germany.ibs[i,j]` refers to the same user i and the same item j. | |
Then, make sure that items are not recommended if a user has already consumed one. (i.e., we recommend item j for user i iff 'data.germany.ibs[i,j]` is zero) | |
```{r} | |
data.germany.user.scores[data.germany.ibs!=0]= 0 # Step5 | |
``` | |
Finally, make the recommendation pretty. | |
```{r} | |
# Placeholder. Top 100 songs | |
data.germany.user.scores.holder = matrix(NA, nrow=nrow(data.germany.user.scores), ncol=100, | |
dimnames=list(rownames(data.germany.user.scores))) | |
for(i in 1:nrow(data.germany.user.scores)) { | |
data.germany.user.scores.holder[i,] = | |
data.germany.user.scores[, order(data.germany.user.scores[i,], decreasing=T)][i,] %>% | |
head(n=100) %>% names | |
} | |
data.germany.user.scores.holder[1:6,1:4] # head | |
``` | |
#### Summary | |
In this gist, I replicated the use-based collaborative filtering implemented by [Salem](http://www.salemmarafi.com/code/collaborative-filtering-r/) using vectors and matrices, instead of scalers and loops as in the original post. I have found that this vectorisation does not change the result and reduces the runtime more than (from 70 minutes to 20 seconds in my environment) | |
#### Entire Code | |
```{r, eval=F} | |
# import library | |
library(magrittr) | |
# import data | |
data.germany = read.csv("http://www.salemmarafi.com/wp-content/uploads/2014/04/lastfm-matrix-germany.csv") | |
#### Item-Based #### | |
data.germany.ibs= as.matrix(data.germany[,!(names(data.germany) %in% c("user"))]) | |
# Compute cosine similarity | |
data.germany.ibs.similarity.num= t(data.germany.ibs) %*% data.germany.ibs | |
sq_diag=sqrt(diag(data.germany.ibs.similarity.num)) | |
data.germany.ibs.similarity.denom= sq_diag %*% t(sq_diag) | |
data.germany.ibs.similarity= data.germany.ibs.similarity.num/data.germany.ibs.similarity.denom | |
# back to dataframe | |
data.germany.ibs.similarity=as.data.frame(data.germany.ibs.similarity) %>% | |
set_rownames(names(data.germany[-1])) | |
# neat recommendation | |
data.germany.neighbours = matrix(NA, nrow=ncol(data.germany.ibs.similarity),ncol=11, | |
dimnames=list(colnames(data.germany.ibs.similarity))) | |
for(i in 1:ncol(data.germany.ibs)) { | |
data.germany.neighbours[i,] = data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i],decreasing=T),][i] %>% | |
rownames %>% head(n=11) %>% t | |
} | |
#### User-Based #### | |
data.germany.user.scores = matrix(NA, nrow=nrow(data.germany),ncol=ncol(data.germany)-1, | |
dimnames=list((data.germany$user),colnames(data.germany[-1]))) | |
# compute user score | |
for (i in 1:ncol(data.germany.user.scores)){ # loop through songs (Step1) | |
# get a product's top 10 neighbours sorted by similarity(Step2) | |
topN=data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i], decreasing=T),][i] %>% | |
head(11) | |
topN.names=as.character(rownames(topN)) | |
topN.similarities= as.numeric(topN[,1]) | |
# Drop the first one because it will always be the same song (Step2) | |
topN.names= topN.names[-1] | |
topN.similarities= topN.similarities[-1] | |
# We then get the user's purchase history for those 10 items (Step3) | |
topN.userPurchases= data.germany[,topN.names] | |
# Fill the user score matrix (Step4) | |
data.germany.user.scores[,i]= as.matrix(topN.userPurchases) %*% topN[-1,1] / sum(topN[-1,1]) | |
} | |
data.germany.user.scores[data.germany.ibs!=0]= 0 # Step5 | |
# Placeholder. Top 100 songs | |
data.germany.user.scores.holder = matrix(NA, nrow=nrow(data.germany.user.scores), ncol=100, | |
dimnames=list(rownames(data.germany.user.scores))) | |
for(i in 1:nrow(data.germany.user.scores)) { | |
data.germany.user.scores.holder[i,] = | |
data.germany.user.scores[, order(data.germany.user.scores[i,], decreasing=T)][i,] %>% | |
head(n=100) %>% names | |
} | |
``` | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment