Skip to content

Instantly share code, notes, and snippets.

@yoshiki146
Last active March 19, 2018 04:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yoshiki146/21cc9a76c366c880b25e1e4c10bc44fd to your computer and use it in GitHub Desktop.
Save yoshiki146/21cc9a76c366c880b25e1e4c10bc44fd to your computer and use it in GitHub Desktop.
## Collaborative Filtering by R
This post impliments user-based collaborative filtering using R. Idea comes from [the blog post by Salem](http://www.salemmarafi.com/code/collaborative-filtering-r/). I vectorise the code thus significantly reducing the runtime. Visit his blog for more information. (http://www.salemmarafi.com/)
(\*Output is not visible for Rmarkdown file in Github. Please find [markdown version](https://gist.github.com/yoshiki146/31d4a46c3d8e906c3cd24f425568d34e) or download and knit in your environment)
#### Import library and dataset
```{r, message=F}
library(magrittr)
```
```{r}
data.germany = read.csv("http://www.salemmarafi.com/wp-content/uploads/2014/04/lastfm-matrix-germany.csv")
data.germany[1:6,1:8] # show head of data
```
#### Item-Based Collaborative Filtering
Then, let's start implementing item-based CF to get similarity matrix.
[Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is defined as:
$$\text{similarity}= cos(\theta)= \frac{A \cdot B}{|A||B|} = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i-1}^n A_i^2} \sqrt{\sum_{i-1}^n B_i^2}}$$
where A and B denote vectors of items/songs and $A_i$ represents i-th element of A (whether user i played song A). To calculate, I drop `name` column and convert dataframe into matrix
```{r}
data.germany.ibs= as.matrix(data.germany[,!(names(data.germany) %in% c("user"))])
```
Then $A\cdot B$ (numerator) is calculated as
```{r}
data.germany.ibs.similarity.num= t(data.germany.ibs) %*% data.germany.ibs
```
To understand, let's denote items A, B,..., Z. Then numerator in matrix form can be written as:
\[
\left[
\begin{array}{rrrr}
A'A & A'B & \cdots & A'Z \\
B'A & B'B & \cdots & B'Z \\
\vdots & \vdots & \ddots & \vdots \\
Z'A & Z'B & \cdots & Z'Z
\end{array}
\right]
\]
where A' is the transpose of A.
Notice that square root of the diagonal element is equivalent to the distance of the vector (|A|). Therefore denominator is
```{r}
sq_diag=sqrt(diag(data.germany.ibs.similarity.num))
data.germany.ibs.similarity.denom= sq_diag %*% t(sq_diag)
```
Now we can get similarity matrix by simply dividing $A \cdot B$ by $|A||B|$ (this is an element-wise operation)
```{r}
data.germany.ibs.similarity= data.germany.ibs.similarity.num/data.germany.ibs.similarity.denom
data.germany.ibs.similarity=as.data.frame(data.germany.ibs.similarity) %>%
set_rownames(names(data.germany[-1]))
data.germany.ibs.similarity[1:6, 1:4] # head
```
Notice that `data.germany.ibs.similarity` is a synmetric matrix with ones on diagonal by definition.
Next, we sort the similarity in a decreasing order to get top 10 neighbours for each song. This process is the same as the original post except that I prefer to use pipe operator to look neat.
```{r}
# Placeholder matrix
data.germany.neighbours = matrix(NA, nrow=ncol(data.germany.ibs.similarity),ncol=11,
dimnames=list(colnames(data.germany.ibs.similarity)))
# Get the names of the song
for(i in 1:ncol(data.germany.ibs)) { # loop through songs
data.germany.neighbours[i,] = data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i],decreasing=T),][i] %>%
rownames %>% head(n=11) %>% t
}
data.germany.neighbours[1:6, 1:4] # head
```
#### User-Based Collaborative Filtering
The steps to calculate the user score are:
1. Choose an item
2. Get the top 10 similar items and its similarity score (from item-based CF part, $simil_j,\ (j=1,2,\dots,10)$)
3. Get the consumption record to the similar items (from original dataset)
4. Calculate the score $\frac{\sum_{j=1}^n hist_j \times simil_j}{\sum_{j=1}^n simil_j}$, where j is the top 10 similar items, $hist_j$ is the purchase history for item j (retreived from `data.germany`) and $simil_j$ is the similarity to item j (retreive from 'data.germany.ibs.similarity')
5. Check if a user already consumed that item
First, create a placeholder matrix for user score. The size of this matrix is \#-of-users x \#-of-songs.
```{r}
data.germany.user.scores = matrix(NA, nrow=nrow(data.germany),ncol=ncol(data.germany)-1,
dimnames=list((data.germany$user),colnames(data.germany[-1])))
```
Step 1 to 4 can be processed inside one loop
```{r}
for (i in 1:ncol(data.germany.user.scores)){ # loop through songs (Step1)
# get a product's top 10 neighbours sorted by similarity(Step2)
topN=data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i], decreasing=T),][i] %>%
head(11)
topN.names=as.character(rownames(topN))
topN.similarities= as.numeric(topN[,1])
# Drop the first one because it will always be the same song (Step2)
topN.names= topN.names[-1]
topN.similarities= topN.similarities[-1]
# We then get the user's purchase history for those 10 items (Step3)
topN.userPurchases= data.germany[,topN.names]
# Fill the user score matrix (Step4)
data.germany.user.scores[,i]= as.matrix(topN.userPurchases) %*% topN[-1,1] / sum(topN[-1,1])
}
data.germany.user.scores[1:6,1:6] # head
```
Notice that the structure of `data.germany.user.score` is the same as `data.germany` except that user IDs is in the first column in `data.germany` instead of rownames.
```{r}
data.germany[1:6, 1:7]
```
This implies that if you drop the user ID from `data.gernamy`, then you get `data.germany.ibs`. Therefore, `data.germany.user.score[i,j]` and `data.germany.ibs[i,j]` refers to the same user i and the same item j.
Then, make sure that items are not recommended if a user has already consumed one. (i.e., we recommend item j for user i iff 'data.germany.ibs[i,j]` is zero)
```{r}
data.germany.user.scores[data.germany.ibs!=0]= 0 # Step5
```
Finally, make the recommendation pretty.
```{r}
# Placeholder. Top 100 songs
data.germany.user.scores.holder = matrix(NA, nrow=nrow(data.germany.user.scores), ncol=100,
dimnames=list(rownames(data.germany.user.scores)))
for(i in 1:nrow(data.germany.user.scores)) {
data.germany.user.scores.holder[i,] =
data.germany.user.scores[, order(data.germany.user.scores[i,], decreasing=T)][i,] %>%
head(n=100) %>% names
}
data.germany.user.scores.holder[1:6,1:4] # head
```
#### Summary
In this gist, I replicated the use-based collaborative filtering implemented by [Salem](http://www.salemmarafi.com/code/collaborative-filtering-r/) using vectors and matrices, instead of scalers and loops as in the original post. I have found that this vectorisation does not change the result and reduces the runtime more than (from 70 minutes to 20 seconds in my environment)
#### Entire Code
```{r, eval=F}
# import library
library(magrittr)
# import data
data.germany = read.csv("http://www.salemmarafi.com/wp-content/uploads/2014/04/lastfm-matrix-germany.csv")
#### Item-Based ####
data.germany.ibs= as.matrix(data.germany[,!(names(data.germany) %in% c("user"))])
# Compute cosine similarity
data.germany.ibs.similarity.num= t(data.germany.ibs) %*% data.germany.ibs
sq_diag=sqrt(diag(data.germany.ibs.similarity.num))
data.germany.ibs.similarity.denom= sq_diag %*% t(sq_diag)
data.germany.ibs.similarity= data.germany.ibs.similarity.num/data.germany.ibs.similarity.denom
# back to dataframe
data.germany.ibs.similarity=as.data.frame(data.germany.ibs.similarity) %>%
set_rownames(names(data.germany[-1]))
# neat recommendation
data.germany.neighbours = matrix(NA, nrow=ncol(data.germany.ibs.similarity),ncol=11,
dimnames=list(colnames(data.germany.ibs.similarity)))
for(i in 1:ncol(data.germany.ibs)) {
data.germany.neighbours[i,] = data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i],decreasing=T),][i] %>%
rownames %>% head(n=11) %>% t
}
#### User-Based ####
data.germany.user.scores = matrix(NA, nrow=nrow(data.germany),ncol=ncol(data.germany)-1,
dimnames=list((data.germany$user),colnames(data.germany[-1])))
# compute user score
for (i in 1:ncol(data.germany.user.scores)){ # loop through songs (Step1)
# get a product's top 10 neighbours sorted by similarity(Step2)
topN=data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i], decreasing=T),][i] %>%
head(11)
topN.names=as.character(rownames(topN))
topN.similarities= as.numeric(topN[,1])
# Drop the first one because it will always be the same song (Step2)
topN.names= topN.names[-1]
topN.similarities= topN.similarities[-1]
# We then get the user's purchase history for those 10 items (Step3)
topN.userPurchases= data.germany[,topN.names]
# Fill the user score matrix (Step4)
data.germany.user.scores[,i]= as.matrix(topN.userPurchases) %*% topN[-1,1] / sum(topN[-1,1])
}
data.germany.user.scores[data.germany.ibs!=0]= 0 # Step5
# Placeholder. Top 100 songs
data.germany.user.scores.holder = matrix(NA, nrow=nrow(data.germany.user.scores), ncol=100,
dimnames=list(rownames(data.germany.user.scores)))
for(i in 1:nrow(data.germany.user.scores)) {
data.germany.user.scores.holder[i,] =
data.germany.user.scores[, order(data.germany.user.scores[i,], decreasing=T)][i,] %>%
head(n=100) %>% names
}
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment