agstudy/gist:5473600

## gistfile1.r
Item-based Collaborative Filtering
========================================================

The task is to use 3003377s-offsides.tsv file of drugs and side effects (url below) and use the R **recommenderlab library** or similar to create code that will let us do Item-based Collaborative Filtering, which will mean :
* we can identify drugs with similar side effects
* possibly predict unknown side effects by finding the average relative risk of side effects of the K-nearest neighbors using some similarity metric.

Basically this means making a matrix where cells are relative risk (“rr” or “log2rr” column) and columns are side effects and rows are drugs (use either the “stitch_id” or “drug” column).  Then we can randomly omit values from cells and attempt to answer which distance metric or transformation of relative risk allows us to predict more accurately the omitted values.

Reading data and levelplot
----------------
I read data using fread from data.table . This is faster that `read.table`
```{r readData,message=FALSE}
## you should change the path here
fileName <- "C:/Users/agbranding/Downloads/3003377s-offsides.txt"
library(data.table)
dat <- fread(fileName,verbose=FALSE)
dat <- as.data.frame(dat)
````
My first idea is to do some  visual inspection using lattice
levelplot . I try to to get some intutions how the data is sparsed.

```{r levelplot,message=FALSE}
library(latticeExtra)
items <- sample(dat$umls_id,100,rep=FALSE)
users <- sample(dat$stitch_id,100,rep=FALSE)
levelplot(rr ~ factor(stitch_id)*factor(umls_id),
          data = subset(dat,umls_id %in% items &
                            stitch_id %in% users),
          scales = list(x = list(rot=90,cex=0.8)))

````

recommenderlab package
------------------------------
### Inspection of data set properties


```{r histo,message=FALSE}
library(recommenderlab)
## construct rating matrix
m <- dat[,c('stitch_id','umls_id','rr')]
r <- as(m, "realRatingMatrix")

## normalize
r_m <- normalize(r)
## distributions
layout(rbind(c(1,1),c(2,3),c(4,5)))
hist(getRatings(r), breaks=100,main='risk distribution')
hist(getRatings(r_m), breaks=100,main='normalized risk distribution')
hist(getRatings(normalize(r, method = "Z−score")),
     breaks=100,main='Risk normalized with z-score')
hist(colMeans(r), breaks=20,main='Average risk per drug')
hist(rowCounts(r),breaks=20,main='Number of events per drug ')
layout(c(1))
````

### Creating a recommender

Next, we create a recommender which generates recommendations solely on the popularity
of items (the number of users who have the item in their profile). We create a recommender
from the first 1000 users(1000 drugs as training data) of my data set.

```{r model,message=FALSE}
## TODO test with  IBCF , RANDOM ,UBCF
r.fit <- Recommender(r[1000], method = "POPULAR")

```

Recommendations are generated by predict() . The result are recommendations in the form of an object of class TopNList. Here we create top-5 recommendation lists for two users who were not used to learn the model.
```{r model.predict,message=FALSE}
recom <- predict(r.fit, r[1001:1002], n=5)
as(recom, "list")
```

Many recommender algorithms can also predict ratings. This is also implemented using
predict() with the parameter type set to "ratings".

```{r predict.ratings,message=FALSE}
recom <- predict(r.fit, r[1001:1002], type="ratings")
as(recom, "matrix")[,1:10]
````


Predicted ratings are returned as an object of realRatingMatrix. The prediction contains
NA for the items rated by the active users. In the example we show the predicted ratings for
the first 10 items for the two active users.


### Evaluation of predicted ratings

1. Create an evaluation scheme that determines what and how data is used for training and evaluation. Here we create an evaluation scheme which splits the first 1000 users into a training set (90%) and a test set (10%). For the test set 15 items will be given to the recommender algorithm and the other items will be held out for computing the error.
2. We compute predicted ratings for the known part of the test data (15 items for each
user) using the two algorithms.
3. Finally, we can calculate the error between the prediction and the unknown part of the
test data.

```{r schema,}
# e <- evaluationScheme(r[1:100], method="split", train=0.9, given=15)
# names <- c('UBCF','IBCF','POPULAR','RANDOM')
# error <- lapply( names,function(x){
#   r1 <- Recommender(getData(e, "train"), x)
#   p1 <- predict(r1, getData(e, "known"), type="ratings")
#   calcPredictionError(p1, getData(e, "unknown"))
# })
# names(error) <- names
# error
```


### Comparing recommender algorithms

For comparison also evaluate() is used. The only change is to use evaluate()
with a list of algorithms together with their parameters instead of a single method name. In
the following we use the evaluation scheme created above to compare the five recommender
algorithms: random items, popular items, user-based CF, item-based CF, and association rule
based recommendations. Note that when running the following code, the CF based algorithms
are very slow.
```{r comapre algorithms,}
scheme <- evaluationScheme(r[1:100], method="split", train = .9,
                           k=1, given=20, goodRating=5)
algorithms <- list(
  "random items" = list(name="RANDOM", param=NULL),
  "popular items" = list(name="POPULAR", param=NULL),
  "user-based CF" = list(name="UBCF", param=list(method="Cosine", nn=50, minRating=5))
  #,
  ## adding this is really very slow! I abnadon after 30 minutes!!!
  #  "item-based CF" = list(name="IBCF", param=list(method="Cosine",  minRating=5))
  )

results <- evaluate(scheme, algorithms,
                    n=c(1, 3, 5, 10, 15, 20))

plot(results, annotate=c(1,3), legend="topleft")
plot(results, "prec/rec", annotate=3)

results

```


### Translation of plots to ggplot2

In order to get the previous plots in ggplot2 I create 2 functions. I tries to keep the same logic and argumets as the initials package plots. The most difficulty was the annotate argument.

The first function , get the result of evaluate and transform it to a ROC curve.


```{r plotEvalutionResult}

  ggplot.evaluationResult  <- function (x, y, avg = TRUE,annotate = FALSE,title='')
  {
    if (missing(y))
      y <- NULL
    plot_type <- match.arg(y, c("ROC", "prec/rec"))
    if (plot_type == "ROC")
      take <- c("FPR", "TPR")
    else take <- c("recall", "precision")
    if(avg){x <- avg(x)}
    else x <- getConfusionMatrix(x)[[1]]
    x <- as.data.frame(x[, take])
    x <- transform(x,text=rownames(x))
    ggplot(data=x,aes_string(x=take[1],y=take[2],label="text"))+
        geom_line()+ geom_point()+
        geom_text(vjust=-1,col=ifelse(annotate,TRUE,NA))+
        ylim(extendrange(x[,take[2]]))+
        ggtitle(paste("ROC curve",title))

  }

```

And the the same function to plot a list of evalution Results. The plot part is nearly the same excpet the fact of using group/color aes for multiple superposed plots.

```{r}

  ggplot.evaluationResultList  <- function (x, y,
                      avg = TRUE, type = "b", annotate = FALSE)
  {
    if (missing(y))
      y <- NULL
    plot_type <- match.arg(y, c("ROC", "prec/rec"))
    take <- if (plot_type == "ROC")
      c("FPR", "TPR")
    else c("recall", "precision")
    max_lim <- apply(sapply(x, FUN = function(y) apply(avg(y)[,
                                                              take], MARGIN = 2, max)), MARGIN = 1, max)
    dat <- do.call(rbind,
                   lapply(names(results),
                          function(x){
                            name <- x
                            dat <- results[[x]]
                            if(avg){dat <- avg(dat)}
                            else dat <- getConfusionMatrix(dat)[[1]]
                            x <- as.data.frame(dat)
                            x <- transform(x,text=rownames(x),
                                           method=name,
                                           col = ifelse(name %in% names(results)[annotate],
                                                        TRUE,
                                                        NA)

                            )
                          }))
    col.v <- dat$col
    ggplot(data=dat,aes_string(x=take[1],y=take[2],label="text",
                               group="method",col='method'))+
      geom_line()+ geom_point()+
      geom_text(vjust=-1,col=col.v)+
      ylim(extendrange(dat[,take[2]]))

  }

```

Then you can call them like this :

```{r}
require(ggplot2)
ggplot.evaluationResult(results[[2]],'prec',annotate=FALSE,avg=FALSE,title=names(results)[2])
ggplot.evaluationResultList(results,'ROC',annotate=FALSE,avg=TRUE)
```
	Item-based Collaborative Filtering
	========================================================

	The task is to use 3003377s-offsides.tsv file of drugs and side effects (url below) and use the R recommenderlab library or similar to create code that will let us do Item-based Collaborative Filtering, which will mean :
	* we can identify drugs with similar side effects
	* possibly predict unknown side effects by finding the average relative risk of side effects of the K-nearest neighbors using some similarity metric.

	Basically this means making a matrix where cells are relative risk (“rr” or “log2rr” column) and columns are side effects and rows are drugs (use either the “stitch_id” or “drug” column). Then we can randomly omit values from cells and attempt to answer which distance metric or transformation of relative risk allows us to predict more accurately the omitted values.

	Reading data and levelplot
	----------------
	I read data using fread from data.table . This is faster that `read.table`
	```{r readData,message=FALSE}
	## you should change the path here
	fileName <- "C:/Users/agbranding/Downloads/3003377s-offsides.txt"
	library(data.table)
	dat <- fread(fileName,verbose=FALSE)
	dat <- as.data.frame(dat)
	````
	My first idea is to do some visual inspection using lattice
	levelplot . I try to to get some intutions how the data is sparsed.

	```{r levelplot,message=FALSE}
	library(latticeExtra)
	items <- sample(dat$umls_id,100,rep=FALSE)
	users <- sample(dat$stitch_id,100,rep=FALSE)
	levelplot(rr ~ factor(stitch_id)*factor(umls_id),
	data = subset(dat,umls_id %in% items &
	stitch_id %in% users),
	scales = list(x = list(rot=90,cex=0.8)))

	````

	recommenderlab package
	------------------------------
	### Inspection of data set properties


	```{r histo,message=FALSE}
	library(recommenderlab)
	## construct rating matrix
	m <- dat[,c('stitch_id','umls_id','rr')]
	r <- as(m, "realRatingMatrix")

	## normalize
	r_m <- normalize(r)
	## distributions
	layout(rbind(c(1,1),c(2,3),c(4,5)))
	hist(getRatings(r), breaks=100,main='risk distribution')
	hist(getRatings(r_m), breaks=100,main='normalized risk distribution')
	hist(getRatings(normalize(r, method = "Z−score")),
	breaks=100,main='Risk normalized with z-score')
	hist(colMeans(r), breaks=20,main='Average risk per drug')
	hist(rowCounts(r),breaks=20,main='Number of events per drug ')
	layout(c(1))
	````

	### Creating a recommender

	Next, we create a recommender which generates recommendations solely on the popularity
	of items (the number of users who have the item in their profile). We create a recommender
	from the first 1000 users(1000 drugs as training data) of my data set.

	```{r model,message=FALSE}
	## TODO test with IBCF , RANDOM ,UBCF
	r.fit <- Recommender(r[1000], method = "POPULAR")

	```

	Recommendations are generated by predict() . The result are recommendations in the form of an object of class TopNList. Here we create top-5 recommendation lists for two users who were not used to learn the model.
	```{r model.predict,message=FALSE}
	recom <- predict(r.fit, r[1001:1002], n=5)
	as(recom, "list")
	```

	Many recommender algorithms can also predict ratings. This is also implemented using
	predict() with the parameter type set to "ratings".

	```{r predict.ratings,message=FALSE}
	recom <- predict(r.fit, r[1001:1002], type="ratings")
	as(recom, "matrix")[,1:10]
	````


	Predicted ratings are returned as an object of realRatingMatrix. The prediction contains
	NA for the items rated by the active users. In the example we show the predicted ratings for
	the first 10 items for the two active users.


	### Evaluation of predicted ratings

	1. Create an evaluation scheme that determines what and how data is used for training and evaluation. Here we create an evaluation scheme which splits the first 1000 users into a training set (90%) and a test set (10%). For the test set 15 items will be given to the recommender algorithm and the other items will be held out for computing the error.
	2. We compute predicted ratings for the known part of the test data (15 items for each
	user) using the two algorithms.
	3. Finally, we can calculate the error between the prediction and the unknown part of the
	test data.

	```{r schema,}
	# e <- evaluationScheme(r[1:100], method="split", train=0.9, given=15)
	# names <- c('UBCF','IBCF','POPULAR','RANDOM')
	# error <- lapply( names,function(x){
	# r1 <- Recommender(getData(e, "train"), x)
	# p1 <- predict(r1, getData(e, "known"), type="ratings")
	# calcPredictionError(p1, getData(e, "unknown"))
	# })
	# names(error) <- names
	# error
	```



	### Comparing recommender algorithms

	For comparison also evaluate() is used. The only change is to use evaluate()
	with a list of algorithms together with their parameters instead of a single method name. In
	the following we use the evaluation scheme created above to compare the five recommender
	algorithms: random items, popular items, user-based CF, item-based CF, and association rule
	based recommendations. Note that when running the following code, the CF based algorithms
	are very slow.
	```{r comapre algorithms,}
	scheme <- evaluationScheme(r[1:100], method="split", train = .9,
	k=1, given=20, goodRating=5)
	algorithms <- list(
	"random items" = list(name="RANDOM", param=NULL),
	"popular items" = list(name="POPULAR", param=NULL),
	"user-based CF" = list(name="UBCF", param=list(method="Cosine", nn=50, minRating=5))
	#,
	## adding this is really very slow! I abnadon after 30 minutes!!!
	# "item-based CF" = list(name="IBCF", param=list(method="Cosine", minRating=5))
	)

	results <- evaluate(scheme, algorithms,
	n=c(1, 3, 5, 10, 15, 20))

	plot(results, annotate=c(1,3), legend="topleft")
	plot(results, "prec/rec", annotate=3)

	results

	```




	### Translation of plots to ggplot2

	In order to get the previous plots in ggplot2 I create 2 functions. I tries to keep the same logic and argumets as the initials package plots. The most difficulty was the annotate argument.

	The first function , get the result of evaluate and transform it to a ROC curve.


	```{r plotEvalutionResult}

	ggplot.evaluationResult <- function (x, y, avg = TRUE,annotate = FALSE,title='')
	{
	if (missing(y))
	y <- NULL
	plot_type <- match.arg(y, c("ROC", "prec/rec"))
	if (plot_type == "ROC")
	take <- c("FPR", "TPR")
	else take <- c("recall", "precision")
	if(avg){x <- avg(x)}
	else x <- getConfusionMatrix(x)[[1]]
	x <- as.data.frame(x[, take])
	x <- transform(x,text=rownames(x))
	ggplot(data=x,aes_string(x=take[1],y=take[2],label="text"))+
	geom_line()+ geom_point()+
	geom_text(vjust=-1,col=ifelse(annotate,TRUE,NA))+
	ylim(extendrange(x[,take[2]]))+
	ggtitle(paste("ROC curve",title))

	}

	```

	And the the same function to plot a list of evalution Results. The plot part is nearly the same excpet the fact of using group/color aes for multiple superposed plots.

	```{r}

	ggplot.evaluationResultList <- function (x, y,
	avg = TRUE, type = "b", annotate = FALSE)
	{
	if (missing(y))
	y <- NULL
	plot_type <- match.arg(y, c("ROC", "prec/rec"))
	take <- if (plot_type == "ROC")
	c("FPR", "TPR")
	else c("recall", "precision")
	max_lim <- apply(sapply(x, FUN = function(y) apply(avg(y)[,
	take], MARGIN = 2, max)), MARGIN = 1, max)
	dat <- do.call(rbind,
	lapply(names(results),
	function(x){
	name <- x
	dat <- results[[x]]
	if(avg){dat <- avg(dat)}
	else dat <- getConfusionMatrix(dat)[[1]]
	x <- as.data.frame(dat)
	x <- transform(x,text=rownames(x),
	method=name,
	col = ifelse(name %in% names(results)[annotate],
	TRUE,
	NA)

	)
	}))
	col.v <- dat$col
	ggplot(data=dat,aes_string(x=take[1],y=take[2],label="text",
	group="method",col='method'))+
	geom_line()+ geom_point()+
	geom_text(vjust=-1,col=col.v)+
	ylim(extendrange(dat[,take[2]]))

	}

	```

	Then you can call them like this :

	```{r}
	require(ggplot2)
	ggplot.evaluationResult(results[[2]],'prec',annotate=FALSE,avg=FALSE,title=names(results)[2])
	ggplot.evaluationResultList(results,'ROC',annotate=FALSE,avg=TRUE)
	```