Last active
December 22, 2015 08:29
-
-
Save lcolladotor/6445586 to your computer and use it in GitHub Desktop.
outliers.Rmd final version and .Rprofile for running knitrBootstrap in RStudio Edit: Added knitr::render_html() in the Rmd file and took it off from .Rprofile after re-reading https://github.com/jimhester/knitrBootstrap/issues/20
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## For knitr bootstrap | |
## More info at http://www.rstudio.com/ide/docs/authoring/markdown_custom_rendering | |
options(rstudio.markdownToHTML = | |
function(inputFile, outputFile) { | |
library(knitrBootstrap) | |
knit_bootstrap_md(input=inputFile, output=outputFile, code_style="Brown Paper", chooser=c("boot", "code"), show_code=FALSE) | |
} | |
) | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Remove outliers | |
=============== | |
The goal is to remove outliers (by variable) by marking them as NA and keeping a record of which were outliers. | |
# Data | |
First, lets create a sample data set | |
```{r data} | |
library(knitr) | |
render_html() | |
set.seed(20130828) | |
rawData <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), Y= c(runif(1000), rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20))) | |
``` | |
Here you can browse it interactively: | |
<link rel="stylesheet" href="http://ajax.aspnetcdn.com/ajax/jquery.dataTables/1.9.4/css/jquery.dataTables.css" /> | |
<script src="http://ajax.aspnetcdn.com/ajax/jquery.dataTables/1.9.4/jquery.dataTables.min.js"></script> | |
Notice for example that the first observation in variable __X__ is a NA. Meaning that we will be dealing with _original_ NAs and new NAs. | |
```{r dataInt, results="asis"} | |
library(rCharts) | |
library(data.table) | |
## Add the index | |
d <- data.table(cbind("row" = 1:nrow(rawData), rawData)) | |
t1 <- dTable(d, sPaginationType= 'full_numbers', iDisplayLength=10, sScrollX='100%') | |
t1$print("chart1", cdn=TRUE) | |
``` | |
# Find outliers | |
We will mark an outlier any observation outside 3 sd. The next function finds the cells of the matrix that are considered as outliers. | |
```{r findOutlier} | |
findOutlier <- function(data, cutoff=3) { | |
## Calculate the sd | |
sds <- apply(data, 2, sd, na.rm=TRUE) | |
## Identify the cells with value greater than cutoff * sd (column wise) | |
result <- mapply(function(d, s) { | |
which(d > cutoff * s) | |
}, | |
data, sds | |
) | |
result | |
} | |
outliers <- findOutlier(rawData) | |
outliers | |
``` | |
# Remove outliers | |
Next we can remove the ouliers. | |
```{r removeOutlier} | |
removeOutlier <- function(data, outliers) { | |
result <- mapply(function(d, o) { | |
res <- d | |
res[o] <- NA | |
return(res) | |
}, data, outliers) | |
return(as.data.frame(result)) | |
} | |
dataFilt <- removeOutlier(rawData, outliers) | |
``` | |
Here is how the data looks after the filtering step. Use the information from the outliers to find the data entries that were filtered. For example, in page 101 (when showing 10 entries per page) you can see entries 1,001 to 1,010. | |
```{r dataFilt, results="asis"} | |
## Add the index | |
d2 <- data.table(cbind("row" = 1:nrow(dataFilt), dataFilt)) | |
t2 <- dTable(d2, sPaginationType= 'full_numbers', iDisplayLength=10, sScrollX='100%') | |
t2$print("chart2") | |
``` | |
# Iterate | |
If you want to, you can iterate the procedure. However, note that the standard deviations of the filtered data will be smaller than in the original data set, thus potentially finding many more outliers. | |
```{r iterate} | |
outliers2 <- findOutlier(dataFilt) | |
outliers2 | |
dataFilt2 <- removeOutlier(dataFilt, outliers2) | |
``` | |
Here is the result after two iterations. | |
```{r dataFilt2, results="asis"} | |
## Add the index | |
d3 <- data.table(cbind("row" = 1:nrow(dataFilt2), dataFilt2)) | |
t3 <- dTable(d3, sPaginationType= 'full_numbers', iDisplayLength=10, sScrollX='100%') | |
t3$print("chart3") | |
``` | |
# Reproducibility | |
```{r reproducibility} | |
Sys.time() | |
proc.time() | |
sessionInfo() | |
``` | |
This report written by [L. Collado Torres](bit.ly/13MBoy8) and was generated using [knitrBootstrap](https://github.com/jimhester/knitrBootstrap). | |
* Showing multiple tables was implemented and fixed by [Ramnath Vaidyanathan](https://github.com/ramnathv) in this [issue](https://github.com/ramnathv/rCharts/issues/227). | |
* The knitrBootstrap and rRcharts interaction was fixed by [Jim Hester](https://github.com/jimhester), [TimelyPortfolio](https://github.com/timelyportfolio) and [Ramnath Vaidyanathan](https://github.com/ramnathv) in [rCharts issue 233](https://github.com/ramnathv/rCharts/issues/233) and [knitrBootstrap issue 21](https://github.com/jimhester/knitrBootstrap/issues/21) other issue. | |
* Running knitrBootstrap correctly in RStudio is addressed by [Jim Hester](https://github.com/jimhester) in [knitrBootstrap issue 20](https://github.com/jimhester/knitrBootstrap/issues/20). | |
Thank you! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment