Skip to content

Instantly share code, notes, and snippets.

@lcolladotor
Last active December 22, 2015 08:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lcolladotor/6445586 to your computer and use it in GitHub Desktop.
Save lcolladotor/6445586 to your computer and use it in GitHub Desktop.
outliers.Rmd final version and .Rprofile for running knitrBootstrap in RStudio Edit: Added knitr::render_html() in the Rmd file and took it off from .Rprofile after re-reading https://github.com/jimhester/knitrBootstrap/issues/20
## For knitr bootstrap
## More info at http://www.rstudio.com/ide/docs/authoring/markdown_custom_rendering
options(rstudio.markdownToHTML =
function(inputFile, outputFile) {
library(knitrBootstrap)
knit_bootstrap_md(input=inputFile, output=outputFile, code_style="Brown Paper", chooser=c("boot", "code"), show_code=FALSE)
}
)
Remove outliers
===============
The goal is to remove outliers (by variable) by marking them as NA and keeping a record of which were outliers.
# Data
First, lets create a sample data set
```{r data}
library(knitr)
render_html()
set.seed(20130828)
rawData <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), Y= c(runif(1000), rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))
```
Here you can browse it interactively:
<link rel="stylesheet" href="http://ajax.aspnetcdn.com/ajax/jquery.dataTables/1.9.4/css/jquery.dataTables.css" />
<script src="http://ajax.aspnetcdn.com/ajax/jquery.dataTables/1.9.4/jquery.dataTables.min.js"></script>
Notice for example that the first observation in variable __X__ is a NA. Meaning that we will be dealing with _original_ NAs and new NAs.
```{r dataInt, results="asis"}
library(rCharts)
library(data.table)
## Add the index
d <- data.table(cbind("row" = 1:nrow(rawData), rawData))
t1 <- dTable(d, sPaginationType= 'full_numbers', iDisplayLength=10, sScrollX='100%')
t1$print("chart1", cdn=TRUE)
```
# Find outliers
We will mark an outlier any observation outside 3 sd. The next function finds the cells of the matrix that are considered as outliers.
```{r findOutlier}
findOutlier <- function(data, cutoff=3) {
## Calculate the sd
sds <- apply(data, 2, sd, na.rm=TRUE)
## Identify the cells with value greater than cutoff * sd (column wise)
result <- mapply(function(d, s) {
which(d > cutoff * s)
},
data, sds
)
result
}
outliers <- findOutlier(rawData)
outliers
```
# Remove outliers
Next we can remove the ouliers.
```{r removeOutlier}
removeOutlier <- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}
dataFilt <- removeOutlier(rawData, outliers)
```
Here is how the data looks after the filtering step. Use the information from the outliers to find the data entries that were filtered. For example, in page 101 (when showing 10 entries per page) you can see entries 1,001 to 1,010.
```{r dataFilt, results="asis"}
## Add the index
d2 <- data.table(cbind("row" = 1:nrow(dataFilt), dataFilt))
t2 <- dTable(d2, sPaginationType= 'full_numbers', iDisplayLength=10, sScrollX='100%')
t2$print("chart2")
```
# Iterate
If you want to, you can iterate the procedure. However, note that the standard deviations of the filtered data will be smaller than in the original data set, thus potentially finding many more outliers.
```{r iterate}
outliers2 <- findOutlier(dataFilt)
outliers2
dataFilt2 <- removeOutlier(dataFilt, outliers2)
```
Here is the result after two iterations.
```{r dataFilt2, results="asis"}
## Add the index
d3 <- data.table(cbind("row" = 1:nrow(dataFilt2), dataFilt2))
t3 <- dTable(d3, sPaginationType= 'full_numbers', iDisplayLength=10, sScrollX='100%')
t3$print("chart3")
```
# Reproducibility
```{r reproducibility}
Sys.time()
proc.time()
sessionInfo()
```
This report written by [L. Collado Torres](bit.ly/13MBoy8) and was generated using [knitrBootstrap](https://github.com/jimhester/knitrBootstrap).
* Showing multiple tables was implemented and fixed by [Ramnath Vaidyanathan](https://github.com/ramnathv) in this [issue](https://github.com/ramnathv/rCharts/issues/227).
* The knitrBootstrap and rRcharts interaction was fixed by [Jim Hester](https://github.com/jimhester), [TimelyPortfolio](https://github.com/timelyportfolio) and [Ramnath Vaidyanathan](https://github.com/ramnathv) in [rCharts issue 233](https://github.com/ramnathv/rCharts/issues/233) and [knitrBootstrap issue 21](https://github.com/jimhester/knitrBootstrap/issues/21) other issue.
* Running knitrBootstrap correctly in RStudio is addressed by [Jim Hester](https://github.com/jimhester) in [knitrBootstrap issue 20](https://github.com/jimhester/knitrBootstrap/issues/20).
Thank you!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment