Created
August 28, 2013 19:31
-
-
Save lcolladotor/6370224 to your computer and use it in GitHub Desktop.
Rmd file for generating http://rpubs.com/lcollado/7901 The main issue is that only the last DataTable will show in the html.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Remove outliers | |
=============== | |
The goal is to remove outliers (by variable) by marking them as NA and keeping a record of which were outliers. | |
# Data | |
First, lets create a sample data set | |
```{r data} | |
set.seed(20130828) | |
data <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), Y= c(runif(1000), rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20))) | |
``` | |
Here you can browse it interactively: | |
```{r dataInt, results="asis"} | |
library(rCharts) | |
library(data.table) | |
## Add the index | |
d <- data.table(cbind("row" = 1:nrow(data), data)) | |
t1 <- dTable(d, sPaginationType= 'full_numbers', iDisplayLength=10, sScrollX='100%') | |
t1$print("chart1", include_assets=TRUE, cdn=TRUE) | |
``` | |
Notice for example that the first observation in variable __X__ is a NA. Meaning that we will be dealing with _original_ NAs and new NAs. | |
# Find outliers | |
We will mark an outlier any observation outside 3 sd. The next function finds the cells of the matrix that are considered as outliers. | |
```{r findOutlier} | |
findOutlier <- function(data, cutoff=3) { | |
## Calculate the sd | |
sds <- apply(data, 2, sd, na.rm=TRUE) | |
## Identify the cells with value greater than cutoff * sd (column wise) | |
result <- mapply(function(d, s) { | |
which(d > cutoff * s) | |
}, | |
data, sds | |
) | |
result | |
} | |
outliers <- findOutlier(data) | |
outliers | |
``` | |
# Remove outliers | |
Next we can remove the ouliers. | |
```{r removeOutlier} | |
removeOutlier <- function(data, outliers) { | |
result <- mapply(function(d, o) { | |
res <- d | |
res[o] <- NA | |
return(res) | |
}, data, outliers) | |
return(as.data.frame(result)) | |
} | |
dataFilt <- removeOutlier(data, outliers) | |
``` | |
Here is how the data looks after the filtering step. Use the information from the outliers to find the data entries that were filtered. For example, in page 101 (when showing 10 entries per page) you can see entries 1,001 to 1,010. | |
```{r dataFilt, results="asis"} | |
## Add the index | |
d2 <- data.table(cbind("row" = 1:nrow(dataFilt), dataFilt)) | |
t2 <- dTable(d2, sPaginationType= 'full_numbers', iDisplayLength=10, sScrollX='100%') | |
t2$print("chart2", cdn=TRUE) | |
``` | |
# Iterate | |
If you want to, you can iterate the procedure. However, note that the standard deviations of the filtered data will be smaller than in the original data set, thus potentially finding many more outliers. | |
```{r iterate} | |
outliers2 <- findOutlier(dataFilt) | |
outliers2 | |
dataFilt2 <- removeOutlier(dataFilt, outliers2) | |
``` | |
Here is the result after two iterations. | |
```{r dataFilt2, results="asis"} | |
## Add the index | |
d3 <- data.table(cbind("row" = 1:nrow(dataFilt2), dataFilt2)) | |
t3 <- dTable(d3, sPaginationType= 'full_numbers', iDisplayLength=10, sScrollX='100%') | |
t3$print("charts3", cdn=TRUE) | |
``` | |
# Reproducibility | |
```{r reproducibility} | |
Sys.time() | |
proc.time() | |
sessionInfo() | |
``` | |
This report written by [L. Collado Torres](bit.ly/13MBoy8) and was generated using [knitrBootstrap](https://github.com/jimhester/knitrBootstrap). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment