Skip to content

Instantly share code, notes, and snippets.

@lcolladotor
Created August 28, 2013 19:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lcolladotor/6370224 to your computer and use it in GitHub Desktop.
Save lcolladotor/6370224 to your computer and use it in GitHub Desktop.
Rmd file for generating http://rpubs.com/lcollado/7901 The main issue is that only the last DataTable will show in the html.
Remove outliers
===============
The goal is to remove outliers (by variable) by marking them as NA and keeping a record of which were outliers.
# Data
First, lets create a sample data set
```{r data}
set.seed(20130828)
data <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), Y= c(runif(1000), rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))
```
Here you can browse it interactively:
```{r dataInt, results="asis"}
library(rCharts)
library(data.table)
## Add the index
d <- data.table(cbind("row" = 1:nrow(data), data))
t1 <- dTable(d, sPaginationType= 'full_numbers', iDisplayLength=10, sScrollX='100%')
t1$print("chart1", include_assets=TRUE, cdn=TRUE)
```
Notice for example that the first observation in variable __X__ is a NA. Meaning that we will be dealing with _original_ NAs and new NAs.
# Find outliers
We will mark an outlier any observation outside 3 sd. The next function finds the cells of the matrix that are considered as outliers.
```{r findOutlier}
findOutlier <- function(data, cutoff=3) {
## Calculate the sd
sds <- apply(data, 2, sd, na.rm=TRUE)
## Identify the cells with value greater than cutoff * sd (column wise)
result <- mapply(function(d, s) {
which(d > cutoff * s)
},
data, sds
)
result
}
outliers <- findOutlier(data)
outliers
```
# Remove outliers
Next we can remove the ouliers.
```{r removeOutlier}
removeOutlier <- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}
dataFilt <- removeOutlier(data, outliers)
```
Here is how the data looks after the filtering step. Use the information from the outliers to find the data entries that were filtered. For example, in page 101 (when showing 10 entries per page) you can see entries 1,001 to 1,010.
```{r dataFilt, results="asis"}
## Add the index
d2 <- data.table(cbind("row" = 1:nrow(dataFilt), dataFilt))
t2 <- dTable(d2, sPaginationType= 'full_numbers', iDisplayLength=10, sScrollX='100%')
t2$print("chart2", cdn=TRUE)
```
# Iterate
If you want to, you can iterate the procedure. However, note that the standard deviations of the filtered data will be smaller than in the original data set, thus potentially finding many more outliers.
```{r iterate}
outliers2 <- findOutlier(dataFilt)
outliers2
dataFilt2 <- removeOutlier(dataFilt, outliers2)
```
Here is the result after two iterations.
```{r dataFilt2, results="asis"}
## Add the index
d3 <- data.table(cbind("row" = 1:nrow(dataFilt2), dataFilt2))
t3 <- dTable(d3, sPaginationType= 'full_numbers', iDisplayLength=10, sScrollX='100%')
t3$print("charts3", cdn=TRUE)
```
# Reproducibility
```{r reproducibility}
Sys.time()
proc.time()
sessionInfo()
```
This report written by [L. Collado Torres](bit.ly/13MBoy8) and was generated using [knitrBootstrap](https://github.com/jimhester/knitrBootstrap).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment