revodavid/OutliersUnwin.Rmd

## OutliersUnwin.Rmd
---
title: "Outliers are a matter of opinion?"
author: "Antony Unwin"
date: ' '
output: html_document
---
```{r include=FALSE}
library(ggplot2)
library(ggthemes)
library(OutliersO3)
```
```{r global_options, include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
update_geom_defaults("bar",   list(fill = "grey70", colour = "grey40"))
```

```{r fig.width=10, fig.height=5, out.width='0.8\\linewidth', fig.align='center', echo=FALSE}
s1 <- O3prep(stackloss)
O3s1 <- O3plotT(s1)
O3s1$gO3
```

Fig 1: An O3 plot of the stackloss dataset.  There is one row for each variable combination (defined by the columns to the left) for which outliers were found, and one column for each case identified as an outlier (the columns to the right).

There are many different methods for identifying outliers and a lot of them are available in **R**.  Do they all give the same results?

Articles on outlier methods use a mixture of theory and practice.  Theory is all very well, but outliers are outliers because they don't follow theory.  Practice involves testing methods on data, sometimes with data simulated based on theory, better with `real' datasets.  A method can be considered successful if it finds the outliers we all agree on, but do we all agree on which cases are outliers?

The Overview Of Outliers (O3) plot is designed to help compare and understand the results of outlier methods.  It is implemented in the **OutliersO3** package (\url{https://CRAN.R-project.org/package=OutliersO3}) and was presented at last year's useR! in Brussels.  Six methods from other **R** packages are included (and, as usual, thanks are due to the authors for making their functions available in packages).

The starting point was a recent proposal of Wilkinson's, his HDoutliers algorithm.  Figure 1 shows the default O3 plot for this method applied to the stackloss dataset.  (Detailed explanations of O3 plots are in the **OutliersO3** vignettes.)  The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (@dodge:1996) that tells you a lot about it.

Wilkinson's algorithm finds 6 outliers for the whole dataset (the bottom row of the plot).  Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!).  There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them.  If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

```{r fig.width=10, fig.height=5, out.width='0.8\\linewidth', fig.align='center', echo=FALSE}
s2 <- O3prep(stackloss, method=c("HDo", "BAC"), tolHDo=0.05, tolBAC=0.05)
O3s2 <- O3plotM(s2)
O3s2$gO3
```

Fig 2: An O3plot comparing outliers identified by *HDoutliers* and *mvBACON* in the stackloss dataset.

Trying another method with tolerance level=0.05 (*mvBACON* from **robustX**) identifies 5 outliers, all ones found for more than one variable combination by *HDoutliers*.  However, no outliers are found for the whole dataset and only one of the three variable combinations where outliers are found is a combination where *HDoutliers* finds outliers.  Of course, the two methods are quite different and it would be strange if they agreed completely.  Is it strange that they do not agree more?

There are four other methods available in **OutliersO3** and using all six methods on stackloss a tolerance level of 0.05 identifies the following numbers of outliers:
```{r fig.width=9, fig.height=6, out.width='0.8\\linewidth', fig.align='center', echo=FALSE}
s6 <- O3prep(stackloss, method=c("HDo", "PCS", "BAC", "adjOut", "DDC", "MCD"), tolHDo=0.05, tolPCS=0.05, tolBAC=0.05, toladj=0.05, tolDDC=0.05, tolMCD=0.05)
O3s6 <- O3plotM(s6)
print(O3s6$nOut)
O3s6$gO3
```
Fig 2: An O3 plot of stackloss using the methods *HDoutliers*, *FastPCS*, *mvBACON*, *adjOutlyingness*, *DectectDeviatingCells*, *covMCD*.  The darker the cell, the more methods agree.  If they all agree, the cell is coloured red and if all but one agree then orange.  No case is identified by all the methods as an outlier for any combination of variables when the tolerance level is set at 0.05 for all.

Each method uses what I have called the tolerance level in a rather different way.  Sometimes it is called alpha and sometimes (1-alpha).  As so often with **R**, you start wondering if more consistency would not be out of place, even at the expense of a little individuality.   **OutliersO3** transforms where necessary to ensure that lower tolerance level values mean fewer outliers for all methods, but no attempt has been made to calibrate them equivalently.  This is probably why *adjOutlyingness* finds few or no outliers (results of this method are mildy random).  The default value, according to *adjOutlyingness*'s page, is an alpha of 0.25.

Stackloss dataset is an odd dataset and small enough that each individual case can be studied in detail (cf. Dodge's paper for just how much detail).  However, similar results have been found with other datasets (milk, Election2005, diamonds, ...). The main conclusion so far is that different outlier methods identify different numbers of different cases for different combinations of variables as different from the bulk of the data (i.e. as potential outliers)---or are these datasets just outlying examples?

There are other outlier methods available in **R** and they will doubtless give yet more different results.  The recommendation has to be to proceed with care.  Outliers may be interesting in their own right, they may be errors of some kind---and we may not agree whether they are outliers at all.
	---
	title: "Outliers are a matter of opinion?"
	author: "Antony Unwin"
	date: ' '
	output: html_document
	---
	```{r include=FALSE}
	library(ggplot2)
	library(ggthemes)
	library(OutliersO3)
	```
	```{r global_options, include=FALSE}
	knitr::opts_chunk$set(warning=FALSE, message=FALSE)
	update_geom_defaults("bar", list(fill = "grey70", colour = "grey40"))
	```

	```{r fig.width=10, fig.height=5, out.width='0.8\\linewidth', fig.align='center', echo=FALSE}
	s1 <- O3prep(stackloss)
	O3s1 <- O3plotT(s1)
	O3s1$gO3
	```

	Fig 1: An O3 plot of the stackloss dataset. There is one row for each variable combination (defined by the columns to the left) for which outliers were found, and one column for each case identified as an outlier (the columns to the right).

	There are many different methods for identifying outliers and a lot of them are available in R. Do they all give the same results?

	Articles on outlier methods use a mixture of theory and practice. Theory is all very well, but outliers are outliers because they don't follow theory. Practice involves testing methods on data, sometimes with data simulated based on theory, better with `real' datasets. A method can be considered successful if it finds the outliers we all agree on, but do we all agree on which cases are outliers?

	The Overview Of Outliers (O3) plot is designed to help compare and understand the results of outlier methods. It is implemented in the OutliersO3 package (\url{https://CRAN.R-project.org/package=OutliersO3}) and was presented at last year's useR! in Brussels. Six methods from other R packages are included (and, as usual, thanks are due to the authors for making their functions available in packages).

	The starting point was a recent proposal of Wilkinson's, his HDoutliers algorithm. Figure 1 shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the OutliersO3 vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (@dodge:1996) that tells you a lot about it.

	Wilkinson's algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

	```{r fig.width=10, fig.height=5, out.width='0.8\\linewidth', fig.align='center', echo=FALSE}
	s2 <- O3prep(stackloss, method=c("HDo", "BAC"), tolHDo=0.05, tolBAC=0.05)
	O3s2 <- O3plotM(s2)
	O3s2$gO3
	```

	Fig 2: An O3plot comparing outliers identified by HDoutliers and mvBACON in the stackloss dataset.

	Trying another method with tolerance level=0.05 (mvBACON from robustX) identifies 5 outliers, all ones found for more than one variable combination by HDoutliers. However, no outliers are found for the whole dataset and only one of the three variable combinations where outliers are found is a combination where HDoutliers finds outliers. Of course, the two methods are quite different and it would be strange if they agreed completely. Is it strange that they do not agree more?

	There are four other methods available in OutliersO3 and using all six methods on stackloss a tolerance level of 0.05 identifies the following numbers of outliers:
	```{r fig.width=9, fig.height=6, out.width='0.8\\linewidth', fig.align='center', echo=FALSE}
	s6 <- O3prep(stackloss, method=c("HDo", "PCS", "BAC", "adjOut", "DDC", "MCD"), tolHDo=0.05, tolPCS=0.05, tolBAC=0.05, toladj=0.05, tolDDC=0.05, tolMCD=0.05)
	O3s6 <- O3plotM(s6)
	print(O3s6$nOut)
	O3s6$gO3
	```
	Fig 2: An O3 plot of stackloss using the methods HDoutliers, FastPCS, mvBACON, adjOutlyingness, DectectDeviatingCells, covMCD. The darker the cell, the more methods agree. If they all agree, the cell is coloured red and if all but one agree then orange. No case is identified by all the methods as an outlier for any combination of variables when the tolerance level is set at 0.05 for all.

	Each method uses what I have called the tolerance level in a rather different way. Sometimes it is called alpha and sometimes (1-alpha). As so often with R, you start wondering if more consistency would not be out of place, even at the expense of a little individuality. OutliersO3 transforms where necessary to ensure that lower tolerance level values mean fewer outliers for all methods, but no attempt has been made to calibrate them equivalently. This is probably why adjOutlyingness finds few or no outliers (results of this method are mildy random). The default value, according to adjOutlyingness's page, is an alpha of 0.25.

	Stackloss dataset is an odd dataset and small enough that each individual case can be studied in detail (cf. Dodge's paper for just how much detail). However, similar results have been found with other datasets (milk, Election2005, diamonds, ...). The main conclusion so far is that different outlier methods identify different numbers of different cases for different combinations of variables as different from the bulk of the data (i.e. as potential outliers)---or are these datasets just outlying examples?

	There are other outlier methods available in R and they will doubtless give yet more different results. The recommendation has to be to proceed with care. Outliers may be interesting in their own right, they may be errors of some kind---and we may not agree whether they are outliers at all.