Heril/CopulaControl2.md Secret

## CopulaControl2.md

      
    Raw
  

              CopulaControl2.md
            
          
    Earlier this year, I posted the first part of a write-up for a project I worked on during a project I worked on last year.
As a reminder, my project involved taking 4 different types of correlation, test the baseline and 8 different data shifts, generating 1000 simulations with 100 points of data each. I then compared how three different multivariate control charts, Hotelling T^2, MEWMA, and MCUSUM, were in detecting a shift while holding the false rate fixed over all simulations.
For this I use the pretty standard for in control average run length (ARL) of 300. This means, on average when the process is in control, you can expect to see an out of control data point in any 300 sequential points. The simple way to do this is to calculate the control limits for each control chart using the in-control data so that this constraint is kept. For this code, the variable cARL is set to 300 to represent constraining the ARL to be 300 in this case.
t2ucl <- simData %>%
  filter(shift == "0/0") %>%
  group_by(copula, rho) %>%
  summarize(UCLt2 = quantile(t2, probs = (cARL - 1)/cARL)) %>%
  select(copula, rho, UCLt2) %>%
  spread(rho, UCLt2)%>%
  column_to_rownames(var="copula")

mewmaucl <- simData %>%
  filter(shift == "0/0") %>%
  group_by(copula, rho) %>%
  summarize(UCLme = quantile(mewma, probs = (cARL - 1)/cARL)) %>%
  select(copula, rho, UCLme) %>%
  spread(rho, UCLme)%>%
  column_to_rownames(var="copula")

mcusumucl <- simData %>%
  filter(shift == "0/0") %>%
  group_by(copula, rho) %>%
  summarize(UCLmc = quantile(mcusum, probs = (cARL - 1)/cARL)) %>%
  select(copula, rho, UCLmc) %>%
  spread(rho, UCLmc)%>%
  column_to_rownames(var="copula")
From these control limits, we can test the out of control data, and determine the ARL, where the lower the value, the quicker a shift is detected. For ease of use, if a run has zero points out of control, I will treat that as an ARL equal to the size of the run, in each of these cases that being 1000.
tmpData <- list()
index <- 1
for(i in 1:Ncopula) {
  for(j in 1:Nrho) {
    tmpData[[index]] <- splitData[[index]] %>%
      group_by(copula, rho, shift, iteration) %>%
      summarize(t2ARL = ifelse(shift == "0/0", size/sum(t2 > t2ucl[i,j]),
                              detect_index(t2, function(z)(z>t2ucl[i,j]))),
                meARL = ifelse(shift == "0/0", size/sum(mewma > mewmaucl[i,j]),
                              detect_index(mewma, function(z)(z>mewmaucl[i,j]))),
                mcARL = ifelse(shift == "0/0", size/sum(mcusum > mcusumucl[i,j]),
                              detect_index(mcusum, function(z)(z>mcusumucl[i,j])))) %>%
      mutate(t2ARL = ifelse(is.infinite(t2ARL), size, t2ARL),
             meARL = ifelse(is.infinite(meARL), size, meARL),
             mcARL = ifelse(is.infinite(mcARL), size, mcARL))
    index <- index + 1
  }
}
rm(splitData)
All that remains is to get those ARL values in a way that is easy to visualize, as well as generate confidence intervals for the simulation as a whole.
ARL <- bind_rows(tmpData) %>%
  group_by(copula, rho, shift) %>%
  summarize(t2ci = list(mean_cl_normal(t2ARL) %>%
                          rename(t2ARLmean=y, t2ARLlwr=ymin, t2ARLupr=ymax)),
            meci = list(mean_cl_normal(meARL) %>%
                          rename(meARLmean=y, meARLlwr=ymin, meARLupr=ymax)),
            mcci = list(mean_cl_normal(mcARL) %>%
                          rename(mcARLmean=y, mcARLlwr=ymin, mcARLupr=ymax))) %>%
  unnest(cols = c(t2ci, meci, mcci))
Now that we have our control chart ARL metrics, let's check that the in-control ARL is approximately 300. This corresponds when the column shift is 0/0, meaning the mean for both variables is 0. I'm selecting columns to exclude the upper and lower values for the confidence intervals.
ARL %>%
	filter(shift == "0/0") %>%
	select(c(1:4,7,10))
# A tibble: 16 x 6
# Groups:   copula, rho [16]
   copula  rho   shift t2ARLmean meARLmean mcARLmean
   <fct>   <fct> <fct>     <dbl>     <dbl>     <dbl>
 1 Normal  0.6   0/0        300.      299.      299.
 2 Normal  0.2   0/0        300.      300.      299.
 3 Normal  -0.2  0/0        300.      300.      299.
 4 Normal  -0.6  0/0        300.      299.      301.
 5 Frank   0.6   0/0        300.      299.      300.
 6 Frank   0.2   0/0        300.      299.      300.
 7 Frank   -0.2  0/0        300.      300.      298.
 8 Frank   -0.6  0/0        299.      299.      301.
 9 Clayton 0.6   0/0        300.      300.      300.
10 Clayton 0.2   0/0        300.      299.      299.
11 Clayton -0.2  0/0        299.      300.      299.
12 Clayton -0.6  0/0        299.      300.      299.
13 Gumbel  0.6   0/0        300.      301.      299.
14 Gumbel  0.2   0/0        300.      299.      300.
15 Gumbel  -0.2  0/0        299.      299.      301.
16 Gumbel  -0.6  0/0        299.      300.      299.

Things are looking good, with everything being right around 300. We've got a lot of plots we can make to visualize this, but for illustrating this simply, let's first look at the simplest, the Hotelling T^2 control chart, looking at a sample control chart as well as the ARL values for both in control and out of control for the different correlation.
Let's take a look at two of the runs for the Normal copula with a correlation of 0.6, one for when there is no shift, and one where just one of the two variables has shifted by 3.


We see in this case with no shift, that only 3 points fall above the upper control limit (UCL), which for 1000 points, that is what we would roughly expect to encounter.


When simulating one of the variables being out of control with a shift of 3, we see we get a lot of points above the UCL and registered as out of control. This is exactly what we would like to happen, as in a production environment, we would want to detect this change and take action to correct it.
Finally, let's look at an overall summary of the performance for Hotelling T^2 charts for these different copulas and correlations.


The curious thing we see with these four graphs is that how well the T^2 performs for detecting different shifts varies between copula type and the strength of the correlation. For my next and final post on this project, I want to delve into the different types of correlations we're working with here and how they are structured and how that changes the performance of the Hotelling T^2 control chart.