adrianolszewski/Olszewski_quartet.md

## Olszewski_quartet.md

      
    Raw
  

              Olszewski_quartet.md
            
          
    Olszewski's vs. Anscombe's quartet

You probably heard about the Anscombe's quartet.
It's almost a textbook justification for looking at the data first and not trusting solely descriptive statistics!
I decided to make my own, Olszewski's quartet! It shows 4 faces in different moods. The mean and variance of the Y coordinate is exactly (NOT approximately!) the same for all 4 faces.
Also, the Pearson's correlation is almost 0.

How did I make it?

First I needed to make somehow datasets storing sketches of faces in four different moods: happy, sad, angry and surprised.
The moods are expressed just by the shape of mouth and eyebrows. Simple, but does the job.
I decided to use the awesome drawdata.xyz/ service.
It allows one to draw 4 datasets (A, B, C, D), but I needed 5: 1 for the overall face (without mouth and eyebrows) and 4 different moods.
So I decided to split it into two datasets: a common dataset with face, and another dataset storing the elements of moods.
Yep, it was a bit challenging to determine the right position of mouth and eyebrows not seeing the original face - I had to reset it to start another dataset.
Fortunately, I found a way to match them: I put an adhesive card on the screen, took a pencil and sketched the face on it.
Then, after resetting the board, I used the sketched face as a template to decide where to put mouth and eyebrows :)

Finally I ended up with the following datasets:

I downloaded them as CSV files. They look like this:
> head(mood)
    x     y z
1 257 111.6 a
2 257 110.6 a
3 257 110.6 a
4 257 109.6 a
5 258 109.6 a
6 258 108.6 a
 
> tail(mood)
       x     y z
1460 498 411.6 d
1461 499 409.6 d
1462 499 408.6 d
1463 499 408.6 d
1464 500 408.6 d
1465 499 408.6 d
The "z" column represents the group. They are named: a, b, c, d
Let's print them in R:
face <- read.csv("faces.csv")
mood <- read.csv("moods.csv")

layout(t(1:2))
plot(face$x, face$y)
plot(mood$x, mood$y)

That's it! But means and variances of their Y coordinates, of course, vary:
> tapply(mood$y, mood$z, function(x) c(mean=mean(x), var=var(x)))
$a
      mean        var 
  262.8756 29554.2238 
$b
      mean        var 
  220.5737 24246.4995 
$c
      mean        var 
  164.9827 19558.0978 
$d
     mean       var 
  234.329 30944.038 
So I needed some hocus-pocus to make them perfectly equal.
The simples way is to just standardize the Y coordinate: $Y_{std} = \frac{Y - mean(Y)}{variance(Y)}$
This gives Y cordinate with exactly 0 mean and unit variance.
That's all! But I wanted to make guessing a little bit more difficult, and decided to shift the mean and "explode" variance a bit, to not suggest the obvious solution 😉
inflate_variance <- function(x, factor) {
  mean_x <- mean(x)
  scaled_var <- (x - mean_x) * sqrt(factor)
  x <- (mean_x + scaled_var)
}

> var( inflate_variance( rnorm(n = 10000, mean = 0, sd = 1), factor = 5))
[1] 5.045869

> var( inflate_variance( rnorm(n = 10000, mean = 0, sd = 1), factor = 2.5))
[1] 2.497489

> var( inflate_variance( rnorm(n = 10000, mean = 0, sd = 1), factor = 11.6))
[1] 11.45029
It works!
Now I can combine all data, re-scale the Y coordinate and finally plot the faces and their descriptive statistics.
BTW, the ~0 Pearson's correlation comes out of the box.
Just look how the facecs are drawn: all elements except nose are symmetric (more or less - I'm not a good sketcher...).
Consequently, all increases in Y compensate (cancel) corresponding decreases, so the overall change is 0. So is the Pearson's R.
And that's all. Enjoy!
library(dplyr)
library(ggplot2)

do.call(bind_rows,
        lapply(letters[1:4], function(mood_id) {
          
          bind_cols("Face" = paste("Face", toupper(mood_id)),
                bind_rows(face, mood[mood$z == mood_id,])) %>% 
          mutate(y = scale(y, center = TRUE, scale = TRUE), 
                 y = y + 5,
                 y = inflate_variance(y, factor=2.2));
          })
        ) %>% 
ggplot() + 
  geom_point(aes(x=x, y=y, col=Face), show.legend = FALSE) +
  facet_wrap(~Face, ncol = 2) +
  theme_bw() +
  geom_label(data = . %>% 
                      group_by(Face) %>% 
                      summarize(Mean = mean(y),
                                Var  = var(y),
                                R    = cor(x, y)),
             aes(x=0, y=1.8, fill = Face, 
                 label = sprintf("Mean: %.3f | Variance: %.3f | Pearson's R: %.2f", Mean, Var, R)),
             hjust = 0, vjust = 0.5, show.legend = FALSE,
             fontface = "bold", col = "white") +
  labs(title = "Olszewski's quartet :-)",
       subtitle = "Always look at the data first! Don't rely solely on  descriptive statistics!",
       caption = "Adrian Olszewski, 2023\nhttps://www.linkedin.com/in/adrianolszewski/") +
  theme(plot.title = element_text(size = 20),
        plot.caption = element_text(face = "italic"))