denis-bz/0-Stripy.md

## 0-Stripy.md

      
    Raw
  

              0-Stripy.md
            
          
    Stripy: percentile stripes for scatterplots

Keywords: scatterplot, percentiles, quantiles, visualize, regression, nonparametric

What does this show ?
Consider a fat vertical line at a given x in one of these plots.
The colored bands are, from low to high,

yellow: percentiles 10 to 25
blue: percentiles 25 to 75, the middle half
yellow again: percentiles 75 to 90

For example, in the top plot of Ozone vs. Solar_radiation,
the wide yellow band shows high Ozone values above Solar_radiation 200.
The intent is to guide the eye to what might be interesting.
How it works

Take a scatterplot.


Slice it into vertical slices, at x percentiles xq = [ 5 10 25 50 75 90 95 ]


Within each slice, take the y percentiles yq = [ 10 25 50 75 90 ].
This gives e.g. for the top plot, Ozone vs. Solar_radiation,
xy_slices x range, y percentile values --
x:     22 ..     37  nx   6  y: [6.5 9 9 12 16]
x:     37 ..    114  nx  17  y: [7 13 20 32 49]
x:    114 ..    207  nx  28  y: [14 22 38 76 87]
x:    207 ..    256  nx  28  y: [20 27 48 85 116]
x:    256 ..    285  nx  17  y: [15 21 37 61 97]
x:    285 ..    310  nx   6  y: [17 26 48 68 76]


One could just color the 4 percentile blocks within each vertical slice.
But that looks blocky, so smooth, interpolate, the blocks, here,
x: av( 22 37 ),   y: [6.5 9 9 12 16]
x: av( 37 114 ),  y: [7 13 20 32 49]
...


(The user can of course specify xq, yq, and the colors.)
Data scatter, model scatter

Stripy just adds colored bands to scatterplots -- no model, no math.

In contrast, "confidence intervals" and "prediction intervals"
are calculated as follows:

Make a model of the data, often a linear model -- a line or curve.
Assume normally-distributed (Gaussian) errors.
"Confidence intervals", the red-shaded regions around the red line fits above,
show where model lines fall with some probability, typically 95 %.
"Prediction intervals" show where data points might fall around the lines.

Be careful:

The normality assumption is nice math, but may not hold for your data.
(Say "homoscedasticity" quickly 3 times.)
"Confidence intervals" can easily be "OVER-confidence intervals".
I prefer to plot say 5 lines on random halves of the data.
The names can be confusing; I'd prefer
"Model scatter bands" and "Data scatter bands" .

Data vs. model: what's important ?

The bottom plot above shows y: Ozone vs x: linear model of Ozone,
from linear regression / ordinary least squares, OLS.
The stripes show how the data scatters --
the line model overestimates middle Ozone and underestimates high Ozone.
What's important ?
If the goal is to predict high Ozone from Solar_radiation and Wind,
then we should go for that, not fit the whole range.
An easy, not to say crude, way is to look only at Ozone >= say 50,
df[ df.Ozone >= 50 ].
That leaves only 33 of the 111 data points, but is better at high Ozone.
Or, one could add quadratic terms; or do piecewise-linear a.k.a. hockey-stick regression; or ...
There are many many ways to trade off
model simplicity against prediction accuracy, with no clear "best".
(The best plot wins ?)
Appendix: percentiles, 3-number-summary, 5-number-summary, 7-number-summary,

A nice way to look at a list of numbers such as prices or temperatures is to sort them,
then pick 3, or 5, or 7 numbers that summarize the lot:


3-number-summary: min median and max.
For example, [10 200 1000 99 1] has 3-number summary [1 10 1000].
The "median", the value in the middle, splits the sorted list
into two halves: those below, and those above.
Any list that sorts like
[1 .......... 10 .......... 1000]
half 1 .. 10  |  half 10 .. 1000


has min median max = [1 10 1000] .

Exercise: what if the top half runs 10 .. 10000 ?  10 .. 1000000 ?
Exercise: estimate the median age, weight, EQ, income of some people you know.
In a list of 101 numbers, percentile "p" is the "p" th in sorted order:
percentile 0 is the min, 50 the median, 100 the max.
(One-off errors are lurking here, see
Percentile .)


5-number-summary, percentiles 0 25 50 75 100:
these split 100 numbers, after sorting, into 4 groups of 25 each, bottom quarter .. top quarter.
The blue stripes above mark the middle half (middle two quarters), percentiles 25 to 75,
of data points in vertical stripes.


7-number-summary, percentiles 0 10 25 50 75 90 100:
bottom 10 % .. top 10 %.
The yellow stripes above mark 10 - 25 % and 75 - 90 % percentiles
of data points in vertical stripes;
the bottom 10 % and top 10 % are not colored.
(For normally-distributed a.k.a. Gaussian-distributed data,
the 4 bands yellow - blue - blue - yellow will be about equally wide;
see Seven-number_summary .)


Comments are welcome

and test cases most welcome.
cheers

-- denis
Last change: 2016-06-08 June