Keywords: scatterplot, percentiles, quantiles, visualize, regression, nonparametric
What does this show ?
Consider a fat vertical line at a given x
in one of these plots.
The colored bands are, from low to high,
- yellow: percentiles 10 to 25
- blue: percentiles 25 to 75, the middle half
- yellow again: percentiles 75 to 90
For example, in the top plot of Ozone vs. Solar_radiation, the wide yellow band shows high Ozone values above Solar_radiation 200. The intent is to guide the eye to what might be interesting.
Take a scatterplot.
-
Slice it into vertical slices, at
x
percentilesxq = [ 5 10 25 50 75 90 95 ]
-
Within each slice, take the
y
percentilesyq = [ 10 25 50 75 90 ]
. This gives e.g. for the top plot, Ozone vs. Solar_radiation,xy_slices x range, y percentile values -- x: 22 .. 37 nx 6 y: [6.5 9 9 12 16] x: 37 .. 114 nx 17 y: [7 13 20 32 49] x: 114 .. 207 nx 28 y: [14 22 38 76 87] x: 207 .. 256 nx 28 y: [20 27 48 85 116] x: 256 .. 285 nx 17 y: [15 21 37 61 97] x: 285 .. 310 nx 6 y: [17 26 48 68 76]
-
One could just color the 4 percentile blocks within each vertical slice. But that looks blocky, so smooth, interpolate, the blocks, here,
x: av( 22 37 ), y: [6.5 9 9 12 16] x: av( 37 114 ), y: [7 13 20 32 49] ...
(The user can of course specify xq
, yq
, and the colors.)
Stripy
just adds colored bands to scatterplots -- no model, no math.
In contrast, "confidence intervals" and "prediction intervals"
are calculated as follows:
- Make a model of the data, often a linear model -- a line or curve.
- Assume normally-distributed (Gaussian) errors.
- "Confidence intervals", the red-shaded regions around the red line fits above, show where model lines fall with some probability, typically 95 %.
- "Prediction intervals" show where data points might fall around the lines.
Be careful:
- The normality assumption is nice math, but may not hold for your data. (Say "homoscedasticity" quickly 3 times.)
- "Confidence intervals" can easily be "OVER-confidence intervals". I prefer to plot say 5 lines on random halves of the data.
- The names can be confusing; I'd prefer "Model scatter bands" and "Data scatter bands" .
The bottom plot above shows y
: Ozone vs x
: linear model of Ozone,
from linear regression / ordinary least squares, OLS.
The stripes show how the data scatters --
the line model overestimates middle Ozone and underestimates high Ozone.
What's important ?
If the goal is to predict high Ozone from Solar_radiation and Wind,
then we should go for that, not fit the whole range.
An easy, not to say crude, way is to look only at Ozone >= say 50,
df[ df.Ozone >= 50 ]
.
That leaves only 33 of the 111 data points, but is better at high Ozone.
Or, one could add quadratic terms; or do piecewise-linear a.k.a. hockey-stick regression; or ...
There are many many ways to trade off
model simplicity against prediction accuracy, with no clear "best".
(The best plot wins ?)
A nice way to look at a list of numbers such as prices or temperatures is to sort them, then pick 3, or 5, or 7 numbers that summarize the lot:
-
3-number-summary: min median and max. For example, [10 200 1000 99 1] has 3-number summary [1 10 1000]. The "median", the value in the middle, splits the sorted list into two halves: those below, and those above. Any list that sorts like
[1 .......... 10 .......... 1000] half 1 .. 10 | half 10 .. 1000
has min median max = [1 10 1000] .
Exercise: what if the top half runs 10 .. 10000 ? 10 .. 1000000 ?
Exercise: estimate the median age, weight, EQ, income of some people you know.
In a list of 101 numbers, percentile "p" is the "p" th in sorted order: percentile 0 is the min, 50 the median, 100 the max. (One-off errors are lurking here, see Percentile .)
-
5-number-summary, percentiles 0 25 50 75 100: these split 100 numbers, after sorting, into 4 groups of 25 each, bottom quarter .. top quarter. The blue stripes above mark the middle half (middle two quarters), percentiles 25 to 75, of data points in vertical stripes.
-
7-number-summary, percentiles 0 10 25 50 75 90 100: bottom 10 % .. top 10 %. The yellow stripes above mark 10 - 25 % and 75 - 90 % percentiles of data points in vertical stripes; the bottom 10 % and top 10 % are not colored. (For normally-distributed a.k.a. Gaussian-distributed data, the 4 bands yellow - blue - blue - yellow will be about equally wide; see Seven-number_summary .)
and test cases most welcome.
cheers
-- denis
Last change: 2016-06-08 June