Skip to content

Instantly share code, notes, and snippets.

@denis-bz
Created January 8, 2021 11:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save denis-bz/a41f662029a7dc51cefa2508085a4f14 to your computer and use it in GitHub Desktop.
Save denis-bz/a41f662029a7dc51cefa2508085a4f14 to your computer and use it in GitHub Desktop.
Stripy: percentile stripes for scatterplots

Stripy: percentile stripes for scatterplots

Keywords: scatterplot, percentiles, quantiles, visualize, regression, nonparametric

ozone-stripy-4june

What does this show ? Consider a fat vertical line at a given x in one of these plots. The colored bands are, from low to high,

  • yellow: percentiles 10 to 25
  • blue: percentiles 25 to 75, the middle half
  • yellow again: percentiles 75 to 90

For example, in the top plot of Ozone vs. Solar_radiation, the wide yellow band shows high Ozone values above Solar_radiation 200. The intent is to guide the eye to what might be interesting.

How it works

Take a scatterplot.

  1. Slice it into vertical slices, at x percentiles xq = [ 5 10 25 50 75 90 95 ]

  2. Within each slice, take the y percentiles yq = [ 10 25 50 75 90 ]. This gives e.g. for the top plot, Ozone vs. Solar_radiation,

    xy_slices x range, y percentile values -- x: 22 .. 37 nx 6 y: [6.5 9 9 12 16] x: 37 .. 114 nx 17 y: [7 13 20 32 49] x: 114 .. 207 nx 28 y: [14 22 38 76 87] x: 207 .. 256 nx 28 y: [20 27 48 85 116] x: 256 .. 285 nx 17 y: [15 21 37 61 97] x: 285 .. 310 nx 6 y: [17 26 48 68 76]

  3. One could just color the 4 percentile blocks within each vertical slice. But that looks blocky, so smooth, interpolate, the blocks, here,

    x: av( 22 37 ), y: [6.5 9 9 12 16] x: av( 37 114 ), y: [7 13 20 32 49] ...

(The user can of course specify xq, yq, and the colors.)

Data scatter, model scatter

Stripy just adds colored bands to scatterplots -- no model, no math.
In contrast, "confidence intervals" and "prediction intervals" are calculated as follows:

  1. Make a model of the data, often a linear model -- a line or curve.
  2. Assume normally-distributed (Gaussian) errors.
  3. "Confidence intervals", the red-shaded regions around the red line fits above, show where model lines fall with some probability, typically 95 %.
  4. "Prediction intervals" show where data points might fall around the lines.

Be careful:

  1. The normality assumption is nice math, but may not hold for your data. (Say "homoscedasticity" quickly 3 times.)
  2. "Confidence intervals" can easily be "OVER-confidence intervals". I prefer to plot say 5 lines on random halves of the data.
  3. The names can be confusing; I'd prefer "Model scatter bands" and "Data scatter bands" .

Data vs. model: what's important ?

The bottom plot above shows y: Ozone vs x: linear model of Ozone, from linear regression / ordinary least squares, OLS. The stripes show how the data scatters -- the line model overestimates middle Ozone and underestimates high Ozone.

What's important ? If the goal is to predict high Ozone from Solar_radiation and Wind, then we should go for that, not fit the whole range. An easy, not to say crude, way is to look only at Ozone >= say 50, df[ df.Ozone >= 50 ]. That leaves only 33 of the 111 data points, but is better at high Ozone. Or, one could add quadratic terms; or do piecewise-linear a.k.a. hockey-stick regression; or ... There are many many ways to trade off model simplicity against prediction accuracy, with no clear "best". (The best plot wins ?)

Appendix: percentiles, 3-number-summary, 5-number-summary, 7-number-summary,

A nice way to look at a list of numbers such as prices or temperatures is to sort them, then pick 3, or 5, or 7 numbers that summarize the lot:

  • 3-number-summary: min median and max. For example, [10 200 1000 99 1] has 3-number summary [1 10 1000]. The "median", the value in the middle, splits the sorted list into two halves: those below, and those above. Any list that sorts like

    [1 .......... 10 .......... 1000] half 1 .. 10 | half 10 .. 1000

has min median max = [1 10 1000] .
Exercise: what if the top half runs 10 .. 10000 ? 10 .. 1000000 ? Exercise: estimate the median age, weight, EQ, income of some people you know.

In a list of 101 numbers, percentile "p" is the "p" th in sorted order: percentile 0 is the min, 50 the median, 100 the max. (One-off errors are lurking here, see Percentile .)

  • 5-number-summary, percentiles 0 25 50 75 100: these split 100 numbers, after sorting, into 4 groups of 25 each, bottom quarter .. top quarter. The blue stripes above mark the middle half (middle two quarters), percentiles 25 to 75, of data points in vertical stripes.

  • 7-number-summary, percentiles 0 10 25 50 75 90 100: bottom 10 % .. top 10 %. The yellow stripes above mark 10 - 25 % and 75 - 90 % percentiles of data points in vertical stripes; the bottom 10 % and top 10 % are not colored. (For normally-distributed a.k.a. Gaussian-distributed data, the 4 bands yellow - blue - blue - yellow will be about equally wide; see Seven-number_summary .)

Comments are welcome

and test cases most welcome.

cheers
-- denis

Last change: 2016-06-08 June

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment