Skip to content

Instantly share code, notes, and snippets.

@timfitzzz
Created June 2, 2015 17:06
Show Gist options
  • Save timfitzzz/8b09389f90a04f7f7489 to your computer and use it in GitHub Desktop.
Save timfitzzz/8b09389f90a04f7f7489 to your computer and use it in GitHub Desktop.

Constructing Chill Graphics

Models facilitate two fundamental acts of science: description and comparison.

Models: ideas and explanations that do two things: describe past observations and predict future outcomes.

Unlike models of hard scientific law, which describe deterministic behavior, statistical models help to understand sources of variation. "What is causing the differences in outcomes that I see?"

So, understanding distributions is the center of statistical or probabilistic thinking. Distributions have two components: 1. a support set (what outcomes are possible) and 2. a mass (probability) function which tells how likely the outcomes are. Variation is something that will be encountered in the process of describing and comparing -- in the process of doing science. It helps us replace noise with explanation.

Summary measures -- like mean and variance -- can easily summarize a bell curve. But when things are more complex, the best thing is to show all the data -- summary measures don't always tell the whole story.

Example 1

The conclusion of the first example used -- the "sparklines" the author uses -- is actually pretty similar, at first glance, to my draft timeline. The idea is, the average of a set of ages isn't that relevant -- what you really want to understand is the distribution and what it implies. Once you graphically represent the distribution in a useful way, it's easy to see what the data implies and also what questions might be relevant to ask about the data.

Indeed, the author makes the argument that the average is only interesting when the total is interesting. "Mean if, and only if, total". If the mean and variance don't describe what's interesting about the distribution, then they're not useful summaries.

Example 2

An example is given of three summaries -- two that try to describe the data without precisely revealing it, and a third that includes the data. Once you see the third, the first two seem woefully unhelpful, as the data itself does not closely resemble the lines drawn therein.

"If you agree or disagree with the presenter with regard to the interpretation, the data remain unchanged. Show the atoms; show the data."

Mean and variance are good for Gaussian curves, but otherwise, they can't really tell you much. They hinge on the assumption that the underlying distribution is normal and that we have a random sample. "What if we don't have that?" The challenge: come up with summaries that can carry the information that we seek to carry.

Example 3

The third example shows "another example of increasing resolution, from summaries to data points." The vertical axis shows the fraction of water acquired in shallow soil, from 0 to 1, and the horizontal axis uses "categorical classications" -- large, medium and small sagebrush -- instead of a continuous predictor.

The figures: simple mean -> "dynamite plot" -> the entire distribution of responses.

At issue here: "the visual perception of this plot." "Are we showing the distribution? Where is the data?" With regard to the center plot, the dynamite plot -- "If we are clever, we notice that the small sagebrush category on the right has a longer variability measure, so that group is more variable. But what can we infer about the distribution? We can maybe infer that the variability stems are standard deviations, so you could assume that 95% of the data fall within two such standard deviations of the mean."

So: it forces you to do some visual compensation in order to really understand the distribution. You have to double the length of the stem, add and subtract it from the top of the black bar, and guess where the data might be.

Instead, a simple rule of thumb: "each datum gets one glob of ink and add extra ink only to increase understanding." The third panel sticks to this: each individual datum, from each plant, is visible. There are lines for the means, as shown in the second example. But the bar plots are removed, as they don't even indicate where the data are -- at the bottom of each bar, there are no actual data points -- there is bar where there are no points. It's confusing.

In summary: "Individual dots on the plot show us the distribution; the bar plots hide the distribution from us or even lie to us as to where the data are."

They write: "standard error calculations address the amount of certainty in finding the mean -- a very specific piece of information that might or might not be even slightly relevant." If we're not making inferrences about a mean, instead about individuals, use standard error. For individuals, use standard deviations.

OK -- at some point they've lost me. What's the difference between standard error and standard deviation?

Get lost, mean! More disrespect for summary measures

"Analysis": from Greek, implies "breaking things down into component parts so as to understand the whole." Opposite of synthesis, bringing parts together to construct the whole. If we're trying to use graphics to analyze data, better focus on the former. Summaries like means, medians, percentiles, standard deviations, F, chi-square, T stats and P values -- not analytic. Synthesis! And synthesis often obscures underlying data.

Summary measures exist because of "sufficient statistics." If data conforms to a known shape, summaries tell you everything there is to know about it. In these cases, the sum (and thus mean) are actually relevant. And medians are similarly useful. But usually data doesn't actually conform. So then "sufficient statistics" becomes the data themselves.

So that's why we plot raw data -- "the atoms" -- first. The outliers are always the most interesting part of the data, so we look for those. And then we try to understand their source. "The goal is understanding the distribution of the data, therefore, make every rational attempt to show all of the data."

Yet another largecap section dedicated to being mean to the mean

Dynamite plots are stupid, and if you asked a million people to confirm this, the author is entirely certain that they would. I have no reason to doubt them and shall take them at their word for now.

Another chart shows a cute cone graphic to denote sets of 30 cones sold by 5 ice cream carts. It's pretty easy to read if you notice the right things, but the person who had to answer a test question based on the chart didn't and did a bunch of extra math. The author notes, this was totally avoidable. And also, this chart doesn't do anything to explain why the data varies. "Why would one go through the trouble to write down these five data? Why are all these carts selling multiple of 30 cones? Are they only selling by the case? And why is cart 4 selling six times as much as cart 2? What is the source of the variation?" Also the writer is mad that his kids have dumb teachers who give these sorts of questions.

Good graphics should reveal data. Tufte's nine "shoulds" for graphical displays. They should:

  • show the data
  • induce the viewer to think about the substance rather than about the methodology, graphic design, the technology production, or something else
  • avoid distorting what the data have to say
  • present many numbers in a small space
  • make large data set coherent
  • encourage the eye to compare different pieces of data
  • reveal the data at several levels of detail, from a broad overview to fine structure
  • serve a reasonably clear purpose: description, exploration, tabulation, or decoration
  • be closely integrated with the statistical and verbal descriptions of a data set

The cone data display probably fails at all of these, except it might not distort what the data have to say.

But hey, there's another question that also sucks, this one about ticket sales. This one has converted groups of 20 tickets to ticket stub graphics, including a half-stub to indicate 10. Why is 10 the basic unit of ticket sales? To answer the question, the student has to actually convert the graphics back into numbers. This is the exercise, is how to read stupid bar charts in all their stupid forms. But thinking about the distribution isn't really even possible with this setup, except for little inclinations. But at least "we are seeing evidence about" what is possible and how often the possible things occurred.

What we are not seeing is any reasoning about the data itself. "The scientific component is being stifled." While we are seeing a distribution akin to the data from the first example, the presentation treats the whole thing as a toy.

Another disaster, and then suggestions

The table contains information on the planets, including Pluto (sorry Neal).

  1. subtle inconsistencies in the column headers. Planets and diameters are plural, "distance from the sun" and "length of one year" are singular.
  2. The unit of distance, km, is written in the table instead of teh header.
  3. Lengths of years are presented in two different quantities, days and years. (Consistency in the display helps make sure people read it correctly.)
  4. An Earth year is measured with the precision of six hours but Jupiter-Pluto are measured to nearest year, a drop in precision of 1400 times, over three orders of magnitude. ("No wonder the kids don't understand significant digits!" OK ranty)
  5. Same thing with planet diameters: Jupiter's is rounded to nearest 1000km; other planets enjoy precision to nearest km, except Pluto where it's 100km.

The question the student is asked just entails subtracting some figures from the data; the chart is just a riddle. It doesnt' help understand -- it's designed to be confusing.

The author provides an alternative: nine rows, representing nine "multivariate data". Huh? OK. Hmm. "A collection of even hastily-drawn scatter plots (using only default graphics settings in R) reveals the relationship that bears Kepler's name".

I'm enormously confused by this plot.

"Here we see the multivariate data in projections onto the three bivariate planes". OK WAIT WHAT?

"Here we see the multivariate data in projections onto the three bivariate planes. We can examine the relationship between distance and diameter and detect the four small inner planets and then see that Pluto is small too. The gas giants Jupiter and Saturn are truly giant compared to the small ones, while Neptune and Uranus fill out a niche in the middle. Planetary period ("Length of One Year") as a function of distance from the sun is seen in the middle plot in the bottom row. The smooth relationship is driven by the fact that the period increases with the 3/2 power of the semimajor axis, or, approximately, the average distance to the sun. The plot also shows the inverse relation (right, middle), that the distance is related to the 2/3 power of the period."

OK. So, I find this to be incredibly confusing -- not sure if this is a mark against the point the author is making or just an indicator of my lack of mastery of statistics. Why are the labels not in the same places as the data? The best I can infer is that the labels "diameter", "distance", and "period" correspond to the data in their rows, but the data scales for one of the boxes in each row is absent and instead printed around the empty box containing the label. I could read the analysis to try to decode the graph, but isn't this basically exactly what the author wants to avoid?

Still, I like their final proposal: a simple animation of the planets rotating around the sun. This actually shows all of the same information, but in a way that actually shows you what the data describes so you can immediately understand it: the planets' sizes, the speeds of their rotation, their distance from the sun. And you can see the system they describe as a whole. Yeah that makes plenty of sense.

Finally, the original data table is reproduced without the incongruities noted previously. The lesson: stay out of the way of the data. This shows up within Tufte's principles above, the writer says: "

  • above all else show the data;
  • maximize the data-ink ratio;
  • erase non-data ink;
  • erase redundant data-ink;
  • revise and edit"

-- as the antithesis of some of Wainer's "How To Display Data Poorly" rules:

  • show as little data as possible,
  • hide what data you do show,
  • emphasize the trivial,
  • label illegibly,
  • incompletely,
  • incorrectly, and
  • ambiguously.

Also compares to Cleveland's Clear Vision: "

  • make the data stand out,
  • avoid superfluity,
  • use visually prominent graphical elements to show the data".

Redrawing the planet table helped keep the focus on the data. Stupid icons get in the way of the data. Plots that focus on the distribution of the data but keep mind of the data atoms themselves are the best yay.

Another Table

OK one more table, intially as presented by some researchers who are studying ROP, which is retinopathy of prematurity -- a disease of the eye that affects prematurely born babies.

Table shows infants as ROP-free, less-than-prethreshold ROP, prethreshold ROP, or threshold ROP. These are ordered categories of disease, "making the data ordered categorical data, sometimes called ordinal data". OK.

The data is also ordered by birth weight, as an index of the amount babies are premature, as children born early tend to weigh less.

The table is meant to compare ROP "then" and "now", across a 10-year gap. Has the ROP situation improved?

The writer suggests that determining this from the table provided is difficult, "if not impossible". First, the gridlines are overbearing. But, if the fundamental comparison of interest is that of distributions (then and now), this table, he says, "prohibits" its investigation.

He lists 3 comparisons that can be made, w/r/t incidence of ROP:

  1. Between levels of ROP -- the comparison between weights in each category.
  2. Between three birth weight classes.
  3. Between time frames, then and now.

The first comparison centers within the distribution; we want to compare distributions as a whole. Since the comparison of interest is between time frames, the author suggests we start with just that one, forgetting for now the fact that we already know the birth weights. Using only the ROP levels in each time period, the author generates a much simpler chart that actually gives you a percentage in each category, making it pretty clear that the incidence of ROP has massively decreased.

The author still has beef with "extra typographic paraphernalia" clogging up the table, though, so we refine further. We keep the percentages (in fact, that's all we really need to keep), we ditch the weirdly thick table lines ("daunting partition cage"), we get rid of the parentheses and percent signs, use integer percentages instead of decimal ones, we use two-digit years, and then retitle the thing to explain better what's going on. The resulting table is much easier to read and interpret.

But: we haven't included the birth weight classes from the original table, so we're missing a piece of the data. Do the differences still show up if you adjust for that? Author says this now looks like a job for a Cochran-Mantel-Haenszel test -- "a way to compare the distributions of ordered categorical data while controlling for a possibly-explanatory stratification variable".

But basically all that means is, adding additional lines (strata, layers) in the table, comparing the years per per weight class. We can easily see that the improvements in ROP are consistent across the weight classes.

Lessons learned here: "erase non-data ink; eliminate redundant ink". Stay out of the way of the data and expose the distribution.

So what now?

So, while the timeline we're developing for the Twitter archiver doesn't pertain to raw numerical data, I suspect that many of the points are the same:

  • expose the data and the distribution
  • erase non-data ink
  • eliminate redundant ink
  • avoid distorting what the data have to say
  • induce viewer to think about substance, not technique or graphics
  • make the data set coherent
  • encourage the eye to compare several points of data
  • reveal the data at several levels of detail -- from macro to micro
  • serve a clear purpose
  • integrate with verbal descriptions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment