timfitzzz/gist:fe0d84692075add8d729

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Fundamental Statistical Concepts in Presenting Data

Notes, Part 2 (Part 1)

Starting from Page 24, we now examine a plot created by the author. This graphic is a line plot with a linear y axis, a logarithmic x axis, and a variable 'range' that sweeps along with the curved lines that are being plotted. Within the graphic itself there appear three significant paragraphs of very small text.
We learn from the commentary that this chart pertains to "the buildup of fatty deposits in artery walls", known as "peripheral artery disease." Patients have one of three classifications: no disease, claudication (pain in legs), or critical limb ischemia (dangerously restricted blood supply legs). The study being charted considered the overall health of the patients' arteries, the degree of tibial artery calcification scored as "TAC", and other covariates: age, race, smoking status, etc.
The examination of the data that produced this plot was meant to understand the reationship between the category of disease and the TAC score, and the covariates. "How do changes in TAC affect the probability that a patient will be in a particular peripheral artery disease state?"
So the plot shows "model probabilities" for each disease state based on the TAC, using an "ordinal logistic regression" model. The covariates are "held at mean level", which I believe is not graphically represented. The three levels are 'ordinal' in that they have a natural ordering. The sum of the probabilities of each of the three levels is 100% for any given TAC score.
Donahue calls out the "pertinent features" of his graphic:

At the top, a title which describes what the plot represents. "Don't underestimate the value of titles in graphics which might need to stand on their own."
A prominent byline w/ author's affiliation, so as to lend some level of authority and credibility to the information being presented.
The three estimated probabilities are presented with measures of uncertainty (those are the light gray 95% confidence intervals), indicating and overtly acknowledging the imprecision of the estimations.
The estimated curves are labeled on the plot, so you instantly understand which curve goes with which disease level. (This is a feature my timeline draft with the dots lacked, even with labels potentially close by.)

Ah, yes: "Rather than a legend floating nearby and using different, visually-distinct lines to show the curves, the on-field labeling allows the viewer quick and easy access to the categories without having to withdraw attention from the plot in order to determine what is what."
The labels require no arrows, connectors, etc, the words' relationship to the lines is clear.


The text on the plot (which I actually would have guessed was bad technique) describes:

The method and the model
An introduction to the problem
How to read the plot and what to look for.
What direction to read in and tbe basic inverse relationship between TAC and the levels.
An example reading from the graphic itself, of TAC score 100
That the probabilities for a given TAC sum to unity.
How the analysis was carried out and who was responsible for measuring and collecting the data.


Donahue notes that while it doesn't tell us "all we need to know about TAC score and peripheral artery disease", it tells us a) "quite a bit" and b) who's responsible for the data. It shows distributions of disease at TAC, and enables comparisons across those levels. It shows you what model is being used to describe the data set, and can therefore help you make predictions.
"And it shows us that these models can be quite pretty, even in grey scale." Oh, do go on, Rafe!
Oh, you do. OK:

All relevant data-related elements are prominent
Supporting information takes a visually suboordinate role
Bold black curves at front for point estimates for probability curves
Curves are atop 95% confidence interval grey ranges
Ranges, labels and textual narrative are on top of the grid of reference lines.
Support elements "humble and unassuming", don't distract from the data-related elements
"Like good type-setting or a good wait-staff," support elements don't interfere -- they only support.

Shouts out to his time and effort in creating this, and also the richness of his work. It's nearly standalone and tells a story.
A few guidelines "from this plot" that he bullets:

Take time to document and explain
Integrate picture and text
Put words on your displays
Tell viewer what the plot is and what to see
Engage the viewer.

He notes that the data for the study are "complex multivariate data in patients with a complex disease using a complex data analysis method." Obviously a graphic representing this will need explanation, and won't be super simple or small. Pictures are great, but good descriptions and attention to detail won't hurt.
OK, next graphic. (p. 26)

This one very obviously charts eruptions of the geyser Old Faithful, which I guess is a sample data set provided by R. The Y axis is "Duration (sec)" and the X axis is "Time til next eruption (min)". It shows a definite relationship between them, as well as a clustering of data. Most of the data points fit into two vague groupings, with a gap between them. This is highlighted with a display of a raw tally of data at each point along each axis, and is also visible in the chart itself.
The label is "Old Faithful Eruptions (271 samples)", highlighting the number of samples so that the viewer knows that this isn't taken way out of context.
The most recent data point -- the duration of the most recent eruption -- is given special prominence, as a line that goes across the chart. This could be used to predict the most likely amount of time until the next eruption.
Some dots are blue and some are red, and the explanation for this isn't immediately apparent to me. There are also breaks in the axis lines and the sections of each line aren't completely aligned. Not sure what that's about.
OK, on to his analysis.
He says that the design employs some "automated techniques for computing summary statistics for the marginal distributions." For both marginals, summaries are shown -- extrema, quartiles, and mean. -- OK, I think the quartiles are the breaks in the line; the extrema might be the the highest and lowest data points, and the mean appears as a little diamond along the axis line, which I hadn't noticed until now.
There's no key, so, can't hlep but think that would have been useful here?
OK, reading on, he clarifies:

Minima and maxima are the labels at the ends of the axes, which aren't in the same intervals as the axes overall. You can read them and know exactly what the first and last data points are.
First and third quartiles are indeed the shifts in the axis bars; the medians are actually the breaks in the bars, he says. OK, wait, shifts versus breaks. Yeah, I think that still means that the first break on each side is the first quartile, and the middle two breaks, shifted in, are the second and third quartiles, but I'm confused a little about that.
The means are indeed the diamonds just below the axes.

And yes adjacent to each axis is a "histogram", similar to the piano data class (and actually similar to my first draft of the timeline, with the dots for the tags, except that all the data was blurred together.)
He says "the data are (at least) bivariate" -- how are they "at least" bivariate? What variables are there aside from the duration and time til eruption? Donahue confirms that there's a positive correlation. The coloration of the dots, turns out, shows "whether or not the duration of the previous eruption was more (red) or less (blue) than an arbitrarily-selected 180 seconds, indicated by the horizontal dotted line".
OHHH. So the line that says "previous duration" which has blue and red dots on either side turns out to be the key for the colors. If the previous duration is more than 180 seconds, the dot appears red. If the previous duration is less than 180 seconds, it appears blue. So the graphic actually shows that longer eruptions tend to alternate with short ones, more often than not -- but definitely not always. He calls this a "negative autocorrelation".
With regard to the automatic axes and visual guides: he calls them "triumphs of programming and subtle visual information encoding, providing summaries without having to compute and plot them by hand" but he says that they actually demonstrate why that's not enough "to tell the whole story". He then notes the thing I said before about the clustering of the data into two distinct groups, which he calls "bimodality" -- and he also calls out that you can see this on the "marginal distributions", as I said!
He notes that while you could examine the data summaries and decide that the mean eruption interval is 70 minutes and be correct, it 'corrupts' the interpretation because the bimodality is only revealed by examining the data points themselves. "Very rarely are the eruptions an hour apart; in fact, if it has been exactly 60 minutes and you forgot to put new film in the camera, you are more likely to wait for more than 15 minutes than less than 15 minutes; you might have time to reload your film!" Very useful! So, if you had only the summaries and couldn't see the atomic level data, you might have missed this. Mean and median wouldn't have accurately communicated the distribution. "All we would know would be the location of the center point, regardless of the amount of mass living there."
He also notes the gap between the clusters, which shows that in the data collected there was no eruption at 61 minutes. There's at least one at every other integer. The reason for this is unclear, but Donahue suggests that it may reveal a human digit preference, since a data collector might prefer 60 over 61. Or maybe it just worked out this way, but humans r stoopid so hard to say. Either way, you wouldn't see this without looking at the individual atoms of data.
More boldface:

Attempt to show each datum; show the atomic-level data.
Avoid arbitrary summarization, particularly across sources of variation.

OK here we go again. Next graphic, p28.

Takin' it back to grade skool again: a photo of a chart on a piece of posterboard! Nice one. It's a terrible picture too, though he promises a better version on the next page, which seems kind of silly but whatever.
I skipped to the zoomed-in plot to make some initial observations:

The bottom axis is "Bedtime," from 8:15pm to 10:00pm (though there doesn't seem to be any data after 9:25pm).
The top axis is "Wake-up time", between 6:00am and 7:30am.
Some of the data points are marked with a + instead of a dot, not sure why.
There are diagonal lines connecting some points, where there appears to be a trend? Not sure.

Honestly can't really see shit else on this. OK, his observations:

The plot shows 54 multivariate data, obviously bedtime-waketime pairs
Individual points also code for weeknight (dot) or weekend (cross). Ah, k.
Diagonal lines show nights of equal amounts of sleep, ahh. Bottom line: 9.5 hours of sleep; then middle is 10 hours and top is 10.5.
Three data points are specifically annotated, to be reviewed momentarily.

Plot is designed to answer question of whether or not early bedtimes yield early waketimes, to hopefully (one would guess) disprove a parental claim that the kid needs to go to bed early because they have a big day tomorrow.
The plot really doesn't provide a solid answer to this question, but hey, it's a teachable moment:

All the data are plotted; it's easy to see that there are two variables that relate: bedtime and waketime.
The variation in bedtimes is tight -- all but 8 occur between 8:30 and 9, regardless of weeknight or weekday -- so you can guess the parents probably don't accept any talking back, or at the very least would revoke all rock and rolling privileges in the case that talking back did occur.
Waketimes vary much more, which means the kid gets to wake up on their own; a similarly proportioned range would be an hour in width.
Three annotated points describe outliers: at the far right, a bedtime after 10:30 on a weeknight: "Super bowl". Another really late night, 9:30, "Acc Tournament". Then the earliest bedtime, outside plotting region, barely 8:00pm. "Got in trouble!"
No equivalent notes for early or late awakenings, so we are led to guess they aren't noteworthy.

Overall, no consistent relationship between bedtime and waketime; the appearance is a child who sleeps until they're not tired and then wakes up on their own. Donahue knows this to be true because -- hey, wait a minute. That's not fair. It's his son? All right Donahue. I see you, you and your wacky twist endings.
Thing is, my interpretation of this kid is someone who has a sense of humor that his dad does not appear to fully appreciate, using his father's beloved graphics against him to make a case that his bedtimes are too strict, and actually doing a great deal to challenge the idea that going to bed early makes a substantial difference in when he wakes up.
OK, next graphic: Text starts on page 30.

OK now we're talking about cell-staining and its use to identify types of cells in a tissue sample.
The process of staining takes awhile. Researchers asked Donahue to help them examine the relationship between the amount of time spent on this process and the way tumors reacted to the stains. They designed an experiment: 7 tumor types, each with 8 stains, each at 4 time points (0, 5, 15, 60 min). At each time point, they took three readings, recorded as values between 0 and 100.
Total datums: 672. Questions he asks:

How do we show them and their distribution? (That is a lot of points of data to show.)
What are the sources of variation that we want to reveal?

He says the problem looks like "a classic analysis of variance model, with sources of variation beting tumor, stain, time, replicates, and appropriate interactions". You could just generate the "ANOVA", which I guess is a variance analysis? But is that what the datums say at the atomic level?
Well hey, here's a graphic. He goes on to analyze it, but first let's see what we see...

A series of gray squares in a thickly-padded-white-border grid. 8 across, 9 down = 72 squares.
The squares across are the 7 tumor types, plus a last column for "All Types", and along the left are the 8 stains, plus a 9th row for "All Stains".
Within the squares are tiny speckish datums, plotted within the squares according to their values. But it's not clear what values they represent, as there is no scale marked for the gray boxes' Y axis, and the X is simply marked 0-60. But: based on the intro summary, we know that's minutes, and we can surmise that the Y values are 0 to 100.
The lower-right corner box contains all of the data points, and forms 4 pretty thick lines encompassing all the stains and all the types. But I don't know quite what the X values of these lines are, since they're not labelled with specificity, and I can't do anything to inquire into the reason there are gaps along their Y axes where no datum points exist.
If we do know what this data represents, though, we can see a few more things. The 8th stain registers lower scores regardless of the data type, though is slightly better with type 4. The rest of the stains form very similar lines in their own summary squares, and if you look closely you can see a lot of info about which stains work best with which types.
The blue specks are so plentiful, and the zoom is so wide, that it's impossible to really see whether there are clusters of many, many blue specks close by with any precision. A strong cluster and a massive cluster would be tough to determine from each other, and my guess is that any predictive value of this plot would be muddied as a result.

OK, back to his analysis of the plot:

669 datums
columns: 7 types of tumors + summary
rows: 8 types of stains + summary
In each box, time flows left to right (0-60 minutes) and percent of cells stained is shown vertically (0 to 100%).
Three replicates at each time point in each regular square (duh, that's why there are bands, they only used certain time points).
"Main effects" of the tumor are shown in the bottom row. I'm guessing that "main" means aggregate? And likewise effects of the stains are in the right-most column. So, intersections represent interactions between tumor/stain.
"Our data display is the model," he writes. "We see the distributions as functions of the levels of the sources of variation."

He directs us to the top row, where all but the fourth type shows high levels of staining. The fourth type seems to interact with every stain but the last one; he points out that the aggregate effects don't give us much when our interest is in interactions, which are separately mapped here. "Compare the marginal distributions to the interaction plots: what does the main effect mean when it depends on the level of a second source of variation?"
OK, gonna have to look up "main effect". Hmm... "a main effect is the effect of an independent variable on a dependent variable averaging across the levels of any other independent variables." Still not entirely sure what that means. Oh, ok. The dependent variable here is the percentage of cells stained, which depends on the independent variables, timing, tumor and strain. So what he is saying is, look at the aggregate distributions vs the interaction plots. Oh, "averaging across". So yeah it is aggregate, right -- and the point is that if there is another source of variation that it depends on, it doesn't mean much because you can't tell which variable was determinative?
Now he looks at the first plot, which has "little clusters of the three replicates at each time point", which are fairly consistent, but have some variation, which he calls the residual error -- but it's small. But in stain4/tumor1, the error is massive -- and this happens in a lot of the plots. The process isn't different, so Donahue suggests that we might want to be sure we understand the source of this variation before we try to understand the effects that time has on the outcome.
And yeah, "what of time?" It's a flat circle; Donahue is such a Rust Cohle. Anyway: the effect of time can't be more than minimal, at least relative to the differences between the tumors and stains in their interactions. And time would also have to interact with those factors as well, because some of the plots show no real difference based on time, and others show enormous effects.
Finally, he notes that the "overall grand mean" is in the neighborhood of 50%. "In light of many drastic 'all or none' experiences in the data, what does such a mean represent? Is reporting the mean even a rational thing to do?"
Well, earlier he said that the mean is only important if the total is... does that apply here, or maybe it doesn't relate to this? But that formulation was a little esoteric, seemed like. Maybe it does matter because we are concerned with the total number of cells that get stained, even if that's not what we're specifically concerned with here. But like, that matters. So maybe reporting the mean is useful here.
He notes that, again, the initial descrption seems like a "classic three-way ANOVA with replicates", but looking at the data reveals issues that need to be addressed before any such presentation would be useful:

Interactions between tumor type, stain type, and time
Inconsistent variances in the distributions of the replication process
Highly non-normal distributions.

"When data is complex, ignoring interactions and consistencies will not make them go away." By focusing on the atomic-level data, and starting from there, we are helped to see the undiscovered complexities of the topic.
"