Skip to content

Instantly share code, notes, and snippets.

@dkapitan
Created January 30, 2021 10:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dkapitan/5f461b567037a22055c87c5e025b3c77 to your computer and use it in GitHub Desktop.
Save dkapitan/5f461b567037a22055c87c5e025b3c77 to your computer and use it in GitHub Desktop.
comet-chart-flight-delays-post

Zan Armstrong's comet chart has been on my list of hobby projects for a while now. I think it is an elegant solution to visualize statistical mix effects and address Simpson's paradox, and particularly useful when working with longitudinal data involving different sub-populations. Recently I found a good excuse to spend some time to actually use it as part of a exploratory data analysis on a project.

Since I mostly work in Python and have recently fallen in love with Altair - for the same reasons as Fernando explains here - I wondered how the comet chart could be implemented using the grammar of interactive graphics. It took me a while to figure out how to actually plot the comets. In a previous version, I had drawn glyphs using Bokeh. While Altair allows you to plot any SVG path in a graph, this felt a bit hacky and not quite in line with the philosophy of using a grammar of graphics.

Thankfully Mattijn was quick to suggest using trail-marks, after which it was almost as easy as pie. So here's an example using a dataset of 20,000 flights for 59 destination airports.

In the example shown here, each comet represents one destination airport. The head of the comet corresponds to the most recent observation of the number of flight arrivals (x-axis, shown as logarithmic scale to accommodate the wide range of observations) against the mean delay of those flights (y-axis). The tail of the comet represents a similar (x,y) datum, but from an earlier point in time. Finally, the colour of the comet is encoded to show the change in the mean delay for each airport. A tooltip with a summary of the data is shown when hovering over the head of the comet.

So-called mix effects can often lead to misinterpretation of aggregate numbers. In the example of flight delays, the fact that only a small change is observed in the mean delay across all airports - visualized with the right-most comet outlined in black - hides the underlying variance between airports. Note that in this example the size of each sub-population (number of flights per airport) remains relatively constant, hence the comets here only go up and down. As explained in the original article, mix effects become harder to interpret when the relative size of the sub-populations change as well as their relative values. In the most extreme case this may lead to Simpson's paradox.

With this base implementation of comet charts in Altair, you can really go to town and combine it with other interactive graphs. Using the overview-detail pattern, you could plot an accompanying density plot of all the flights for a given airport. That way you can quickly zoom in to the lowest level of detail and get a better understanding of the underlying mix effects.

For now, I will leave you with the Python code to make the plot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment