Skip to content

Instantly share code, notes, and snippets.

@pete-murphy
Last active March 22, 2023 23:26
Show Gist options
  • Save pete-murphy/ea250217a6dea23a7fef151c7ad04394 to your computer and use it in GitHub Desktop.
Save pete-murphy/ea250217a6dea23a7fef151c7ad04394 to your computer and use it in GitHub Desktop.
Data Visualization Course

Scatterplot

Simple(st?) chart: just mapping two values from dataset to see relationship (e.g. fuel economy (mpg) vs power (hp)) If not seeing continuous values, probably not the chart to use (e.g. cylinder count, though a number, is not continuous)

Difference between bar chart and scatterplot

Bar chart can have its axis sorted arbitrarily, scatterplot has to be continuous scale

Bar charts

Precision doesn't matter so much for data vis Good for quickly comparing values by category

Bad practice: starting at non-zero for bar chart value axis misrepresents data (makes it hard to make comparison between bars)

Pie charts

Unlike bar charts, need to show whole dataset because it represents parts-to-whole Not good for comparison between the parts (use bar chart for that)

Pie chart & bar chart are similar in that they both have two mappings, to category and to value

Line chart

Generally "continuous process" that changes over time Almost always the horizontal axis is time

When can line charts go wrong?

  • compressing the aspect ratio can skew the representation of slope

Rule of thumb for line chart (not based on research, but common practice): "bank to 45 degrees"

Other

Instructor answered question about "radar charts": "haven't seen a good use case for radar chart" :)

Bad practices are collected in "How to lie with statistics"

Charts are not meant to be used for high-precision, just to get a rough sense of what the numbers are

Using retinal variables to encode data: e.g. simple scatterplot, one variable would be position

Retinal variables also called visual encodings

... discussion of JavaScript, how Observable works, and Plot API ...

Marks

Marks are specified by function calls. For example dot

Plot.plot({
  marks: [
//                   👇 This is mapping cars data to the position retinal variable
    Plot.dot(cars, { x: "power (hp)", y: "economy (mpg)" })
  ]
})

Plot & vega-lite (and similar libraries) share terminology (such as "marks") from common ancestry in:

  • Grammar of Graphics by Leland Wilkinson (1999)
  • Semiology of Graphics by Jacques Bertin (1967)
Plot.plot({
  marks: [
//                                                       👇 This is a constant color
    Plot.dot(cars, { x: "power (hp)", y: "economy (mpg)", fill: "steelblue" })
  ]
})
Plot.plot({
  //                                                 👇 This is another encoding
  marks(cars, { x: "power (hp)", y: "economy (mpg)", fill: "cylinders" })
})

In general, it's not a good idea to add too many variables mapped to visual variables, becomes impossible to read

Distinguishing between continuous and categorical scales

Plot.plot({
  marks(cars, { x: "power (hp)", y: "economy (mpg)", fill: "cylinders" })
  color: {
    // By default this will be a _continuous_ color scheme
    // (because of magic inspection of cylinders domain)
    legend: true
  }
})
Plot.plot({
  marks(cars, { x: "power (hp)", y: "economy (mpg)", fill: "cylinders" })
  color: {
    legend: true,
    // This makes the categories of color scheme explicit
    // (and changes the color scheme, because no longer continuous)
+   domain: [3, 4, 6, 8]
  }
})

Dot plot

Different from scatterplot, has categorical axis (like car make) and within each category, a continuous scale that values within that category are mapped

Looking at cars dataset, can see the breakdown of different "tiers" of performance within car models: for Pontiac, a couple at the high-end, three or so in mid-high-range, then rest clustered towards the lower end

Table view

Plot's table view is really nice! Gives DB-table view but with some smart histograms/filtering/sorting baked in.

Line chart

Different from dot: for any given date (x value) there can only be one (and should be exactly one) y value

Force a line chart to include 0 with Plot.ruleY([0]) mark

Data types (categorical vs continuous)

Important distinction between different types of data:

  • There's nothing between Kia & Honda, or nothing between letters "A" & "B", they're just names
  • can be unsorted, lots of ways they can be sorted, but they don't have an inherent ordering
  • this is categorical data

People tend to get this wrong (e.g., showing alphabet frequency as line chart, doesn't make sense because it's showing continuity as if there were values between individual letters)

One of the reasons radar charts are bad, changing the axes around changes the shape of the chart

Distinction is not always clear, for example, aggregating continuos data, you might want to show large bins as categorical (years as bars)

Session 3

Pop out effect (pre-attentive vision) its importance is overstated in data vis

The thought is that a single color bar stands out and is processed before anything else

Motion effective, and distinctive color (single blue bar in an otherwise-gray bar chart)

Specifically useful for presentation, when pointing people to something

In early phases of data exploration don't want that (want to avoid any kind of "bias") Early stages of data vis (exploration) should be boring "Don't use presets because they often have colors" (this is true of a lot of chart libraries, Plot's defaults try to be boring)

Unusual charts are memorable, take longer to process (Sankey)

Comparison

Not biasing, but helping to understand the chart: e.g., adding a 1/26 line for the alphabet frequency (those bars extending above are more frequent than if they were all equally probable) or adding reference line such as average among many lines in line chart

For comparisons between spikey data, useful to smooth the values (makes higher level pattern easier to parse)

Session 4: Transformations

Grouping

Binning & Histogram

Binning creates a histogram, shows distribution of data

If looking at a bunch of overlapping data, could use dodge which will push plotted points so they stack

dodge "pushes apart" plotted data points (dots)

Sometimes affected by quantization (or reveals quantization, you can see higher aggregations around specific points, like multiples of 10)

(Same as beeswarm chart? I think, but I missed what he said there)

Binning does similar to dodge, but slices data into ranges (bins), can control the granularity/size of bins to reduce the spikiness/noise of raw data

Can also pass in explicit threshold values instead of number of bins

"Noise can be interesting (might point to errors in data gathering), but could just be noise"

Stacking

Stacking can be a bad idea.

(Mentions bad idea to have too many categories)

Comparison can be hard with stacked bar charts (bars aren't starting at same point on axis). People have a harder time with stacked bar charts than pie charts (comparison also difficult with pie charts).

Stacked area: same issue, can read the area on the bottom pretty well, but once they're stacked its hard to compare.

Hard to read the vertical width of each section, especially when at extreme angles, distorts the vertical width (which is actual value)

Other views (silhouette, wiggle) try to make the stack easier to interpret

Windowing

For continuous value, smooth by taking average over window (neighborhood of values)

Session 5:

(I was very tired and didn't take many notes on this one)

Facets

Grouped bar charts: give you something that is difficult in other repre

  • comparison within a group
  • comparison across groups

Index chart

Stock prices, e.g., can show starting from a collapsed single point (0) and then plot multiples of initial value

Shows change (not comparison of absolute value)

In Plot, you can use normalizeY for index chart

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment