Skip to content

Instantly share code, notes, and snippets.

@tiffehr
Last active October 20, 2020 22:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tiffehr/0147cd219739438535d3807939b5b8de to your computer and use it in GitHub Desktop.
Save tiffehr/0147cd219739438535d3807939b5b8de to your computer and use it in GitHub Desktop.
Tableau PDF example (covid-19 data acquisition team)

Tableau's generated PDF challenges

For many Tableau dashboards, we generate PDFs of specific tabular layouts, e.g. Idaho's Public Health District #5.

An example dashboard table

PHD5's table

Its PDF generation

generated PDF

The PDF contents

The Case total.pdf itself looks fine: PDF contents

But when parsing the PDF's digital text layer with PDF Parse for Node, we see:

Total Cases by County
Fields without a number equal zero. Total cases in
Twin Falls county include outbreak in county jail.
County Confirmed Probable
Twin Falls
Cassia
Minidoka
Jerome
Blaine
Gooding
Lincoln
Camas 7
24
83
40
126
94
98
388
32
111
369
773
851
985
1,078
2,970

Significant Parsing Challenges

The column order "snakes" around, in no clear repeated order within visible columns or rows. There are no clean row entries, like Twin Falls 2,970 388\r\n. We can set in place some assumptions to unravel headers from County names from column data, but is very fragile compared to a more standard PDF structure order.

More Typical PDF contents

Here is a PDF from the Green River Health Department, Kentucky, which also includes a data table.

Green River table

The PDF contents

Parsing the PDF's digital text layer, we get:

GRDHD COVID-19 Case Summary as of 9:00 AM October 20, 2020
County Confirmed
Cases
Recovered
Cases
Current
Hospitalizations
Ever
Hospitalized
Deaths
Daviess 1,785 1,580 13 127 27
Hancock 112 87 1 7 1
Henderson 1,200 873 12 94 24
McLean 150 103 3 15 3
Ohio 584 494 4 40 9
Union 481 395 3 34 6
Webster 282 228 1 21 5
Total 4,594 3,760 37 338 75

Fewer parsing challenges

With normalized spacing in our PDF-reading code, this is much closer a semi-reliable extraction. Once we hard-code some (validated) assumptions, we can safely separate and extract figures for our data-collection workflow without asking humans to hand-enter each figure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment