I recently came across a common problem of visualizing simple state level data captured in PDFs as a choropleth. The data is from an significantly well researched report on housing data
tabulapdf/tabula works quite well in extracting data. Yeah! Even on a windows machine!
Now, that we have got the data, let us create the state choropleth.
Making basic state level Choropleths is a breeze CRAN - Package choroplethr available at arilamstein/choroplethr
Let us create a static version of the map like it is available at Out Of Reach: National Low Income Housing Coalition.
sDir <- "~/Dropbox/pandora/My-Projects/repos/hackery/"
setwd(sDir)
library(choroplethr)
library(choroplethrMaps)
data(state.regions)
head(state.regions)
Now, the data in state.regions
does not match exactly with the dataset we have at hand.
So, instead of correcting the data so that it matches manually, let us try to use a algorithmic approach.
The R
packages that seem to be available to accomplish this task are:
- markvanderloo/stringdist. Well explained at Approximate text matching with the stringdist package
- R: String Metrics
Replacing the data worked quite well with gsub
explained at http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenRFiles/regExprTalk.pdf
- Look into more coloring here ggplot2: axis manipulation and themes
- Choropleth in R: custom breaks and plotting - Geographic Information Systems Stack Exchange
- Report on Housing Data: nlihc.org/sites/default/files/oor/OOR_2015_FULL.pdf
- How to extract data from a PDF - #Interhacktives
- pdftables – a Python library for getting tables out of PDF files | ScraperWiki
- screen scraping - Extracting tables from PDF files programmatically? - Stack Overflow
- tabulapdf/tabula