Skip to content

Instantly share code, notes, and snippets.

@JakeRuss
Created December 17, 2018 16:09
Show Gist options
  • Save JakeRuss/cf8946f6bf25c278473672de3bd9c464 to your computer and use it in GitHub Desktop.
Save JakeRuss/cf8946f6bf25c278473672de3bd9c464 to your computer and use it in GitHub Desktop.
table-izing the output from pdftools 2.0
library(pdftools)
library(tidyverse)
library(janitor)
pdf_file <- "https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf"
df <- pdf_data(pdf_file)[[1]]
# Table-ize ----
headers <- df %>%
filter(y %in% 126) %>%
pull(text) %>%
c("y", .)
car_names <- df %>%
filter(x < 253) %>%
group_by(y) %>%
summarise(car = str_c(text, collapse = " "))
final <- df %>%
filter(x >= 253, y > 126) %>%
select(y, x, text) %>%
spread(x, text) %>%
filter(!is.na(`254`)) %>%
remove_empty("cols") %>%
mutate(`308` = coalesce(`308`, `313`),
`342` = coalesce(`342`, `347`)) %>%
select(-`313`, -`347`) %>%
set_names(headers) %>%
mutate_at(.vars = vars(mpg:carb), parse_number) %>%
left_join(x = car_names,
y = .,
by = "y") %>%
select(-y)
@JakeRuss
Copy link
Author

I have this working now via Jenk's Natural Breaks Optimization technique.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment