Skip to content

Instantly share code, notes, and snippets.

@JakeRuss
Created December 17, 2018 16:09
Show Gist options
  • Save JakeRuss/cf8946f6bf25c278473672de3bd9c464 to your computer and use it in GitHub Desktop.
Save JakeRuss/cf8946f6bf25c278473672de3bd9c464 to your computer and use it in GitHub Desktop.
table-izing the output from pdftools 2.0
library(pdftools)
library(tidyverse)
library(janitor)
pdf_file <- "https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf"
df <- pdf_data(pdf_file)[[1]]
# Table-ize ----
headers <- df %>%
filter(y %in% 126) %>%
pull(text) %>%
c("y", .)
car_names <- df %>%
filter(x < 253) %>%
group_by(y) %>%
summarise(car = str_c(text, collapse = " "))
final <- df %>%
filter(x >= 253, y > 126) %>%
select(y, x, text) %>%
spread(x, text) %>%
filter(!is.na(`254`)) %>%
remove_empty("cols") %>%
mutate(`308` = coalesce(`308`, `313`),
`342` = coalesce(`342`, `347`)) %>%
select(-`313`, -`347`) %>%
set_names(headers) %>%
mutate_at(.vars = vars(mpg:carb), parse_number) %>%
left_join(x = car_names,
y = .,
by = "y") %>%
select(-y)
@JakeRuss
Copy link
Author

JakeRuss commented Dec 17, 2018

purrr and the dev dplyr 0.8 get me closer to a generic solution which eliminates the coalesce(s) and could be wrapped into a function.

final <- df %>%
  filter(x >= 253, y > 126, y < 695) %>%
  select(y, x, text) %>%
  spread(x, text) %>%
  group_by(y) %>%
  group_split() %>%
  map_df(~ .x %>% remove_empty("cols") %>% set_names(., nm = headers)) %>%
  mutate_at(.vars = vars(mpg:carb), parse_number) %>%
  left_join(x  = car_names, 
            y  = ., 
            by = "y") %>%
  select(-y)

@JakeRuss
Copy link
Author

One problem is going to be missing cells in the table. I think complete() might help in this regard, but then how to ensure added NAs stay and drop the created-by-spread missing values?

@JakeRuss
Copy link
Author

In case anyone reads this, inferring the column positions is my current struggle (ie. these N X positions belong to column 1). Once I can reliably get row numbers and columns, then I can group_by() + complete() and I'll have the whole rectangle.

@JakeRuss
Copy link
Author

I have this working now via Jenk's Natural Breaks Optimization technique.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment