Skip to content

Instantly share code, notes, and snippets.

@jeremybmerrill
Last active January 3, 2016 17:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jeremybmerrill/8496307 to your computer and use it in GitHub Desktop.
Save jeremybmerrill/8496307 to your computer and use it in GitHub Desktop.
Script to output the four tables from page 1 and page 3 of an NYC EDC report using Tabula.
require 'tabula'
require 'fileutils'
folder_name = "EDC"
output_folder_name = "EDCcsvs"
#########################################################################
#########################################################################
FileUtils.mkdir_p(output_folder_name + "/")
pdf_file_paths = Dir.glob(folder_name + "/*.pdf")
pdf_file_paths.each do |pdf_file_path|
outfilename = File.join(output_folder_name, File.basename(pdf_file_path) + ".PAGE.TYPE.csv")
extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, [1, 3] ) #:all ) # 1..2643
extractor.extract.each do |pdf_page| #(:line_color_filter => color )
out = open(outfilename.gsub("PAGE", pdf_page.number).gsub("TYPE", type), 'w')
if pdf_page.number == 1
pdf_page.spreadsheets.reject{|spr| spr.cells.size < 10 }.each_with_index do |spreadsheet, index|
type = index == 0 ? "employment" : "unemployment"
out << spreadsheet.to_csv
end
else
areas = [
["office_vacancy_rates", [200, 47, 460, 331]], #crime complaints
["construction_starts" , [533, 47, 653, 331]] #historical perspective
]
areas.each do |type, area|
pdf_page.get_area(area).spreadsheets.each do |spreadsheet|
spreadsheet.fill_in_cells!
out << spreadsheet.to_csv
out << "\n\n"
end
end
end
out.close
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment