Skip to content

Instantly share code, notes, and snippets.

@jeremybmerrill
Last active October 2, 2023 20:49
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jeremybmerrill/8495652 to your computer and use it in GitHub Desktop.
Save jeremybmerrill/8495652 to your computer and use it in GitHub Desktop.
scrape a folder of NYPD CompStat PDFs to CSVs.
require 'tabula'
require 'fileutils'
folder_name = "compstat"
output_folder_name = "compstat_csvs"
#########################################################################
#########################################################################
FileUtils.mkdir_p(output_folder_name + "/")
pdf_file_paths = Dir.glob(folder_name + "/*.pdf")
pdf_file_paths.each do |pdf_file_path|
outfilename = File.join(output_folder_name, File.basename(pdf_file_path) + ".TYPE.csv")
extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, :all) #:all ) # 1..2643
extractor.extract.each do |pdf_page|
areas = [
["crime_complaints", [200, 0, 460, 1000]], #crime complaints
["historical_perspective" , [500, 0, 650, 1000]] #historical perspective
]
areas.each do |type, area|
out = open(outfilename.gsub("TYPE", type), 'w')
table = pdf_page.get_area(area).get_table
out << table.to_csv
out.close
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment