Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
scrape a folder of NYPD CompStat PDFs to CSVs.
require 'tabula'
require 'fileutils'
folder_name = "compstat"
output_folder_name = "compstat_csvs"
#########################################################################
#########################################################################
FileUtils.mkdir_p(output_folder_name + "/")
pdf_file_paths = Dir.glob(folder_name + "/*.pdf")
pdf_file_paths.each do |pdf_file_path|
outfilename = File.join(output_folder_name, File.basename(pdf_file_path) + ".TYPE.csv")
extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, :all) #:all ) # 1..2643
extractor.extract.each do |pdf_page|
areas = [
["crime_complaints", [200, 0, 460, 1000]], #crime complaints
["historical_perspective" , [500, 0, 650, 1000]] #historical perspective
]
areas.each do |type, area|
out = open(outfilename.gsub("TYPE", type), 'w')
table = pdf_page.get_area(area).get_table
out << table.to_csv
out.close
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.