Skip to content

Instantly share code, notes, and snippets.

View jeremybmerrill's full-sized avatar

Jeremy B. Merrill jeremybmerrill

View GitHub Profile
@jeremybmerrill
jeremybmerrill / PhotostreamJob.rb
Created August 10, 2012 22:39
PhotostreamJob demonstrates inheritance as a way of dynamically assigning a Resque job to a queue
class PhotostreamJob < BaseJob
@queue = :photostreamphotos
end
@jeremybmerrill
jeremybmerrill / Resque.rake
Created August 10, 2012 22:40
Sample resque.rake for workers on multiple servers
require 'resque/tasks'
require 'resque_scheduler'
require 'resque_scheduler/tasks'
require 'resque_scheduler/server'
rails_root = ENV['RAILS_ROOT'] || File.dirname(__FILE__) + '/../..'
rails_env = ENV['RAILS_ENV'] || 'development'
resque_config = YAML.load_file(rails_root + '/config/resque.yml') #contains Redis's location on the network for different Rails environments
Resque.redis = resque_config[rails_env]
@jeremybmerrill
jeremybmerrill / BaseJob.rb
Created August 10, 2012 22:41
BaseJob -- inherited from by jobs with different queues
class BaseJob
def self.perform()
#Do the job.
end
end
@jeremybmerrill
jeremybmerrill / astrazeneca.rb
Last active December 20, 2015 17:19
How to scrape AstraZeneca's ASP.net disclosure page with Upton
require 'upton'
class AstraZenecaScraper < Upton::Scraper
ROWS_PER_PAGE = 50
def initialize(index_url_array, site_meta)
@sleep_time_between_requests = 15
@site_meta = site_meta
@total_pages = @site_meta[:total_pages]
@az_time_period_identifier = @site_meta[:az_time_period_identifier]
@jeremybmerrill
jeremybmerrill / count_scraper.rb
Created September 8, 2013 23:37
Scrape the Los Angeles Review of Books for contributors and the authors of reviewed books, then classify those by gender by pronouns in their biographies (or statistical probability, if it's clear)
require 'upton'
require 'date'
require 'guess'
GLOBAL_VERBOSE = true
# - any lowercased pronoun is okay
# - capitalized pronouns are okay unless they're in a book title, which is a series of capitalized words;
# that is, capitalized pronouns are okay if there are zero alphabetic characters between them and a sentence-final punct
FEMALE_REGEXES = [/ she[\.,\s!?\' ]/, / her[\.,\s!?\' ]/,
@jeremybmerrill
jeremybmerrill / gender.rb
Last active December 24, 2015 22:59
first pass at ruby version of global name data
require 'csv'
require 'set'
class Gender
def initialize(options={})
countries = Set.new([:us, :uk])
@threshold = options[:threshold] || 0.99
@names_counts = {}
@jeremybmerrill
jeremybmerrill / tabula_basic.rb
Created January 18, 2014 05:12
A snippet to extract spreadsheet data from a PDF using Tabula's tabula-extractor
require 'tabula'
pdf_file_path = "czechmaybe.pdf"
outfilename = "czechmaybe.csv"
out = open(outfilename, 'w')
extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, [5] ) #:all ) # 1..2643
extractor.extract.each do |pdf_page|
pdf_page.spreadsheets.each do |spreadsheet|
@jeremybmerrill
jeremybmerrill / compstat.rb
Last active October 2, 2023 20:49
scrape a folder of NYPD CompStat PDFs to CSVs.
require 'tabula'
require 'fileutils'
folder_name = "compstat"
output_folder_name = "compstat_csvs"
#########################################################################
#########################################################################
FileUtils.mkdir_p(output_folder_name + "/")
@jeremybmerrill
jeremybmerrill / edc.rb
Last active January 3, 2016 17:29
Script to output the four tables from page 1 and page 3 of an NYC EDC report using Tabula.
require 'tabula'
require 'fileutils'
folder_name = "EDC"
output_folder_name = "EDCcsvs"
#########################################################################
#########################################################################
FileUtils.mkdir_p(output_folder_name + "/")

Keybase proof

I hereby claim:

  • I am jeremybmerrill on github.
  • I am jeremybmerrill (https://keybase.io/jeremybmerrill) on keybase.
  • I have a public key whose fingerprint is 441A 05CC B462 AF95 45FA 95B5 CDF7 BBEF F5A7 B374

To claim this, I am signing this object: