Skip to content

Instantly share code, notes, and snippets.

View jeremybmerrill's full-sized avatar

Jeremy B. Merrill jeremybmerrill

View GitHub Profile
@jeremybmerrill
jeremybmerrill / astrazeneca.rb
Last active December 20, 2015 17:19
How to scrape AstraZeneca's ASP.net disclosure page with Upton
require 'upton'
class AstraZenecaScraper < Upton::Scraper
ROWS_PER_PAGE = 50
def initialize(index_url_array, site_meta)
@sleep_time_between_requests = 15
@site_meta = site_meta
@total_pages = @site_meta[:total_pages]
@az_time_period_identifier = @site_meta[:az_time_period_identifier]
@jeremybmerrill
jeremybmerrill / count_scraper.rb
Created September 8, 2013 23:37
Scrape the Los Angeles Review of Books for contributors and the authors of reviewed books, then classify those by gender by pronouns in their biographies (or statistical probability, if it's clear)
require 'upton'
require 'date'
require 'guess'
GLOBAL_VERBOSE = true
# - any lowercased pronoun is okay
# - capitalized pronouns are okay unless they're in a book title, which is a series of capitalized words;
# that is, capitalized pronouns are okay if there are zero alphabetic characters between them and a sentence-final punct
FEMALE_REGEXES = [/ she[\.,\s!?\' ]/, / her[\.,\s!?\' ]/,
@jeremybmerrill
jeremybmerrill / gender.rb
Last active December 24, 2015 22:59
first pass at ruby version of global name data
require 'csv'
require 'set'
class Gender
def initialize(options={})
countries = Set.new([:us, :uk])
@threshold = options[:threshold] || 0.99
@names_counts = {}
@jeremybmerrill
jeremybmerrill / tabula_basic.rb
Created January 18, 2014 05:12
A snippet to extract spreadsheet data from a PDF using Tabula's tabula-extractor
require 'tabula'
pdf_file_path = "czechmaybe.pdf"
outfilename = "czechmaybe.csv"
out = open(outfilename, 'w')
extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, [5] ) #:all ) # 1..2643
extractor.extract.each do |pdf_page|
pdf_page.spreadsheets.each do |spreadsheet|
@jeremybmerrill
jeremybmerrill / edc.rb
Last active January 3, 2016 17:29
Script to output the four tables from page 1 and page 3 of an NYC EDC report using Tabula.
require 'tabula'
require 'fileutils'
folder_name = "EDC"
output_folder_name = "EDCcsvs"
#########################################################################
#########################################################################
FileUtils.mkdir_p(output_folder_name + "/")
@jeremybmerrill
jeremybmerrill / code_tos.rb
Created January 20, 2016 21:11
quick-and-dirty code things, CSV backend
require 'sinatra'
require 'csv'
$csv_read_path = "my_thing.uncoded.csv"
$csv_write_path = "my_thing.coded.csv"
$data = CSV.read($csv_read_path, {:headers => true})
def write_csv!
CSV.open($csv_write_path, 'wb') do |csv|
@jeremybmerrill
jeremybmerrill / airplanes.sql
Created March 13, 2016 02:27
color lines by gradient
create table flight_segments as
SELECT hexid,start_time,end_time,callsign,point,
-- take a substring if the length reamining in the segment is greater than 5280 feet (1609.34 m)
-- otherwise take the remainder
ST_LineSubstring(geom, 1609.34*n/length,
CASE
WHEN 1609.34*(n+1) < length THEN 1609.34*(n+1)/length
ELSE 1
END) as geom
FROM
@jeremybmerrill
jeremybmerrill / demo.html
Created October 3, 2016 18:10
a demonstration of the shenanigans caused by the isTrusted attribute
<!DOCTYPE html>
<html>
<head>
<title>How To Cause Trouble With Events' isTrusted Attribute</title>
<meta charset="UTF-8">
<script
src="http://code.jquery.com/jquery-2.2.4.min.js"
integrity="sha256-BbhdlvQf/xTY9gja0Dq3HiwQF8LaCRTXxZKRutelT44="
crossorigin="anonymous"></script>
</head>
@jeremybmerrill
jeremybmerrill / atlantacrime2012until2017.small.csv
Last active February 28, 2018 22:08
resources for an observable notebook
We can't make this file beautiful and searchable because it's too large.
offense_id,occur_date,UC2 Literal,neighborhood,npu
110171050,01/14/2012,LARCENY-NON VEHICLE,Sweet Auburn,M
110181057,08/22/2011,LARCENY-NON VEHICLE,Glenrose Heights,Z
112032439,07/22/2011,AUTO THEFT,Downtown,M
112152334,08/03/2011,AUTO THEFT,Perkerson,X
113491709,12/07/2011,LARCENY-FROM VEHICLE,Hills Park,D
120010023,01/01/2012,AGG ASSAULT,The Villages at Carver,Y
120010069,12/31/2011,LARCENY-FROM VEHICLE,Old Fourth Ward,M
120010072,12/31/2011,LARCENY-FROM VEHICLE,English Avenue,L
120010086,01/01/2012,LARCENY-FROM VEHICLE,Morningside/Lenox Park,F
@jeremybmerrill
jeremybmerrill / gist:1d058424aca5ebe2eb3d
Created April 7, 2015 16:55
export from RDS mysql to a CSV
#
# on mac, replace TABGOESHERE with a tab by typing Ctrl-V then the Tab key
#
mysql -u USERNAME --database=dbname --host=HOST --batch -e "select * from tablename" |
sed 's/TABGOESHERE/","/g'| sed 's/^/"/g' | sed 's/$/"/g' | sed 's/\n//g' > destination.csv