Skip to content

Instantly share code, notes, and snippets.

@tylerpearson
Created December 30, 2015 00:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tylerpearson/cb5c8a884ddd537d8300 to your computer and use it in GitHub Desktop.
Save tylerpearson/cb5c8a884ddd537d8300 to your computer and use it in GitHub Desktop.
Ruby scraper for bills introduced in the West Virginia legislature
#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'
require 'json'
YEAR = ARGV.first
if YEAR.nil?
puts "Please pass a year to scrape"
puts "e.g. ./scraper.rb 2015"
exit
end
### Helper methods
def save_results(results, filename)
File.open(filename, "w") do |f|
f.write(results.to_json)
end
end
### Scraper
page = Nokogiri::HTML(open("http://www.legis.state.wv.us/Bill_Status/Bills_all_bills.cfm?year=#{YEAR}&sessiontype=rs&btype=bill"))
bills = []
page.css('.tabborder tr').each_with_index do |bill, i|
next if i == 0 # skip that first info row
bill = {
number: bill.css('td')[0].text.strip,
title: bill.css('td')[1].text.strip,
url: 'http://www.legis.state.wv.us/Bill_Status/' + bill.css('td')[0].css('a')[0]['href'],
status: bill.css('td')[2].text.gsub(/\s+/, '')
}
bills << bill
end
puts "Scraped bill index"
## Save an index file for the bills
save_results(bills, "wv-bills-#{YEAR}-index.json")
bills.each do |bill|
bill_link = Nokogiri::HTML(open(bill[:url]))
path = bill_link.css('a').select{ |link| link.text.strip.downcase == "html" }.last['href']
html_bill_link = "http://www.legis.state.wv.us/Bill_Status/" + path.gsub(' ', '%20')
bill_text_page = Nokogiri::HTML(open(html_bill_link))
number = bill[:number].gsub(/\s+/, '').downcase
puts "Scraping bill #{bill[:number]} at #{html_bill_link}"
paragraphs = bill_text_page.css('#wrapper p').map do |p|
p.text.gsub(/\s+/, ' ').strip
end
info = {
number: number,
title: bill[:title],
url: html_bill_link,
paragraphs: paragraphs,
text: paragraphs.join(' ')
}
save_results(info, "wv-bills/#{YEAR}/#{number}.json")
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment