equivalentideas/webscraping_workshop_links.md

## webscraping_workshop_links.md

      
    Raw
  

              webscraping_workshop_links.md
            
          
    Scraping Tasks

Collect the daily Fire Danger Ratings and Total Fire Bans for NSW

Create a scraper on morph.io that, for each NSW fire area, collects:

the name of the area
fire danger level and total fire bans for today and tomorrow
the list of councils effected

Set the scraper to run every day.
Collect every bill introduced into NSW Parliament since 1997

Create a scraper that collects bills introduced into NSW Parliament. Collect every bill introduced since 1997.
For each bill collect the:

bill’s name
URL for the bill on parliament.nsw.gov.au
the house the bill originated in

Set the scraper to run every day so that it stays up to date.
Helpful scraping bits and pieces


Mechanize - http://mechanize.rubyforge.org/Mechanize.html
Nokogiri - http://www.rubydoc.info/github/sparklemotion/nokogiri/
morph.io docs - https://morph.io/documentation
Useful Ruby bits:

Array - http://ruby-doc.org/core-2.0.0/Array.html
String - http://ruby-doc.org/core-2.0.0/String.html
Enumerable - http://ruby-doc.org/core-2.0.0/Enumerable.html


Ruby Regular Expressions - http://rubular.com/
Get help with scraping at the morph.io help forum - https://help.morph.io/

Commands and methods

Clone your scraper to your local machine:
git clone https://github.com/morph-test-scrapers/australian_federal_members_of_parliament_tutorial.git

Check and install any missing dependencies:
bundle

Start IRB session:
bundle exec irb

Run your scraper on your local machine
bundle exec ruby scraper.rb

Make the Mechanize and ScraperWiki libraries available:
require 'scraperwiki'
require 'mechanize'

Get the page to scrape using Mechanize:
agent = Mechanize.new
page = agent.get('https://www.yourpageurl.org.au/')

Return an element from the page using .at():
page.at(:h1)

Return an Array of elements using .search():
page.search(:h2)

Get the text from an element:
page.at(:h1).text

Get the value of an attribute on an element:
page.at(:img).attr('src')

Assign data to an Object:
record = {
  name: page.at(:h1).text,
  url: page.at(:h1).at(:a)[:href]
}

Save the record you've collected:
ScraperWiki.save_sqlite([:url], record)

Loop through series of elements:
page.search(:h2).each do |item|
  # get the text for the second paragraph in the item element
  item.search(:p)[1].text
end

Stay in touch


Civic tech monthly newsletter

Subscribe
Contribute


OpenAustralia Foundation Monthly Sydney Pub Meet & Lightning Talks