Skip to content

Instantly share code, notes, and snippets.

@mattSpell
Created September 8, 2014 18:24
Show Gist options
  • Save mattSpell/63019a8a0d87a1cefabe to your computer and use it in GitHub Desktop.
Save mattSpell/63019a8a0d87a1cefabe to your computer and use it in GitHub Desktop.
Web Content Scrapers
#Web Content Scrapers:
##Recommendation:
- open-uri - http://ruby-doc.org/stdlib-2.1.0/libdoc/open-uri/rdoc/OpenURI.html
to be used along with:
- Nokogiri - http://nokogiri.org/ to parse through the HTML
##Notes:
- In the Ruby Toolbox, the top 2 are Anemone and Pismo, but they are intended for getting metadata from web sites, not necessarily the html content.
- Nokogiri can be a pain to install, but most of us should have already crossed that bridge with our other in-class projects
- Also, to help spot the CSS selector(s) that you want to grab, use http://selectorgadget.com/. There is a quick 1.5 minute video that explains exactly how to use it.
- It has been recommended that you keep the controllers light and put a scraping task into a model as a best practice.
##Other Resources:
- http://railscasts.com/episodes/190-screen-scraping-with-nokogiri - Great Video!
- http://ruby.bastardsbook.com/chapters/html-parsing/
- This is not the best, but another example of the basic syntax you might use to scrape web content:
https://teamtreehouse.com/forum/im-stuck-on-how-to-integrate-a-nokogiri-scrape-into-my-rails-application
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment