Skip to content

Instantly share code, notes, and snippets.

@atduskgreg
Created November 26, 2011 21:34
Show Gist options
  • Save atduskgreg/1396326 to your computer and use it in GitHub Desktop.
Save atduskgreg/1396326 to your computer and use it in GitHub Desktop.
get the text of all the headlines on nytimes.com
require 'rubygems'
require 'nokogiri'
require 'open-uri'
# fetch the html of the page
page = open("http://nytimes.com").read
# parse the page
doc = Nokogiri::HTML(page)
# use css selectors to get all the headlines on the page
# which, discovered by viewing source, are marked up as links inside of h2s, h3s, and h5s
headlines = doc.css("h2 a, h3 a, h5 a")
# print out the innter_html
for headline in headlines
puts headline.inner_html
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment