Skip to content

Instantly share code, notes, and snippets.

@sporsh
Forked from jimweirich/html_parser.rb
Created February 9, 2012 12:54
Show Gist options
  • Save sporsh/1779812 to your computer and use it in GitHub Desktop.
Save sporsh/1779812 to your computer and use it in GitHub Desktop.
Vital Ruby Advance Lab 2
require "nokogiri"
require "uri"
class HtmlParser
def parse(source, html_string)
uri = URI.parse(source)
html_doc = Nokogiri::HTML(html_string)
anchors = html_doc.xpath('//a[@href!="" and not(starts-with(@href, "#"))]')
links = anchors.map { |elem|
elem.attribute('href').value
}
absolute_links = links.map { |link|
uri.merge(link) rescue nil
}
http_links = absolute_links.select { |link|
link.is_a?(URI::HTTP) || link.is_a?(URI::HTTPS)
}
http_links.map { |uri| uri.to_s }
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment