Skip to content

Instantly share code, notes, and snippets.

@markprovan
Created February 1, 2012 20:45
Show Gist options
  • Save markprovan/1719202 to your computer and use it in GitHub Desktop.
Save markprovan/1719202 to your computer and use it in GitHub Desktop.
Crawler
require 'net/http'
require 'open-uri'
require 'nokogiri'
require 'uri'
class Crawler
attr_accessor :visited_links, :links_to_visit
def initialize(starting_link)
self.visited_links = []
self.links_to_visit = []
crawl(starting_link)
end
def valid_link?(link)
uri = URI.parse(link)
response = Net::HTTP.get_response(uri)
response.code == "200" ? true : false
end
def crawl(url)
if url[0] != "h"
self.links_to_visit.delete(url)
crawl(self.links_to_visit[0])
end
doc = Nokogiri::HTML(open(url))
puts "Crawling: #{url}"
doc.css('a').each do |link|
if self.visited_links.include?(link['href'])
self.links_to_visit.delete(link['href'])
else
self.links_to_visit << link['href']
end
end
self.visited_links << url
crawl(self.links_to_visit[0])
end
end
c = Crawler.new("http://www.vamosa.com")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment