Skip to content

Instantly share code, notes, and snippets.

@dkubb
Created June 27, 2014 23:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dkubb/9c5848eb34a1f2d88de4 to your computer and use it in GitHub Desktop.
Save dkubb/9c5848eb34a1f2d88de4 to your computer and use it in GitHub Desktop.
Create a sitemap for supplied hostname and list of urls
#!/usr/bin/ruby
require 'rubygems'
require 'mechanize'
require 'addressable/uri'
ROOT_URL = Addressable::URI.parse(ARGV.fetch(0)).freeze
# Disable SSL certificate verification
I_KNOW_THAT_OPENSSL_VERIFY_PEER_EQUALS_VERIFY_NONE_IS_WRONG = nil
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
agent = Mechanize.new
uris = File.readlines(ARGV.fetch(1)).map(&:chomp).uniq
seen = {}
while uri = uris.pop
seen[uri] = true
begin
agent.get(uri) do |page|
next unless page.respond_to?(:content_type) &&
page.content_type =~ %r{\Atext/html\b}
puts page.uri
page.links_with(href: %r{\A/}).each do |link|
link_uri = ROOT_URL.join(link.href).merge!(fragment: nil).normalize!
next if link_uri.host != ROOT_URL.host || link_uri.fragment || seen.key?(link_uri.to_s)
uris << link_uri.to_s unless uris.include?(link_uri.to_s)
end
end
rescue Mechanize::UnauthorizedError, Mechanize::ResponseCodeError
# do nothing
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment