Skip to content

Instantly share code, notes, and snippets.

@gemfarmer
Last active April 17, 2017 19:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gemfarmer/1eb6960c5a6d417b4495d60181d20491 to your computer and use it in GitHub Desktop.
Save gemfarmer/1eb6960c5a6d417b4495d60181d20491 to your computer and use it in GitHub Desktop.
Script to crawl a yaml list of broken urls and determine where they were moved to..

To run this script,

  1. Add crawl_broken_urls.rb to you directory
  2. At the same directory level, add a broken_links.yml.
  3. Then run ruby crawl_broken_urls.rb
require 'yaml'
require 'pry'
require 'rb-readline'
require 'net/http'
require 'uri'
require 'timeout'
broken_urls = YAML.load_file("./broken_urls.yml")
def fetch(uri_str, limit = 10)
default_error = 'HTTPError'
raise ArgumentError, 'HTTP redirect too deep' if limit == 0
url = URI.parse(uri_str)
req = Net::HTTP::Get.new(url.path)
response = Net::HTTP.start(url.host, url.port) do |http|
begin
status = Timeout::timeout(3) {
http.request(req)
}
rescue Timeout::Error
puts 'That took too long, exiting...'
end
end
begin
case response
when Net::HTTPSuccess then response
when (Net::HTTPRedirection && response.code != '404') then
if (uri_str != response['location']) && (response.code != '302')
fetch(response['location'], limit - 1)
else
response
end
else
default_error
end
rescue
"TimeoutError"
end
end
broken_urls.each do |link|
# puts "#{link}"
# puts `curl -I #{link}`
puts "#{link}: #{fetch(link)}"
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment