Skip to content

Instantly share code, notes, and snippets.

@prokizzle
Last active August 29, 2015 14:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save prokizzle/e867d06f8b2d6f70548f to your computer and use it in GitHub Desktop.
Save prokizzle/e867d06f8b2d6f70548f to your computer and use it in GitHub Desktop.
80legs URL Crawlability Validator
# 80legs Bulk Robots Checker
# 80legs.com 2014
# Usage: ruby crawlable.rb <path_to_url_list>
# Example: $> ruby crawlable.rb /Users/nick/Documents/url_list_1.txt > output.csv
# #=> creates a CSV file containing each url, and whether or not it's crawlable
#
# Installation/Requirements:
# gem install robotstxt
#
class String
require 'robotstxt'
def crawlability
all = Robotstxt.allowed?(self, '*') != false
eightylegs = Robotstxt.allowed?(self, '008') != false
return "#{self}, #{all && eightylegs}"
end
end
puts "URL,Crawlable"
if ARGV[1] == "-r"
File.open(ARGV[0], "r").each_line("\r"){ |url| puts url.crawlability }
else
file = File.open(ARGV[0], "r").to_a.map{|url| url.split("\n").first.split("\r").first}
file.each{|url| puts url.crawlability}
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment