Skip to content

Instantly share code, notes, and snippets.

@prokizzle
Last active August 29, 2015 14:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save prokizzle/e02fc378ccaaf9065e1c to your computer and use it in GitHub Desktop.
Save prokizzle/e02fc378ccaaf9065e1c to your computer and use it in GitHub Desktop.
80legs_Robots_Parser
# 80legs Bulk Robots Checker
# 80legs.com 2014
# Usage: ruby robots.rb <path_to_url_list>
# Example: $> ruby robots.rb /Users/nick/Documents/url_list_1.txt
#
# Installation/Requirements:
# gem install rest-client
#
class RobotsParser
require 'json'
require 'rest-client'
MASHAPE_API_KEY = "put_key_here"
def robots_file_for(url)
parsed_url = URI.parse(url)
return "#{parsed_url.site.scheme}://#{parsed_url.host.downcase}/robots.txt" rescue ""
end
def parse_robots_file(url_to_parse)
begin
response = RestClient.get("https://robotstxt.p.mashape.com/site/robots/?url=#{URI.encode url_to_parse}", :"X-Mashape-Authorization" => MASHAPE_API_KEY)
array = JSON.parse response
agents = Hash.new
array["agents"].each do |a|
agents[a["name"]] = {allow: a["allow"], disallow: a['disallow']}
end
rescue
end
return agents
end
end
app = RobotsParser.new
file = ARGV[0]
File.open(file, "r").each_line do |url|
begin
robots_file = app.robots_file_for(url)
agents = app.parse_robots_file(robots_file)
puts "Site: #{URI.parse(url).host} ->"
puts "\t#{agents["*"]}"
puts "\t#{agents["008"]}"
rescue
puts "InvalidURL"
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment