Last active
December 13, 2015 17:28
-
-
Save heisters/4948255 to your computer and use it in GitHub Desktop.
Check all the links in a wordpress export (or really any text file) against Google Safe Browsing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# This is a version of https://github.com/juliensobrier/google-safe-browsing-lookup-ruby, | |
# modified to work against Ruby 1.9.2 and not blow up when malformed URLs are encountered. | |
# | |
# The library requires an API key from Google. | |
# Sign up for a free key a http://code.google.com/apis/safebrowsing/key_signup.html | |
# | |
# See README.rdoc for more information about Google Safe Browsing v2 API | |
# and this library. | |
# | |
# Author:: Julien Sobrier (mailto:julien@sobrier.net) | |
# Copyright:: Copyright (c) 2011 Julien Sobrier | |
# License:: Distributes under the same terms as Ruby | |
require 'uri' | |
require 'net/https' | |
class SafeBrowsingLookup | |
# API key | |
attr_reader :key | |
# Enable debug & error output to the standard output | |
attr_reader :debug | |
# Enable error output to the standard output | |
attr_reader :error | |
# Contain last error | |
attr_reader :last_error | |
# Library version | |
attr_reader :version | |
# Google API version | |
attr_reader :api_version | |
# New client | |
# | |
# +key+:: API key | |
# +debug+:: Set to true to print debug & error output to the standard output. false (disabled) by default. | |
# +error+:: Set to true to print error output to the standard output. false (disabled) by default. | |
def initialize(key='', debug=false, error=false) | |
@key = key || '' | |
@debug = debug || false | |
@error = error || false | |
@last_error = '' | |
@version = '0.1' | |
@api_version = '3.0' | |
raise ArgumentError, "Missing API key" if (@key == '') | |
end | |
# Lookup a list of URLs against the Google Safe Browsing v2 lists. | |
# | |
# Returns a hash <url>: <Gooogle match>. The possible values for <Gooogle match> are: "ok" (no match), "malware", "phishing", "malware,phishing" (match both lists) and "error". | |
# | |
# +urls+:: List of URLs to lookup. The Lookup API allows only 10,000 URL checks a day. If you need more, find a Ruby implementation of the full Google Safe Browsing v2 API. Each requests must contain 500 URLs at most. The lookup() method will split the list of URLS in blocks of 500 URLs if needed. | |
def lookup(urls='') | |
if (urls.respond_to?('each') == false) | |
urls = Array.new(1, urls) | |
end | |
# urls_copy = Array.new(urls) | |
results = { } | |
# while (urls_copy.length > 0) | |
# inputs = urls_copy.slice!(0, 500) | |
count = 0 | |
while (count * 500 < urls.length) | |
inputs = urls.slice(count * 500, 500) | |
body = inputs.length.to_s | |
inputs.each do |url| | |
c_url = canonical(url) | |
unless c_url | |
warn "Skipping #{url.inspect}" | |
next | |
end | |
body = body + "\n" + c_url | |
end | |
debug("BODY:\n#{body}\n\n") | |
uri = URI.parse("https://sb-ssl.google.com/safebrowsing/api/lookup?client=ruby&apikey=#{@key}&appver=#{@version}&pver=#{@api_version}") | |
http = Net::HTTP.new(uri.host, uri.port) | |
http.open_timeout = 30 | |
http.read_timeout = 30 | |
http.use_ssl = true | |
http.verify_mode = OpenSSL::SSL::VERIFY_NONE | |
response = http.request_post("#{uri.path}?#{uri.query}", body) | |
case response | |
when Net::HTTPOK # 200 | |
debug("At least 1 match\n") | |
results.merge!( parse(inputs, response.body) ) | |
when Net::HTTPNoContent # 204 | |
debug("No match\n") | |
results.merge!( ok(inputs) ) | |
when Net::HTTPBadRequest # 400 | |
error("Invalid request") | |
results.merge( errors(inputs) ) | |
when Net::HTTPUnauthorized # 401 | |
error("Invalid API key") | |
results.merge!( errors(inputs) ) | |
when Net::HTTPServiceUnavailable # 503 | |
error("Server error, client may have sent too many requests") | |
results.merge!( errors(inputs) ) | |
else | |
self.error("Unexpected server response: #{response.code}") | |
results.merge!( errors(inputs) ) | |
end | |
count = count + 1 | |
end | |
return results | |
end | |
private | |
# Not much is actually done, full URL canonicalization is not required with the Lookup library according to the API documentation | |
def canonical(url='') | |
# remove leading/ending white spaces | |
url.strip! | |
# make sure whe have a scheme | |
if (url !~ /^https?\:\/\//i) | |
url = "http://#{url}" | |
end | |
begin | |
uri = URI.parse(url) | |
rescue URI::InvalidURIError | |
return nil | |
end | |
return uri.to_s | |
end | |
def parse(urls=[], response) | |
lines = response.split("\n") | |
if (urls.length != lines.length) | |
error("Number of URLs in the reponse does not match the number of URLs in the request") | |
debug("#{urls.length} / #{lines.length}") | |
debug(response); | |
return errors(urls); | |
end | |
results = { } | |
for i in (0..lines.length - 1) | |
results[urls[i]] = lines[i] | |
debug(urls[i] + " => " + lines[i]) | |
end | |
return results | |
end | |
def errors(urls=[]) | |
return Hash[*urls.map {|url| [url, 'error']}.flatten] | |
end | |
def ok(urls=[]) | |
return Hash[*urls.map {|url| [url, 'ok']}.flatten] | |
end | |
def debug(message='') | |
puts message if (@debug == true) | |
end | |
def error(message='') | |
puts "#{message}\n" if (@debug == true or @error == true) | |
@last_error = message | |
end | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'google-safe-browsing-lookup' | |
require 'yaml' | |
google = SafeBrowsingLookup.new '<your key>' | |
xml = File.read('<the file>') | |
output = google.lookup xml.scan(%r{(https?://[^/<"]*)}).flatten.uniq.sort | |
puts output.to_yaml |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment