Skip to content

Instantly share code, notes, and snippets.

@Techbrunch
Created June 8, 2016 09:57
Show Gist options
  • Save Techbrunch/3cfb0f392cfaee40290b44a2f6c07076 to your computer and use it in GitHub Desktop.
Save Techbrunch/3cfb0f392cfaee40290b44a2f6c07076 to your computer and use it in GitHub Desktop.
Scrape stackshare.io to get a list of host
require 'http'
require 'nokogiri'
File.open('urls.txt', 'r') do |file_handle|
file_handle.each_line do |line|
begin
html = HTTP.get(line).to_s
doc = Nokogiri::HTML(html)
href = doc.xpath('//*[@id="stp-sidebar"]/div/div[1]/div[1]/a/@href')
if href.to_s == ''
File.open('bad.txt', 'a+') {|f| f.puts(line) }
else
host = URI(HTTP.get('http://stackshare.io/' + href.to_s)['Location']).host
if host != ''
File.open('good.txt', 'a+') {|f| f.puts(host) }
else
File.open('bad.txt', 'a+') {|f| f.puts(line) }
end
end
rescue
puts 'could not parse:'+line
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment