Skip to content

Instantly share code, notes, and snippets.

@rajeshg
Created October 15, 2010 19:46
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rajeshg/628816 to your computer and use it in GitHub Desktop.
Save rajeshg/628816 to your computer and use it in GitHub Desktop.
Quick little script to fetch all the meta tags from a website and export them to excel
#!/usr/bin/ruby -w
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'spreadsheet'
class MetaReader
attr_reader :url, :outFile, :links, :book, :sheet1, :rownum
attr_accessor :url, :outFile, :links, :book, :sheet1, :rownum
def initialize(url, outFile)
@url = url
@outFile = outFile
@links = {}
@rownum = 0
@book = Spreadsheet::Workbook.new
@sheet1 = book.create_worksheet
end
def writeMeta(tempurl)
fullurl = "#{url}#{tempurl}"
puts tempurl
outFile.puts "#{tempurl}\n\n"
keywords = ''
desc = ''
begin
doc = Hpricot(open(tempurl)) # open web page
doc.search('meta').each do |meta|
if meta.attributes['name'] =~ /keywords|description/
outFile.puts "#{meta.attributes['name']}: #{meta.attributes['content']}"
if meta.attributes['name'] == 'keywords'
keywords = meta.attributes['content']
else
desc = meta.attributes['content']
end
end
end
sheet1.row(rownum).push tempurl, keywords, desc
@rownum += 1
doc.search("a[@href]").each do |param| # for every link do
param_string = param.to_s
href = param.attributes['href']
unless (href =~ /\.pdf|\.jpg|\.mp3|\.flv|\.jpeg|\.flv|\.swf|netreturns|staywellsolutions|javascript\:|mailto/i)
href = param.attributes['href']
href = href.gsub(/\.\./,'')
if(href =~ /http|www/ )
if (!links["#{href}"] && href =~/nch\.org/)
links["#{href}"] = 1
writeMeta("#{href}")
end
elsif href=~/^\//
href = href.gsub(/\/\//,'')
unless (links["#{url}#{href}"])
links["#{url}#{href}"] = 1
writeMeta("#{url}#{href}")
end
else
href = href.gsub(/\/\//,'')
unless (links["#{url}/#{href}"])
links["#{url}/#{href}"] = 1
writeMeta("#{url}/#{href}")
end
end
end
end
book.write "links.xls"
rescue OpenURI::HTTPError => e
puts "The '#{tempurl}' page is not accessible, error #{e}"
sheet1.row(rownum).push tempurl, '', '', "#{e}"
@rownum += 1
rescue SocketError => e
puts "The '#{tempurl}' page is not accessible, error #{e}"
sheet1.row(rownum).push tempurl, '', '', "#{e}"
@rownum += 1
rescue ECONNRESET => e
puts "The '#{tempurl}' page is not accessible, error #{e}"
sheet1.row(rownum).push tempurl, '', '', "#{e}"
@rownum += 1
end
end
end
myfile = File.new('links.html', 'w') # the output file
MetaReader.new("http://www.example.com", myfile).writeMeta('http://www.example.com')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment