Skip to content

Instantly share code, notes, and snippets.

@raygunsix
Created May 11, 2011 00:13
Show Gist options
  • Save raygunsix/965660 to your computer and use it in GitHub Desktop.
Save raygunsix/965660 to your computer and use it in GitHub Desktop.
Scrapes URLs and puts the html into an S3 bucket
require 'rubygems'
require 'open-uri'
require 'aws/s3'
require 'json'
AWS::S3::Base.establish_connection!(
:access_key_id => '',
:secret_access_key => ''
)
data = open('urls.json') {|f| f.read}
parsed_data = JSON.parse(data)
parsed_data.each do |id,permalink|
url = permalink.sub(/\[server_name\]/, 'suite101.com')
html = open(url) {|f| f.read}
filename = url.sub('http://www.suite101.com/content/', '') + '.html'
File.open(filename, 'w') {|f| f.write(html) }
AWS::S3::S3Object.store(filename, open(filename), 'www-s3.suite101.com/content/')
end
{
"1":"http://www.[server_name]/content/southampton-promoted-as-players-and-fans-revel-in-delight-a370378",
"2":"http://www.[server_name]/content/hvar-marks-start-of-2011-tourist-season-with-st-prosper-festival-a370324",
"3":"http://www.[server_name]/content/how-to-use-vocabulary-activities-a31127",
"4":"http://www.[server_name]/content/beaconsfield-mine-miracle-rescue-remembered---5-years-on-a370342"
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment