Skip to content

Instantly share code, notes, and snippets.

@mnutt
Created June 20, 2013 02:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mnutt/5819893 to your computer and use it in GitHub Desktop.
Save mnutt/5819893 to your computer and use it in GitHub Desktop.
Grabs streeteasy.com listings based on availability dates and reformats.
#!/usr/bin/env ruby
require 'mechanize'
require 'yaml'
unless ARGV[0] =~ /^http/ && ARGV[1] =~ /(now|[\d\/]+)/
puts "Usage: scraper.rb [streeteasy search URL] [date(s)]"
puts " streeteasy search url: just perform a search on streeteasy.com"
puts " and grab the URL from the location bar"
puts ""
puts " date(s): can either be now, unknown, or dates in the"
puts " format mm/dd/yyyy, and can be multiple separated a comma"
puts ""
exit 1
end
@listing_url = ARGV[0] # || "http://streeteasy.com/nyc/rentals/downtown-manhattan/rental_type:frbo,brokernofee,brokerfee%7Cprice:2500-3500%7Cbeds:1?page=1&sort_by=price_desc"
@available_date = ARGV[1].split(",") # || "08/01/2013,07/31/2013"
@available_date.shift if @available_date == ['']
agent = Mechanize.new
def full_url(path)
if path =~ /^\//
"http://streeteasy.com#{path}"
else
path
end
end
def page_url(page_number)
if @listing_url =~ /page=/
@listing_url.gsub(/page=\d+/, "page=#{page_number.to_s}")
else
@listing_url + "&page=#{page_number.to_s}"
end
end
html = ""
first_page = agent.get page_url(1)
page_count = first_page.search("div.pager_top").text.scan(/of (\d+)/).first.first.to_i
(1..page_count).each { |page|
sleep 0.5
#puts "getting page #{page}"
page = agent.get page_url(page)
urls = page.search("div.unsponsored div.item_inner div.photo a").map{|e| e.attribute("href").to_s }.flatten.uniq
urls.each {|url|
sleep 0.5
page = agent.get url
availability = page.search("div.price div.small").first.to_s.scan(/available( on)?\s+([now\/0-9]+)/m).flatten[1] || "unknown"
if @available_date.nil? or @available_date.any? {|a| a == availability }
images = page.search("a.shadowbox_badge").map{|e| e.attribute("href").to_s }.flatten
title = page.search("h1")
html += "<h1>#{title.to_html}</h1>"
images.each do |image|
html += "<a href='#{full_url(url)}'><img src='#{image}'/></a>"
end
puts full_url(url)
html += "<br><hr><br>"
elsif availability
# puts "available #{availability.first}"
end
#puts "#{availability.first.first} #{url}"
}
}
File.open("/tmp/listings.html", "w") do |f|
f.write("<html><body>")
f.write(html)
f.write("</body></html>")
end
`open /tmp/listings.html`
@davidsshamilton
Copy link

Hey, this seems to do what I have been wanting for a while - but I am struggling running this - any chance you could offer some guidance - I on on OS X, installed ruby, brew, mechanize, yaml etc, but when I run this in terminal, it gives me the instructions for the URL and date but the code ends - am I doing something wrong? thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment