Skip to content

Instantly share code, notes, and snippets.

@kamui
Last active January 2, 2016 18:59
Show Gist options
  • Save kamui/8347721 to your computer and use it in GitHub Desktop.
Save kamui/8347721 to your computer and use it in GitHub Desktop.
A ruby script to scrape a site for a specific dom node change. It will email you if you choose when the content changes and what the new node content changed into. Require `nokogiri` and `pony` gems. http://jackchu.com/articles/2014/01/10/web-scraping-content-changes-with-nokogiri/
require 'nokogiri'
require 'open-uri'
require 'pony'
require 'digest'
URL = 'https://encrypted.google.com/'.freeze
CSS_SELECTOR = '#lga'.freeze
INTERVAL = 600.freeze
USER_AGENT = "Ruby/#{RUBY_VERSION}".freeze
SEND_EMAIL = false
MAIL_FROM = 'u1@example.com'.freeze
MAIL_TO = 'u2@example.com'.freeze
MAIL_OPTIONS = {
address: 'smtp.gmail.com',
port: '587',
enable_starttls_auto: true,
user_name: 'user',
password: 'password',
authentication: :plain, # :plain, :login, :cram_md5, no auth by default
domain: "localhost.localdomain" # the HELO domain provided by the client to the server
}.freeze
Pony.options = {
from: MAIL_FROM,
via: :smtp,
via_options: MAIL_OPTIONS
}
target_digest = nil
while(true) do
doc = Nokogiri::HTML(open(URL, 'User-Agent' => USER_AGENT))
content = doc.css(CSS_SELECTOR).first
content_digest = Digest::MD5.new.digest(content.to_s)
if target_digest.nil?
target_digest = content_digest
puts "#{Time.now}: Seeded content: #{content}"
elsif content_digest.strip != target_digest.strip
puts "#{Time.now}: #{content}"
Pony.mail({
to: MAIL_TO,
subject: "Web Scrape Script: Target website has been updated!",
body: "Target URL: #{URL}\n\nChange: #{content}"
}) if SEND_EMAIL
target_digest = content_digest
else
puts "#{Time.now}: No change"
end
sleep INTERVAL
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment