Skip to content

Instantly share code, notes, and snippets.

@mtsuszycki
Last active June 10, 2016 14:05
Show Gist options
  • Save mtsuszycki/b5cebbfbd56e67721151dca7984db2ae to your computer and use it in GitHub Desktop.
Save mtsuszycki/b5cebbfbd56e67721151dca7984db2ae to your computer and use it in GitHub Desktop.
Find and print DOM element from html, ruby, mechanize
#!/usr/bin/ruby
# examples
# ./htmlgrep.rb "input#gbv" "http://www.google.co.uk"
# ./htmlgrep.rb "div#gbv" "http://www.google.co.uk"
# htmlgrep.rb "[@class='gbm']" "http://www.google.co.uk"
# ./htmlgrep.rb "[@class='gbmc']/ol/li" "http://www.google.co.uk"
# ~/htmlgrep.rb '.postblockcat_whitesquare/a' HERE\ -\ Nokia\ Conversations.html | grep -Eo '<a href[^>]+' | sed 's/title=/,/
# g; s/<a href=//g;' > HERE\ -\ Nokia\ Conversations.csv
require 'rubygems'
require 'hpricot'
require 'open-uri'
file = open(ARGV[1]) if ARGV[1]
file ||= STDIN
doc = Hpricot(file)
puts doc / ARGV[0]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment