Skip to content

Instantly share code, notes, and snippets.

@takuya
Last active November 17, 2021 00:17
Show Gist options
  • Save takuya/894c5aeabc620344bcea to your computer and use it in GitHub Desktop.
Save takuya/894c5aeabc620344bcea to your computer and use it in GitHub Desktop.
#!/usr/bin/env ruby
require 'nokogiri'
require 'pp'
require 'nkf'
require 'open-uri'
require "optparse"
$verbose = true
def xpath_main(xml_str,expression)
doc = Nokogiri::HTML(xml_str)
ret = doc.xpath(expression)
if ret.class == Nokogiri::XML::NodeSet then
puts ret.inspect unless ret.size > 0
ret.each{|e| puts e }
if $verbose then
$stderr.puts "--"*10
$stderr.puts " xpath: #{expression} "
$stderr.puts "result: #{ret.size} found"
end
else
puts "#{ret.inspect}"
if $verbose then
$stderr.puts "--"*10
$stderr.puts " xpath: #{expression} "
$stderr.puts "result: #{ret.class}"
end
end
end
if ARGV.size < 1 then
str =<<-"EOS"
Usage:
#{File.expand_path( __FILE__ )} [filename or url] xpath_expression
If no filename is given, supply XML on STDIN.
Sample xpath expression :
//* All nodes.
(//*)[1] First of all node.
(//a[1]) A Node list of "First child" in a parent node.
//a//span A span node ancestor of a.
//a/@href An attribute named href in all a node.
//a[@href="/index.html"] A node with href attribute is "index.html".
//a[contains(@href,"index.html")] A node contains "index.html" in its href.
//title | //meta Node list of <title> or <meta> nodes.(Join result)
//img[ contains(@src,'jpg') or contains(@src,'png')] Nodes with attributes has some keyword.
//form[ //input[name="username"] ] A node which has a node of pointed by xpath.
//div[@id=main]//form A node ancester of node.
//div/* A node list of children of div.
//div//* A node list of ancesstor of div.
//table//td[2] A td node second child of parent which in a table node.
Example of Functions.
count(//a) Number of counts of a <a> node.
substring(//title, 1,2) First 2 letters of title string.
EOS
puts str
exit
end
f = ( ARGV.size>1) ? open(ARGV[0]) : STDIN
expression = ( ARGV.size > 1 ) ? ARGV[1] : ARGV[0];
str = f.read
str = NKF.nkf("-w", str)
##
if str =~ %r"^https?://" && str.split("\n").all?{|e| e=~/^http/} then
str.split("\n").each{|line|
next unless line =~/^http/
str = open(line).read
xpath_main( str, expression )
}
else
xpath_main( str, expression )
end
# Copyright (C) 2014 takuya_1st
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#
#
#
#
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment