Skip to content

Instantly share code, notes, and snippets.

@inutano
Created April 20, 2012 03:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save inutano/2425824 to your computer and use it in GitHub Desktop.
Save inutano/2425824 to your computer and use it in GitHub Desktop.
nature_parser.rb
# -*- coding: utf-8 -*-
require "open-uri"
require "nokogiri"
require "ap"
class NatureParser
def initialize(page_url)
@nkgr_main = Nokogiri::HTML(open(page_url))
end
def abstract
@nkgr_main.css("#first-paragraph p")
end
def main_text_nodeset
@nkgr_main.css("#main .content p").select{|n| n.parent.attr("class") == "content" }
end
def methods_nodeset
@nkgr_main.css("#methods .content").children.select{|n| ["h2","p"].include?(n.name)}
end
end
def exclude_sup(nodeset)
nosup = nodeset.map{|p| p.children.select{|n| !(n.name == "sup" && n.child.name == "a")} }.flatten
nosup.map{|n| n.inner_text }.join
end
if __FILE__ == $0
url = "http://www.nature.com/nature/journal/vaop/ncurrent/full/nature10750.html"
np = NatureParser.new(url)
[np.abstract, np.main_text_nodeset, np.methods_nodeset].each do |nodeset|
puts exclude_sup(nodeset)
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment