Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save marcamillion/1345716 to your computer and use it in GitHub Desktop.
Save marcamillion/1345716 to your computer and use it in GitHub Desktop.
Nokogiri results
# Get a URL's content to play with
content = contents[contents.keys.first]
# => [#<Nokogiri::XML::Element:0x48a9a86 name="p" children=[#<Nokogiri::XML::Text:0x48a9810 "The autobiography of Yohwan Lim, ">, #<Nokogiri::XML::Element:0x48a96ee name="i" children=[#<Nokogiri::XML::Text:0x48a9536 "Crazy As Me">]>, #<Nokogiri::XML::Text:0x48a931a " was released in Korea by BookRoad Publishers in October 25, 2004. This is my translation of the book, except the following four sections which were translated by BinaryStar of Teamliquid.net, which I have made minor changes: \"Hope on the Road Not Taken,\" \"Chapter One: The Game-crazed Kid,\" \"The Birth of the Emperor,\" and \"The Little Prince with Three Sisters.\"">]>, #<Nokogiri::XML::Element:0x48a9194 name="p" children=[#<Nokogiri::XML::Text:0x48a9022 "As of October 4, 2004:">]>, #<Nokogiri::XML::Element:0x48a8e92 name="p">, #<Nokogiri::XML::Element:0x48a8ce4 name="p" children=[#<Nokogiri::XML::Text:0x48a8af0 "The addition of e-sports organizations to major companies, with spectators in the hundreds of thousands, and the advance of e-sports led by the government, is a phenomenon that displays our country's blooming vision of e-sports. The vital function to this e-sports renaissance is the PC game known as Starcraft. Since its first appearance to the world in April 1998, it has kept its throne for over 6 years among many other PC, online, and arcade games. Aided by the increase of PC cafes and their mutual benefits, with 6 million copies of the game sold in our country alone, and over 10 million users which is enough to reach the Guinness Book of Records, it has received nationwide affection.">, #<Nokogiri::XML::Element:0x48a8a14 name="br">, #<Nokogiri::XML::Element:0x48a885c name="br">, #<Nokogiri::XML::Text:0x48a8654 "E-Sports, with the representation of Starcraft, has increasingly expanded its territory and created at least 200,000 related occupations, completely rejuvenating the related industries. Moreover, it has had extensive effects socially, economically, and culturally, enough for professional gaming to be the youth's most desired occupation. The person who has played a crucial role in intensifying such love for Starcraft is the progamer Lim Yohwan.">, #<Nokogiri::XML::Element:0x48a855a name="br">, #<Nokogiri::XML::Element:0x48a83d4 name="br">, #<Nokogiri::XML::Text:0x48a8230 "Receiving affection from the fans and media, which could be considered as the most important factor to e-sports, ">, #<Nokogiri::XML::Element:0x48a81ae name="strong" children=[#<Nokogiri::XML::Text:0x48a185e "Lim Yohwan, with the thorough mentality of a professional as his foundation, has imprinted on the minds of the public through his sincere games that progamers are not \"game-addicts without any prudence,\" but \"hard-working professionals.\"">]>, #<Nokogiri::XML::Element:0x48a161a name="br">, #<Nokogiri::XML::Element:0x48a13cc name="br">, #<Nokogiri::XML::Text:0x48a11e2 "The unrelenting efforts of Lim Yohwan that are placed in this book vividly portray the movement and evolution of our country's e-sports. Furthermore, by uncovering a realistic view of the spectacular progamers, I believe that the book acts as a compass to the youth, telling them what they need to keep in mind if they are to realize their dreams of becoming progamers.">, #<Nokogiri::XML::Element:0x48a10ac name="br">, #<Nokogiri::XML::Element:0x48a0e90 name="br">, #<Nokogiri::XML::Text:0x48a0ada "As a fellow e-sports member, I would like to again congratulate the publication of this book, and hope that through it many people will be able to have the correct understanding of e-sports and progamers.">, #<Nokogiri::XML::Element:0x48a0a12 name="br">, #<Nokogiri::XML::Element:0x48a05bc name="br">]>, #<Nokogiri::XML::Element:0x48a0198 name="p" attributes=[#<Nokogiri::XML::Attr:0x48a0166 name="align" value="right">] children=[#<Nokogiri::XML::Text:0x489fcc0 "October 2004">, #<Nokogiri::XML::Element:0x489fc02 name="br">, #<Nokogiri::XML::Text:0x489fa36 "Korea E-Sports Association President">, #<Nokogiri::XML::Element:0x489f928 name="br">, #<Nokogiri::XML::Text:0x489dbc8 "Kim Yungman">]>, #<Nokogiri::XML::Element:0x489d966 name="p" children=[#<Nokogiri::XML::Element:0x489d65a name="br">, #<Nokogiri::XML::Element:0x489d4a2 name="br">]>]
puts content.collect { |c| just_text c }.join("\n")
#The autobiography of Yohwan Lim, Crazy As Me was released in Korea by BookRoad Publishers in October 25, 2004. This is my translation of the book, except the following four sections which were translated by BinaryStar of Teamliquid.net, which I have made minor changes: "Hope on the Road Not Taken," "Chapter One: The Game-crazed Kid," "The Birth of the Emperor," and "The Little Prince with Three Sisters."
#As of October 4, 2004:
#
#The addition of e-sports organizations to major companies, with spectators in the hundreds of thousands, and the advance of e-sports led by the government, is a phenomenon that displays our country's blooming vision of e-sports. The vital function to this e-sports renaissance is the PC game known as Starcraft. Since its first appearance to the world in April 1998, it has kept its throne for over 6 years among many other PC, online, and arcade games. Aided by the increase of PC cafes and their mutual benefits, with 6 million copies of the game sold in our country alone, and over 10 million users which is enough to reach the Guinness Book of Records, it has received nationwide affection.E-Sports, with the representation of Starcraft, has increasingly expanded its territory and created at least 200,000 related occupations, completely rejuvenating the related industries. Moreover, it has had extensive effects socially, economically, and culturally, enough for professional gaming to be the youth's most desired occupation. The person who has played a crucial role in intensifying such love for Starcraft is the progamer Lim Yohwan.Receiving affection from the fans and media, which could be considered as the most important factor to e-sports, Lim Yohwan, with the thorough mentality of a professional as his foundation, has imprinted on the minds of the public through his sincere games that progamers are not "game-addicts without any prudence," but "hard-working professionals."The unrelenting efforts of Lim Yohwan that are placed in this book vividly portray the movement and evolution of our country's e-sports. Furthermore, by uncovering a realistic view of the spectacular progamers, I believe that the book acts as a compass to the youth, telling them what they need to keep in mind if they are to realize their dreams of becoming progamers.As a fellow e-sports member, I would like to again congratulate the publication of this book, and hope that through it many people will be able to have the correct understanding of e-sports and progamers.
#October 2004Korea E-Sports Association PresidentKim Yungman
# Each paragraph consists of a bunch of children. Some are text, some are tags, like <i>, with their own children (the content of the tags).
#
# This implies that in order to get *just* the text, without any markup, we'll need a recursive function to get all the text.
def just_text el
s = ""
el.children.each do |child|
s += (child.name == "text") ? child.text : just_text(child)
end
s
end
# The "top level" elements are the paragraph tags, because that's what we asked for.
content.length
# => 6
# Each paragraph has a number of children; let's look at one.
content.first.inspect
# => "#<Nokogiri::XML::Element:0x48a9a86 name=\"p\" children=[#<Nokogiri::XML::Text:0x48a9810 \"The autobiography of Yohwan Lim, \">, #<Nokogiri::XML::Element:0x48a96ee name=\"i\" children=[#<Nokogiri::XML::Text:0x48a9536 \"Crazy As Me\">]>, #<Nokogiri::XML::Text:0x48a931a \" was released in Korea by BookRoad Publishers in October 25, 2004. This is my translation of the book, except the following four sections which were translated by BinaryStar of Teamliquid.net, which I have made minor changes: \\\"Hope on the Road Not Taken,\\\" \\\"Chapter One: The Game-crazed Kid,\\\" \\\"The Birth of the Emperor,\\\" and \\\"The Little Prince with Three Sisters.\\\"\">]>"
# Now we'll combine it all into one giant thingie.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
def just_text el
txt = ""
el.children.each do |child|
txt += (child.name == "text") ? child.text : just_text(child)
txt += "\n"
end
txt
end
root_url = "http://boxerbiography.blogspot.com/2006/11/table-of-contents.html"
root_site = Nokogiri::HTML(open(root_url))
link_contents = {}
root_site.css(".entry a").each do |link|
link_url = link[:href]
p "Fetching #{link_url}..."
full_contents = Nokogiri::HTML(open(link_url))
link_contents[link_url] = just_text(full_contents.css("#top p"))
end
# Now dump it all to files based on the URL.
link_contents.each do |url, text|
fname = url[url.rindex("/")+1..-1].gsub(".html", ".txt")
File.open(fname, "w") { |f| f.write(text) }
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment