Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
require "rubygems"
require "nokogiri"
class PlainTextExtractor < Nokogiri::XML::SAX::Document
attr_reader :plaintext
# Initialize the state of interest variable with false
def initialize
@interesting = false
@plaintext = ""
# This method is called whenever a comment occurs and
# the comments text is passed in as string.
def comment(string)
case string.strip # strip leading and trailing whitespaces
when /^someComment/ # match starting comment
@interesting = true
when /^\/someComment/
@interesting = false # match closing comment
# This callback method is called with any string between
# a tag.
def characters(string)
@plaintext << string if @interesting
pte =
parser =
parser.parse_file ARGV[0]
puts pte.plaintext
<title>Some Title</title>
<h2>Here goes some heading we are not interested in.</h2>
<!-- someComment -->
Here it goes. We are interested in this text. </br>
But <b>some</b> words are wrapped with HTML-Tags we are <i>not</i>
interested in.
<a href="bar">Or links,..</a>
<td>Or a Table,...</td>
<!-- /someComment -->
But we do NOT care about this.
<!-- foo -->
Even if it is wrapped in another comment.
<!-- /foo -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.