Skip to content

Instantly share code, notes, and snippets.

@hoehrmann
Created October 23, 2013 20:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hoehrmann/7126471 to your computer and use it in GitHub Desktop.
Save hoehrmann/7126471 to your computer and use it in GitHub Desktop.
#!perl -w
use Modern::Perl;
use XML::LibXML;
use Algorithm::Diff::XS;
#####################################################################
# Merges the Lynx rendering of a HTML document with libxml2-parsed
# DOM representation of the same document to figure out which parts
# of the text are in a blockquote element. This is useful when using
# a html2text program while wanting to format parts differently when
# the html2text program does not have corresponding options, or in my
# case, taking the text/plain alternative of a HTML mail and using
# the HTML markup to decide which parts have been quoted, as that may
# not be apparent from the text/plain part. The same approach could
# also be used to compare two HTML documents and link together parts
# that are unmodified between them.
#####################################################################
my $path = $ARGV[0] or die "Usage: ...";
# TODO: properly escape $path
my $text = `lynx -dump -width 10000000 -nolist "$path"`;
my $doc = XML::LibXML->load_html(location => $path);
my @from_html = map {
my $node = $_;
map {
{
word => $_,
node => $node,
}
} $_->nodeValue =~ /([^\W_]+|[\W_])/gs;
} $doc->findnodes('//text()[not(ancestor::head | ancestor::script
| ancestor::style )]');
my @from_text = map {
{
word => $_,
}
} $text =~ /([^\W_]+|[\W_])/gs;
my @diff = Algorithm::Diff::sdiff(
\@from_html,
\@from_text,
sub {
return $_[0]->{word};
});
for (@diff) {
next unless $_->[0] eq 'u';
$_->[2]->{corresponding_node} = $_->[1]->{node};
}
for my $word (@from_text) {
my $in_blockquote = $word->{corresponding_node}
&& $word->{corresponding_node}->exists('ancestor::blockquote');
if ($in_blockquote) {
print "[quoted] ";
}
say $word->{word};
if ($in_blockquote) {
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment