Skip to content

Instantly share code, notes, and snippets.

@kasei
Created August 5, 2017 23:31
Show Gist options
  • Save kasei/85530ffa034a7318693579c586b18cec to your computer and use it in GitHub Desktop.
Save kasei/85530ffa034a7318693579c586b18cec to your computer and use it in GitHub Desktop.
HTML::HTML5::Parser charset issue (debian Bug report #750946)
#!/usr/bin/env perl
# Regarding https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750946
# There are at least two issues with the code in the bugreport.
# The first looks like a bug in HTML::HTML5::Parser (or its
# dependencies) that is mis-recognizing the charset of the file being
# opened.
#
# However, the code included in the bugreport also has a bug in it:
# even with a properly loaded $doc object (as in this case from a
# string literal), calling `print $doc->toString()` won't work as
# expected because it returns a byte string and STDOUT has been
# configured to utf8 encode all output. If STDOUT remains configured
# with the UTF-8 encoding layer, the bytes must be decoded to a
# character string before printing to STDOUT:
use strict;
use HTML::HTML5::Parser;
use Encode qw(encode_utf8 decode_utf8);
use utf8; # for the characters in the script.
binmode STDOUT, ':encoding(UTF-8)'; # for stdout.
my $parser = HTML::HTML5::Parser->new;
my $doc = $parser->parse_string(encode_utf8(<<"END"));
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>title</title>
</head>
<body>
<p>é↓</p>
</body>
</html>
END
print "Charset: '", $parser->charset($doc), "'\n";
my $bytes = $doc->toString();
my $str = decode_utf8($bytes);
print $str;
@kasei
Copy link
Author

kasei commented Aug 6, 2017

I think this might fix the problem, but I'm not at all familiar with the HTML5 parser code. So while it still passes its test suite, I have no idea if this might break things.

diff -ru HTML-HTML5-Parser-0.301/lib/HTML/HTML5/Parser.pm HTML-HTML5-Parser-0.301-patched/lib/HTML/HTML5/Parser.pm
--- HTML-HTML5-Parser-0.301/lib/HTML/HTML5/Parser.pm	2013-07-08 07:12:25.000000000 -0700
+++ HTML-HTML5-Parser-0.301-patched/lib/HTML/HTML5/Parser.pm	2017-08-06 12:42:58.000000000 -0700
@@ -13,6 +13,7 @@
 use HTML::HTML5::Parser::TagSoupParser;
 use Scalar::Util qw(blessed);
 use URI::file;
+use Encode qw(encode_utf8);
 use XML::LibXML;
 
 BEGIN {
@@ -102,6 +103,11 @@
 	{
         # XXX AGAIN DO THIS TO STOP ENORMOUS MEMORY LEAKS
         my ($errh, $errors) = @{$self}{qw(error_handler errors)};
+        
+        if (utf8::is_utf8($text)) {
+        	$text	= encode_utf8($text);
+        }
+        
 		$self->{parser}->parse_byte_string(
             $opts->{'encoding'}, $text, $dom,
             sub {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment