Skip to content

Instantly share code, notes, and snippets.

@dtonhofer
Last active May 7, 2019 10:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save dtonhofer/29c8d561c911cc93052f2bb2181ee75e to your computer and use it in GitHub Desktop.
Save dtonhofer/29c8d561c911cc93052f2bb2181ee75e to your computer and use it in GitHub Desktop.
Comparing behaviour of Perl Data::Dumper when using "Pure Perl" and "XS" mode for non-iso-8859-1 codepoints
#!/usr/bin/perl
# ===
# Testing what Perl's Data::Dumper does with "high" characters e.g.
#
# å -> iso-8859-1 : 0xE5
# Unicode UTF-16 : 0x00E5
# Unicode UTF-8 : 0xC3A5
#
# See also:
#
# https://stackoverflow.com/questions/50489062/how-to-display-readable-utf-8-strings-with-datadumper
#
# Note that the "UTF-8" pragma is on.
# The character 'å' is encoded in UTF-8 *in this program file* and
# the pragma tells Perl that this is so!
#
# We also tell Perl that STDERR and STDOUT and the test files to write
# are/shall-be UTF-8 encoded.
#
# We find:
# ========
#
# The string:
#
# 'Nuuk (Godthåb)'
#
# - is written by Data::Dumper, pure Perl implementation, as UTF-8 (not ISO-8559-1)
# but for higher characters, e.g. "Ч" (Cyrillic Che), the implementation switches
# to ASCII-based escaping (perl string escaping) of UTF-16: "\x{427}" (Unicode 0x427)
#
# - is written by Data::Dumper, XS implementation, as ASCII-based escaping of
# ISO-8859-1 (not UTF-8): "\x{e5}". For higher characters, the implementation
# switches to ASCII-based escaping of UTF-16.
# The implementation seems to eagerly escape anything beyong 7 bit.
#
# In both cases, reading back as UTF-8 works!
# ===
use strict;
use warnings;
use utf8; # Meaning "This lexical scope (i.e. file) contains utf8"
use File::Temp qw(tempfile tempdir);
use File::Spec::Functions qw(catfile);
use Data::Dumper;
# ---
# To print accented chars correctly to STDERR/STDOUT, supposed to be in UTF-8.
# https://perldoc.perl.org/perlunifaq.html
# ---
binmode STDERR, ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';
# ---
# Just call dat main!
# ---
_main();
# ===
# 1) Define data
# 2) Create temporary directory
# 3) For Perl and XS implementation of Data::Dumper:
# 1) Dump data to file in said temporary directory
# 2) Read file back and eval it
# 3) Compare original data and data resulting from eval
# ===
sub _main {
my $data = { EGKK => "London Gatwick"
,BGAA => "Aasiaat (Egedesminde)"
,BGSF => "Kangerlussuaq (Søndre Strømfjord)"
,BGGH => "Nuuk (Godthåb)"
,USHQ => "Белоя́рский"
,USCC => "Челя́бинск"
,TFFR => "Aéroport Guadeloupe - Pôle Caraïbes"
,BKPR => "Aeroporti Ndërkombëtar i Prishtinës 'Adem Jashari'"
};
determineUtf8Flags($data);
my $outdir = makeTmpDir();
# $usePerl = 1 --> use pure Perl implementation
# $usePerl = 0 --> use XS implementation
my $names = { 1 => 'pure_perl', 0 => 'xs' };
for my $usePerl ( qw( 0 1 ) ) {
my $fqfn = makeFullyQualifiedFilename($outdir,$$names{$usePerl});
{
open(my $fh,">:encoding(UTF-8)", $fqfn) || die "Could not open file '$fqfn' for writing: $!";
$$data{used} = "Data::Dumper, $$names{$usePerl}";
$$data{file} = $fqfn;
print $fh Data::Dumper->new([$data])->Useperl($usePerl)->Purity(1)->Sortkeys(1)->Dump;
close $fh || die "Could not close file '$fqfn' after writing: $!"
}
my $reData = slurpAndEval($fqfn,"data");
determineUtf8Flags($reData);
for my $key (sort keys %$data) {
next if ($key eq 'used' || $key eq 'data' || $key =~ /utf8/);
my $orig = $$data{$key};
die "No key '$key' in data extracted from '$fqfn'" unless exists $$reData{$key};
my $reValue = $$reData{$key};
if ($reValue ne $orig) {
print STDERR "Key '$key': Previously '$orig', afterwards '$reValue'\n"
}
else {
print STDERR "Key '$key': No change\n"
}
}
}
print STDERR "Running a 'diff --side-by-side'!\n";
system ("diff", "--side-by-side", makeFullyQualifiedFilename($outdir, $$names{0}), makeFullyQualifiedFilename($outdir, $$names{1}));
}
sub makeTmpDir {
my $outdir = tempdir("test_XXXX", DIR => '/tmp') || die "Could not create temporary directory: $!";
print STDERR "Output goes to files in directory '$outdir' (this directory will not be automatically removed later!)\n";
return $outdir
}
sub makeFullyQualifiedFilename {
my($dir,$impl) = @_;
return catfile($dir,"$impl.dump")
}
sub slurpAndEval {
my($fn,$name) = @_;
my $txt;
{
open(my $fh, '<:encoding(UTF-8)', $fn) or die "Could not open file '$fn' for reading: $!";
# https://perlmaven.com/slurp
# - undefine the record terminator to NOT break apart input!
# - make sure this is a local variable so as not to stress anyone else
local $/ = undef;
$txt = <$fh>;
close $fh;
# redefine the record terminator to be '\n' (n.b. this must be a string, not a character!!!)
$/ = "\n";
}
# Danger Will Robinson!! We are using EVAL, so the data better be gud (i.e. not include a call to rm -rf for example)!
# Assume the text to eval assigns $VAR1
# >>>
my $VAR1;
eval($txt);
# <<<
die "Error in eval of $name content from file '$fn': $@" unless $VAR1;
my $len = scalar (keys %$VAR1);
print STDERR "Read '$name' content from file '$fn' ($len elements found in undumped hash)\n";
return $VAR1
}
sub determineUtf8Flags {
my($data) = @_;
for my $key (sort keys %$data) {
next if ($key eq 'used' || $key eq 'data' || $key =~ /utf8/);
my $str = $$data{$key};
my $val;
if (utf8::is_utf8($str)) { $val = 'yes' } else { $val = 'no' }
$$data{"${key}_utf8"} = $val
}
}
@dtonhofer
Copy link
Author

When run (on Unix), the program produces the following output.

After the first three lines, a "diff" of the output created with the "pure Perl" implementation and the "XS" implementation is shown. The files created during the test remain in the indicated tmp directory.

"Pure Perl" implementation writes UTF-8. The text "Nuuk (Godthåb)" is correctly displayed on an UTF-8 terminal.

"XS" implementation writes escaped iso-8859-1 (sometimes) which is shown as "Nuuk (Godth\x{e5}b)" -- 0xE5 being the codepoint for "å" in iso-8859-1.

Output goes to files in directory '/tmp/test_H7Xm'
Read 'data' content from file '/tmp/test_H7Xm/perl.dump' (8 elements found in undumped hash)
Read 'data' content from file '/tmp/test_H7Xm/xs.dump' (8 elements found in undumped hash)
$VAR1 = {                                                       $VAR1 = {
          'BGAA' => 'Aasiaat (Egedesminde)',                              'BGAA' => 'Aasiaat (Egedesminde)',
          'BGGH' => "Nuuk (Godth\x{e5}b)",                    |           'BGGH' => 'Nuuk (Godthåb)',
          'BGSF' => "Kangerlussuaq (S\x{f8}ndre Str\x{f8}mfjo |           'BGSF' => 'Kangerlussuaq (Søndre Strømfjord)',
          'BKPR' => "Aeroporti Nd\x{eb}rkomb\x{eb}tar i Prish |           'BKPR' => 'Aeroporti Ndërkombëtar i Prishtinës \'Ad
          'EGKK' => 'London Gatwick',                                     'EGKK' => 'London Gatwick',
          'TFFR' => "A\x{e9}roport Guadeloupe - P\x{f4}le Car |           'TFFR' => 'Aéroport Guadeloupe - Pôle Caraïbes',
          'USCC' => "\x{427}\x{435}\x{43b}\x{44f}\x{301}\x{43             'USCC' => "\x{427}\x{435}\x{43b}\x{44f}\x{301}\x{43
          'USHQ' => "\x{411}\x{435}\x{43b}\x{43e}\x{44f}\x{30             'USHQ' => "\x{411}\x{435}\x{43b}\x{43e}\x{44f}\x{30
        };                                                              };

@ernix
Copy link

ernix commented Jan 19, 2019

Both of "Nuuk (Godth\x{e5}b)" and 'Nuuk (Godthåb)' are exact same strings under utf8 pragma. Try:

#!/usr/bin/perl
use strict;
use warnings;
use utf8; # << NEED THIS

if ("Nuuk (Godth\x{e5}b)" eq 'Nuuk (Godthåb)') {
    print "same\n";
}
else {
    print "nah...\n";
}

1;

So two files dumped by the script, are eventually evaluated into exactly the same hash references.

@dtonhofer
Copy link
Author

dtonhofer commented Jan 19, 2019

Thanks ernix.

I found this: https://blog.summercat.com/perl-and-character-encoding.html and so:

use utf8; - tells Perl that the file it is reading is encoded using UTF-8

  • 'Nuuk (Godthåb)' (when displayed in an editor that expects UTF-8), is actually the UTF-8 encoded bytes of Nuuk (Godthåb)
  • "Nuuk (Godth\x{e5}b)" is an escaped sequence using a Perl-internal string representation. The \x{e5} is an UTF-16 representation of a character, which is the character å.

So both representations are the same (again, in an UTF-8 encoded file and with Perl being told about that using use utf8;)

The interesting part is that "pure Perl", when dumping to a stream that has been marked as being UTF-8 encoded, use the UTF-8 encoding (and so the dump result can be read in an editor that expects UTF-8), whereas "XS" use the Perl internal string representation with \x escapes.

Reading the dump back from a stream that has been marked as being UTF-8 encoded end eval-ing the text yields back the correct data in both cases, so everything is well (except if one wants to read the dump in UTF-8)

I thought there was a problem with the reading-back, but apparently not.

@ernix
Copy link

ernix commented Jan 20, 2019

Your problem might be also related to https://perldoc.perl.org/perlunifaq.html#Data%3a%3aDumper-doesn't-restore-the-UTF8-flag%3b-is-it-broken%3f

Here's what happens: when Perl reads in a string literal, it sticks to 8 bit encoding as long as it can. (But perhaps originally it was internally encoded as UTF-8, when you dumped it.) When it has to give that up because other characters are added to the text string, it silently upgrades the string to UTF-8.

'Nuuk (Godthåb)' is a UTF-8 encoded byte string, that will be utf8::upgraded as long as utf8 pragma is on. Because the string has a character å, that makes Perl to give up sticking to 8 bit(ASCII/ISO-8859-1) encoding.

$ echo 'Nuuk (Godthåb)' | xxd
00000000: 4e75 756b 2028 476f 6474 68c3 a562 290a  Nuuk (Godth..b).
                                     ^^^^^
                                       å

On the other hand, "Nuuk (Godth\x{e5}b)" is not an escaped sequence, but a legit ISO-8859-1 string that doesn't have any multi-byte characters. So it won't be upgraded even with utf8 pragma.

Perl itself internally treats them differently, using UTF8 flag or not. But UTF8 flag is tied up with Perl variables and completely irrelevant to string/literal/Unicode code-points/etc in a general sense. Data::Dumper is trying to dump a string, I guess that's why Data::Dumper doesn't restore UTF8 flag.

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use utf8;
use open qw(:std :utf8);

my ($s1, $s2) = ('Nuuk (Godthåb)', "Nuuk (Godth\x{e5}b)");

if (utf8::is_utf8($s1)) {
    print qq{\$s1 has UTF-8 flag\n};
    print Dumper $s1;
}
else {
    print qq{\$s1 doesn't have UTF-8 flag\n};
    print Dumper $s1;
}

if (utf8::is_utf8($s2)) {
    print qq{\$s2 has UTF-8 flag\n};
    print Dumper $s2;
}
else {
    print qq{\$s2 doesn't have UTF-8 flag\n};
    print Dumper $s2;
}

# Force add UTF-8 flag to $s2
utf8::upgrade($s2);

if (utf8::is_utf8($s2)) {
    print qq{\$s2 has UTF-8 flag\n};
    print Dumper $s2;
}
else {
    print qq{\$s2 doesn't have UTF-8 flag\n};
    print Dumper $s2;
}

1;

@dtonhofer
Copy link
Author

test script updated a bit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment