Skip to content

Instantly share code, notes, and snippets.

@deepakg
Last active December 2, 2015 22:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save deepakg/a2f19e077ee8696bc3da to your computer and use it in GitHub Desktop.
Save deepakg/a2f19e077ee8696bc3da to your computer and use it in GitHub Desktop.
Synopsis:
Perl6 converts code-points to graphemes. Unfortunately, this
might cause some code-points to not round-trip between reading and
writing a text file.
Consider a text file:
source.txt
with contents:
café
$ hexdump source.txt
0000000 63 61 66 65 cc 81
Notice that é is present as 0x65 0xcc 0x81, i.e. e + COMBINING ACUTE
ACCENT
Now consider the following snippet:
my $contents = "source.txt".IO.slurp;
spurt "dest.txt", $contents;
dest.txt contains:
$ hexdump dest.txt
0000000 63 61 66 c3 a9
é has been converted to 0xc3 0xa9, i.e. simply, LATIN SMALL LETTER E
WITH ACUTE
i.e. we lost the original values without explicitly touching anything.
I understand why this happens and understand that this might be a
result of a conscious design choice, but I find this concerning
because this will be a source of hard-to track down bugs.
John Haltiwanger at work suggested trying this:
git init io.git
cd io.git
cp ~/Desktop/source.txt ./
git add .
git commit -a -m "Initial commit"
cp ../dest.txt ./source.txt
git diff
diff --git a/source.txt b/source.txt
index fa04539..1c2e52c 100644
--- a/source.txt
+++ b/source.txt
@@ -1 +1 @@
-café
\ No newline at end of file
+café
\ No newline at end of file
p.s. here is the Perl 5 script to create source.txt:
use strict;
use warnings;
use 5.16.1;
my $contents = "caf\x{65}\x{301}";
open my $fh, ">:encoding(UTF-8)", "source.txt";
print $fh $contents;
close($fh);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment