-
-
Save deepakg/a2f19e077ee8696bc3da to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Synopsis: | |
Perl6 converts code-points to graphemes. Unfortunately, this | |
might cause some code-points to not round-trip between reading and | |
writing a text file. | |
Consider a text file: | |
source.txt | |
with contents: | |
café | |
$ hexdump source.txt | |
0000000 63 61 66 65 cc 81 | |
Notice that é is present as 0x65 0xcc 0x81, i.e. e + COMBINING ACUTE | |
ACCENT | |
Now consider the following snippet: | |
my $contents = "source.txt".IO.slurp; | |
spurt "dest.txt", $contents; | |
dest.txt contains: | |
$ hexdump dest.txt | |
0000000 63 61 66 c3 a9 | |
é has been converted to 0xc3 0xa9, i.e. simply, LATIN SMALL LETTER E | |
WITH ACUTE | |
i.e. we lost the original values without explicitly touching anything. | |
I understand why this happens and understand that this might be a | |
result of a conscious design choice, but I find this concerning | |
because this will be a source of hard-to track down bugs. | |
John Haltiwanger at work suggested trying this: | |
git init io.git | |
cd io.git | |
cp ~/Desktop/source.txt ./ | |
git add . | |
git commit -a -m "Initial commit" | |
cp ../dest.txt ./source.txt | |
git diff | |
diff --git a/source.txt b/source.txt | |
index fa04539..1c2e52c 100644 | |
--- a/source.txt | |
+++ b/source.txt | |
@@ -1 +1 @@ | |
-café | |
\ No newline at end of file | |
+café | |
\ No newline at end of file | |
p.s. here is the Perl 5 script to create source.txt: | |
use strict; | |
use warnings; | |
use 5.16.1; | |
my $contents = "caf\x{65}\x{301}"; | |
open my $fh, ">:encoding(UTF-8)", "source.txt"; | |
print $fh $contents; | |
close($fh); |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment