Created
December 6, 2012 18:15
-
-
Save rurban/4226700 to your computer and use it in GitHub Desktop.
Day 7 – MIME::Base64 – On encoded strings
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<title>Day 7 - MIME::Base64 - On encoded strings</title> | |
<h2>parrot MIME::Base64 FixedIntegerArray: index out of bounds!</h2> | |
Ronaldxs created the following parrot <a href="https://github.com/parrot/parrot/issues/813">ticket #813</a> 4 months ago: | |
<i>"Was playing with p6 MIME::Base64 and utf8 sampler page when I came across this. It seems that the parrot MIME Base64 library can't handle some UTF-8 characters as demonstrated below."</i> | |
<pre> | |
.sub go :main | |
load_bytecode 'MIME/Base64.pbc' | |
.local pmc enc_sub | |
enc_sub = get_global [ "MIME"; "Base64" ], 'encode_base64' | |
.local string result_encode | |
result_encode = enc_sub(utf8:"\x{203e}") | |
say result_encode | |
.end</pre> | |
<code>FixedIntegerArray: index out of bounds!</code> | |
<code>current instr.: 'parrot;MIME;Base64;encode_base64' | |
pc 163 (runtime/parrot/library/MIME/Base64.pir:147)</code> | |
<code>called from Sub 'go' pc 11 (die_utf8_base64.pir:8)</code> | |
This was interesting, because parrot strings store the encoding information in the string. The user does not need to store the string encoding information somewhere else as in perl5, nor have to do educated guesses about the encoding. parrot supports ascii, latin1, binary, utf-8, ucs-2, utf-16 and ucs-4 string encodings natively. | |
So we thought we the hell cannot parrot handle simple utf-8 encoded strings? | |
As it turned out, the parrot implementation of MIME::Base64, which can be shared to all languages which use parrot as VM, stored the character codepoints for each character as array of integers. On multibyte encodings such as UTF-8 this leads to different data held in memory than a normal multibyte string which is encoded as the byte buffer and the additional encoding information. | |
<h2>Internal string representations</h2> | |
For example an overview of different internal string representations for the utf-8 string <b>"\x{203e}"</b>: | |
perl5 strings: | |
<pre>len=3, utf-8 flag, "\342\200\276" buf=[e2 80 be]</pre> | |
parrot strings: | |
<pre>len=1, bufused=3, encoding=utf-8, buf=[e2 80 be]</pre> | |
The Unicode tables: | |
<pre>U+203E ‾ e2 80 be OVERLINE</pre> | |
<h2>gdb perl5</h2> | |
Let's check it out: | |
<pre> | |
$ gdb --args perl -e'print "\x{203e}"' | |
(gdb) start | |
(gdb) b Perl_pp_print | |
(gdb) c | |
(gdb) n | |
<i>.. until if (!do_print(*MARK, fp))</i> | |
(gdb) p **MARK | |
$1 = {sv_any = 0x404280, sv_refcnt = 1, sv_flags = 671106052, sv_u = { | |
svu_pv = <b>0x426dd0 "‾"</b>, svu_iv = 4353488, svu_uv = 4353488, | |
svu_rv = 0x426dd0, svu_array = 0x426dd0, svu_hash = 0x426dd0, | |
svu_gp = 0x426dd0, svu_fp = 0x426dd0}, ...} | |
(gdb) p Perl_sv_dump(*MARK) | |
ALLOCATED at -e:1 for stringify (parent 0x0); serial 301 | |
SV = PV(0x404280) at 0x4239a8 | |
REFCNT = 1 | |
FLAGS = (POK,READONLY,pPOK,<strong>UTF8</strong>) | |
PV = 0x426dd0 "\342\200\276" [UTF8 "\x{203e}"] | |
CUR = <strong>3</strong> | |
LEN = 16 | |
$2 = void | |
(gdb) x/3x 0x426dd0 | |
0x426dd0: <strong>0xe2 0x80 0xbe</strong> | |
</pre> | |
We see that perl5 does store the utf-8 flag, but not the length of the string, the utf8 length (=1), only the length of the buffer (=3). | |
Any other multi-byte encoded string, such as UCS-2 is stored differently. We suppose as utf-8. | |
We are already in the debugger, so let's try the different cmdline argument. | |
<pre> | |
(gdb) run -e'use Encode; print encode("UCS-2", "\x{203e}")' | |
The program being debugged has been started already. | |
Start it from the beginning? (y or n) y | |
Breakpoint 2, Perl_pp_print () at pp_hot.c:712 | |
712 dVAR; dSP; dMARK; dORIGMARK; | |
(gdb) p **MARK | |
$3 = {sv_any = 0x404b30, sv_refcnt = 1, sv_flags = 541700, sv_u = { | |
svu_pv = <strong>0x563a50 " >"</strong>, svu_iv = 5651024, svu_uv = 5651024, | |
svu_rv = 0x563a50, svu_array = 0x563a50, svu_hash = 0x563a50, svu_gp = 0x563a50, | |
svu_fp = 0x563a50}, ...} | |
(gdb) p Perl_sv_dump(*MARK) | |
ALLOCATED at -e:1 by return (parent 0x0); serial 9579 | |
SV = PV(0x404b30) at 0x556fb8 | |
REFCNT = 1 | |
FLAGS = (TEMP,POK,pPOK) | |
PV = 0x563a50 " >" | |
CUR = 2 | |
LEN = 16 | |
$4 = void | |
(gdb) x/2x 0x563a50 | |
0x563a50: <strong>0x20 0x3e</strong> | |
</pre> | |
But we don't see the UTF8 flag in encode("UCS-2", "\x{203e}"), just the simple ascii string " >", which is the UCS-2 representation of [20 3e]. | |
Because " >" is perfectly representable as non-utf8 ASCII string. | |
UCS-2 is much much nicer than UTF-8, is has a fixed size, it is readable, Windows uses it, but it cannot represent all Unicode characters. | |
<a title="Encode::Unicode" href="http://perldoc.perl.org/Encode/Unicode.html">Encode::Unicode</a> contains this nice cheatsheet: | |
<pre> | |
<b>Quick Reference</b> | |
Decodes from ord(N) Encodes chr(N) to... | |
octet/char BOM S.P d800-dfff ord > 0xffff \x{1abcd} == | |
---------------+-----------------+------------------------------ | |
UCS-2BE 2 N N is bogus Not Available | |
UCS-2LE 2 N N bogus Not Available | |
UTF-16 2/4 Y Y is S.P S.P BE/LE | |
UTF-16BE 2/4 N Y S.P S.P 0xd82a,0xdfcd | |
UTF-16LE 2/4 N Y S.P S.P 0x2ad8,0xcddf | |
UTF-32 4 Y - is bogus As is BE/LE | |
UTF-32BE 4 N - bogus As is 0x0001abcd | |
UTF-32LE 4 N - bogus As is 0xcdab0100 | |
UTF-8 1-4 - - bogus >= 4 octets \xf0\x9a\af\8d | |
---------------+-----------------+------------------------------ | |
</pre> | |
<h2>gdb parrot</h2> | |
Back to parrot: | |
If you debug parrot with gdb you get a gdb pretty-printer thanks to Nolan Lum, which displays the string and encoding information automatically. | |
In perl5 you have to call <code>Perl_sv_dump</code> with or without the <code>my_perl</code> as first argument, if threaded or not. With a threaded perl, e.g. on Windows you'd need to call <code>p Perl_sv_dump(my_perl, *MARK)</code>. | |
In parrot you just ask for the value and the formatting is done with a gdb pretty-printer plugin. | |
The string length is called <code>strlen</code> (of the encoded string), the buffer size is called <code>bufused</code>. | |
Even in a backtrace the string arguments are displayed abbrevated like this: | |
<pre> | |
#3 0x00007ffff7c29fc4 in utf8_iter_get_and_advance (interp=0x412050, str="utf8:� [1/2]", | |
i=0x7fffffffdd00) at src/string/encoding/utf8.c:551 | |
#4 0x00007ffff7a440f6 in Parrot_str_escape_truncate (interp=0x412050, src="utf8:� [1/2]", | |
limit=20) at src/string/api.c:2492 | |
#5 0x00007ffff7b02fb3 in trace_op_dump (interp=0x412050, code_start=0x63a1c0, pc=0x63b688) | |
at src/runcore/trace.c:450 | |
</pre> | |
[1/2] means strlen=1 bufused=2 | |
Each non-ascii or non latin-1 encoded string is printed with the encoding prefix. | |
Internally the encoding is of course a index or pointer in the table of supported encodings. | |
You can set a breakpoint to <code>utf8_iter_get_and_advance</code> and watch the strings. | |
<pre> | |
(gdb) r t/library/mime_base64u.t | |
Breakpoint 1, utf8_iter_get_and_advance (interp=0x412050, str="utf8:\\x{00c7} [8/8]", | |
i=0x7fffffffcd40) at src/string/encoding/utf8.c:544 | |
(gdb) p str | |
$1 = "utf8:\\x{00c7} [8/8]" | |
(gdb) p str->bufused | |
$3 = 8 | |
(gdb) p str->strlen | |
$4 = 8 | |
(gdb) p str->strstart | |
$5 = 0x5102d7 "\\x{00c7}" | |
</pre> | |
This is escaped. Let's advance to a more interesting utf8 string in this test, i.e. until str="utf8:Ā [1/2]" | |
You get the members of a struct with tab-completion, i.e. press <strong><TAB></strong> after <strong>p str-></strong> | |
<pre> | |
(gdb) p str-> | |
<strong>_buflen _bufstart bufused encoding flags hashval strlen strstart</strong> | |
(gdb) p str->strlen | |
$9 = 8 | |
(gdb) dis 1 | |
(gdb) b utf8_iter_get_and_advance if str->strlen == 1 | |
(gdb) c | |
Breakpoint 2, utf8_iter_get_and_advance (interp=0x412050, str="utf8:Ā [1/2]", | |
i=0x7fffffffcd10) at src/string/encoding/utf8.c:544 | |
544 ASSERT_ARGS(utf8_iter_get_and_advance) | |
(gdb) p str->strlen | |
$10 = 1 | |
(gdb) p str->strstart | |
$11 = 0x7ffff7faeb58 "Ā" | |
(gdb) x/2x str->strstart | |
0x7ffff7faeb58: <strong>0xc4 0x80</strong> | |
(gdb) p str->encoding | |
$12 = (const struct _str_vtable *) 0x7ffff7d882e0 | |
(gdb) p *str->encoding | |
$13 = {num = 3, name = 0x7ffff7ce333f "utf8", name_str = "utf8", bytes_per_unit = 1, | |
max_bytes_per_codepoint = 4, to_encoding = 0x7ffff7c292b0 <utf8_to_encoding>, chr = | |
0x7ffff7c275c0 <unicode_chr>, equal = 0x7ffff7c252e0 <encoding_equal>, compare = | |
0x7ffff7c254e0 <encoding_compare>, index = 0x7ffff7c25690 <encoding_index>, rindex | |
= 0x7ffff7c257a0 <encoding_rindex>, hash = 0x7ffff7c25a20 <encoding_hash>, scan = | |
0x7ffff7c29380 <utf8_scan>, partial_scan = 0x7ffff7c29460 <utf8_partial_scan>, ord | |
= 0x7ffff7c297e0 <utf8_ord>, substr = 0x7ffff7c25de0 <encoding_substr>, is_cclass = | |
0x7ffff7c26000 <encoding_is_cclass>, find_cclass = | |
0x7ffff7c260e0 <encoding_find_cclass>, find_not_cclass = | |
0x7ffff7c26220 <encoding_find_not_cclass>, get_graphemes = | |
0x7ffff7c263d0 <encoding_get_graphemes>, compose = | |
0x7ffff7c27680 <unicode_compose>, decompose = 0x7ffff7c26450 <encoding_decompose>, | |
upcase = 0x7ffff7c27b20 <unicode_upcase>, downcase = | |
0x7ffff7c27be0 <unicode_downcase>, titlecase = 0x7ffff7c27ca0 <unicode_titlecase>, | |
upcase_first = 0x7ffff7c27d60 <unicode_upcase_first>, downcase_first = | |
0x7ffff7c27dc0 <unicode_downcase_first>, titlecase_first = | |
0x7ffff7c27e20 <unicode_titlecase_first>, iter_get = | |
0x7ffff7c29c40 <utf8_iter_get>, iter_skip = 0x7ffff7c29d60 <utf8_iter_skip>, | |
iter_get_and_advance = 0x7ffff7c29eb0 <utf8_iter_get_and_advance>, | |
iter_set_and_advance = 0x7ffff7c29fd0 <utf8_iter_set_and_advance>} | |
</pre> | |
<h2>encode_base64(str)</h2> | |
<pre> | |
$ perl -MMIME::Base64 -lE'$x="20e3";$s="\x{20e3}"; | |
printf "0x%s\t%s=> %s",$x,$s,encode_base64($s)' | |
Wide character in subroutine entry at -e line 1. | |
</pre> | |
Oops, I'm clearly a unicode perl5 newbie. Does my term not understand utf-8? | |
<pre> | |
$ echo $TERM | |
xterm | |
</pre> | |
No, it should. encode_base64 does not understand unicode. | |
<code>perldoc MIME::Base64</code> | |
<i>"The base64 encoding is only defined for single-byte characters. Use the Encode module to select the byte encoding you want."</i> | |
Oh my! But it is just perl5. It just works on byte buffers, not on strings. | |
perl5 strings can be utf8 and non-utf8. Why on earth an utf8 encoded string is disallowed and only byte buffers of unknown encodings are allowed goes beyond my understanding, but what can you do. Nothing. base64 is a binary only protocol, based on byte buffers. So we decode it manually to byte buffers. The Encode API for decoding is called <i>encode</i>. | |
<pre> | |
$ perl -MMIME::Base64 -MEncode -lE'$x="20e3";$s="\x{20e3}"; | |
printf "0x%s\t%s=> %s",$x,$s,encode_base64(decode('utf8',$s))' | |
Wide character in printf at -e line 1. | |
0x20e3 => 4oOj | |
</pre> | |
This is now the term warning I know. We need <strong>-C</strong> | |
<pre> | |
$ <strong>perldoc perluniintro</strong> | |
$ perl -C -MMIME::Base64 -MEncode -lE'$x="20e3";$s="\x{20e3}"; | |
printf "0x%s\t%s=> %s",$x,$s,encode_base64(encode('utf8',$s))' | |
0x20e3 => 4oOj | |
</pre> | |
Over to rakudo/perl6 and parrot: | |
<pre> | |
$ cat >m.pir << EOP | |
.sub main :main | |
load_bytecode 'MIME/Base64.pbc' | |
$P1 = get_global [ "MIME"; "Base64" ], 'encode_base64' | |
$S1 = utf8:"\x{203e}" | |
$S2 = $P1(s1) | |
say $S1 | |
say $S2 | |
.end | |
EOP | |
$ parrot m.pir | |
FixedIntegerArray: index out of bounds! | |
current instr.: 'parrot;MIME;Base64;encode_base64' | |
pc 163 (runtime/parrot/library/MIME/Base64.pir:147) | |
</pre> | |
The perl6 test, using the parrot library, from <a href="https://github.com/ronaldxs/perl6-Enc-MIME-Base64/">https://github.com/ronaldxs/perl6-Enc-MIME-Base64/</a> | |
<pre> | |
$ git clone git://github.com/ronaldxs/perl6-Enc-MIME-Base64.git | |
Cloning into 'perl6-Enc-MIME-Base64'... | |
$ PERL6LIB=perl6-Enc-MIME-Base64/lib perl6 <<EOP | |
use Enc::MIME::Base64; | |
say encode_base64_str("\x203e"); | |
EOP | |
> use Enc::MIME::Base64; | |
Nil | |
> say encode_base64_str("\x203e"); | |
FixedIntegerArray: index out of bounds! | |
... | |
</pre> | |
The pure perl6 workaround: | |
<pre> | |
$ PERL6LIB=perl6-Enc-MIME-Base64/lib perl6 <<EOP | |
use PP::Enc::MIME::Base64; | |
say encode_base64_str("\x203e"); | |
EOP | |
> use PP::Enc::MIME::Base64; | |
Nil | |
> say encode_base64_str("\x203e"); | |
4oC+ | |
</pre> | |
Wait. perl6 creates a different enoding than perl5? | |
What about coreutils <a href="http://www.gnu.org/software/coreutils/manual/html_node/base64-invocation.html">base64</a> command. | |
<pre> | |
$ echo -n "‾" > m.raw | |
$ od -x m.raw | |
0000000 80e2 00be | |
0000003 | |
$ ls -al m.raw | |
-rw-r--r-- 1 rurban rurban 3 Dec 6 10:23 m.raw | |
$ base64 m.raw | |
4oC+ | |
</pre> | |
<code>[80e2 00be]</code> is the little-endian version of <code>[e2 80 be]</code>, 3 bytes, flipped. | |
Ok, at least base64 agrees with perl6, and I must have made some encoding mistake with perl5. | |
Back to debugging our parrot problem: | |
parrot unlike perl6 has no debugger yet. So we have to use <code>gdb</code>, and we need to know in which function the error occured. We use the parrot <code>-t</code> trace flag, which is like the perl5 debugging <code>-Dt</code> flag, but it is always enabled, even in optimized builds. | |
<pre> | |
$ parrot --help | |
... | |
-t --trace [flags] | |
--help-debug | |
... | |
$ parrot --help-debug | |
... | |
--trace -t [Flags] ... | |
0001 opcodes | |
0002 find_method | |
0004 function calls | |
$ parrot -t7 m.pir | |
... | |
009f band I9, I2, 63 I9=0 I2=0 | |
00a3 set I10, P0[I5] I10=0 P0=<strong>FixedIntegerArray</strong>=PMC(0xff7638) I5=[<strong>2063</strong>] | |
016c get_results PC2 (1), P2 PC2=FixedIntegerArray=PMC(0xedd178) P2=PMCNULL | |
016f finalize P2 P2=Exception=PMC(0x16ed498) | |
0171 pop_eh | |
<i>lots of error handling</i> | |
... | |
0248 callmethodcc P0, "print" P0=FileHandle=PMC(0xedcca0) | |
FixedIntegerArray: index out of bounds! | |
</pre> | |
We finally see the problem, which matches the run-time error. | |
<pre>00a3 <strong>set I10, P0[I5]</strong> I10=0 P0=<strong>FixedIntegerArray</strong>=PMC(0xff7638) I5=[<strong>2063</strong>] | |
</pre> | |
We want to set <tt>I10</tt> to the <tt>I5=2063</tt>'th element in the FixedIntegerArray <tt>P0</tt>, and the array is not big enough. | |
After several hours of analyzing I came to the conclusion that the parrot library MIME::Base64 was wrong by using <strong>ord</strong> of every character in the string. It should use a <strong>bytebuffer</strong> instead. | |
Which was fixed with <a href="https://github.com/parrot/parrot/commit/3a48e6b462d8fff501cb16a2f92a857baee0df53">commit 3a48e6</a>. ord can return integers > 255, but base64 can only handle chars < 255. | |
The fixed parrot library was now correct: | |
<pre> | |
$ parrot m.pir | |
‾ | |
4oC+ | |
</pre> | |
But then the tests started failing. I spent several weeks trying to understand why the parrot testsuite was wrong with the mime_base64 tests, the testdata came from perl5. I came up with different implementation hacks which would match the testsuite, but finally had to bite the bullet, changing the tests to match the implementation. | |
And I had to special case the tests for big-endian, as base64 is endian agnostic. You cannot decode a base64 encoded powerpc file on an intel machine, when you use multi-byte characters. And utf-8 is even more multi-byte than ucs-2. I had to accept the fact the big-endian will return a different encoding. Before the results were the same. The tests were written to return the same encoding on little and big-endian. | |
<h2>Summary</h2> | |
The first reason why I wrote this blog post was to show how to debug into crazy problems like this, when you are not sure if the core implementation, the library, the spec or the tests are wrong. It turned out, that the library and the tests were wrong. | |
You saw who easily you could use gdb to debug into such problems, as soon as you find out a proper breakpoint. | |
The internal string representations looked like this: | |
MIME::Base64 internally: | |
<pre>len=1, encoding=utf-8, buf=[3e20]</pre> | |
and inside the parrot imcc compiler the SREG | |
<pre>len=8, buf="utf-8:\"\x{203e}\""</pre> | |
parrot is a register based runtime, and a SREG is the string representation of the register value. Unfortunately a SREG cannot hold the encoding info yet, so we prefix the encoding in the string, and unquote it back. This is not the reason why parrot is still slower than the perl5 VM. I <a href="https://github.com/parrot/parrot/commit/9c8159314dd2d26365653fbcd8627b0f8fbb0559">benchmarked</a> it. parrot still uses too much sprintf's internally and the encoding quote/unquoting counts only for a 4th of the time of the sprintf gyrations. | |
And parrot function calls are awfully slow and de-optimized. | |
The second reason is to explain the new decode_base64() API, which only parrot - and therefore all parrot based languages like rakudo - now have got. | |
<h2>decode_base64(str, ?:encoding)</h2> | |
<p><i>"Decode a base64 string by calling the decode_base64() function. | |
This function takes as first argument the string to decode, | |
as optional second argument the encoding string for the decoded data. | |
It returns the decoded data.</i> | |
<i>Any character not part of the 65-character base64 subset is silently ignored. | |
Characters occurring after a '=' padding character are never decoded."</i></p> | |
So decode_base64 got now a second optional encoding argument. The src string for encode_base64 can be any encoding and is automatically decoded to a bytebuffer. You can easily encode an image or unicode string without any trouble, and for the decoder you can define the wanted encoding beforehand. The result can be the encoding <strong>binary</strong> or <strong>utf-8</strong> or any encoding you prefer, no need for additional decoding of the result. The default encoding of the decoded string is either ascii, latin-1 or utf-8. parrot will upgrade the encoding automatically. | |
You can compare the new examples of <a href="https://github.com/parrot/parrot/blob/master/examples/library/utf8_base64.pir">pir</a> against the <a href="https://github.com/parrot/parrot/blob/master/examples/library/utf8_base64.pl">perl5</a> version: | |
parrot: | |
<pre> | |
.sub main :main | |
load_bytecode 'MIME/Base64.pbc' | |
.local pmc enc_sub | |
enc_sub = get_global [ "MIME"; "Base64" ], 'encode_base64' | |
.local string result_encode | |
# GH 814 | |
result_encode = enc_sub(utf8:"\x{a2}") | |
say "encode: utf8:\"\\x{a2}\"" | |
say "expected: wqI=" | |
print "result: " | |
say result_encode | |
# GH 813 | |
result_encode = enc_sub(utf8:"\x{203e}") | |
say "encode: utf8:\"\\x{203e}\"" | |
say "expected: 4oC+" | |
print "result: " | |
say result_encode | |
.end | |
</pre> | |
perl5: | |
<pre> | |
use MIME::Base64 qw(encode_base64 decode_base64); | |
use Encode qw(encode); | |
my $encoded = encode_base64(encode("UTF-8", "\x{a2}")); | |
print "encode: utf-8:\"\\x{a2}\" - ", encode("UTF-8", "\x{a2}"), "\n"; | |
print "expected: wqI=\n"; | |
print "result: $encoded\n"; | |
print "decode: ",decode_base64("wqI="),"\n\n"; # 302 242 | |
my $encoded = encode_base64(encode("UTF-8", "\x{203e}")); | |
print "encode: utf-8:\"\\x{203e}\" -> ",encode("UTF-8", "\x{203e}"),"\n"; | |
print "expected: 4oC+\n"; | |
print "result: $encoded\n"; # 342 200 276 | |
print "decode: ",decode_base64("4oC+"),"\n"; | |
for ([qq(a2)],[qq(c2a2)],[qw(203e)],[qw(3e 20)],[qw(1000)],[qw(00c7)],[qw(00ff 0000)]){ | |
$s = pack "H*",@{$_}; | |
printf "0x%s\t=> %s", join("",@{$_}), encode_base64($s); | |
} | |
</pre> | |
perl6: | |
<pre> | |
use Enc::MIME::Base64; | |
say encode_base64_str("\xa2"); | |
say encode_base64_str("\x203e"); | |
</pre> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is great! One tiny suggestion I have is: perhaps mention that Nolan Lum ( @nol888 ) was one of our most productive Google Code-In students and link to this Google blog post: http://google-opensource.blogspot.com/2011/03/parrot-foundation-reflects-on-google.html ?