Skip to content

Instantly share code, notes, and snippets.

@rurban
Created December 6, 2012 18:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rurban/4226700 to your computer and use it in GitHub Desktop.
Save rurban/4226700 to your computer and use it in GitHub Desktop.
Day 7 – MIME::Base64 – On encoded strings
<title>Day 7 - MIME::Base64 - On encoded strings</title>
<h2>parrot MIME::Base64 FixedIntegerArray: index out of bounds!</h2>
Ronaldxs created the following parrot <a href="https://github.com/parrot/parrot/issues/813">ticket #813</a> 4 months ago:
<i>"Was playing with p6 MIME::Base64 and utf8 sampler page when I came across this. It seems that the parrot MIME Base64 library can't handle some UTF-8 characters as demonstrated below."</i>
<pre>
.sub go :main
load_bytecode 'MIME/Base64.pbc'
.local pmc enc_sub
enc_sub = get_global [ "MIME"; "Base64" ], 'encode_base64'
.local string result_encode
result_encode = enc_sub(utf8:"\x{203e}")
say result_encode
.end</pre>
<code>FixedIntegerArray: index out of bounds!</code>
<code>current instr.: 'parrot;MIME;Base64;encode_base64'
pc 163 (runtime/parrot/library/MIME/Base64.pir:147)</code>
<code>called from Sub 'go' pc 11 (die_utf8_base64.pir:8)</code>
This was interesting, because parrot strings store the encoding information in the string. The user does not need to store the string encoding information somewhere else as in perl5, nor have to do educated guesses about the encoding. parrot supports ascii, latin1, binary, utf-8, ucs-2, utf-16 and ucs-4 string encodings natively.
So we thought we the hell cannot parrot handle simple utf-8 encoded strings?
As it turned out, the parrot implementation of MIME::Base64, which can be shared to all languages which use parrot as VM, stored the character codepoints for each character as array of integers. On multibyte encodings such as UTF-8 this leads to different data held in memory than a normal multibyte string which is encoded as the byte buffer and the additional encoding information.
<h2>Internal string representations</h2>
For example an overview of different internal string representations for the utf-8 string <b>"\x{203e}"</b>:
perl5 strings:
<pre>len=3, utf-8 flag, "\342\200\276" buf=[e2 80 be]</pre>
parrot strings:
<pre>len=1, bufused=3, encoding=utf-8, buf=[e2 80 be]</pre>
The Unicode tables:
<pre>U+203E ‾ e2 80 be OVERLINE</pre>
<h2>gdb perl5</h2>
Let's check it out:
<pre>
$ gdb --args perl -e'print "\x{203e}"'
(gdb) start
(gdb) b Perl_pp_print
(gdb) c
(gdb) n
<i>.. until if (!do_print(*MARK, fp))</i>
(gdb) p **MARK
$1 = {sv_any = 0x404280, sv_refcnt = 1, sv_flags = 671106052, sv_u = {
svu_pv = <b>0x426dd0 "‾"</b>, svu_iv = 4353488, svu_uv = 4353488,
svu_rv = 0x426dd0, svu_array = 0x426dd0, svu_hash = 0x426dd0,
svu_gp = 0x426dd0, svu_fp = 0x426dd0}, ...}
(gdb) p Perl_sv_dump(*MARK)
ALLOCATED at -e:1 for stringify (parent 0x0); serial 301
SV = PV(0x404280) at 0x4239a8
REFCNT = 1
FLAGS = (POK,READONLY,pPOK,<strong>UTF8</strong>)
PV = 0x426dd0 "\342\200\276" [UTF8 "\x{203e}"]
CUR = <strong>3</strong>
LEN = 16
$2 = void
(gdb) x/3x 0x426dd0
0x426dd0: <strong>0xe2 0x80 0xbe</strong>
</pre>
We see that perl5 does store the utf-8 flag, but not the length of the string, the utf8 length (=1), only the length of the buffer (=3).
Any other multi-byte encoded string, such as UCS-2 is stored differently. We suppose as utf-8.
We are already in the debugger, so let's try the different cmdline argument.
<pre>
(gdb) run -e'use Encode; print encode("UCS-2", "\x{203e}")'
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Breakpoint 2, Perl_pp_print () at pp_hot.c:712
712 dVAR; dSP; dMARK; dORIGMARK;
(gdb) p **MARK
$3 = {sv_any = 0x404b30, sv_refcnt = 1, sv_flags = 541700, sv_u = {
svu_pv = <strong>0x563a50 " &gt;"</strong>, svu_iv = 5651024, svu_uv = 5651024,
svu_rv = 0x563a50, svu_array = 0x563a50, svu_hash = 0x563a50, svu_gp = 0x563a50,
svu_fp = 0x563a50}, ...}
(gdb) p Perl_sv_dump(*MARK)
ALLOCATED at -e:1 by return (parent 0x0); serial 9579
SV = PV(0x404b30) at 0x556fb8
REFCNT = 1
FLAGS = (TEMP,POK,pPOK)
PV = 0x563a50 " &gt;"
CUR = 2
LEN = 16
$4 = void
(gdb) x/2x 0x563a50
0x563a50: <strong>0x20 0x3e</strong>
</pre>
But we don't see the UTF8 flag in encode("UCS-2", "\x{203e}"), just the simple ascii string " &gt;", which is the UCS-2 representation of [20 3e].
Because " &gt;" is perfectly representable as non-utf8 ASCII string.
UCS-2 is much much nicer than UTF-8, is has a fixed size, it is readable, Windows uses it, but it cannot represent all Unicode characters.
<a title="Encode::Unicode" href="http://perldoc.perl.org/Encode/Unicode.html">Encode::Unicode</a> contains this nice cheatsheet:
<pre>
<b>Quick Reference</b>
Decodes from ord(N) Encodes chr(N) to...
octet/char BOM S.P d800-dfff ord > 0xffff \x{1abcd} ==
---------------+-----------------+------------------------------
UCS-2BE 2 N N is bogus Not Available
UCS-2LE 2 N N bogus Not Available
UTF-16 2/4 Y Y is S.P S.P BE/LE
UTF-16BE 2/4 N Y S.P S.P 0xd82a,0xdfcd
UTF-16LE 2/4 N Y S.P S.P 0x2ad8,0xcddf
UTF-32 4 Y - is bogus As is BE/LE
UTF-32BE 4 N - bogus As is 0x0001abcd
UTF-32LE 4 N - bogus As is 0xcdab0100
UTF-8 1-4 - - bogus >= 4 octets \xf0\x9a\af\8d
---------------+-----------------+------------------------------
</pre>
<h2>gdb parrot</h2>
Back to parrot:
If you debug parrot with gdb you get a gdb pretty-printer thanks to Nolan Lum, which displays the string and encoding information automatically.
In perl5 you have to call <code>Perl_sv_dump</code> with or without the <code>my_perl</code> as first argument, if threaded or not. With a threaded perl, e.g. on Windows you'd need to call <code>p Perl_sv_dump(my_perl, *MARK)</code>.
In parrot you just ask for the value and the formatting is done with a gdb pretty-printer plugin.
The string length is called <code>strlen</code> (of the encoded string), the buffer size is called <code>bufused</code>.
Even in a backtrace the string arguments are displayed abbrevated like this:
<pre>
#3 0x00007ffff7c29fc4 in utf8_iter_get_and_advance (interp=0x412050, str="utf8:� [1/2]",
i=0x7fffffffdd00) at src/string/encoding/utf8.c:551
#4 0x00007ffff7a440f6 in Parrot_str_escape_truncate (interp=0x412050, src="utf8:� [1/2]",
limit=20) at src/string/api.c:2492
#5 0x00007ffff7b02fb3 in trace_op_dump (interp=0x412050, code_start=0x63a1c0, pc=0x63b688)
at src/runcore/trace.c:450
</pre>
[1/2] means strlen=1 bufused=2
Each non-ascii or non latin-1 encoded string is printed with the encoding prefix.
Internally the encoding is of course a index or pointer in the table of supported encodings.
You can set a breakpoint to <code>utf8_iter_get_and_advance</code> and watch the strings.
<pre>
(gdb) r t/library/mime_base64u.t
Breakpoint 1, utf8_iter_get_and_advance (interp=0x412050, str="utf8:\\x{00c7} [8/8]",
i=0x7fffffffcd40) at src/string/encoding/utf8.c:544
(gdb) p str
$1 = "utf8:\\x{00c7} [8/8]"
(gdb) p str->bufused
$3 = 8
(gdb) p str->strlen
$4 = 8
(gdb) p str->strstart
$5 = 0x5102d7 "\\x{00c7}"
</pre>
This is escaped. Let's advance to a more interesting utf8 string in this test, i.e. until str="utf8:Ā [1/2]"
You get the members of a struct with tab-completion, i.e. press <strong>&lt;TAB&gt;</strong> after <strong>p str-&gt;</strong>
<pre>
(gdb) p str->
<strong>_buflen _bufstart bufused encoding flags hashval strlen strstart</strong>
(gdb) p str->strlen
$9 = 8
(gdb) dis 1
(gdb) b utf8_iter_get_and_advance if str->strlen == 1
(gdb) c
Breakpoint 2, utf8_iter_get_and_advance (interp=0x412050, str="utf8:Ā [1/2]",
i=0x7fffffffcd10) at src/string/encoding/utf8.c:544
544 ASSERT_ARGS(utf8_iter_get_and_advance)
(gdb) p str->strlen
$10 = 1
(gdb) p str->strstart
$11 = 0x7ffff7faeb58 "Ā"
(gdb) x/2x str->strstart
0x7ffff7faeb58: <strong>0xc4 0x80</strong>
(gdb) p str->encoding
$12 = (const struct _str_vtable *) 0x7ffff7d882e0
(gdb) p *str->encoding
$13 = {num = 3, name = 0x7ffff7ce333f "utf8", name_str = "utf8", bytes_per_unit = 1,
max_bytes_per_codepoint = 4, to_encoding = 0x7ffff7c292b0 &lt;utf8_to_encoding&gt;, chr =
0x7ffff7c275c0 &lt;unicode_chr&gt;, equal = 0x7ffff7c252e0 &lt;encoding_equal&gt;, compare =
0x7ffff7c254e0 &lt;encoding_compare&gt;, index = 0x7ffff7c25690 &lt;encoding_index&gt;, rindex
= 0x7ffff7c257a0 &lt;encoding_rindex&gt;, hash = 0x7ffff7c25a20 &lt;encoding_hash&gt;, scan =
0x7ffff7c29380 &lt;utf8_scan&gt;, partial_scan = 0x7ffff7c29460 &lt;utf8_partial_scan&gt;, ord
= 0x7ffff7c297e0 &lt;utf8_ord&gt;, substr = 0x7ffff7c25de0 &lt;encoding_substr&gt;, is_cclass =
0x7ffff7c26000 &lt;encoding_is_cclass&gt;, find_cclass =
0x7ffff7c260e0 &lt;encoding_find_cclass&gt;, find_not_cclass =
0x7ffff7c26220 &lt;encoding_find_not_cclass&gt;, get_graphemes =
0x7ffff7c263d0 &lt;encoding_get_graphemes&gt;, compose =
0x7ffff7c27680 &lt;unicode_compose&gt;, decompose = 0x7ffff7c26450 &lt;encoding_decompose&gt;,
upcase = 0x7ffff7c27b20 &lt;unicode_upcase&gt;, downcase =
0x7ffff7c27be0 &lt;unicode_downcase&gt;, titlecase = 0x7ffff7c27ca0 &lt;unicode_titlecase&gt;,
upcase_first = 0x7ffff7c27d60 &lt;unicode_upcase_first&gt;, downcase_first =
0x7ffff7c27dc0 &lt;unicode_downcase_first&gt;, titlecase_first =
0x7ffff7c27e20 &lt;unicode_titlecase_first&gt;, iter_get =
0x7ffff7c29c40 &lt;utf8_iter_get&gt;, iter_skip = 0x7ffff7c29d60 &lt;utf8_iter_skip&gt;,
iter_get_and_advance = 0x7ffff7c29eb0 &lt;utf8_iter_get_and_advance&gt;,
iter_set_and_advance = 0x7ffff7c29fd0 &lt;utf8_iter_set_and_advance&gt;}
</pre>
<h2>encode_base64(str)</h2>
<pre>
$ perl -MMIME::Base64 -lE'$x="20e3";$s="\x{20e3}";
printf "0x%s\t%s=> %s",$x,$s,encode_base64($s)'
Wide character in subroutine entry at -e line 1.
</pre>
Oops, I'm clearly a unicode perl5 newbie. Does my term not understand utf-8?
<pre>
$ echo $TERM
xterm
</pre>
No, it should. encode_base64 does not understand unicode.
<code>perldoc MIME::Base64</code>
<i>"The base64 encoding is only defined for single-byte characters. Use the Encode module to select the byte encoding you want."</i>
Oh my! But it is just perl5. It just works on byte buffers, not on strings.
perl5 strings can be utf8 and non-utf8. Why on earth an utf8 encoded string is disallowed and only byte buffers of unknown encodings are allowed goes beyond my understanding, but what can you do. Nothing. base64 is a binary only protocol, based on byte buffers. So we decode it manually to byte buffers. The Encode API for decoding is called <i>encode</i>.
<pre>
$ perl -MMIME::Base64 -MEncode -lE'$x="20e3";$s="\x{20e3}";
printf "0x%s\t%s=> %s",$x,$s,encode_base64(decode('utf8',$s))'
Wide character in printf at -e line 1.
0x20e3 => 4oOj
</pre>
This is now the term warning I know. We need <strong>-C</strong>
<pre>
$ <strong>perldoc perluniintro</strong>
$ perl -C -MMIME::Base64 -MEncode -lE'$x="20e3";$s="\x{20e3}";
printf "0x%s\t%s=> %s",$x,$s,encode_base64(encode('utf8',$s))'
0x20e3 => 4oOj
</pre>
Over to rakudo/perl6 and parrot:
<pre>
$ cat >m.pir &lt;&lt; EOP
.sub main :main
load_bytecode 'MIME/Base64.pbc'
$P1 = get_global [ "MIME"; "Base64" ], 'encode_base64'
$S1 = utf8:"\x{203e}"
$S2 = $P1(s1)
say $S1
say $S2
.end
EOP
$ parrot m.pir
FixedIntegerArray: index out of bounds!
current instr.: 'parrot;MIME;Base64;encode_base64'
pc 163 (runtime/parrot/library/MIME/Base64.pir:147)
</pre>
The perl6 test, using the parrot library, from <a href="https://github.com/ronaldxs/perl6-Enc-MIME-Base64/">https://github.com/ronaldxs/perl6-Enc-MIME-Base64/</a>
<pre>
$ git clone git://github.com/ronaldxs/perl6-Enc-MIME-Base64.git
Cloning into 'perl6-Enc-MIME-Base64'...
$ PERL6LIB=perl6-Enc-MIME-Base64/lib perl6 &lt;&lt;EOP
use Enc::MIME::Base64;
say encode_base64_str("\x203e");
EOP
> use Enc::MIME::Base64;
Nil
> say encode_base64_str("\x203e");
FixedIntegerArray: index out of bounds!
...
</pre>
The pure perl6 workaround:
<pre>
$ PERL6LIB=perl6-Enc-MIME-Base64/lib perl6 &lt;&lt;EOP
use PP::Enc::MIME::Base64;
say encode_base64_str("\x203e");
EOP
> use PP::Enc::MIME::Base64;
Nil
> say encode_base64_str("\x203e");
4oC+
</pre>
Wait. perl6 creates a different enoding than perl5?
What about coreutils <a href="http://www.gnu.org/software/coreutils/manual/html_node/base64-invocation.html">base64</a> command.
<pre>
$ echo -n "‾" > m.raw
$ od -x m.raw
0000000 80e2 00be
0000003
$ ls -al m.raw
-rw-r--r-- 1 rurban rurban 3 Dec 6 10:23 m.raw
$ base64 m.raw
4oC+
</pre>
<code>[80e2 00be]</code> is the little-endian version of <code>[e2 80 be]</code>, 3 bytes, flipped.
Ok, at least base64 agrees with perl6, and I must have made some encoding mistake with perl5.
Back to debugging our parrot problem:
parrot unlike perl6 has no debugger yet. So we have to use <code>gdb</code>, and we need to know in which function the error occured. We use the parrot <code>-t</code> trace flag, which is like the perl5 debugging <code>-Dt</code> flag, but it is always enabled, even in optimized builds.
<pre>
$ parrot --help
...
-t --trace [flags]
--help-debug
...
$ parrot --help-debug
...
--trace -t [Flags] ...
0001 opcodes
0002 find_method
0004 function calls
$ parrot -t7 m.pir
...
009f band I9, I2, 63 I9=0 I2=0
00a3 set I10, P0[I5] I10=0 P0=<strong>FixedIntegerArray</strong>=PMC(0xff7638) I5=[<strong>2063</strong>]
016c get_results PC2 (1), P2 PC2=FixedIntegerArray=PMC(0xedd178) P2=PMCNULL
016f finalize P2 P2=Exception=PMC(0x16ed498)
0171 pop_eh
<i>lots of error handling</i>
...
0248 callmethodcc P0, "print" P0=FileHandle=PMC(0xedcca0)
FixedIntegerArray: index out of bounds!
</pre>
We finally see the problem, which matches the run-time error.
<pre>00a3 <strong>set I10, P0[I5]</strong> I10=0 P0=<strong>FixedIntegerArray</strong>=PMC(0xff7638) I5=[<strong>2063</strong>]
</pre>
We want to set <tt>I10</tt> to the <tt>I5=2063</tt>'th element in the FixedIntegerArray <tt>P0</tt>, and the array is not big enough.
After several hours of analyzing I came to the conclusion that the parrot library MIME::Base64 was wrong by using <strong>ord</strong> of every character in the string. It should use a <strong>bytebuffer</strong> instead.
Which was fixed with <a href="https://github.com/parrot/parrot/commit/3a48e6b462d8fff501cb16a2f92a857baee0df53">commit 3a48e6</a>. ord can return integers > 255, but base64 can only handle chars < 255.
The fixed parrot library was now correct:
<pre>
$ parrot m.pir
4oC+
</pre>
But then the tests started failing. I spent several weeks trying to understand why the parrot testsuite was wrong with the mime_base64 tests, the testdata came from perl5. I came up with different implementation hacks which would match the testsuite, but finally had to bite the bullet, changing the tests to match the implementation.
And I had to special case the tests for big-endian, as base64 is endian agnostic. You cannot decode a base64 encoded powerpc file on an intel machine, when you use multi-byte characters. And utf-8 is even more multi-byte than ucs-2. I had to accept the fact the big-endian will return a different encoding. Before the results were the same. The tests were written to return the same encoding on little and big-endian.
<h2>Summary</h2>
The first reason why I wrote this blog post was to show how to debug into crazy problems like this, when you are not sure if the core implementation, the library, the spec or the tests are wrong. It turned out, that the library and the tests were wrong.
You saw who easily you could use gdb to debug into such problems, as soon as you find out a proper breakpoint.
The internal string representations looked like this:
MIME::Base64 internally:
<pre>len=1, encoding=utf-8, buf=[3e20]</pre>
and inside the parrot imcc compiler the SREG
<pre>len=8, buf="utf-8:\"\x{203e}\""</pre>
parrot is a register based runtime, and a SREG is the string representation of the register value. Unfortunately a SREG cannot hold the encoding info yet, so we prefix the encoding in the string, and unquote it back. This is not the reason why parrot is still slower than the perl5 VM. I <a href="https://github.com/parrot/parrot/commit/9c8159314dd2d26365653fbcd8627b0f8fbb0559">benchmarked</a> it. parrot still uses too much sprintf's internally and the encoding quote/unquoting counts only for a 4th of the time of the sprintf gyrations.
And parrot function calls are awfully slow and de-optimized.
The second reason is to explain the new decode_base64() API, which only parrot - and therefore all parrot based languages like rakudo - now have got.
<h2>decode_base64(str, ?:encoding)</h2>
<p><i>"Decode a base64 string by calling the decode_base64() function.
This function takes as first argument the string to decode,
as optional second argument the encoding string for the decoded data.
It returns the decoded data.</i>
<i>Any character not part of the 65-character base64 subset is silently ignored.
Characters occurring after a '=' padding character are never decoded."</i></p>
So decode_base64 got now a second optional encoding argument. The src string for encode_base64 can be any encoding and is automatically decoded to a bytebuffer. You can easily encode an image or unicode string without any trouble, and for the decoder you can define the wanted encoding beforehand. The result can be the encoding <strong>binary</strong> or <strong>utf-8</strong> or any encoding you prefer, no need for additional decoding of the result. The default encoding of the decoded string is either ascii, latin-1 or utf-8. parrot will upgrade the encoding automatically.
You can compare the new examples of <a href="https://github.com/parrot/parrot/blob/master/examples/library/utf8_base64.pir">pir</a> against the <a href="https://github.com/parrot/parrot/blob/master/examples/library/utf8_base64.pl">perl5</a> version:
parrot:
<pre>
.sub main :main
load_bytecode 'MIME/Base64.pbc'
.local pmc enc_sub
enc_sub = get_global [ "MIME"; "Base64" ], 'encode_base64'
.local string result_encode
# GH 814
result_encode = enc_sub(utf8:"\x{a2}")
say "encode: utf8:\"\\x{a2}\""
say "expected: wqI="
print "result: "
say result_encode
# GH 813
result_encode = enc_sub(utf8:"\x{203e}")
say "encode: utf8:\"\\x{203e}\""
say "expected: 4oC+"
print "result: "
say result_encode
.end
</pre>
perl5:
<pre>
use MIME::Base64 qw(encode_base64 decode_base64);
use Encode qw(encode);
my $encoded = encode_base64(encode("UTF-8", "\x{a2}"));
print "encode: utf-8:\"\\x{a2}\" - ", encode("UTF-8", "\x{a2}"), "\n";
print "expected: wqI=\n";
print "result: $encoded\n";
print "decode: ",decode_base64("wqI="),"\n\n"; # 302 242
my $encoded = encode_base64(encode("UTF-8", "\x{203e}"));
print "encode: utf-8:\"\\x{203e}\" -> ",encode("UTF-8", "\x{203e}"),"\n";
print "expected: 4oC+\n";
print "result: $encoded\n"; # 342 200 276
print "decode: ",decode_base64("4oC+"),"\n";
for ([qq(a2)],[qq(c2a2)],[qw(203e)],[qw(3e 20)],[qw(1000)],[qw(00c7)],[qw(00ff 0000)]){
$s = pack "H*",@{$_};
printf "0x%s\t=> %s", join("",@{$_}), encode_base64($s);
}
</pre>
perl6:
<pre>
use Enc::MIME::Base64;
say encode_base64_str("\xa2");
say encode_base64_str("\x203e");
</pre>
@leto
Copy link

leto commented Dec 6, 2012

This is great! One tiny suggestion I have is: perhaps mention that Nolan Lum ( @nol888 ) was one of our most productive Google Code-In students and link to this Google blog post: http://google-opensource.blogspot.com/2011/03/parrot-foundation-reflects-on-google.html ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment