dbolser/Thanks mauke!

## Thanks mauke!
21:13 < dbolser_> On another issue... I'm usign LWP::Simple to grab this:
                  https://letstalkbitcoin.com/api/v1/forum/threads, which is
                  "Content-Type:application/json", however, when I decode_json
                  (using JSON), I get the error: malformed JSON string, neither
                  array, object, number, string or atom, at character offset 0
                  (before "\x{ef}\x{bb}\x{bf}{"...") at ./get_and_load_data.plx
                  line 24
21:14 < dngor> Maybe it's compressed.
21:14 < mauke> no, UTF-8 BOM
21:14 < mauke> a.k.a. malformed JSON
21:14 < dbolser_> https://gist.github.com/anonymous/a24ff7317bdd7dda54b8
21:14 < dbolser_> mauke: you mean it's a server side issue?
21:15 < mauke> "issue" ... I guess
21:15 < mauke> do you know what a BOM is?
21:15 < dbolser_> no
21:15 < dngor> Something you can't talk about at airports or in municipal
               buildings.
21:15 < dbolser_> FREEDOM!...
21:16 < thrig> also, gunpowder tea
21:16 < mauke> ok, this is going to be fun
21:16 < mauke> dbolser_: do you know what unicode is?
21:16 < dngor> I curl'd it through head -c and hexdump -C and I see what you
               mean about the BOM.
21:16 < dbolser_> mauke: only vaguely... as something I have to work around
                  when things stop being ascii
21:17 < dngor> tl;dr: $content =~ s/^[^{]*// first.
21:17 < dbolser_> ahhh...
21:17 < mauke> $content =~ s/^\x{ef}\x{bb}\x{bf}//;  # better
21:17  * dbolser_ runs off ignorant but happy
21:17 < mauke> and maybe report a bug to them
21:17 < mauke> because their "JSON" api returns shit
21:18 < dbolser_> mauke: what words should I pretend to understand in my bug
                  report?
21:18 < dngor> And I suppose pray that the payload isn't otherwise corrupt.
21:18 < mauke> dbolser_: unicode is a character set. it assigns numbers to
               characters
21:18 < mauke> it's a superset of ascii, so 'A' = 65 in both ascii and unicode
21:19 < dbolser_> what do you know! it works
21:19 < dbolser_> ok
21:19 < mauke> the difference is that ascii only has 128 characters (7 bits)
               but unicode has a lot more (21 bits)
21:19 < blooney> 21 bits?
21:19 < mauke> so the problem is: how do you actually turn those numbers into
               bytes so you can store them in files?
21:20 < cfedde> yeah. funny number.
21:20 < blooney> I thought that it was all in bytes
21:20 < mauke> this is where encodings come in
21:20 < dbolser_> ok... so far.. I think...
21:20  * dbolser_ goes to put daugher back to bed... she doesnt sleep!
21:20 < mauke> UTF-32 pads every 21-bit number with zeroes until you have a
               32-bit number, which is 4 bytes
21:21 < blooney> I mean, I was pretty sure that they just took the eight bit
                 that was used in other encoding and pushed there their weird
                 logic to indicate multi-byte characters and that stuff
21:21 < mauke> which you can then write to a file
21:21 < blooney> oh damn
21:21 < kerframil> dbolser: tell them to read the section on encoding in rfc
                   4627, and mention that utf-8 is always little endian
21:21  * blooney now has to rethink everything
21:22 < mauke> UTF-16 is a bit more complicated. characters that fit in 16 bits
               are kept as is; other characters are encoded as "surrogate pairs"
21:22 < cfedde> or just read the wikipedia page unless you need the gross
                details.
21:22 < mauke> that is, there's a special range of unicode codepoints that are
               not used for characters
21:22 < cfedde> utf-8 is pretty much the winner. for a number of reasons.
21:22 < mauke> but whatever
21:22 < mauke> UTF-8 is both trickier and simpler
21:23 < Grinnz_> just ask IRC
21:23 < mauke> 7-bit characters (i.e. ascii) are stored as is
21:23 < Grinnz_> well, IRC clients :)
21:23 < cfedde> At one end it it is "just ascii" but it gets silly after than.
21:23 < cfedde> that
21:23 < ttkai> mmm ascii
21:23 < mauke> other characters are stored according to some variable-width
               encoding scheme; details omitted
21:24 < Grinnz_> IRC clients generally send that windows version of latin1, but
                 utf-8 encodes it if there's characters > 256
21:24 < Grinnz_> so the decoding is fun
21:24 < mauke> the issue with UTF-32 and UTF-16 is that they deal with 4 byte /
               2 byte entities, but there are two different ways to store them
               in files
21:24 < mauke> big endian and little endian!
21:24 < Grinnz_> oh god endianness
21:25 < mauke> so let's say your character has the number 43794 in unicode
21:25 < mauke> that's 0xAB12 in hex
21:25 < cfedde> things get messy when you try to preserve backward
                compatability while supporting extension.
21:25 < blooney> why can't we just decide which endianness everyone will use?
21:25 < blue_sky> Grinnz_: female endians
21:26 < mauke> serializing that to bytes can give you either {AB, 12} or {12,
               AB}, depending on which endianness you're using
21:26 < cfedde> blooney: history.
21:26 < mauke> so there are two variants, UTF-16LE and UTF-16BE (same for
               UTF-32)
21:27 < mauke> so the next problem is, given a document that is in "UTF-16",
               how do you tell which endianness was used?
21:27 < average> mauke: I recently opened the Unicode book and I was horrified
                 by the many variants
21:27 < cfedde> It would have been nice if the authors of the encoding had put
                in a marker for this.
21:27 < average> mauke: about your question with the endiannes to use, there
                 was some specific byte for that
21:27 < average> mauke: like cfedde says, the marker
21:27 < mauke> the trick that was used is to prepend the character 0xFEFF to
               the document
21:27 < average> BOM
21:27 < average> I think it was called BOM byte
21:28 < mauke> 0xFEFF is a "zero width no-break space", i.e. an invisible space
21:28 < average> http://en.wikipedia.org/wiki/Byte_order_mark
21:28 < mauke> so when you're reading the document and you see the bytes { FE,
               FF } you know it's big endian
21:28 < blue_sky> average: mauke isn't exactly being obtuse in his explanation,
                  let him get on with it.
21:28 < mauke> and if it's { FF, FE }, it's little endian
21:28 < blooney> "The Unicode Standard permits the BOM in UTF-8"
21:29 < _AxS_> kerframil: pink_mist: thanks!
21:29 < mauke> 0xFFEF is an invalid codepoint so there's no ambiguity
21:29 < cfedde> hed go BOM
21:30 < mauke> 0xFEFF at the start of the document is called a "byte order
               mark" (BOM)
21:30 < mauke> and it's a hack
21:30 < dbolser_> OK
21:30 < Juerd> It keeps popping up :(
21:30 < mauke> ok, so what happens if you add the character 0xFEFF to a
               document, but then encode it as UTF-8?
21:30 < anno> _AxS_: there's no need to switch to the Slic3r package.
              $Slic3r::var accesses it from anywhere
21:30 < dbolser_> I think I'm just going to paste this whole thread to the
                  website dev...
21:30  * average had to deal with this sort of thing recently, then realized
          there were libraries already handling this type of thing, so he just
          used those..
21:30 < mauke> the result is a string starting with the bytes {EF, BB, BF}
21:31 < mauke> it's valid UTF-8 and all
21:31 < _AxS_> anno: the issue i was having is that I couldn't find where that
               path (the 'var' path) was set; for some reasn grep failed me.
               I'm trying to override that as i don't want to put these image
               files in a subdir of /usr/bin
21:31 < dbolser_> ahh, but it throws the whole doc off by one byte
21:31 < dbolser_> ?
21:31 < mauke> it's just pointless as a BOM because UTF-8 has no byte order
               issues. there are no variants and no ambiguity
21:32 < Juerd> dbolser_: One codepoint, several bytes.
21:32 < blooney> mauke: "The Unicode Standard permits the BOM in UTF-8"
21:32 < mauke> dbolser_: the problem is that it's invalid in JSON
21:32 < pink_mist> blooney: so? it's still utterly useless in utf-8
21:32 < dbolser_> I can imagine!
21:33 < blooney> pink_mist: umm, but it's a standard...
21:33 < anno> _AxS_: yes, i know. kerframil's suggestion should work, but is a
              bit long-winded
21:33 < pink_mist> blooney: what? no it isn't. it's just permitted.
21:33 < tm604> blooney: The Unicode standard permits many things that aren't
               valid in JSON
21:33 < blooney> pink_mist: I mean it is permitted by standard. And if it is,
                 tools should not break when they see it
21:34 < mauke> blooney: nothing is breaking
21:34 < Juerd> blooney: "a" is valid Unicod.e Just not valid JSON.
21:34 < Juerd> Without the quotes.
21:34 < Juerd> Otherwise it would be valid JSON :P
21:34 < blooney> ooh right
21:34 < pink_mist> haha
21:34 < blooney> ok then, kinda makes sense
21:34 < mauke> JSON only allows tabs, spaces, line feed, carriage return
               between tokens
21:34 < _AxS_> anno: i'm actually going to patch the 'our $var' setting in the
               .pm directly before I install it.  It uses FindBin, and swapping
               it to use ::RealBin instead of ::Bin will work just fine
21:34 < mauke> so the json decoder skips those and checks what the next
               character is
21:35 < Altreus> wait, the BOM counts as a character?
21:35 < sproingie> yes and no.  it's zero-width.
21:35 < mauke> and instead of [ or { it sees a "zero width no-break space", so
               it reports a syntax error
21:35 < Juerd> Altreus: "Character" is a confusing term. Usually in Unicode
               stuff, character means codepoint.
21:35 < Altreus> I would have thought turning utf8 into chars would remove the
                 BOM
21:35 < sproingie> it counts as a code unit, not a glyph
21:35 < mauke> Altreus: in UTF-8, yes. because UTF-8 has no BOM
21:35 < sproingie> er codepoint that is
21:35 < Juerd> See also "control characters" in ASCII. You may not consider
               them characters, but they're just called that anyway.
21:35 < Altreus> that's well confusing :P
21:36 < Altreus> I'm just going to never use it
21:36 < mauke> correct
21:36 < thrig> some of them are quite alarming
21:36 < Juerd> Altreus: Yes, the term "character" is a source of a lot of pain
               and confusion.
21:36 < mauke> BOMs also break unix scripts
21:36  * blue_sky is taking Altreus' side on UTF
21:36 < Altreus> 7?
21:36 < sproingie> a "character" is an abstract glyph in unicode-ese
21:36 < Altreus> 8 is OP. Nerf UTF8
21:37 < mst> mauke: but they make a great excuse for humming the start of the
             Toccata from Fugue in D minor
21:37 < Altreus> yea but utf8 is the layer above unicode
21:37 < mst> BOM BOM BOM .... BOM BOM BOM BOM *BOMMMM* *BOM*
21:37 < Juerd> Altreus: Perl 6 will have a configurable definition of
               "character". You can tell it whether you want graphemes,
               codepoints, bytes, ...
21:37 < sproingie> BOMbast
21:37 < mst> Juerd: because what unicodes needs is even more ways to do it
             wrong :D
21:37 < Juerd> In Perl 5, typically, a character is a codepoint, and in that
               way, a BOM is definitely a character.
21:37 < anno> Juerd: nice
21:37 < Altreus> Isn't it tocatta *and* fugue
21:37  * sproingie just listened to the Pirates of the Carribean soundtrack,
          now there's nice bombastic tunes
21:38 < mauke> grapheme clusterbomb
21:38 < sproingie> strangely it's by Klaus Bedelt, i always thought it was Hans
                   Zimmer
21:38 < Juerd> mst: Perl 6 will at the same time make doing it right a lot
               easier though :)
21:38 < Altreus> sproingie: he invented the walking aid
21:38 < Juerd> mst: But yea, I guess much more rope will be provided than ever
               before.
21:38 < mst> Juerd: I'm sure I'll still find a way to fuck it up
21:38 < dbolser_> mauke: many thanks
	21:13 < dbolser_> On another issue... I'm usign LWP::Simple to grab this:
	https://letstalkbitcoin.com/api/v1/forum/threads, which is
	"Content-Type:application/json", however, when I decode_json
	(using JSON), I get the error: malformed JSON string, neither
	array, object, number, string or atom, at character offset 0
	(before "\x{ef}\x{bb}\x{bf}{"...") at ./get_and_load_data.plx
	line 24
	21:14 < dngor> Maybe it's compressed.
	21:14 < mauke> no, UTF-8 BOM
	21:14 < mauke> a.k.a. malformed JSON
	21:14 < dbolser_> https://gist.github.com/anonymous/a24ff7317bdd7dda54b8
	21:14 < dbolser_> mauke: you mean it's a server side issue?
	21:15 < mauke> "issue" ... I guess
	21:15 < mauke> do you know what a BOM is?
	21:15 < dbolser_> no
	21:15 < dngor> Something you can't talk about at airports or in municipal
	buildings.
	21:15 < dbolser_> FREEDOM!...
	21:16 < thrig> also, gunpowder tea
	21:16 < mauke> ok, this is going to be fun
	21:16 < mauke> dbolser_: do you know what unicode is?
	21:16 < dngor> I curl'd it through head -c and hexdump -C and I see what you
	mean about the BOM.
	21:16 < dbolser_> mauke: only vaguely... as something I have to work around
	when things stop being ascii
	21:17 < dngor> tl;dr: $content =~ s/^[^{]*// first.
	21:17 < dbolser_> ahhh...
	21:17 < mauke> $content =~ s/^\x{ef}\x{bb}\x{bf}//; # better
	21:17 * dbolser_ runs off ignorant but happy
	21:17 < mauke> and maybe report a bug to them
	21:17 < mauke> because their "JSON" api returns shit
	21:18 < dbolser_> mauke: what words should I pretend to understand in my bug
	report?
	21:18 < dngor> And I suppose pray that the payload isn't otherwise corrupt.
	21:18 < mauke> dbolser_: unicode is a character set. it assigns numbers to
	characters
	21:18 < mauke> it's a superset of ascii, so 'A' = 65 in both ascii and unicode
	21:19 < dbolser_> what do you know! it works
	21:19 < dbolser_> ok
	21:19 < mauke> the difference is that ascii only has 128 characters (7 bits)
	but unicode has a lot more (21 bits)
	21:19 < blooney> 21 bits?
	21:19 < mauke> so the problem is: how do you actually turn those numbers into
	bytes so you can store them in files?
	21:20 < cfedde> yeah. funny number.
	21:20 < blooney> I thought that it was all in bytes
	21:20 < mauke> this is where encodings come in
	21:20 < dbolser_> ok... so far.. I think...
	21:20 * dbolser_ goes to put daugher back to bed... she doesnt sleep!
	21:20 < mauke> UTF-32 pads every 21-bit number with zeroes until you have a
	32-bit number, which is 4 bytes
	21:21 < blooney> I mean, I was pretty sure that they just took the eight bit
	that was used in other encoding and pushed there their weird
	logic to indicate multi-byte characters and that stuff
	21:21 < mauke> which you can then write to a file
	21:21 < blooney> oh damn
	21:21 < kerframil> dbolser: tell them to read the section on encoding in rfc
	4627, and mention that utf-8 is always little endian
	21:21 * blooney now has to rethink everything
	21:22 < mauke> UTF-16 is a bit more complicated. characters that fit in 16 bits
	are kept as is; other characters are encoded as "surrogate pairs"
	21:22 < cfedde> or just read the wikipedia page unless you need the gross
	details.
	21:22 < mauke> that is, there's a special range of unicode codepoints that are
	not used for characters
	21:22 < cfedde> utf-8 is pretty much the winner. for a number of reasons.
	21:22 < mauke> but whatever
	21:22 < mauke> UTF-8 is both trickier and simpler
	21:23 < Grinnz_> just ask IRC
	21:23 < mauke> 7-bit characters (i.e. ascii) are stored as is
	21:23 < Grinnz_> well, IRC clients :)
	21:23 < cfedde> At one end it it is "just ascii" but it gets silly after than.
	21:23 < cfedde> that
	21:23 < ttkai> mmm ascii
	21:23 < mauke> other characters are stored according to some variable-width
	encoding scheme; details omitted
	21:24 < Grinnz_> IRC clients generally send that windows version of latin1, but
	utf-8 encodes it if there's characters > 256
	21:24 < Grinnz_> so the decoding is fun
	21:24 < mauke> the issue with UTF-32 and UTF-16 is that they deal with 4 byte /
	2 byte entities, but there are two different ways to store them
	in files
	21:24 < mauke> big endian and little endian!
	21:24 < Grinnz_> oh god endianness
	21:25 < mauke> so let's say your character has the number 43794 in unicode
	21:25 < mauke> that's 0xAB12 in hex
	21:25 < cfedde> things get messy when you try to preserve backward
	compatability while supporting extension.
	21:25 < blooney> why can't we just decide which endianness everyone will use?
	21:25 < blue_sky> Grinnz_: female endians
	21:26 < mauke> serializing that to bytes can give you either {AB, 12} or {12,
	AB}, depending on which endianness you're using
	21:26 < cfedde> blooney: history.
	21:26 < mauke> so there are two variants, UTF-16LE and UTF-16BE (same for
	UTF-32)
	21:27 < mauke> so the next problem is, given a document that is in "UTF-16",
	how do you tell which endianness was used?
	21:27 < average> mauke: I recently opened the Unicode book and I was horrified
	by the many variants
	21:27 < cfedde> It would have been nice if the authors of the encoding had put
	in a marker for this.
	21:27 < average> mauke: about your question with the endiannes to use, there
	was some specific byte for that
	21:27 < average> mauke: like cfedde says, the marker
	21:27 < mauke> the trick that was used is to prepend the character 0xFEFF to
	the document
	21:27 < average> BOM
	21:27 < average> I think it was called BOM byte
	21:28 < mauke> 0xFEFF is a "zero width no-break space", i.e. an invisible space
	21:28 < average> http://en.wikipedia.org/wiki/Byte_order_mark
	21:28 < mauke> so when you're reading the document and you see the bytes { FE,
	FF } you know it's big endian
	21:28 < blue_sky> average: mauke isn't exactly being obtuse in his explanation,
	let him get on with it.
	21:28 < mauke> and if it's { FF, FE }, it's little endian
	21:28 < blooney> "The Unicode Standard permits the BOM in UTF-8"
	21:29 < _AxS_> kerframil: pink_mist: thanks!
	21:29 < mauke> 0xFFEF is an invalid codepoint so there's no ambiguity
	21:29 < cfedde> hed go BOM
	21:30 < mauke> 0xFEFF at the start of the document is called a "byte order
	mark" (BOM)
	21:30 < mauke> and it's a hack
	21:30 < dbolser_> OK
	21:30 < Juerd> It keeps popping up :(
	21:30 < mauke> ok, so what happens if you add the character 0xFEFF to a
	document, but then encode it as UTF-8?
	21:30 < anno> _AxS_: there's no need to switch to the Slic3r package.
	$Slic3r::var accesses it from anywhere
	21:30 < dbolser_> I think I'm just going to paste this whole thread to the
	website dev...
	21:30 * average had to deal with this sort of thing recently, then realized
	there were libraries already handling this type of thing, so he just
	used those..
	21:30 < mauke> the result is a string starting with the bytes {EF, BB, BF}
	21:31 < mauke> it's valid UTF-8 and all
	21:31 < _AxS_> anno: the issue i was having is that I couldn't find where that
	path (the 'var' path) was set; for some reasn grep failed me.
	I'm trying to override that as i don't want to put these image
	files in a subdir of /usr/bin
	21:31 < dbolser_> ahh, but it throws the whole doc off by one byte
	21:31 < dbolser_> ?
	21:31 < mauke> it's just pointless as a BOM because UTF-8 has no byte order
	issues. there are no variants and no ambiguity
	21:32 < Juerd> dbolser_: One codepoint, several bytes.
	21:32 < blooney> mauke: "The Unicode Standard permits the BOM in UTF-8"
	21:32 < mauke> dbolser_: the problem is that it's invalid in JSON
	21:32 < pink_mist> blooney: so? it's still utterly useless in utf-8
	21:32 < dbolser_> I can imagine!
	21:33 < blooney> pink_mist: umm, but it's a standard...
	21:33 < anno> _AxS_: yes, i know. kerframil's suggestion should work, but is a
	bit long-winded
	21:33 < pink_mist> blooney: what? no it isn't. it's just permitted.
	21:33 < tm604> blooney: The Unicode standard permits many things that aren't
	valid in JSON
	21:33 < blooney> pink_mist: I mean it is permitted by standard. And if it is,
	tools should not break when they see it
	21:34 < mauke> blooney: nothing is breaking
	21:34 < Juerd> blooney: "a" is valid Unicod.e Just not valid JSON.
	21:34 < Juerd> Without the quotes.
	21:34 < Juerd> Otherwise it would be valid JSON :P
	21:34 < blooney> ooh right
	21:34 < pink_mist> haha
	21:34 < blooney> ok then, kinda makes sense
	21:34 < mauke> JSON only allows tabs, spaces, line feed, carriage return
	between tokens
	21:34 < _AxS_> anno: i'm actually going to patch the 'our $var' setting in the
	.pm directly before I install it. It uses FindBin, and swapping
	it to use ::RealBin instead of ::Bin will work just fine
	21:34 < mauke> so the json decoder skips those and checks what the next
	character is
	21:35 < Altreus> wait, the BOM counts as a character?
	21:35 < sproingie> yes and no. it's zero-width.
	21:35 < mauke> and instead of [ or { it sees a "zero width no-break space", so
	it reports a syntax error
	21:35 < Juerd> Altreus: "Character" is a confusing term. Usually in Unicode
	stuff, character means codepoint.
	21:35 < Altreus> I would have thought turning utf8 into chars would remove the
	BOM
	21:35 < sproingie> it counts as a code unit, not a glyph
	21:35 < mauke> Altreus: in UTF-8, yes. because UTF-8 has no BOM
	21:35 < sproingie> er codepoint that is
	21:35 < Juerd> See also "control characters" in ASCII. You may not consider
	them characters, but they're just called that anyway.
	21:35 < Altreus> that's well confusing :P
	21:36 < Altreus> I'm just going to never use it
	21:36 < mauke> correct
	21:36 < thrig> some of them are quite alarming
	21:36 < Juerd> Altreus: Yes, the term "character" is a source of a lot of pain
	and confusion.
	21:36 < mauke> BOMs also break unix scripts
	21:36 * blue_sky is taking Altreus' side on UTF
	21:36 < Altreus> 7?
	21:36 < sproingie> a "character" is an abstract glyph in unicode-ese
	21:36 < Altreus> 8 is OP. Nerf UTF8
	21:37 < mst> mauke: but they make a great excuse for humming the start of the
	Toccata from Fugue in D minor
	21:37 < Altreus> yea but utf8 is the layer above unicode
	21:37 < mst> BOM BOM BOM .... BOM BOM BOM BOM BOMMMM BOM
	21:37 < Juerd> Altreus: Perl 6 will have a configurable definition of
	"character". You can tell it whether you want graphemes,
	codepoints, bytes, ...
	21:37 < sproingie> BOMbast
	21:37 < mst> Juerd: because what unicodes needs is even more ways to do it
	wrong :D
	21:37 < Juerd> In Perl 5, typically, a character is a codepoint, and in that
	way, a BOM is definitely a character.
	21:37 < anno> Juerd: nice
	21:37 < Altreus> Isn't it tocatta and fugue
	21:37 * sproingie just listened to the Pirates of the Carribean soundtrack,
	now there's nice bombastic tunes
	21:38 < mauke> grapheme clusterbomb
	21:38 < sproingie> strangely it's by Klaus Bedelt, i always thought it was Hans
	Zimmer
	21:38 < Juerd> mst: Perl 6 will at the same time make doing it right a lot
	easier though :)
	21:38 < Altreus> sproingie: he invented the walking aid
	21:38 < Juerd> mst: But yea, I guess much more rope will be provided than ever
	before.
	21:38 < mst> Juerd: I'm sure I'll still find a way to fuck it up
	21:38 < dbolser_> mauke: many thanks