Skip to content

Instantly share code, notes, and snippets.

@dbolser
Created February 12, 2015 23:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dbolser/d002bde517d088fe4c25 to your computer and use it in GitHub Desktop.
Save dbolser/d002bde517d088fe4c25 to your computer and use it in GitHub Desktop.
BOM
21:13 < dbolser_> On another issue... I'm usign LWP::Simple to grab this:
https://letstalkbitcoin.com/api/v1/forum/threads, which is
"Content-Type:application/json", however, when I decode_json
(using JSON), I get the error: malformed JSON string, neither
array, object, number, string or atom, at character offset 0
(before "\x{ef}\x{bb}\x{bf}{"...") at ./get_and_load_data.plx
line 24
21:14 < dngor> Maybe it's compressed.
21:14 < mauke> no, UTF-8 BOM
21:14 < mauke> a.k.a. malformed JSON
21:14 < dbolser_> https://gist.github.com/anonymous/a24ff7317bdd7dda54b8
21:14 < dbolser_> mauke: you mean it's a server side issue?
21:15 < mauke> "issue" ... I guess
21:15 < mauke> do you know what a BOM is?
21:15 < dbolser_> no
21:15 < dngor> Something you can't talk about at airports or in municipal
buildings.
21:15 < dbolser_> FREEDOM!...
21:16 < thrig> also, gunpowder tea
21:16 < mauke> ok, this is going to be fun
21:16 < mauke> dbolser_: do you know what unicode is?
21:16 < dngor> I curl'd it through head -c and hexdump -C and I see what you
mean about the BOM.
21:16 < dbolser_> mauke: only vaguely... as something I have to work around
when things stop being ascii
21:17 < dngor> tl;dr: $content =~ s/^[^{]*// first.
21:17 < dbolser_> ahhh...
21:17 < mauke> $content =~ s/^\x{ef}\x{bb}\x{bf}//; # better
21:17 * dbolser_ runs off ignorant but happy
21:17 < mauke> and maybe report a bug to them
21:17 < mauke> because their "JSON" api returns shit
21:18 < dbolser_> mauke: what words should I pretend to understand in my bug
report?
21:18 < dngor> And I suppose pray that the payload isn't otherwise corrupt.
21:18 < mauke> dbolser_: unicode is a character set. it assigns numbers to
characters
21:18 < mauke> it's a superset of ascii, so 'A' = 65 in both ascii and unicode
21:19 < dbolser_> what do you know! it works
21:19 < dbolser_> ok
21:19 < mauke> the difference is that ascii only has 128 characters (7 bits)
but unicode has a lot more (21 bits)
21:19 < blooney> 21 bits?
21:19 < mauke> so the problem is: how do you actually turn those numbers into
bytes so you can store them in files?
21:20 < cfedde> yeah. funny number.
21:20 < blooney> I thought that it was all in bytes
21:20 < mauke> this is where encodings come in
21:20 < dbolser_> ok... so far.. I think...
21:20 * dbolser_ goes to put daugher back to bed... she doesnt sleep!
21:20 < mauke> UTF-32 pads every 21-bit number with zeroes until you have a
32-bit number, which is 4 bytes
21:21 < blooney> I mean, I was pretty sure that they just took the eight bit
that was used in other encoding and pushed there their weird
logic to indicate multi-byte characters and that stuff
21:21 < mauke> which you can then write to a file
21:21 < blooney> oh damn
21:21 < kerframil> dbolser: tell them to read the section on encoding in rfc
4627, and mention that utf-8 is always little endian
21:21 * blooney now has to rethink everything
21:22 < mauke> UTF-16 is a bit more complicated. characters that fit in 16 bits
are kept as is; other characters are encoded as "surrogate pairs"
21:22 < cfedde> or just read the wikipedia page unless you need the gross
details.
21:22 < mauke> that is, there's a special range of unicode codepoints that are
not used for characters
21:22 < cfedde> utf-8 is pretty much the winner. for a number of reasons.
21:22 < mauke> but whatever
21:22 < mauke> UTF-8 is both trickier and simpler
21:23 < Grinnz_> just ask IRC
21:23 < mauke> 7-bit characters (i.e. ascii) are stored as is
21:23 < Grinnz_> well, IRC clients :)
21:23 < cfedde> At one end it it is "just ascii" but it gets silly after than.
21:23 < cfedde> that
21:23 < ttkai> mmm ascii
21:23 < mauke> other characters are stored according to some variable-width
encoding scheme; details omitted
21:24 < Grinnz_> IRC clients generally send that windows version of latin1, but
utf-8 encodes it if there's characters > 256
21:24 < Grinnz_> so the decoding is fun
21:24 < mauke> the issue with UTF-32 and UTF-16 is that they deal with 4 byte /
2 byte entities, but there are two different ways to store them
in files
21:24 < mauke> big endian and little endian!
21:24 < Grinnz_> oh god endianness
21:25 < mauke> so let's say your character has the number 43794 in unicode
21:25 < mauke> that's 0xAB12 in hex
21:25 < cfedde> things get messy when you try to preserve backward
compatability while supporting extension.
21:25 < blooney> why can't we just decide which endianness everyone will use?
21:25 < blue_sky> Grinnz_: female endians
21:26 < mauke> serializing that to bytes can give you either {AB, 12} or {12,
AB}, depending on which endianness you're using
21:26 < cfedde> blooney: history.
21:26 < mauke> so there are two variants, UTF-16LE and UTF-16BE (same for
UTF-32)
21:27 < mauke> so the next problem is, given a document that is in "UTF-16",
how do you tell which endianness was used?
21:27 < average> mauke: I recently opened the Unicode book and I was horrified
by the many variants
21:27 < cfedde> It would have been nice if the authors of the encoding had put
in a marker for this.
21:27 < average> mauke: about your question with the endiannes to use, there
was some specific byte for that
21:27 < average> mauke: like cfedde says, the marker
21:27 < mauke> the trick that was used is to prepend the character 0xFEFF to
the document
21:27 < average> BOM
21:27 < average> I think it was called BOM byte
21:28 < mauke> 0xFEFF is a "zero width no-break space", i.e. an invisible space
21:28 < average> http://en.wikipedia.org/wiki/Byte_order_mark
21:28 < mauke> so when you're reading the document and you see the bytes { FE,
FF } you know it's big endian
21:28 < blue_sky> average: mauke isn't exactly being obtuse in his explanation,
let him get on with it.
21:28 < mauke> and if it's { FF, FE }, it's little endian
21:28 < blooney> "The Unicode Standard permits the BOM in UTF-8"
21:29 < _AxS_> kerframil: pink_mist: thanks!
21:29 < mauke> 0xFFEF is an invalid codepoint so there's no ambiguity
21:29 < cfedde> hed go BOM
21:30 < mauke> 0xFEFF at the start of the document is called a "byte order
mark" (BOM)
21:30 < mauke> and it's a hack
21:30 < dbolser_> OK
21:30 < Juerd> It keeps popping up :(
21:30 < mauke> ok, so what happens if you add the character 0xFEFF to a
document, but then encode it as UTF-8?
21:30 < anno> _AxS_: there's no need to switch to the Slic3r package.
$Slic3r::var accesses it from anywhere
21:30 < dbolser_> I think I'm just going to paste this whole thread to the
website dev...
21:30 * average had to deal with this sort of thing recently, then realized
there were libraries already handling this type of thing, so he just
used those..
21:30 < mauke> the result is a string starting with the bytes {EF, BB, BF}
21:31 < mauke> it's valid UTF-8 and all
21:31 < _AxS_> anno: the issue i was having is that I couldn't find where that
path (the 'var' path) was set; for some reasn grep failed me.
I'm trying to override that as i don't want to put these image
files in a subdir of /usr/bin
21:31 < dbolser_> ahh, but it throws the whole doc off by one byte
21:31 < dbolser_> ?
21:31 < mauke> it's just pointless as a BOM because UTF-8 has no byte order
issues. there are no variants and no ambiguity
21:32 < Juerd> dbolser_: One codepoint, several bytes.
21:32 < blooney> mauke: "The Unicode Standard permits the BOM in UTF-8"
21:32 < mauke> dbolser_: the problem is that it's invalid in JSON
21:32 < pink_mist> blooney: so? it's still utterly useless in utf-8
21:32 < dbolser_> I can imagine!
21:33 < blooney> pink_mist: umm, but it's a standard...
21:33 < anno> _AxS_: yes, i know. kerframil's suggestion should work, but is a
bit long-winded
21:33 < pink_mist> blooney: what? no it isn't. it's just permitted.
21:33 < tm604> blooney: The Unicode standard permits many things that aren't
valid in JSON
21:33 < blooney> pink_mist: I mean it is permitted by standard. And if it is,
tools should not break when they see it
21:34 < mauke> blooney: nothing is breaking
21:34 < Juerd> blooney: "a" is valid Unicod.e Just not valid JSON.
21:34 < Juerd> Without the quotes.
21:34 < Juerd> Otherwise it would be valid JSON :P
21:34 < blooney> ooh right
21:34 < pink_mist> haha
21:34 < blooney> ok then, kinda makes sense
21:34 < mauke> JSON only allows tabs, spaces, line feed, carriage return
between tokens
21:34 < _AxS_> anno: i'm actually going to patch the 'our $var' setting in the
.pm directly before I install it. It uses FindBin, and swapping
it to use ::RealBin instead of ::Bin will work just fine
21:34 < mauke> so the json decoder skips those and checks what the next
character is
21:35 < Altreus> wait, the BOM counts as a character?
21:35 < sproingie> yes and no. it's zero-width.
21:35 < mauke> and instead of [ or { it sees a "zero width no-break space", so
it reports a syntax error
21:35 < Juerd> Altreus: "Character" is a confusing term. Usually in Unicode
stuff, character means codepoint.
21:35 < Altreus> I would have thought turning utf8 into chars would remove the
BOM
21:35 < sproingie> it counts as a code unit, not a glyph
21:35 < mauke> Altreus: in UTF-8, yes. because UTF-8 has no BOM
21:35 < sproingie> er codepoint that is
21:35 < Juerd> See also "control characters" in ASCII. You may not consider
them characters, but they're just called that anyway.
21:35 < Altreus> that's well confusing :P
21:36 < Altreus> I'm just going to never use it
21:36 < mauke> correct
21:36 < thrig> some of them are quite alarming
21:36 < Juerd> Altreus: Yes, the term "character" is a source of a lot of pain
and confusion.
21:36 < mauke> BOMs also break unix scripts
21:36 * blue_sky is taking Altreus' side on UTF
21:36 < Altreus> 7?
21:36 < sproingie> a "character" is an abstract glyph in unicode-ese
21:36 < Altreus> 8 is OP. Nerf UTF8
21:37 < mst> mauke: but they make a great excuse for humming the start of the
Toccata from Fugue in D minor
21:37 < Altreus> yea but utf8 is the layer above unicode
21:37 < mst> BOM BOM BOM .... BOM BOM BOM BOM *BOMMMM* *BOM*
21:37 < Juerd> Altreus: Perl 6 will have a configurable definition of
"character". You can tell it whether you want graphemes,
codepoints, bytes, ...
21:37 < sproingie> BOMbast
21:37 < mst> Juerd: because what unicodes needs is even more ways to do it
wrong :D
21:37 < Juerd> In Perl 5, typically, a character is a codepoint, and in that
way, a BOM is definitely a character.
21:37 < anno> Juerd: nice
21:37 < Altreus> Isn't it tocatta *and* fugue
21:37 * sproingie just listened to the Pirates of the Carribean soundtrack,
now there's nice bombastic tunes
21:38 < mauke> grapheme clusterbomb
21:38 < sproingie> strangely it's by Klaus Bedelt, i always thought it was Hans
Zimmer
21:38 < Juerd> mst: Perl 6 will at the same time make doing it right a lot
easier though :)
21:38 < Altreus> sproingie: he invented the walking aid
21:38 < Juerd> mst: But yea, I guess much more rope will be provided than ever
before.
21:38 < mst> Juerd: I'm sure I'll still find a way to fuck it up
21:38 < dbolser_> mauke: many thanks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment