Skip to content

Instantly share code, notes, and snippets.

@sipa
Last active March 17, 2017 09:06
Show Gist options
  • Save sipa/c291da162f6ef8cc770bfc7f015c6c49 to your computer and use it in GitHub Desktop.
Save sipa/c291da162f6ef8cc770bfc7f015c6c49 to your computer and use it in GitHub Desktop.
@petertodd
Copy link

The human readable part "bc" for mainnet, and "bctest" for testnet.

A problem with making the testnet prefix significantly longer than the mainnet prefix is testnet UI's have to deal with significantly longer addresses than mainnet UI's.

I think we should use a same-length prefix such as "tb" instead.

@petertodd
Copy link

petertodd commented Feb 4, 2017

@sipa "2" is also left out of zbase32, due to it's similarities with "z".

How about we make the separation character "2" instead? Rational: while "0" is can be both confused for "0" in handwriting, printed fonts, and in spoken speach ("oh") the number "2" is only confused with "z" in handwriting.

@petertodd
Copy link

Also, re: zbase32's argument against "2", I wonder if that's backed by any science; "5" and "s" can be confused for similar reasons.

And for that matter, "q" and "9" can be confused too, though probably not as often.

@gmaxwell
Copy link

gmaxwell commented Feb 4, 2017

We've been talking about running a brief study on character errors. E.g. generate random base-36 strings and have users enter them, prompting them to go fast enough that some errors are made.

After a fair amount of effort I've been unable to find well studied behavior on this (there is a paper from 1989 studying the visual similarity of an apple2 charset... some NIST tool for similarities with a seeming random out-of-the-authors-rear confusion table) and so on.

What data we have, doesn't support the zbase32 decisions very strongly, and even if we stay with zbase32, it looks like we can get a not-irrelevant gain from choosing the ordering correctly.

@sipa
Copy link
Author

sipa commented Feb 4, 2017

Switched to separator '2', and fixed the 'errodes'.

@ecdsa
Copy link

ecdsa commented Feb 5, 2017

Thanks @sipa.
Not completely offtopic: Are there similar plans to replace base58 with zbase32 for BIP32 extended public keys?

@sipa
Copy link
Author

sipa commented Feb 5, 2017

Not completely offtopic: Are there similar plans to replace base58 with zbase32 for BIP32 extended public keys?

Not as part of this spec, as it's limited to 89 characters (which, minus 6 checksum characters, means only 83 data bytes or 415 bits... not enough for an extended pubkey).

I'm looking to find a good 12-character checksum that is usable for longer data or data that needs stronger error detection (like extended pubkeys, private keys, stealth addresses, CT addresses, ...). Such a code would have random detection failure chance of one in a quintillion, and guarantee detecting 6 or 7 errors.

@gmaxwell
Copy link

gmaxwell commented Feb 6, 2017

A permutation that optimizes for bit error propagation for ascii is "jhz8nd6tki39om74bxrpfewucys1ga5q". Hamming 1 errors to the ascii result in hamming 1 43 times, 2 19 times, 3 once.

The current ordering results in {1: 11, 2: 22, 3: 21, 4: 7, 5: 2}

@sipa
Copy link
Author

sipa commented Feb 7, 2017

Short notice: I've found a much faster way to verify that a code has good qualities, and there may be a code that guarantees detecting 5 errors when searching in a wider class. I'm starting such a search now, and it may take a few days. The actual checksum may change then (or not).

@sipa
Copy link
Author

sipa commented Feb 9, 2017

I have changed:

  • Testnet HRP changed to "bctest" to "tb" (and added rationale)
  • Changed z-base-32 ordering to @gmaxwell's ordering (and added rationale).
  • Fixed the errorneous claim that we were using a^5 + a^2 + 1 as modulus for the base field (it is a^5 + a^3 + 1).

After analysing around 5% of the whole space of 2^30 6-character generators, it doesn't seem like there are any with significantly better properties, so I think I'll stay with the current BCH code.

@petertodd
Copy link

@sipa Looking good!

@sipa
Copy link
Author

sipa commented Feb 11, 2017

I removed the reference implementations apart from those that actually specify the checksum. I'll add some reference implementations in a few languages separately later.

@petertodd
Copy link

@sipa Minor nit: in the examples section, I'd remove the trailing periods at the end of each address type.

@petertodd
Copy link

@sipa There doesn't seem to be a license on https://github.com/sipa/ezbase32/

I'd suggest you add one for clarity prior to publication. Doesn't need to be the same license as this document - heck it could even be proprietary as it's not part of the standard - but we should have something to make things clear.

@rustyrussell
Copy link

Dislike tb since every altcoin will copy and use ya 2-char prefix for their testnet. But I said I'd stop bike-shedding :)

@petertodd
Copy link

@rustyrussell That may be better than every altcoin copying us and using bc-test because they're too lazy to change it. :)

@sipa
Copy link
Author

sipa commented Feb 14, 2017

@rustyrussell @petertodd I think both your points are interesting, but I don't know. For GUI/Mobile applications, the size of addresses may become an issue, and needing to support testnet if it's a different size may require extra UI changes, which would be unfortunate. On the other hand, 2 characters is not much to distinguish with.

@gmaxwell found https://hissa.nist.gov/~black/GTLD/ with numbers of visual similarity between characters (though not particularly well researched, it's more extensive than anything else). We used that to both optimize the character set and the ordering. As a result, the separator character is now a '1'. We're still running an analysis to find a better ordering, but unless there are complaints about this approach, I expect to stop changing the scheme today or tomorrow.

@gmaxwell
Copy link

We should require the base32 part be all upper or all lower and reject if it is mixed.

@sipa
Copy link
Author

sipa commented Feb 15, 2017

We should require the base32 part be all upper or all lower and reject if it is mixed.

Already done.

@sipa There doesn't seem to be a license on https://github.com/sipa/ezbase32/

Right, good point. My plan was to create a new repo (called bech32), and just put the latest and relevant code snippets and reference code there, but not all historical analyses and data that would just be confusing.

@petertodd
Copy link

@sipa Currently testnet/mainnet addresses are distinguished by just one character, so arguably two is twice as good. :)

Re: visual similarity, that NIST site isn't optimizing for the right things: as far as I can tell it's based on printed fonts, rather than handwriting. We're concerned about both here, and probably more the latter than the former.

Re: the bech32 repo, it wouldn't be a bad idea to leave the historical analyses/data in Git history, while leaving it out of the more recent commit tree.

@petertodd
Copy link

I'm really puzzled as to why "b" is now left out - it's very distinct in handwriting, while "2"/"z" definitely isn't.

Also, note that that NIST site says: "The algorithm, consisting of the distance measure, the scoring formula, and the character similarities are mostly just my estimations." <- e.g. there's no science behind the character similarity scores.

@sipa
Copy link
Author

sipa commented Feb 15, 2017

@petertodd From the data, there are potential confusions between 'b' and ['6', 'd', 'h'], and between 'B' and ['8', 'E', 'P', 'R']. I'm aware that there is nothing scientific about it, but it's the most extensive similarity data found anywhere.

@gmaxwell
Copy link

gmaxwell commented Feb 16, 2017

I don't think handwriting is at all our primary target: primary lossy target is read from a screen, out loud, hear, and type in. I think handwriting is certainly secondary. Esp since any handwritten representation will start with a screen printed step and end with typing.

The data we really want doesn't exist. I'd be happy to fund mechnical turking it (and I'm sure that many in our community would be willing to participate to create ground truth data...) but I know nothing about the relevant APIs and loathe web programming.

Expirement would be something liek showing random base36 strings (each string all upper or all lower), and ask users to type them in.. encourage them to be fast so the error rate isn't zero. Handwriting could be covered by having them write them and take pictures and someone else transcribes, but I think we should consider handwriting more out of scope. (also 2/Z is completely unambiguous handwritten if you stroke the Zs).

@petertodd
Copy link

Hmm, I think you both made good points, so I'll accept it the way it is.

Anyway, come to think of it the real case where handwritten addresses come up is private keys - not public keys - and for that we already use safer and much more verbose encoding; I probably had that use-case (incorrectly) in the back of my head.

@petertodd
Copy link

petertodd commented Feb 16, 2017

@gmaxwell Can I quote you on that? Specifically, the exact phrase "if you stroke the Zs"? :P
FWIW I asked around my design friends for research on this stuff, and all I got back was some papers on OCR! They did know of similar visual simularity problems, but it sounds like in other fields it's more focused on issues like shapes of switches and knobs and the light (e.g. aviation). I wonder if part of the thinking is if you're transcribing text, all hope is lost already by their standards...

@baryluk
Copy link

baryluk commented Feb 17, 2017

What about use case of entering addresses by hand (without QR scanning) on ATM style machines for buying bitcoin? It might be error prone to do it correctly on a first go, especially on low quality touchscreens. I wouldn't say that speaking (i.e. over the phone), or handwriting are the only good use cases. Also having just 32 characters to choose from on a screen, instead of lower and upper cases, and full alphanumeric keyboard, would allow for much quicker input, and bigger on screen keys, reducing risk of click / touching wrong one. In fact it can apply too all touch screen based system, where we do not copy address automatically (i.e. qr scan, copy paste, nfc, etc), but get it from somewhere else (be it somebody speaking to us, or we reading it from a paper, even printed one with clear labels, etc).

I am still not sure what would be exact layout of the 32 to characters on a screen. Or maybe it should be full qwerty keyboard (with upper case characters most likely) + digits (+ backspace), and do the full ambiguity conversion in software transparently.

/me friend of sipa.

@gmaxwell
Copy link

gmaxwell commented Feb 24, 2017

You need a passing vector that begins with at least 8 zero bits in the witness hash. (maybe an all zeros one would be good)

@sipa
Copy link
Author

sipa commented Feb 26, 2017

@gmaxwell Added one with 3 zeroes.

@gmaxwell
Copy link

gmaxwell commented Mar 3, 2017

Immunity to easily confused non-address sequences.

Transaction IDs or other similar encoded data may be easily confused for addresses by users. This specification provides for improved resistance to this common class of confusions beyond what is provided by the checksum.

The minimal padding requirement in this specification means that no input with length under 11, over 74, or congruent to 0, 3, or 5
mod 8 can ever be mistaken as a valid Bitcoin address. This means that no common hex sequence length (8, 16, 32, 40, or 64
characters) would be accepted by this specification.

Similarly, any string with the common base64 maximum line length of 76 characters can interpreted as an address.

A short base64 encoded string (with length 12,16,20,24,28,32,36,40,44,48,52,56,60,64,68, or 72) could potentially be misinterpreted as an address, however the probability of this happening for any uniformly selected random base-64 string is never greater than 1 : 2^60 due to the improbability of matching the prefix and partial overlap of the character sets.
.....

or something like that.

@gmaxwell
Copy link

gmaxwell commented Mar 7, 2017

I just supported someone today to was running into getting an invalid address while trying to send funds. Looks like his chat software was adding invisible characters which then the rpc was rejecting, but when sent to bc.i were just ignored. We might want to include a vector with such a character and have advice that UIs should do something useful in those cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment