Skip to content

Instantly share code, notes, and snippets.

@defuse
Last active August 29, 2015 13:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save defuse/9907281 to your computer and use it in GitHub Desktop.
Save defuse/9907281 to your computer and use it in GitHub Desktop.
Random Characters to Random Bits
Goal:
You're given a sequence of random alphanumeric characters (0-9a-zA-Z, 62
possible characters), for example from a password generator. Convert it into
a sequence of random *bits*.
The output should have the property:
The alphanumeric character RNG can be distinguished from random if and
only if the alphanumeric character RNG, with the conversion algorithm
attached, can be distinguished from random.
Attempt #1:
output_lsb(n, b) { /* outputs the b least-significant bits of n */ }
while(1) {
// Get the next character from the alphanumeric generator.
c = get_next_alphanumeric_character();
// Get the 'index of the character in the character set'
x = get_character_number(c); // a=0, b=1, ... A=26 ... 0=52 ... 9=62
if (x < 32) {
// x is random in [0,31]
output_lsb(x, 5);
} else {
x = x - 32; // x is random in [0,29]
if (x < 16) {
// x is random in [0,15]
output_lsb(x, 4);
} else {
x = x - 16; // x is random in [0,13]
if (x < 8) {
// x is random in [0,7]
output_lsb(x, 3);
} else {
x = x - 8; // x is random in [0,5]
if (x < 4) {
// x is random in [0,3]
output_lsb(x, 2);
} else {
x = x - 4; // x is random in [0,1]
output_lsb(x, 1)
}
}
}
}
}
// Obviously you'd implement the inner part as a loop, but I unrolled it to
// make it easier to understand.
Analysis:
It is NOT true for this algorithm that if the output is secure then the
underlying generator is secure.
Proof:
Counterexample: If the underlying generator always returned a random
character between between 0 and 31 (a and F), the output of this
algorithm would be random, but the underlying generator clearly isn't.
(Note: This is only true if you see the output stream as a whole, and
can't tell which branches were taken by the algorithm. Otherwise you
could tell if you always saw 5 bits getting output at a time).
Hint: Maybe we can make this true by explicitly including which
branch was taken in the output?
A trivial way to guarantee it is to make it easy to make it possible
to invert the output stream back into the input characters, except
I doubt this is possible because it would imply a way to generate
random alphanumeric characters more efficiently than
try-and-throw-away.
(Question: Does the property in general imply that too?)
But maybe it is...
There are 5 calls to output_lsb(), with probabilities:
[ 0,31] | 32/62 | 0.51613
[32,47] | 16/62 | 0.25806
[48,55] | 8/62 | 0.12903
[56,59] | 4/62 | 0.06452
[60,61] | 2/62 | 0.03226
This seems well suited for a Huffman coding....
Wait... if you collect 62 samples, counting the number for
each branch, the expected values are 32, 16, 8, 4, 2. Now,
by some kind of symmetry argument, it should be 50/50 that
the actual value is above/below the expected. So, after 62
samples, you could output 5 bits, one for each branch,
that's 0 if actual < expected, 1 if actual > expected.
But then what to do about actual == expected?
Ugh... seems like this is going to be an infinite
regress of things-not-being-powers-of-two.
Conjecture: It IS true that if the underlying generator is secure then the
output of this algorithm is secure.
Proof:
Assumption: If 0 < C < N is some constant and X is a random number
between 0 and N (inclusive), then:
if X < C
X is a random number between 0 and C-1
otherwise
X-C is a random number between 0 and N-C
... then just follow the logic in the comments above, and you'll see
that all the output_lsb() are outputing random bits.
Notes
It's impossible for any *finite* sequence of bits, for the following reason:
The algorithm outputs a discrete number of bits, and gets a discrete number
of characters from the underlying generator. Say it has output X bits from
K characters. Then one of the following is true:
1. K * lg(62) > X
This says that the output is ambiguous. There are at least two
things the underlying RNG could have spat out to give the same
output, and the bias in the underlying RNG can lie there, without
there ever being a bias in the bit output.
So algorithm security does not imply underling RNG security.
(Note: Did I just unintentionally switch from computationally
secure to information-theoretically secure?)
2. K * lg(62) < X
This says that X is non-random because it only contains K*lg(62)
bits of entropy yet is actually more than that many bits.
So underling RNG security does not imply algorithm security.
(Note: It's not really an algorithm, since it never halts).
Proof:
(By contradiction)
Suppose K * lg(62) = X for some integers K and X.
Then lg(62) = X/K.
Then 62 = 2^(X/K).
Then 62^K = 2^(X/K)^K
Then 62^K = 2^X
This is impossible, since 2^X is not divisible by 31 (since
there is no exponent on the 31 in the prime factorization), but
62^K is, since 31 divides 62.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment