defuse/algorithm.txt

## algorithm.txt
Goal:
    You're given a sequence of random alphanumeric characters (0-9a-zA-Z, 62
    possible characters), for example from a password generator. Convert it into
    a sequence of random *bits*.

    The output should have the property:

        The alphanumeric character RNG can be distinguished from random if and
        only if the alphanumeric character RNG, with the conversion algorithm
        attached, can be distinguished from random.

Attempt #1:

    output_lsb(n, b) { /* outputs the b least-significant bits of n */ }

    while(1) {
        // Get the next character from the alphanumeric generator.
        c = get_next_alphanumeric_character();
        // Get the 'index of the character in the character set'
        x = get_character_number(c); // a=0, b=1, ... A=26 ... 0=52 ... 9=62
        if (x < 32) {
            // x is random in [0,31]
            output_lsb(x, 5);
        } else {
            x = x - 32; // x is random in [0,29]
            if (x < 16) {
                // x is random in [0,15]
                output_lsb(x, 4);
            } else {
                x = x - 16; // x is random in [0,13]
                if (x < 8) {
                    // x is random in [0,7]
                    output_lsb(x, 3);
                } else {
                    x = x - 8; // x is random in [0,5]
                    if (x < 4) {
                        // x is random in [0,3]
                        output_lsb(x, 2);
                    } else {
                        x = x - 4; // x is random in [0,1]
                        output_lsb(x, 1)
                    }
                }
            }
        }
    }

    // Obviously you'd implement the inner part as a loop, but I unrolled it to
    // make it easier to understand.

Analysis:

    It is NOT true for this algorithm that if the output is secure then the
    underlying generator is secure.

    Proof:

        Counterexample: If the underlying generator always returned a random
        character between between 0 and 31 (a and F), the output of this
        algorithm would be random, but the underlying generator clearly isn't.

        (Note: This is only true if you see the output stream as a whole, and
        can't tell which branches were taken by the algorithm. Otherwise you
        could tell if you always saw 5 bits getting output at a time).

            Hint: Maybe we can make this true by explicitly including which
            branch was taken in the output?

            A trivial way to guarantee it is to make it easy to make it possible
            to invert the output stream back into the input characters, except
            I doubt this is possible because it would imply a way to generate
            random alphanumeric characters more efficiently than
            try-and-throw-away.

                (Question: Does the property in general imply that too?)

            But maybe it is...

                There are 5 calls to output_lsb(), with probabilities:

                    [ 0,31] | 32/62 | 0.51613
                    [32,47] | 16/62 | 0.25806
                    [48,55] |  8/62 | 0.12903
                    [56,59] |  4/62 | 0.06452
                    [60,61] |  2/62 | 0.03226

                This seems well suited for a Huffman coding....

                    Wait... if you collect 62 samples, counting the number for
                    each branch, the expected values are 32, 16, 8, 4, 2. Now,
                    by some kind of symmetry argument, it should be 50/50 that
                    the actual value is above/below the expected. So, after 62
                    samples, you could output 5 bits, one for each branch,
                    that's 0 if actual < expected, 1 if actual > expected.

                        But then what to do about actual == expected?

                            Ugh... seems like this is going to be an infinite
                            regress of things-not-being-powers-of-two.


    Conjecture: It IS true that if the underlying generator is secure then the
    output of this algorithm is secure.

    Proof:

        Assumption: If 0 < C < N is some constant and X is a random number
        between 0 and N (inclusive), then:
            if X < C
                X is a random number between 0 and C-1
            otherwise
                X-C is a random number between 0 and N-C

        ... then just follow the logic in the comments above, and you'll see
        that all the output_lsb() are outputing random bits.

Notes

    It's impossible for any *finite* sequence of bits, for the following reason:

    The algorithm outputs a discrete number of bits, and gets a discrete number
    of characters from the underlying generator. Say it has output X bits from
    K characters. Then one of the following is true:

        1. K * lg(62) > X
            This says that the output is ambiguous. There are at least two
            things the underlying RNG could have spat out to give the same
            output, and the bias in the underlying RNG can lie there, without
            there ever being a bias in the bit output.

            So algorithm security does not imply underling RNG security.

                (Note: Did I just unintentionally switch from computationally
                secure to information-theoretically secure?)

        2. K * lg(62) < X
            This says that X is non-random because it only contains K*lg(62)
            bits of entropy yet is actually more than that many bits.

            So underling RNG security does not imply algorithm security.

                (Note: It's not really an algorithm, since it never halts).

        Proof:

            (By contradiction)

                Suppose K * lg(62) = X for some integers K and X.

                Then lg(62) = X/K.

                Then 62 = 2^(X/K).

                Then 62^K = 2^(X/K)^K

                Then 62^K = 2^X

                This is impossible, since 2^X is not divisible by 31 (since
                there is no exponent on the 31 in the prime factorization), but
                62^K is, since 31 divides 62.
	Goal:
	You're given a sequence of random alphanumeric characters (0-9a-zA-Z, 62
	possible characters), for example from a password generator. Convert it into
	a sequence of random bits.

	The output should have the property:

	The alphanumeric character RNG can be distinguished from random if and
	only if the alphanumeric character RNG, with the conversion algorithm
	attached, can be distinguished from random.

	Attempt #1:

	output_lsb(n, b) { /* outputs the b least-significant bits of n */ }

	while(1) {
	// Get the next character from the alphanumeric generator.
	c = get_next_alphanumeric_character();
	// Get the 'index of the character in the character set'
	x = get_character_number(c); // a=0, b=1, ... A=26 ... 0=52 ... 9=62
	if (x < 32) {
	// x is random in [0,31]
	output_lsb(x, 5);
	} else {
	x = x - 32; // x is random in [0,29]
	if (x < 16) {
	// x is random in [0,15]
	output_lsb(x, 4);
	} else {
	x = x - 16; // x is random in [0,13]
	if (x < 8) {
	// x is random in [0,7]
	output_lsb(x, 3);
	} else {
	x = x - 8; // x is random in [0,5]
	if (x < 4) {
	// x is random in [0,3]
	output_lsb(x, 2);
	} else {
	x = x - 4; // x is random in [0,1]
	output_lsb(x, 1)
	}
	}
	}
	}
	}

	// Obviously you'd implement the inner part as a loop, but I unrolled it to
	// make it easier to understand.

	Analysis:

	It is NOT true for this algorithm that if the output is secure then the
	underlying generator is secure.

	Proof:

	Counterexample: If the underlying generator always returned a random
	character between between 0 and 31 (a and F), the output of this
	algorithm would be random, but the underlying generator clearly isn't.

	(Note: This is only true if you see the output stream as a whole, and
	can't tell which branches were taken by the algorithm. Otherwise you
	could tell if you always saw 5 bits getting output at a time).

	Hint: Maybe we can make this true by explicitly including which
	branch was taken in the output?

	A trivial way to guarantee it is to make it easy to make it possible
	to invert the output stream back into the input characters, except
	I doubt this is possible because it would imply a way to generate
	random alphanumeric characters more efficiently than
	try-and-throw-away.

	(Question: Does the property in general imply that too?)

	But maybe it is...

	There are 5 calls to output_lsb(), with probabilities:

	[ 0,31] \| 32/62 \| 0.51613
	[32,47] \| 16/62 \| 0.25806
	[48,55] \| 8/62 \| 0.12903
	[56,59] \| 4/62 \| 0.06452
	[60,61] \| 2/62 \| 0.03226

	This seems well suited for a Huffman coding....

	Wait... if you collect 62 samples, counting the number for
	each branch, the expected values are 32, 16, 8, 4, 2. Now,
	by some kind of symmetry argument, it should be 50/50 that
	the actual value is above/below the expected. So, after 62
	samples, you could output 5 bits, one for each branch,
	that's 0 if actual < expected, 1 if actual > expected.

	But then what to do about actual == expected?

	Ugh... seems like this is going to be an infinite
	regress of things-not-being-powers-of-two.


	Conjecture: It IS true that if the underlying generator is secure then the
	output of this algorithm is secure.

	Proof:

	Assumption: If 0 < C < N is some constant and X is a random number
	between 0 and N (inclusive), then:
	if X < C
	X is a random number between 0 and C-1
	otherwise
	X-C is a random number between 0 and N-C

	... then just follow the logic in the comments above, and you'll see
	that all the output_lsb() are outputing random bits.

	Notes

	It's impossible for any finite sequence of bits, for the following reason:

	The algorithm outputs a discrete number of bits, and gets a discrete number
	of characters from the underlying generator. Say it has output X bits from
	K characters. Then one of the following is true:

	1. K * lg(62) > X
	This says that the output is ambiguous. There are at least two
	things the underlying RNG could have spat out to give the same
	output, and the bias in the underlying RNG can lie there, without
	there ever being a bias in the bit output.

	So algorithm security does not imply underling RNG security.

	(Note: Did I just unintentionally switch from computationally
	secure to information-theoretically secure?)

	2. K * lg(62) < X
	This says that X is non-random because it only contains K*lg(62)
	bits of entropy yet is actually more than that many bits.

	So underling RNG security does not imply algorithm security.

	(Note: It's not really an algorithm, since it never halts).

	Proof:

	(By contradiction)

	Suppose K * lg(62) = X for some integers K and X.

	Then lg(62) = X/K.

	Then 62 = 2^(X/K).

	Then 62^K = 2^(X/K)^K

	Then 62^K = 2^X

	This is impossible, since 2^X is not divisible by 31 (since
	there is no exponent on the 31 in the prime factorization), but
	62^K is, since 31 divides 62.