jan-g/critical-code-approaches.py

## critical-code-approaches.py
#!/usr/bin/env python3

# From the introduction, pp. 5-6.
# Two approaches to generate anagrams.


from collections import Counter
from functools import cmp_to_key
from math import log2, factorial
from string import ascii_lowercase   # not very internationally-minded, but it'll do for these purposes
import random
import statistics  # because we'll be doing some


# Pick some random string or other.
s = ascii_lowercase[:8]  # There are 40320 potential anagrams here


def approach_1(s):
    s = list(s)
    for i in range(len(s)):
        j = random.randrange(len(s))  # This introduces bias
        s[i], s[j] = s[j], s[i]
    return "".join(s)


def random_cmp(a, b):
    return 0.5 - random.random()


def approach_2(s):
    return "".join(sorted(list(s), key=cmp_to_key(random_cmp)))


def uniform(s):
    """A better (fsvo) way that should produce a uniform distribution"""
    s = list(s)
    random.shuffle(s)   # A uniform shuffle; the implementation is *very* nearly the same as approach #1
    return "".join(s)


results = {alg: Counter() for alg in (approach_1, approach_2, uniform)}

# Give it a million attempts
for _ in range(1000000):
    for alg in results:
        results[alg][alg(s)] += 1


# Summarise results
# How many different results did we get?
print(len(s), "characters have a total potential anagram set of", factorial(len(s)),
      "; version one selects from a space of", len(s) ** len(s),
      "; approach 2 from a space of", 2 ** (len(s) * log2(len(s))))

for alg, r in results.items():
    print(alg.__name__, "gives", len(r), "unique outputs after a million trials with a std dev of", statistics.stdev(r.values()))

# Sample a uniform distribution and work out its std dev for comparison; it should come out close to the `uniform` approach.
u = random.choices(range(40320), k=1000000)
cu = Counter(u)
print("uniform distribution over 40320 possible outcomes, 1000000 selections, gives ", statistics.stdev(cu.values()))

## explanation
There's a simple counting argument here as to the bias in the two approaches. Both might be suitable for
a follow-up question during an interview.

For the first approach, we make `n` swaps, each picking a target from `n` elements. There are thus `n^n`
potential paths through the algorithm. However, there are `n!` distinct outcomes. Because the former doesn't
divide cleanly by the latter, the argument is that there *must* be some bias in the outcome. (This is evident
even when we're producing anagrams of a simple string like "abc".)

For the second approach, we rely on the fact that the sorting algorithm is likely to have been picked with a
runtime of `n . log_2 n`. That is, that's the number of comparisons it'll make, give-or-take. Each time the
sort algorithm makes a comparison, it'll perform one of two operations: swap, or don't swap. Thus, there are
`2 ^ (n log_2 n)` potential paths through the algorithm - or again, `n^n`. However, sorting algorithms try
hard to not make comparisons when they "know" what the result would be; thus, this approach produces a much
stronger bias in the sampled outcomes. Indeed, some potential anagrams of eight characters don't show up at all
during the sample run[*].

[*] Assuming the sort function isn't broken, all permutations *can* turn up - we'd just require far more runs
before we expected to see the extreme cases.

I guess you might also question whether the requirement for uniform distribution is desirable; at the other
end of the spectrum, sorting all characters (or just returning the input string) may suffice. (If the requirement
that the order of characters is different from the input, then of course there are some inputs for which
that requirement is unsatisfiable.) There may be other versions of "better," for instance, producing letter-orderings
which produce more challenging anagrams for a human solver.

In typical i18n-blindness, all approaches here don't deal correctly with combining characters and other tricky
unicode features.

On the question of which of those approaches is "stronger": the first approach produces outputs which are closer to
uniform. Additionally, it requires a far smaller change to bring its output to a strictly uniform distribution.
The second approach may be "clever", but it's by no means clear that it could be adjusted, should uniformity of
output be desired.

## sample run
# A sample run

8 characters have a total potential anagram set of 40320 ; version one selects from a space of 16777216 ; approach 2 from a space of 16777216.0
approach_1 gives 40320 unique outputs after a million trials with a std dev of 9.796132039833688
approach_2 gives 40205 unique outputs after a million trials with a std dev of 81.79639616460702
uniform gives 40320 unique outputs after a million trials with a std dev of 4.98372732942099
uniform distribution over 40320 possible outcomes, 1000000 selections, gives  4.997166037344494
	#!/usr/bin/env python3

	# From the introduction, pp. 5-6.
	# Two approaches to generate anagrams.


	from collections import Counter
	from functools import cmp_to_key
	from math import log2, factorial
	from string import ascii_lowercase # not very internationally-minded, but it'll do for these purposes
	import random
	import statistics # because we'll be doing some


	# Pick some random string or other.
	s = ascii_lowercase[:8] # There are 40320 potential anagrams here


	def approach_1(s):
	s = list(s)
	for i in range(len(s)):
	j = random.randrange(len(s)) # This introduces bias
	s[i], s[j] = s[j], s[i]
	return "".join(s)


	def random_cmp(a, b):
	return 0.5 - random.random()


	def approach_2(s):
	return "".join(sorted(list(s), key=cmp_to_key(random_cmp)))


	def uniform(s):
	"""A better (fsvo) way that should produce a uniform distribution"""
	s = list(s)
	random.shuffle(s) # A uniform shuffle; the implementation is very nearly the same as approach #1
	return "".join(s)


	results = {alg: Counter() for alg in (approach_1, approach_2, uniform)}

	# Give it a million attempts
	for _ in range(1000000):
	for alg in results:
	results[alg][alg(s)] += 1


	# Summarise results
	# How many different results did we get?
	print(len(s), "characters have a total potential anagram set of", factorial(len(s)),
	"; version one selects from a space of", len(s) ** len(s),
	"; approach 2 from a space of", 2 ** (len(s) * log2(len(s))))

	for alg, r in results.items():
	print(alg.__name__, "gives", len(r), "unique outputs after a million trials with a std dev of", statistics.stdev(r.values()))

	# Sample a uniform distribution and work out its std dev for comparison; it should come out close to the `uniform` approach.
	u = random.choices(range(40320), k=1000000)
	cu = Counter(u)
	print("uniform distribution over 40320 possible outcomes, 1000000 selections, gives ", statistics.stdev(cu.values()))
	There's a simple counting argument here as to the bias in the two approaches. Both might be suitable for
	a follow-up question during an interview.

	For the first approach, we make `n` swaps, each picking a target from `n` elements. There are thus `n^n`
	potential paths through the algorithm. However, there are `n!` distinct outcomes. Because the former doesn't
	divide cleanly by the latter, the argument is that there must be some bias in the outcome. (This is evident
	even when we're producing anagrams of a simple string like "abc".)

	For the second approach, we rely on the fact that the sorting algorithm is likely to have been picked with a
	runtime of `n . log_2 n`. That is, that's the number of comparisons it'll make, give-or-take. Each time the
	sort algorithm makes a comparison, it'll perform one of two operations: swap, or don't swap. Thus, there are
	`2 ^ (n log_2 n)` potential paths through the algorithm - or again, `n^n`. However, sorting algorithms try
	hard to not make comparisons when they "know" what the result would be; thus, this approach produces a much
	stronger bias in the sampled outcomes. Indeed, some potential anagrams of eight characters don't show up at all
	during the sample run[*].

	[] Assuming the sort function isn't broken, all permutations can* turn up - we'd just require far more runs
	before we expected to see the extreme cases.

	I guess you might also question whether the requirement for uniform distribution is desirable; at the other
	end of the spectrum, sorting all characters (or just returning the input string) may suffice. (If the requirement
	that the order of characters is different from the input, then of course there are some inputs for which
	that requirement is unsatisfiable.) There may be other versions of "better," for instance, producing letter-orderings
	which produce more challenging anagrams for a human solver.

	In typical i18n-blindness, all approaches here don't deal correctly with combining characters and other tricky
	unicode features.

	On the question of which of those approaches is "stronger": the first approach produces outputs which are closer to
	uniform. Additionally, it requires a far smaller change to bring its output to a strictly uniform distribution.
	The second approach may be "clever", but it's by no means clear that it could be adjusted, should uniformity of
	output be desired.
	# A sample run

	8 characters have a total potential anagram set of 40320 ; version one selects from a space of 16777216 ; approach 2 from a space of 16777216.0
	approach_1 gives 40320 unique outputs after a million trials with a std dev of 9.796132039833688
	approach_2 gives 40205 unique outputs after a million trials with a std dev of 81.79639616460702
	uniform gives 40320 unique outputs after a million trials with a std dev of 4.98372732942099
	uniform distribution over 40320 possible outcomes, 1000000 selections, gives 4.997166037344494