This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python3 | |
# From the introduction, pp. 5-6. | |
# Two approaches to generate anagrams. | |
from collections import Counter | |
from functools import cmp_to_key | |
from math import log2, factorial | |
from string import ascii_lowercase # not very internationally-minded, but it'll do for these purposes | |
import random | |
import statistics # because we'll be doing some | |
# Pick some random string or other. | |
s = ascii_lowercase[:8] # There are 40320 potential anagrams here | |
def approach_1(s): | |
s = list(s) | |
for i in range(len(s)): | |
j = random.randrange(len(s)) # This introduces bias | |
s[i], s[j] = s[j], s[i] | |
return "".join(s) | |
def random_cmp(a, b): | |
return 0.5 - random.random() | |
def approach_2(s): | |
return "".join(sorted(list(s), key=cmp_to_key(random_cmp))) | |
def uniform(s): | |
"""A better (fsvo) way that should produce a uniform distribution""" | |
s = list(s) | |
random.shuffle(s) # A uniform shuffle; the implementation is *very* nearly the same as approach #1 | |
return "".join(s) | |
results = {alg: Counter() for alg in (approach_1, approach_2, uniform)} | |
# Give it a million attempts | |
for _ in range(1000000): | |
for alg in results: | |
results[alg][alg(s)] += 1 | |
# Summarise results | |
# How many different results did we get? | |
print(len(s), "characters have a total potential anagram set of", factorial(len(s)), | |
"; version one selects from a space of", len(s) ** len(s), | |
"; approach 2 from a space of", 2 ** (len(s) * log2(len(s)))) | |
for alg, r in results.items(): | |
print(alg.__name__, "gives", len(r), "unique outputs after a million trials with a std dev of", statistics.stdev(r.values())) | |
# Sample a uniform distribution and work out its std dev for comparison; it should come out close to the `uniform` approach. | |
u = random.choices(range(40320), k=1000000) | |
cu = Counter(u) | |
print("uniform distribution over 40320 possible outcomes, 1000000 selections, gives ", statistics.stdev(cu.values())) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There's a simple counting argument here as to the bias in the two approaches. Both might be suitable for | |
a follow-up question during an interview. | |
For the first approach, we make `n` swaps, each picking a target from `n` elements. There are thus `n^n` | |
potential paths through the algorithm. However, there are `n!` distinct outcomes. Because the former doesn't | |
divide cleanly by the latter, the argument is that there *must* be some bias in the outcome. (This is evident | |
even when we're producing anagrams of a simple string like "abc".) | |
For the second approach, we rely on the fact that the sorting algorithm is likely to have been picked with a | |
runtime of `n . log_2 n`. That is, that's the number of comparisons it'll make, give-or-take. Each time the | |
sort algorithm makes a comparison, it'll perform one of two operations: swap, or don't swap. Thus, there are | |
`2 ^ (n log_2 n)` potential paths through the algorithm - or again, `n^n`. However, sorting algorithms try | |
hard to not make comparisons when they "know" what the result would be; thus, this approach produces a much | |
stronger bias in the sampled outcomes. Indeed, some potential anagrams of eight characters don't show up at all | |
during the sample run[*]. | |
[*] Assuming the sort function isn't broken, all permutations *can* turn up - we'd just require far more runs | |
before we expected to see the extreme cases. | |
I guess you might also question whether the requirement for uniform distribution is desirable; at the other | |
end of the spectrum, sorting all characters (or just returning the input string) may suffice. (If the requirement | |
that the order of characters is different from the input, then of course there are some inputs for which | |
that requirement is unsatisfiable.) There may be other versions of "better," for instance, producing letter-orderings | |
which produce more challenging anagrams for a human solver. | |
In typical i18n-blindness, all approaches here don't deal correctly with combining characters and other tricky | |
unicode features. | |
On the question of which of those approaches is "stronger": the first approach produces outputs which are closer to | |
uniform. Additionally, it requires a far smaller change to bring its output to a strictly uniform distribution. | |
The second approach may be "clever", but it's by no means clear that it could be adjusted, should uniformity of | |
output be desired. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# A sample run | |
8 characters have a total potential anagram set of 40320 ; version one selects from a space of 16777216 ; approach 2 from a space of 16777216.0 | |
approach_1 gives 40320 unique outputs after a million trials with a std dev of 9.796132039833688 | |
approach_2 gives 40205 unique outputs after a million trials with a std dev of 81.79639616460702 | |
uniform gives 40320 unique outputs after a million trials with a std dev of 4.98372732942099 | |
uniform distribution over 40320 possible outcomes, 1000000 selections, gives 4.997166037344494 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment