Skip to content

Instantly share code, notes, and snippets.

@jan-g
Last active January 20, 2020 13:38
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save jan-g/34bbf97b1f4b1aa885baa5f5b17bbcb4 to your computer and use it in GitHub Desktop.
#!/usr/bin/env python3
# From the introduction, pp. 5-6.
# Two approaches to generate anagrams.
from collections import Counter
from functools import cmp_to_key
from math import log2, factorial
from string import ascii_lowercase # not very internationally-minded, but it'll do for these purposes
import random
import statistics # because we'll be doing some
# Pick some random string or other.
s = ascii_lowercase[:8] # There are 40320 potential anagrams here
def approach_1(s):
s = list(s)
for i in range(len(s)):
j = random.randrange(len(s)) # This introduces bias
s[i], s[j] = s[j], s[i]
return "".join(s)
def random_cmp(a, b):
return 0.5 - random.random()
def approach_2(s):
return "".join(sorted(list(s), key=cmp_to_key(random_cmp)))
def uniform(s):
"""A better (fsvo) way that should produce a uniform distribution"""
s = list(s)
random.shuffle(s) # A uniform shuffle; the implementation is *very* nearly the same as approach #1
return "".join(s)
results = {alg: Counter() for alg in (approach_1, approach_2, uniform)}
# Give it a million attempts
for _ in range(1000000):
for alg in results:
results[alg][alg(s)] += 1
# Summarise results
# How many different results did we get?
print(len(s), "characters have a total potential anagram set of", factorial(len(s)),
"; version one selects from a space of", len(s) ** len(s),
"; approach 2 from a space of", 2 ** (len(s) * log2(len(s))))
for alg, r in results.items():
print(alg.__name__, "gives", len(r), "unique outputs after a million trials with a std dev of", statistics.stdev(r.values()))
# Sample a uniform distribution and work out its std dev for comparison; it should come out close to the `uniform` approach.
u = random.choices(range(40320), k=1000000)
cu = Counter(u)
print("uniform distribution over 40320 possible outcomes, 1000000 selections, gives ", statistics.stdev(cu.values()))
There's a simple counting argument here as to the bias in the two approaches. Both might be suitable for
a follow-up question during an interview.
For the first approach, we make `n` swaps, each picking a target from `n` elements. There are thus `n^n`
potential paths through the algorithm. However, there are `n!` distinct outcomes. Because the former doesn't
divide cleanly by the latter, the argument is that there *must* be some bias in the outcome. (This is evident
even when we're producing anagrams of a simple string like "abc".)
For the second approach, we rely on the fact that the sorting algorithm is likely to have been picked with a
runtime of `n . log_2 n`. That is, that's the number of comparisons it'll make, give-or-take. Each time the
sort algorithm makes a comparison, it'll perform one of two operations: swap, or don't swap. Thus, there are
`2 ^ (n log_2 n)` potential paths through the algorithm - or again, `n^n`. However, sorting algorithms try
hard to not make comparisons when they "know" what the result would be; thus, this approach produces a much
stronger bias in the sampled outcomes. Indeed, some potential anagrams of eight characters don't show up at all
during the sample run[*].
[*] Assuming the sort function isn't broken, all permutations *can* turn up - we'd just require far more runs
before we expected to see the extreme cases.
I guess you might also question whether the requirement for uniform distribution is desirable; at the other
end of the spectrum, sorting all characters (or just returning the input string) may suffice. (If the requirement
that the order of characters is different from the input, then of course there are some inputs for which
that requirement is unsatisfiable.) There may be other versions of "better," for instance, producing letter-orderings
which produce more challenging anagrams for a human solver.
In typical i18n-blindness, all approaches here don't deal correctly with combining characters and other tricky
unicode features.
On the question of which of those approaches is "stronger": the first approach produces outputs which are closer to
uniform. Additionally, it requires a far smaller change to bring its output to a strictly uniform distribution.
The second approach may be "clever", but it's by no means clear that it could be adjusted, should uniformity of
output be desired.
# A sample run
8 characters have a total potential anagram set of 40320 ; version one selects from a space of 16777216 ; approach 2 from a space of 16777216.0
approach_1 gives 40320 unique outputs after a million trials with a std dev of 9.796132039833688
approach_2 gives 40205 unique outputs after a million trials with a std dev of 81.79639616460702
uniform gives 40320 unique outputs after a million trials with a std dev of 4.98372732942099
uniform distribution over 40320 possible outcomes, 1000000 selections, gives 4.997166037344494
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment