I decided to measure two related algorithms that came up in a job interview. The idea is to randomly pick a country from a list of countries with their population, but with biased randomness towards higher poulations.
Algorithms are in Algorithm1.hs and Algorithm2.hs respectively.
Expectations: Algorithm1 is faster but significantly more biased. (Wrong! See bottom)
Compiled with ghc -o countries -threaded -O2 Main.hs
[vincent@tazbook countries]$ time ./countries 4000000 +RTS -N4 | sort | uniq -c
15716 ("Bangladesh",152518015)
125087 ("Brazil",202550000)
1999377 ("China",1364410000)
8 ("Democratic Republic of the Congo",67514000)
135 ("Egypt",86446400)
220 ("Ethiopia",86613986)
3 ("France",65885000)
69 ("Germany",80716000)
999985 ("India",1244010000)
250622 ("Indonesia",247424598)
36 ("Iran",77427000)
3853 ("Japan",127140000)
1947 ("Mexico",119713203)
31223 ("Nigeria",173615000)
62077 ("Pakistan",186435000)
1008 ("Philippines",99554400)
7675 ("Russia",143700000)
1 ("Thailand",64456700)
18 ("Turkey",76667864)
2 ("United Kingdom",63705000)
500457 ("United States",318030000)
481 ("Vietnam",89708900)
real 2m15.730s
user 3m0.523s
sys 1m29.130s
I ran it a few more times, yielding these timings:
[vincent@tazbook countries]$ time ./countries 4000000 +RTS -N4 > /dev/null
real 2m16.750s
user 2m48.253s
sys 1m44.527s
[vincent@tazbook countries]$ time ./countries 4000000 +RTS -N4 > /dev/null
real 2m20.478s
user 3m5.947s
sys 1m35.910s
Compiled with ghc -o countries -threaded -O2 Main.hs
[vincent@tazbook countries]$ time ./countries 4000000 +RTS -N4 | sort | uniq -c
44124 ("Afghanistan",25500100)
3608 ("Algeria",38700000)
21878 ("Angola",20609294)
7286 ("Argentina",42669500)
43784 ("Australia",23488300)
43850 ("Bangladesh",152518015)
175573 ("Brazil",202550000)
10967 ("Burkina Faso",17322796)
21917 ("Burma",53259000)
11038 ("Cameroon",20386799)
44206 ("Canada",35344962)
22145 ("Chile",17620000)
350319 ("China",1364410000)
44089 ("Colombia",47588000)
88017 ("Democratic Republic of the Congo",67514000)
43873 ("Egypt",86446400)
43883 ("Ethiopia",86613986)
43951 ("France",65885000)
21904 ("Germany",80716000)
10922 ("Ghana",24658823)
1819 ("Guatemala",15806675)
175428 ("India",1244010000)
87465 ("Indonesia",247424598)
175801 ("Iran",77427000)
87408 ("Iraq",34035000)
44074 ("Italy",60021955)
21832 ("Ivory Coast",23202000)
87582 ("Japan",127140000)
21834 ("Kazakhstan",17221000)
21941 ("Kenya",44354000)
22089 ("Madagascar",21263403)
3808 ("Malawi",16363000)
88224 ("Malaysia",30113000)
88199 ("Mexico",119713203)
44043 ("Morocco",33257800)
87250 ("Mozambique",23700715)
21800 ("Nepal",26494504)
11070 ("Netherlands",16850300)
10912 ("Niger",17138707)
88108 ("Nigeria",173615000)
22105 ("North Korea",24895000)
87717 ("Pakistan",186435000)
43682 ("Peru",30475144)
44041 ("Philippines",99554400)
175561 ("Poland",38496000)
22067 ("Romania",20121641)
176142 ("Russia",143700000)
43621 ("Saudi Arabia",29994272)
87856 ("South Africa",52981991)
44119 ("South Korea",50219669)
21972 ("Spain",46609700)
43970 ("Sri Lanka",20277597)
88045 ("Sudan",37289406)
43854 ("Syria",21898000)
43920 ("Taiwan",23382948)
21944 ("Tanzania",44928923)
87904 ("Thailand",64456700)
87425 ("Turkey",76667864)
87592 ("Uganda",35357000)
43696 ("Ukraine",45395604)
43791 ("United Kingdom",63705000)
175480 ("United States",318030000)
21846 ("Uzbekistan",30183400)
43913 ("Venezuela",28946101)
87996 ("Vietnam",89708900)
21720 ("Yemen",25235000)
real 0m51.041s
user 1m32.100s
sys 0m24.013s
As expected, the distribution in the second algorithm is more uniform while preserving a certain bias towards the higher population entries.
The first algorithm has a much stronger bias.
The second algorithm scales reliably with larger datasets but can never complete until the whole tree has been walked, whilst the first algorithm can (50% chance) be 0(1), but also worst-case O(n).
Doing this has brought no new information to the computer science world, but it was fun!