RapGenius has an interesting post about Heroku's randomized load balancing, complaining about how random placement degrades performance compared to prior omniscient approaches. RapGenius ran some simulations, including an experiments with a "Choice of Two" method:
Choice of two routing is the naive method from before, with the twist that when you assign a request to a random dyno, if that dyno is already busy then you reassign the request to a second random dyno, with no regard for whether the second dyno is busy
This differs subtly but substantially from the standard "Power of Two Choices" randomized load balancing:
each [request] is placed in the least loaded of d >= 2 [Dynos] chosen independently and uniformly at random
Take a look at the difference in queue lengths below, for 200 Dynos, 1000 timesteps, and a constant 200 requests per timestep (assuming requests are fulfilled in one timestep):
Strategy Average 99th Percentile Random 16.87119 58.746 RapGenius "Choice of Two" 9.063825 23.613 Standard "Power of Two Choices" 3.734270 5.402
Note the 3x difference on average between the two "two choice" strategies (and an almost 5x difference at the 99th percentile).
There are all sorts of fun variants on "Power of Two Choices" and their theoretical implications. I'm hardly an expert, but I recall that the standard "Two Choices" gets you pretty close to optimal, and adding more choices doesn't help nearly as much as the first one. Of course, querying two dynos takes time (although the requests are in parallel), so you can play games like "query 5, check the first two responses." There's some cool work by my friends at Berkeley on doing this for cluster schedulers (no paper yet).
Quick and Dirty Python below:
''' Compares randomized load balancing, RapGenius's "choice of two" and traditional "power of two" randomized load balancing for a fixed arrival rate (determined by the "LOAD" parameter) ''' from random import randint TRIALS=1000 DYNOS=200 LOAD=1 PCTILE=.99 def average(l): return float(sum(l))/len(l) def randint_wo_replacement(minrange, maxrange, skip): ret = randint(minrange, maxrange) if ret >= skip: ret = ((ret+1) % maxrange) return ret for strategy in ["random", "rapgenius", "pwrtwo"]: queues=*DYNOS avgs= pctiles= for trial in range(0, TRIALS): for dyno in range(0, DYNOS): queues[dyno] = max(queues[dyno]-1, 0) for request in range(0, int(DYNOS*LOAD)): if strategy == "rapgenius": first_check = randint(0, DYNOS-1) if(queues[first_check] == 0): queues[first_check] += 1 else: queues[randint_wo_replacement(0, DYNOS-1, first_check)] += 1 elif strategy == "pwrtwo": first_check = randint(0, DYNOS-1) second_check = randint_wo_replacement(0, DYNOS-1, first_check) if(queues[first_check] < queues[second_check]): queues[first_check] += 1 else: queues[second_check] += 1 elif strategy == "random": queues[randint(0, DYNOS-1)] += 1 avgs.append(average(queues)) #get 99th pctile queues.sort() pctiles.append(queues[int(DYNOS*PCTILE-1)]) print "%s: avg %f, %f percentile %f" % (strategy, average(avgs), PCTILE, average(pctiles))