One of the many reasons I love working with Ruby is it has a rich vocabulary that allows you to accomplish your goals with a minimal amount of code. If there isn't a method that does exactly what you want, it's usually possible to build an elegant solution yourself.
Let's take the example of simulating the rolling of a die.
We can represent a die as an array of its faces.
die = [*?⚀..?⚅]
# => ["⚀", "⚁", "⚂", "⚃", "⚄", "⚅"]
We can simulate a single roll of the die using Array#sample
.
die.sample
# => "⚁"
There isn't any built-in method for rolling our die several times (i.e., sampling with replacement), but we can use Enumerable#map
for this:
5.times.map { die.sample }
# => ["⚄", "⚂", "⚃", "⚁", "⚂"]
10.times.map { die.sample }
# => ["⚂", "⚀", "⚄", "⚁", "⚃", "⚁", "⚁", "⚁", "⚄", "⚀"]
We can also use Enumerable#each_with_object
to observe the frequency distribution of rolling our die a large number of times:
360.times.each_with_object(Hash.new(0)) { |_, freq| freq[die.sample] += 1 }
# => {"⚃"=>55, "⚄"=>56, "⚂"=>67, "⚁"=>64, "⚅"=>65, "⚀"=>53}
3600.times.each_with_object(Hash.new(0)) { |_, freq| freq[die.sample] += 1 }
# => {"⚂"=>612, "⚃"=>600, "⚅"=>571, "⚄"=>593, "⚁"=>652, "⚀"=>572}
freq = 36_000.times.each_with_object(Hash.new(0)) { |_, freq| freq[die.sample] += 1 }
# => {"⚂"=>6061, "⚀"=>6018, "⚄"=>6087, "⚃"=>5940, "⚅"=>5891, "⚁"=>6003}
freq.values.map { |roll_count| roll_count / 36_000.0 }
# => [0.1683611111111111, 0.16716666666666666, 0.16908333333333334, 0.165, 0.1636388888888889, 0.16675]
Here we can see that as the number of rolls tends towards infinity, the likelihood of rolling any die face is equal. This is an example of uniform random sampling, and Ruby's Array#sample
makes it very easy to simulate.
But what if we want to simulate rolling a loaded die? Given an array of associated probabilities for each die face, is there an elegant way to calculate weighted random samples in Ruby?
My inclination would be to construct the cumulative distribution function from the probabilities and work from there. But there is an even simpler way using Enumerable#max_by
and a remarkable result due to Efraimidis and Spirakis.
Suppose we want to simulate our loaded die with the probabilities below:
ps = [0.05, 0.05, 0.1, 0.1, 0.2, 0.5]
ps.reduce(:+)
# => 1.0
Note that our weights here sum to 1. In the event we have raw weights, we can normalize them like so:
weights = [5, 5, 10, 10, 20, 50]
ps = weights.map { |w| (Float w) / weights.reduce(:+) }
# => [0.05, 0.05, 0.1, 0.1, 0.2, 0.5]
Now we can represent our loaded die as a hash where each face is a key whose value is the probability of rolling it.
loaded_die = die.zip(ps).to_h
# => {"⚀"=>0.05, "⚁"=>0.05, "⚂"=>0.1, "⚃"=>0.1, "⚄"=>0.2, "⚅"=>0.5}
A single roll of the loaded die can be simulated like this:
loaded_die.max_by { |_, weight| rand ** (1.0 / weight) }.first
# => "⚅"
Weighted random sampling with replacement can be achieved like this:
wrs = -> (freq) { freq.max_by { |_, weight| rand ** (1.0 / weight) }.first }
5.times.map { wrs[loaded_die] }
# => ["⚅", "⚅", "⚁", "⚁", "⚁"]
10.times.map { wrs[loaded_die] }
# => ["⚀", "⚄", "⚀", "⚄", "⚅", "⚅", "⚁", "⚅", "⚁", "⚅"]
And to confirm that our loaded die behaves like we expect, we can inspect the frequency distribution like we did above:
freq = 100_000.times.each_with_object(Hash.new(0)) { |_, freq| freq[wrs[loaded_die]] += 1 }
# => {"⚅"=>49980, "⚁"=>5045, "⚄"=>20171, "⚂"=>9840, "⚀"=>5031, "⚃"=>9933}
freq.map { |face, roll_count| [face, roll_count / 100_000.0] }.to_h
# => {"⚅"=>0.4998, "⚁"=>0.05045, "⚄"=>0.20171, "⚂"=>0.0984, "⚀"=>0.05031, "⚃"=>0.09933}
If you found this anywhere near as intriguing as I did, I'd encourage you to peruse the links below. The paper that introduces the algorithm used is a svelte 4 pages and worth a read.
Further reading:
- Ruby-Doc for
Enumerable#max_by
— specifically thewsample
example - Weighted Random Sampling by Efraimidis and Spirakis (2005) which introduces the algorithm
- New features for Array#sample, Array#choice which mentions the intention of adding weighted random sampling to
Array#sample
and reintroducingArray#choice
for sampling with replacement.
Thanks for this brilliant application sampling algorithm!
A sidenote about performance.
If you have to deal with
BigDecimal
(IE loaded from a database) weights it's very critical to coerce them to floats. A little benchmark to illustrate the issue:This happened because
BigDecimal
is a more accurate type and when you have to deal with large numbers it's pretty slow.So I would say normalization is very critical in performance terms also and coercion should be applied to both parts of equations to have
Float
type in weights:ps = weights.map { |w| (Float w) / (Float weights.reduce(:+)) }