Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save garthk/4e4751cd0ff5da3ca2278849a116878a to your computer and use it in GitHub Desktop.
Save garthk/4e4751cd0ff5da3ca2278849a116878a to your computer and use it in GitHub Desktop.
Birthday Paradox for Random Collision Avoidance in Elixir
failures = fn population, bytes, max_duration_s ->
max_rounds = 10_000
deadline = System.monotonic_time(:millisecond) + 1000 * max_duration_s
gather = fn gather, rounds, failures ->
cond do
rounds >= max_rounds ->
{rounds, failures}
System.monotonic_time(:millisecond) > deadline ->
{rounds, failures}
true ->
1..population
|> Enum.map(fn _ -> :crypto.strong_rand_bytes(bytes) end)
|> Enum.uniq()
|> length()
|> case do
^population -> gather.(gather, rounds + 1, failures)
_n -> gather.(gather, rounds + 1, failures + 1)
end
end
end
{rounds, failures} = gather.(gather, 0, 0)
decimals = max(0, floor(:math.log10(rounds)) - 2)
failure_rate =
((failures / rounds * 100.0) |> :erlang.float_to_binary(decimals: decimals)) <> "%"
%{
population: population,
bits: bytes * 8,
rounds_total: rounds,
rounds_failed: failures,
failure_rate: failure_rate
}
end
max_duration_s = 1
failures.(100, 4, max_duration_s)
max_duration_s = 600
[
failures.(100, 4, max_duration_s),
failures.(1_000, 4, max_duration_s),
failures.(10_000, 4, max_duration_s),
failures.(65_536, 4, max_duration_s),
failures.(100_000, 4, max_duration_s),
failures.(1_000_000, 4, max_duration_s),
failures.(1_000, 5, max_duration_s),
failures.(10_000, 5, max_duration_s),
failures.(100_000, 5, max_duration_s),
failures.(1_000_000, 5, max_duration_s),
failures.(1_048_576, 5, max_duration_s)
]

Collisions in Key-Sortable Unique Identifiers (KSUIDs)

Expanding on my thread, it’s been nagging me that the random tail on small key sortable unique IDs (KSUID) might not provide as much protection vs collision as I’d trusted. Segment were able to throw bits at the problem, but I needed to pack my identifiers in smaller spaces.

After sparing 32 bits for the timestamp I might have room for only 32, 40, or 96 random bits. There’s that cryptographers’ rule of thumb about collision needing half the bit count… but how likely is it we’ll get at least one collision?

My math being a bit rusty, I decided to brute force it (birthday.exs). I didn't like my results.exs:

  • 39% failure for 2¹⁶ random choices from 2³² possible keys (3443 rounds)
  • 45% failure for 2²⁰ random choices from 2⁴⁰ possible keys (163 rounds)

Searching around again, worried, I found “Understanding the Birthday Paradox” and played with its calculator. Clearly, I'd been cocky: I'd expected I'd need to retry a little at my “peak throughput” of half the bits, but I'd actually need to retry around half the time.

To nail down my safety margin once and for all, I've adapted Appendix A:

odds_of_missing_collision = fn population_bits, random_identifier_bits ->
  population = :math.pow(2, population_bits)
  space = :math.pow(2, random_identifier_bits)

  1 |> :math.exp() |> :math.pow(population * (population - 1) / (-2 * space))
end

for sacrificed_bits <- 0..8 do
    confidence = odds_of_missing_collision.(16 - sacrificed_bits, 32)
    rounds_for_half_chance_of_collision =  floor(:math.log(0.5) / :math.log(confidence))
    hours_at_one_round_per_second = Float.round(rounds_for_half_chance_of_collision / 3600, 1)
    {sacrificed_bits, confidence, rounds_for_half_chance_of_collision, hours_at_one_round_per_second}
end

[
  {0, 0.6065352871919841, 1, 0.0},
  {1, 0.8825002690495376, 5, 0.0},
  {2, 0.9692350831437918, 22, 0.0},
  {3, 0.9922188845134587, 88, 0.0},
  {4, 0.9980492570143334, 354, 0.1},
  {5, 0.9995120762421099, 1420, 0.4},
  {6, 0.999878056332523, 5683, 1.6},
  {7, 0.9999695424903592, 22757, 6.3},
  {8, 0.9999924004366679, 91208, 25.3}
]

Sacrificing another eight bits gives you a peak traffic level for which you've got half a chance of a collision every ninety thousand peaks.

If you squeeze a Segment-style KSUID into 64 bits, you'll have 32 bits for the timestamp in seconds and another 32 bits for the random component. If your peak traffic is 256 attempts per second, it's a coin-flip whether you survive peaking for a day.

[
%{
bits: 32,
failure_rate: "0.00%",
population: 100,
rounds_failed: 0,
rounds_total: 10000
},
%{
bits: 32,
failure_rate: "0.00%",
population: 1000,
rounds_failed: 0,
rounds_total: 10000
},
%{
bits: 32,
failure_rate: "1.07%",
population: 10000,
rounds_failed: 107,
rounds_total: 10000
},
%{
bits: 32,
failure_rate: "38.9%",
population: 65536,
rounds_failed: 1340,
rounds_total: 3443
},
%{
bits: 32,
failure_rate: "68.6%",
population: 100000,
rounds_failed: 1508,
rounds_total: 2197
},
%{
bits: 32,
failure_rate: "100%",
population: 1000000,
rounds_failed: 176,
rounds_total: 176
},
%{
bits: 40,
failure_rate: "0.00%",
population: 1000,
rounds_failed: 0,
rounds_total: 10000
},
%{
bits: 40,
failure_rate: "0.00%",
population: 10000,
rounds_failed: 0,
rounds_total: 10000
},
%{
bits: 40,
failure_rate: "0.5%",
population: 100000,
rounds_failed: 10,
rounds_total: 2159
},
%{
bits: 40,
failure_rate: "41%",
population: 1000000,
rounds_failed: 73,
rounds_total: 176
},
%{
bits: 40,
failure_rate: "45%",
population: 1048576,
rounds_failed: 74,
rounds_total: 163
}
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment