Skip to content

Instantly share code, notes, and snippets.

@jpbetz
Last active January 24, 2024 02:13
Show Gist options
  • Save jpbetz/fa7db5ec344274039f74999271c44d07 to your computer and use it in GitHub Desktop.
Save jpbetz/fa7db5ec344274039f74999271c44d07 to your computer and use it in GitHub Desktop.

Problem

Kubernetes generateName randomly generates a 5 char sequence that is appended as a "hash" to the generateName prefix (spoiler: It's not really a hash, just an encoded random number).

Each char in the sequence is from the "bcdfghjklmnpqrstvwxz2456789" range.

This leads to $27^5 = 14,348,907$ possible names per generateName prefix. For context, we have seen clusters store 100,000..1,000,000 CRDs, so the total number of names that can be generated is not currently a problem.

But there is a problem with collisions. There is a 50% chance of hitting a collision before generating even 5000 names, and a 0.1% chance at only 500 names! This is unacceptably BAD. For comparison a 160 bit git hash has a 50% probability of a single collision at about $2^{80}$ (1 million billion billion).

Today a HTTP 409 respose returned to the client if a collision happens. It is possible for clients to detect the error and retry, but we've seen that most clients don't realize to do this and hit this problem in production.

Proposed Solution

Modify the apiserver to retry name generation. After generating a name, the apiserver checks if the name already exists in etcd, and retries name generation if needed.

If we retry up to 7 times, we can generate up to 1 million names before reaching a 0.1% chance of a collision. This is roughtly the same probability of collision we would get if we were to increased the number of chars per hash to 11.

This doesn't fundamentally fix the problem. If the last retry results in a collision, the collision will still be reported back to the client, but it reduces the severity of the problem dramatically, buying time for longer term solutions to be developed.

See the prototype for implementation details.

Analysis

Note that the probability of a collision follows normal hash collision probabiblities:

Screenshot from 2024-01-19 13-53-04

The probability changes dramatically if we retry when a generated name conflicts with an existing name. We can calculate the probability when retries are performed iteratively with:

$C_n=C_{n-1}+(1-C_{n-1})(n-1/N)^t$

Where $C_n$ is the probablity of a collision (ignoring retries) when creating $n$ hashes, $N$ is the total number of hashes, $t$ is the number of tries performed.

Graphing this out for $t=2$ highlights how much better this is than what we have today:

Screenshot from 2024-01-19 13-53-11

And the more retry attempts we allow, the lower the probability of collisions:

Screenshot from 2024-01-19 13-53-19

After 7 attempts, the probability of a collision for 1 million entries drops below 0.1%.

For comparison, the probability of a collision if we were instead to increase the number of chars are:

number of chars # of possible names 50% probability of collision when # of names added 0.1% probability of collision when # of names added
5 14348907 5000 500
6 387420489 25000 900
7 10460353203 120000 5000
8 282429536481 650000 8000
9 7625597484987 3300000 120000
10 205891132094649 17000000 600000
11 5559060566555523 24000000 3400000

(based on https://kevingal.com/apps/collision.html)

Here, we can see that we need 11 chars to for the "0.1% probability of collision when # of names added" to exceed 1 million. The gives us a way to compare approaches. This is how we can conclude that 7 retries provides roughly the same probability of collision that we would get if we were to increased the number of chars per hash to 11.

Sanity Check

I compared the equations used in this doc with actual conflicts on apiserver where the hash size was set to 2 and a range of retries were used, the resulting in:

Screenshot from 2024-01-19 13-51-54

retries # of hashes at first collision (observed) # of hashes at 50% probabiliy of collision (calculated)
0 55 33
1 158 104
2 117 54
3 202 182
4 368 251
5 346 309
6 444 357
7 365 397
8 450 430
@jpbetz
Copy link
Author

jpbetz commented Jan 18, 2024

Thank you @siyuanfoundation for help on the math!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment