jpbetz/generatename-collisions.md

## generatename-collisions.md

      
    Raw
  

              generatename-collisions.md
            
          
    Problem

Kubernetes generateName randomly generates a 5 char sequence that is appended as a "hash" to the generateName prefix (spoiler: It's not really a hash, just an encoded random number).
Each char in the sequence is from the "bcdfghjklmnpqrstvwxz2456789" range.
This leads to $27^5 = 14,348,907$ possible names per generateName prefix. For context, we have seen clusters store 100,000..1,000,000 CRDs, so the total number of names that can be generated is not currently a problem.
But there is a problem with collisions. There is a 50% chance of hitting a collision before generating even 5000 names, and a 0.1% chance at only 500 names!  This is unacceptably BAD. For comparison a 160 bit git hash has a 50% probability of a single collision at about $2^{80}$ (1 million billion billion).
Today a HTTP 409 respose returned to the client if a collision happens.  It is possible for clients to
detect the error and retry, but we've seen that most clients don't realize to do this and hit this problem in production.
Proposed Solution

Modify the apiserver to retry name generation. After generating a name, the apiserver checks if the name already exists in etcd, and retries name generation if needed.
If we retry up to 7 times, we can generate up to 1 million names before reaching a 0.1% chance of a collision. This is roughtly the same probability of collision we would get if we were to increased the number of chars per hash to 11.
This doesn't fundamentally fix the problem.  If the last retry results in a collision, the collision will still be reported back to the client, but it reduces the severity of the problem dramatically, buying time for longer term solutions to be developed.
See the prototype for implementation details.
Analysis

Note that the probability of a collision follows normal hash collision probabiblities:

The probability changes dramatically if we retry when a generated name conflicts with an existing name. We can calculate the probability when retries are performed iteratively with:
$C_n=C_{n-1}+(1-C_{n-1})(n-1/N)^t$
Where $C_n$ is the probablity of a collision (ignoring retries) when creating $n$ hashes, $N$ is the total number of hashes, $t$ is the number of tries performed.
Graphing this out for $t=2$ highlights how much better this is than what we have today:

And the more retry attempts we allow, the lower the probability of collisions:

After 7 attempts, the probability of a collision for 1 million entries drops below 0.1%.
For comparison, the probability of a collision if we were instead to increase the number of chars are:


number of chars
# of possible names
50% probability of collision when # of names added
0.1% probability of collision when # of names added


5
14348907
5000
500


6
387420489
25000
900


7
10460353203
120000
5000


8
282429536481
650000
8000


9
7625597484987
3300000
120000


10
205891132094649
17000000
600000


11
5559060566555523
24000000
3400000


(based on https://kevingal.com/apps/collision.html)
Here, we can see that we need 11 chars to for the "0.1% probability of collision when # of names added" to exceed 1 million. The gives us a way to compare approaches. This is how we can conclude that 7 retries provides roughly the same probability of collision that we would get if we were to increased the number of chars per hash to 11.
Sanity Check

I compared the equations used in this doc with actual conflicts on apiserver where the hash size was set to 2 and a range of retries were used, the resulting in:


retries
# of hashes at first collision (observed)
# of hashes at 50% probabiliy of collision (calculated)


0
55
33


1
158
104


2
117
54


3
202
182


4
368
251


5
346
309


6
444
357


7
365
397


8
450
430
number of chars	# of possible names	50% probability of collision when # of names added	0.1% probability of collision when # of names added
5	14348907	5000	500
6	387420489	25000	900
7	10460353203	120000	5000
8	282429536481	650000	8000
9	7625597484987	3300000	120000
10	205891132094649	17000000	600000
11	5559060566555523	24000000	3400000
retries	# of hashes at first collision (observed)	# of hashes at 50% probabiliy of collision (calculated)
0	55	33
1	158	104
2	117	54
3	202	182
4	368	251
5	346	309
6	444	357
7	365	397
8	450	430