An implementation of ID generator with these predefined constraints:
- We want to persist requests against an API for deferred processing. There are ~1,000 request/second against the API during peak times. Each request should have a unique identifier that can be persisted.
- We want to be able to work on large collections of these identifiers in memory (i.e., we expect to use a maxium 1.2GB memory for processing and manipulating 50 million IDs for analytics).
- We want to create large batches of identifiers when processing of requests.
- We want to be able to create these identifiers on distributed nodes without requiring communication between nodes (clashes can be possible but should be unlikely).
- We want to be able to apply a total order to entities that were created across nodes for later auditing.
Final solution is Minid.java
and MinidTest.java
. Fetch, compile and run it using:
git clone https://gist.github.com/d648925cbcc3553d2ebf.git minid && cd minid
javac *.java
java -Xms500m -Xmx10g MinidTest
Note, from the arguments the program will start using heap at 500MB and gradually increasing to 10GB of RAM due to the nature of tests it performs. If you don't have RAM of that size, the program will fail.
On successfull execution, the generated output is below:
CPU architecture: 64
Max allowed heap size: 7022MB
------------Performing footprint test------------------------------------------------
Checking heap size, current size is 479MB
Generating 50 millions of IDs...
Checking heap size, current size is 880MB
Are we really generated 50 millions of IDs? What about printing last 10 of 'em?
LTU3NDIxMDg1MjQxOTcyNTEzNzA=
LTI1NjU5MDM1NTM2MTA3NDY3
NzEwMDY4NTUxMTQyMDkzMDEyOA==
LTcxNzIzNjc5NDA4NzgwNzcxNDU=
MjI1MTk0MDA0MjQ0NTQ1MjAwMQ==
NTc0ODI3NDE2MjIyMjk0ODk2NA==
NjYzOTc1OTQzNTYxOTM0MzE1MA==
LTgwMjczNzMzMDc4Mjk3Nzc3Mzg=
LTU4ODkyOTc3OTEwNTg2NTIyNjc=
NTc3MTM2NDM3NjUwNDIwOTc1
footprint test is passed
------------Performing uniqueness test-----------------------------------------------
0 IDs generated, collisions so far 0, memory used: 880MB
1000000 IDs generated, collisions so far 0, memory used: 880MB
2000000 IDs generated, collisions so far 0, memory used: 880MB
3000000 IDs generated, collisions so far 0, memory used: 942MB
...
48000000 IDs generated, collisions so far 0, memory used: 3731MB
49000000 IDs generated, collisions so far 0, memory used: 3731MB
50000000 IDs generated, collisions so far 0, memory used: 3755MB
51000000 IDs generated, collisions so far 0, memory used: 4267MB
...
93000000 IDs generated, collisions so far 0, memory used: 6344MB
94000000 IDs generated, collisions so far 0, memory used: 6344MB
95000000 IDs generated, collisions so far 0, memory used: 6344MB
...
Uniqueness test is passed
Out of many files I'd included, the final solution is a bit disappointing, just Minid.java
. I believe the way and why I'd came up with that solution is important, hence I included the rest as well, namely LightID
and LightIDTest
.
Basically, generating a unique ID is simple, pick a number randomly from a massive sized set of numbers. If you pick a number from 4 number set, the collision probability is 1/4 which means, every 4 picks will generate 1 collision. Practically, collisions could be omitted by increasing the set size and using a good method to fairly pick a number.
To reduce footprint, we have to trim the set size. My first educated guess is 64-bit. The size is pretty small, well, it's 64-bit a.k.a 8 bytes and yet, the number combination is amazing. How big is 64-bit combination? Assuming every second our CPU can increment a number 1 billion times, using 32-bit number, we will get an overflow in just 4 seconds. Yes, from zero, incremented one by one up to four billions in just 4 seconds. Had we used 64-bit numbers, overflow would had been occurred in 585 years. Quite a difference eh?
If the set size is 64-bit, and we pick one number randomly from the set billion times a second (with fair way of picking), we will get a collision every 585 years. Practically, this won't happen because we don't have picking algorithm of ideal fairness.
Java has provided us with a decent random generator in SecureRandom
. By generating 64-bit random number from it, we just implemented 'picking a number from 64-bit set'.
First, I'd came up with LightID.java
and LightIDTest.java
, but apparently the footprint test always failing (threw OutOfMemoryError
exception on 1GB sized heap). This is a shocking experience provided the class has just contains 1 (one) primitive field of type long
. The test is passed if we change long
to id
, but believe me, you'll get thousands collisions every seconds by doing that.
Another shocking experience, just using Long
, a boxed version of long
, and get it instantiated 50 millions times will allocate the heap space bigger than 1.2GB as described by the code below:
Long[] longs = new Long[50_000_000];
for (int i = 0; i < longs.length; ++i)
// note, JVM will automatically box i with new Long(i)
longs[i] = i;
The overhead for creating object in Java is just too much. So basically, we cannot wrap the ID inside a class due to the overhead it takes.
Our final ID is the long
itself, alone and un-encapsulated. The real meat of the solution is Minid
class which wraps a bunch of static methods for generating unique long
ID, printing, and batch generating. Sometime we have to sacrifice good programming practices for our LightID
, a well encapsulated object, to overcome the footprint constraint.
We want to persist requests against an API for deferred processing. There are ~1,000 request/second against the API during peak times. Each request should have a unique identifier that can be persisted.
Checked, we can generate 1 million unique IDs every second. In uniquenessTest
, we can generate 100 millions of unique IDs. Oh, and please don't run it on your laptop because it takes both CPU and RAM (10GB of it).
We want to be able to work on large collections of these identifiers in memory (i.e., we expect to use a maxium 1.2.GB memory for processing and manipulating 50 million IDs for analytics).
Checked, after calling footprintTest
, our heap is expanded to 923.7MB from 500MB. But we must aware that any use of Java standard collections (Set
, List
, Map
, etc) for analyzing 50 millions of our ID will break the memory constraints, because they will automatically box the long
primitive type to Long
reference type. As we've discussed, boxing (and instantiation) overhead is too much. We can only use native array (e.g. long[]
). I'm often get asked by friends, why we use native array instead of Java collections? Now we've known the answer.
We want to create large batches of identifiers when processing of requests.
Checked, we can just call Minid.generateId
inside a loop. Further enhancement is maximizing the process by using multiple CPU. Implementing multiprocessing in Java 8 is very trivial, see how it is implemented in Minid.generateBatches
.
We want to be able to create these identifiers on distributed nodes without requiring communication between nodes (clashes can be possible but should be unlikely).
Checked, clashes are very unlikely as we'd witnessed that our solution can stand 100 millions of unique IDs test. Each node could independently generate ID without clashing.
We want to be able to apply a total order to entities that were created across nodes for later auditing.
Unchecked yet. I seriously suggest not to order entities by ID because you will only get a randomized order of entities as our solution doesn't produce monotonically increasing IDs. This will bring us to another serious suggestion, don't make the generated ID as primary index on your SQL database, or it will always inserting new row inefficiently.