Skip to content

Instantly share code, notes, and snippets.

@arinal
Last active September 3, 2023 16:56
Show Gist options
  • Save arinal/d648925cbcc3553d2ebf to your computer and use it in GitHub Desktop.
Save arinal/d648925cbcc3553d2ebf to your computer and use it in GitHub Desktop.
Super lightweight ID generator

Lightweight Random Generator

An implementation of ID generator with these predefined constraints:

  • We want to persist requests against an API for deferred processing. There are ~1,000 request/second against the API during peak times. Each request should have a unique identifier that can be persisted.
  • We want to be able to work on large collections of these identifiers in memory (i.e., we expect to use a maxium 1.2GB memory for processing and manipulating 50 million IDs for analytics).
  • We want to create large batches of identifiers when processing of requests.
  • We want to be able to create these identifiers on distributed nodes without requiring communication between nodes (clashes can be possible but should be unlikely).
  • We want to be able to apply a total order to entities that were created across nodes for later auditing.

Final solution is Minid.java and MinidTest.java. Fetch, compile and run it using:

git clone https://gist.github.com/d648925cbcc3553d2ebf.git minid && cd minid
javac *.java
java -Xms500m -Xmx10g MinidTest

Note, from the arguments the program will start using heap at 500MB and gradually increasing to 10GB of RAM due to the nature of tests it performs. If you don't have RAM of that size, the program will fail.

On successfull execution, the generated output is below:

CPU architecture: 64
Max allowed heap size: 7022MB
------------Performing footprint test------------------------------------------------
Checking heap size, current size is 479MB
Generating 50 millions of IDs...
Checking heap size, current size is 880MB
Are we really generated 50 millions of IDs? What about printing last 10 of 'em?
LTU3NDIxMDg1MjQxOTcyNTEzNzA=
LTI1NjU5MDM1NTM2MTA3NDY3
NzEwMDY4NTUxMTQyMDkzMDEyOA==
LTcxNzIzNjc5NDA4NzgwNzcxNDU=
MjI1MTk0MDA0MjQ0NTQ1MjAwMQ==
NTc0ODI3NDE2MjIyMjk0ODk2NA==
NjYzOTc1OTQzNTYxOTM0MzE1MA==
LTgwMjczNzMzMDc4Mjk3Nzc3Mzg=
LTU4ODkyOTc3OTEwNTg2NTIyNjc=
NTc3MTM2NDM3NjUwNDIwOTc1
footprint test is passed
------------Performing uniqueness test-----------------------------------------------
0 IDs generated, collisions so far 0, memory used: 880MB
1000000 IDs generated, collisions so far 0, memory used: 880MB
2000000 IDs generated, collisions so far 0, memory used: 880MB
3000000 IDs generated, collisions so far 0, memory used: 942MB
...
48000000 IDs generated, collisions so far 0, memory used: 3731MB
49000000 IDs generated, collisions so far 0, memory used: 3731MB
50000000 IDs generated, collisions so far 0, memory used: 3755MB
51000000 IDs generated, collisions so far 0, memory used: 4267MB
...
93000000 IDs generated, collisions so far 0, memory used: 6344MB
94000000 IDs generated, collisions so far 0, memory used: 6344MB
95000000 IDs generated, collisions so far 0, memory used: 6344MB
...
Uniqueness test is passed

Basic idea

Out of many files I'd included, the final solution is a bit disappointing, just Minid.java. I believe the way and why I'd came up with that solution is important, hence I included the rest as well, namely LightID and LightIDTest.

Basically, generating a unique ID is simple, pick a number randomly from a massive sized set of numbers. If you pick a number from 4 number set, the collision probability is 1/4 which means, every 4 picks will generate 1 collision. Practically, collisions could be omitted by increasing the set size and using a good method to fairly pick a number.

To reduce footprint, we have to trim the set size. My first educated guess is 64-bit. The size is pretty small, well, it's 64-bit a.k.a 8 bytes and yet, the number combination is amazing. How big is 64-bit combination? Assuming every second our CPU can increment a number 1 billion times, using 32-bit number, we will get an overflow in just 4 seconds. Yes, from zero, incremented one by one up to four billions in just 4 seconds. Had we used 64-bit numbers, overflow would had been occurred in 585 years. Quite a difference eh?

If the set size is 64-bit, and we pick one number randomly from the set billion times a second (with fair way of picking), we will get a collision every 585 years. Practically, this won't happen because we don't have picking algorithm of ideal fairness.

Approach

Java has provided us with a decent random generator in SecureRandom. By generating 64-bit random number from it, we just implemented 'picking a number from 64-bit set'.

First, I'd came up with LightID.java and LightIDTest.java, but apparently the footprint test always failing (threw OutOfMemoryError exception on 1GB sized heap). This is a shocking experience provided the class has just contains 1 (one) primitive field of type long. The test is passed if we change long to id, but believe me, you'll get thousands collisions every seconds by doing that.

Another shocking experience, just using Long, a boxed version of long, and get it instantiated 50 millions times will allocate the heap space bigger than 1.2GB as described by the code below:

Long[] longs = new Long[50_000_000];
for (int i = 0; i < longs.length; ++i)
    // note, JVM will automatically box i with new Long(i)
    longs[i] = i;

The overhead for creating object in Java is just too much. So basically, we cannot wrap the ID inside a class due to the overhead it takes.

Final solution

Our final ID is the long itself, alone and un-encapsulated. The real meat of the solution is Minid class which wraps a bunch of static methods for generating unique long ID, printing, and batch generating. Sometime we have to sacrifice good programming practices for our LightID, a well encapsulated object, to overcome the footprint constraint.

Final solution broken-down by predefined constraints

We want to persist requests against an API for deferred processing. There are ~1,000 request/second against the API during peak times. Each request should have a unique identifier that can be persisted.

Checked, we can generate 1 million unique IDs every second. In uniquenessTest, we can generate 100 millions of unique IDs. Oh, and please don't run it on your laptop because it takes both CPU and RAM (10GB of it).

We want to be able to work on large collections of these identifiers in memory (i.e., we expect to use a maxium 1.2.GB memory for processing and manipulating 50 million IDs for analytics).

Checked, after calling footprintTest, our heap is expanded to 923.7MB from 500MB. But we must aware that any use of Java standard collections (Set, List, Map, etc) for analyzing 50 millions of our ID will break the memory constraints, because they will automatically box the long primitive type to Long reference type. As we've discussed, boxing (and instantiation) overhead is too much. We can only use native array (e.g. long[]). I'm often get asked by friends, why we use native array instead of Java collections? Now we've known the answer.

We want to create large batches of identifiers when processing of requests.

Checked, we can just call Minid.generateId inside a loop. Further enhancement is maximizing the process by using multiple CPU. Implementing multiprocessing in Java 8 is very trivial, see how it is implemented in Minid.generateBatches.

We want to be able to create these identifiers on distributed nodes without requiring communication between nodes (clashes can be possible but should be unlikely).

Checked, clashes are very unlikely as we'd witnessed that our solution can stand 100 millions of unique IDs test. Each node could independently generate ID without clashing.

We want to be able to apply a total order to entities that were created across nodes for later auditing.

Unchecked yet. I seriously suggest not to order entities by ID because you will only get a randomized order of entities as our solution doesn't produce monotonically increasing IDs. This will bring us to another serious suggestion, don't make the generated ID as primary index on your SQL database, or it will always inserting new row inefficiently.

import java.security.SecureRandom;
@Deprecated
public class LightID implements Cloneable {
private static SecureRandom Random = new SecureRandom();
private long id;
public LightID() {
id = Random.nextLong();
}
public LightID clone() {
LightID newId = new LightID();
newId.id = id;
return newId;
}
@Override
public boolean equals(Object other) {
if (!other.getClass().equals(getClass()))
return false;
LightID that = (LightID)other;
return id == that.id;
}
@Override
public int hashCode() {
return new Long(id).hashCode();
}
}
@Deprecated
public class LightIDTest {
static void main(String...args) {
equalityTest();
footprintTest();
}
static void footprintTest() {
LightID[] ids = new LightID[50_000_000];
for (int i = 0; i < ids.length; i++) {
if (i % 1_000_000 == 0)
System.out.println(i);
ids[i] = new LightID();
}
}
static void equalityTest() {
LightID id = new LightID();
LightID sameId = id.clone();
LightID differentId = new LightID();
System.out.println(id.equals(sameId));
System.out.println(id.equals(differentId));
System.out.printf("id: %d; sameId: %d; differentId: %d\n", id.hashCode(), sameId.hashCode(), differentId.hashCode());
}
}
import java.security.SecureRandom;
import java.util.Base64;
import java.util.stream.IntStream;
public class Minid {
private static SecureRandom Random = new SecureRandom();
public static long generateId() {
return Random.nextLong();
}
public static long[] generateBatches(int count) {
return IntStream.range(0, count).parallel().mapToLong(i -> Minid.generateId()).toArray();
}
public static String toString(long id) {
return Base64.getEncoder().encodeToString(String.valueOf(id).getBytes());
}
}
import java.util.HashSet;
import java.util.Set;
import java.util.stream.LongStream;
public class MinidTest {
private static int ONE_MEG = 1024 * 1024;
public static void main(String[] args) throws Exception {
System.out.printf("CPU architecture: %s\nMax allowed heap size: %dMB\n",
System.getProperty("sun.arch.data.model"), Runtime.getRuntime().maxMemory() / ONE_MEG);
System.out.println("------------Performing footprint test------------------------------------------------");
footprintTest();
System.out.println("------------Performing uniqueness test-----------------------------------------------");
uniquenessTest();
}
static void footprintTest() throws Exception {
ensureMemorySize("Make sure memory is less than 1.2GB before footprint test");
System.out.println("Generating 50 millions of IDs...");
long[] ids = Minid.generateBatches(50_000_000);
ensureMemorySize("Memory has expanded greater than 1.2GB during processing");
System.out.println("Are we really generated 50 millions of IDs? What about printing last 10 of 'em?");
LongStream.range(0, 10).map(n -> ids[ids.length - ((int) n + 1)]).
forEach(id -> System.out.println(Minid.toString(id)));
System.out.println("footprint test is passed");
}
static void uniquenessTest() {
int collisions = 0;
Set<Long> set = new HashSet<>();
for (int i = 1; i <= 10; i++) {
set.clear();
for (int j = 0; j < 100_000_000; j++) {
if (j % 1_000_000 == 0 ) System.out.printf("%d IDs generated, collisions so far %d, memory used: %dMB\n",
j, collisions, Runtime.getRuntime().totalMemory() / ONE_MEG);
long newId = Minid.generateId();
if (set.contains(newId))
System.out.printf("Clash! clashing ID is %d, collisions so far %d\n",
Minid.toString(newId), collisions++);
else set.add(newId);
}
}
System.out.println(collisions == 0? "Uniqueness test is passed" :
("Uniqueness is failed, there are " + collisions + " collisions"));
}
static void ensureMemorySize(String message) throws Exception {
long heapSize = Runtime.getRuntime().totalMemory();
System.out.printf("Checking heap size, current size is %dMB\n", heapSize / ONE_MEG);
if (heapSize > 1200 * ONE_MEG) throw new Exception(message);
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment