Skip to content

Instantly share code, notes, and snippets.

@jamarparris
Last active March 20, 2024 00:04
Show Gist options
  • Star 89 You must be signed in to star a gist
  • Fork 19 You must be signed in to fork a gist
  • Save jamarparris/6100413 to your computer and use it in GitHub Desktop.
Save jamarparris/6100413 to your computer and use it in GitHub Desktop.
Create ObjectIds in PostGres following the MongoDB semantics. Very similar to the Instagram approach linked below.http://docs.mongodb.org/manual/reference/object-id/http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram
The MIT License (MIT)
Copyright (c) 2013 Jamar Parris
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
CREATE OR REPLACE FUNCTION generate_object_id() RETURNS varchar AS $$
DECLARE
time_component bigint;
machine_id bigint := FLOOR(random() * 16777215);
process_id bigint;
seq_id bigint := FLOOR(random() * 16777215);
result varchar:= '';
BEGIN
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp())) INTO time_component;
SELECT pg_backend_pid() INTO process_id;
result := result || lpad(to_hex(time_component), 8, '0');
result := result || lpad(to_hex(machine_id), 6, '0');
result := result || lpad(to_hex(process_id), 4, '0');
result := result || lpad(to_hex(seq_id), 6, '0');
RETURN result;
END;
$$ LANGUAGE PLPGSQL;
Example Use:
CREATE TABLE IF NOT EXISTS users
(
id varchar(24) NOT NULL default generate_object_id(),
name varchar(255) NOT NULL,
);
@DataGreed
Copy link

how do you generate the same id as bigint?

@danilodeveloper
Copy link

@DataGreed you can't. The main objective here is generate an ID like Mongodb does (letters and numbers).
If you want a bigint ID you should take a look in the instagram article as @jamarparris said.

@danilodeveloper
Copy link

danilodeveloper commented Sep 11, 2016

@jamarparris did you already used the generate_object_id() function in a shard enviroment (like Instagram did)? I'm not sure if the generate_object_id() will generate an unique ID in a distributed enviroment. Looking the code I have the impression that will generate an unique ID but I'm not sure...

Tank you for share your code 👍

@danilodeveloper
Copy link

Hey @DataGreed, take a look here.

@holsted
Copy link

holsted commented Sep 14, 2016

@jamarparris - Is there a license for this code? If you don't mind others using it, it would be really helpful if you could add an MIT license to it.

@jamarparris
Copy link
Author

jamarparris commented Sep 21, 2016

@andrewholsted - I updated it with an MIT license. Sorry for the delay.

@danilodeveloper - I've never used it in a sharded environment. While the random bits will help somewhat with uniqueness (especially in conjunction with the timestamp component), it doesn't guarantee it at all. It is still possible that two records could be generated at the same second (epoch doesn't account for milliseconds) and for the random generator to generate the same numbers for machine_id and seq_id for both records. Unlikely but possible.

You could probably ensure greater uniqueness across shards (and within the same database) by assigning each shard a unique fixed identifier and using it as the machine id and then using a PostGres SEQUENCE as the seq_id component.

https://www.postgresql.org/docs/9.1/static/functions-sequence.html

@osuushi
Copy link

osuushi commented Oct 7, 2016

If you want higher quality random numbers (postgres apparently uses an LCG for its random()), you can use pgcrypto and gen_random_bytes like so:

CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE OR REPLACE FUNCTION generate_object_id() RETURNS varchar AS $$
    DECLARE
        time_component bigint;
        machine_id text := encode(gen_random_bytes(3), 'hex');
        process_id bigint;
        seq_id text := encode(gen_random_bytes(3), 'hex');
        result varchar:= '';
    BEGIN
        SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp())) INTO time_component;
        SELECT pg_backend_pid() INTO process_id;

        result := result || lpad(to_hex(time_component), 8, '0');
        result := result || machine_id;
        result := result || lpad(to_hex(process_id), 4, '0');
        result := result || seq_id;
        RETURN result;
    END;
$$ LANGUAGE PLPGSQL;

@andrewfinnell
Copy link

@osuushi Can't this generate duplicate identifiers? There is a chance the gen_random_bytes(3) returns the same result AND EPOCH from clock_timestamp doesn't have enough precision such that it's called twice in a row and returns the same number as well. There's nothing in here that would ensure the ID's being generated can never conflict. Perhaps the precision of a Postgres timestamp is such that this could never happen as it takes long enough to run the function itself without there being a problem. Unless two connections ran the function at the same time. One could account for this by adding a simple rolling sequence. It's similar work I did for the ACE C++ Network framework for the UUID generation.

By using a sequence in the function we guarantee that we can have at least 10 connections running this function at the 'exact' same time, and if the random byte generate were to happen to create the same data each time this would not result in conflicts. When creating ID's one has to ensure they will never conflict, even across shards. This isn't the same as hash'ing something for a bucket lookup. I'm curious about your thoughts?

-- This adaptation is released under the MIT License.
CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE SEQUENCE epoch_seq INCREMENT BY 1 MAXVALUE 9 CYCLE;
CREATE OR REPLACE FUNCTION generate_object_id() RETURNS varchar AS $$
    DECLARE
        time_component bigint;
        epoch_seq int;
        machine_id text := encode(gen_random_bytes(3), 'hex');
        process_id bigint;
        seq_id text := encode(gen_random_bytes(3), 'hex');
        result varchar:= '';
    BEGIN
        SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp())) INTO time_component;
        SELECT nextval('epoch_seq') INTO epoch_seq;
        SELECT pg_backend_pid() INTO process_id;

        result := result || lpad(to_hex(time_component), 8, '0');
        result := result || machine_id;
        result := result || lpad(to_hex(process_id), 4, '0');
        result := result || seq_id;
        result := result || epoch_seq;
        RETURN result;
    END;
$$ LANGUAGE PLPGSQL;

@amcgregor
Copy link

As a quick note (this was linked in ##webdev on irc.freenode.net) ObjectIDs are meant to be generated client-side, that is, within the application connecting to the DB, not from within the DB itself. The ObjectID fields for machine and counter are not random. Additionally, the latest revision of the specification (since 3.2) has redefined the fields to not have a distinction between the "fields", now it is just a random chunk, but it's a simpler format overall.

  • The machine ID is usually generated from the socket.gethostname(), hashed. It is most useful for this to reference the application worker that constructed the record, not the DB node (which will be largely consistent) that processed the insert.
    Python: int(md5(gethostname()).hexdigest()[:6], 16)

  • The counter is an actual counter, counting within that process, requiring coordination (and locking) within that process.
    E.g. https://github.com/marrow/web.session/blob/develop/web/session/util.py#L32-L49

Server-side generated IDs can not benefit from this additional data, which is useful, esp. for aggregating things like logging records. ("Group by machine", "clear all sessions generated by machine X", etc…)

Can't this generate duplicate identifiers?

Yes, the method described above can generate duplicates, but their creation is a statistical problem. You'd need six bytes generating the same random value twice (highly unlikely, but possible esp. if using something like a time-based PRNG seed… note osuushi's comment above) and with a sequence cycle (maximum, looping) of 9… only ten records would be needed to duplicate that component. (If I grok that line correctly.)

This compares to the "correct" generation approach, which would require you generate 16.777 million IDs within the same second on the same worker process and machine. Best I've ever reached (with a capped collection, five years ago) was an insertion rate of ~5.7 million records per second, which basically makes a collision impossible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment