Skip to content

Instantly share code, notes, and snippets.

@christhekeele
Last active July 20, 2023 16:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save christhekeele/08a5805a3b6e519303ef1d17eebca278 to your computer and use it in GitHub Desktop.
Save christhekeele/08a5805a3b6e519303ef1d17eebca278 to your computer and use it in GitHub Desktop.
Automatic UUID Reflection in Postgres

Automatic UUID Reflection in Postgres

You have a system backed by a Postgres database that uses UUIDs as primary keys for everything. This is nice:

  • You can generate ids on clients before transmitting writes to servers if you wish.
  • You don't have to worry about exhausting the integer size allocated to your auto-incrementing primary keys.1
  • You don't have to worry about enumeration attacks from auto-incrementing primary keys.
  • You can identify and index on data that has no natural key, without the challenges of auto-incrementing primary keys.

The Problem

However, now you have large, illegible UUIDs permeating your system's logs. This sucks when interpreting logs and debugging errors, especially for data that does have natural keys you could have chosen as a primary key; like any table with a global uniqueness constraint (ex: user.email). Homogeneity is attractive, but having to guess what table a record lives in, based on its UUID alone, is not.

This is especially salient if you expose UUIDs in URLs, and your Customer Support team requires URLs in bug reports, and some of your URLs are polymorphic on ID. For example, an example.com/edit/<thing-id> URL that can operate on data from different tables.

This debugging and observability challenge can be addressed with better logging—but that is true of most debugging and observability challenges. You could also implement your own custom UUID generation method in all your systems and give up some of those collision-avoiding bits to tag them with metadata. Instead, let's explore solving it at the storage level.

Table of Contents

One Solution

Use postgres inheritance and a generated column to automatically maintain a registry of all the UUIDs in your system.2

The Entity Table

Let's create a base table named entity where we can discover the location of any record in our system:

CREATE TABLE entity (
  -- Every inheritor of this table gets a uuid PK
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  -- Every inheritor of this table auto-generates the table name it comes from
  source_table text GENERATED ALWAYS AS (tableoid::regclass) STORED
);

Creating Registered Tables

We can now create tables that inherit from the entity registry for this core functionality:

CREATE TABLE user (
  email citext PRIMARY KEY
) INHERITS (entity);

Registering Existing Tables

Existing tables (per the docs, I have not tried this on a non-greenfield database) can sign up to the registry:

ALTER TABLE existing_table INHERIT entity;

Querying the Entity Registry

Now if you want to know what table a UUID in your system comes from, you can query the entity registry:

-- Found uuid 'd4c0fc8c-72b7-44ab-94a7-a070d59357e0' in a log file?
SELECT * FROM entity WHERE id = 'd4c0fc8c-72b7-44ab-94a7-a070d59357e0';
-- | id                                   | source |
-- |--------------------------------------|--------|
-- | d4c0fc8c-72b7-44ab-94a7-a070d59357e0 | user   |

With the source in hand, you can check out the data associated with that UUID:

SELECT * FROM user WHERE id = 'd4c0fc8c-72b7-44ab-94a7-a070d59357e0';
--- | id                                   | source | email             |
--- |--------------------------------------|--------|-------------------|
--- | d4c0fc8c-72b7-44ab-94a7-a070d59357e0 | user   | user@example.com  |

Trade-offs

  • Per this footnote1, this approach exacerbates the probablistic risks of using UUIDS as primary keys further.

    You now have to worry about UUID generation collision in write activity across your entire table-space, instead of per-table. If you have a system with a write volume where this risk concerns you, hopefully you have already studied how to calculate those probabilities around collisions arising from the birthday paradox, and can apply them to this approach.3

  • You have to pay for the cost of storing the table name for every record.

    Postgres generated columns only support the STORED qualifier for GENERATED ALWAYS AS columns, and AFAICT in 2023 there are no plans to introduce support for a hypothetical VIRTUAL qualifier. This cost can be mitigated by limiting the size of the source column via something other than text.

Bonus Columns

If you're interested in enforcing audit timestamps for every record in your system, you can now add these columns to your base entity table.

Unfortunately, triggers are not inherited from the base entity table, so you must remember to execute DDLs that attach these triggers to every table in your system; new or existing.

Inserted At

  1. Add a new column to your base table:

    ALTER TABLE entity ADD COLUMN inserted_at DEFAULT CURRENT_TIMESTAMP NOT NULL;

    Existing records in inheriting tables will receive values corresponding to the time the DDL was ran. New records will receive values corresponding to when they were inserted.

Updated At

  1. Add a new column to your base table.

    ALTER TABLE entity ADD COLUMN updated_at DEFAULT CURRENT_TIMESTAMP NOT NULL;

    Existing records in inheriting tables will receive values corresponding to the time the DDL was ran. New records will receive values corresponding to when they were inserted.

  2. Create a function that any table can use to update their updated_at timestamp.

    CREATE OR REPLACE FUNCTION update_updated_at_timestamp()
    returns TRIGGER AS $$
      BEGIN
        NEW.updated_on = CURRENT_TIMESTAMP();
      RETURN NEW;
      END;
    $$ LANGUAGE 'plpgsql';
    
  3. Add the trigger to your inheriting <existing_table>s, including every time you create one.

    CREATE TRIGGER update_updated_at_timestamp_on_<existing_table>
        BEFORE UPDATE
        ON
            <existing_table?
        FOR EACH ROW
    EXECUTE PROCEDURE update_updated_at_timestamp();

Soft Deleted At

  1. Add a new column to your base table.

    ALTER TABLE entity ADD COLUMN soft_deleted_at DEFAULT NULL;
  2. Enforce inserting a timestamp into this field when your application wants to "soft delete" data.

    To query only "not-soft-deleted" records, only select where soft_deleted_at IS NULL everywhere, including in joins.

  3. Add a covering index on the null-ness of this field.

    This way, if soft deletes accrete and become a disproportionate presence in your table, not-soft-deleted-ones remain rapidly queryable.

  4. Consider a periodic sweep of soft-deleted records.

This must align with your data retention and right-to-be-forgotten policies. It must also support your "undo" and audit log features. Within those constraints, however, regularly purge soft-deleted records to keep your dataset minimal.

Notes

Footnotes

  1. Technically, you now have to worry about your write volume exhausting the probabilistic "guarantees" of UUIDs. 2

  2. Fair warning: I have only tried this approach in one meaningfully write-heavy production system; I've since re-implemented it in several hobby projects after devising this solution for that system. After a year and a half of productionization before I left that team, it was still working well. And, I still really love using it in my side projects.

  3. If you are even just an entire order of magnitude under having write volumes that could guarantee even occastional probabalistic UUID generation collisions across your database, let alone within single tables, you should be anxiously researching non-table-based multi-tenancy, or sharding approaches, for your storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment