Skip to content

Instantly share code, notes, and snippets.

@yashap
Last active September 13, 2023 04:10
Show Gist options
  • Save yashap/6d7a34ef37c6b7d3e4fc11b0bece70b0 to your computer and use it in GitHub Desktop.
Save yashap/6d7a34ef37c6b7d3e4fc11b0bece70b0 to your computer and use it in GitHub Desktop.
Example of how sometimes a normalized schema + joins kills performance vs. denormalizing

Normalized/join vs. denormalized/no-join performance example

An example of how sometimes you have to denormalize your schema, so that you can build compound indexes, to get acceptable performance. In some cases a normalized schema with join(s) just won't be fast enough.

Note: I think in almost all cases you should start with a normalized schema and use joins. But when you hit cases like the above, it's fine to denormalize just for these specific cases - often you'll just have one or a few such cases in your entire app, where the combination of data size/shape/queries means you cannot have efficient queries without denormalizing.

The problem case

Two tables, both large, where you sort on one table, filter on the other, reading out one page of data. Filtering eliminates a lot of records, but also still leaves you with a lot of records.

As an example, imagine a transportation app, where you have trips that have a pickup time and a rider, and the rider could be a wheelchair or ambulatory rider. We want to get 1 page of wheelchair riders, sorted by pickup time.

First, lets create a Postgres DB, with some persistent storage at ~/pg_example_data:

docker run \
  --name pg_example \
  -v "$HOME/pg_example_data":/var/lib/postgresql/data \
  --cpus 2 \
  --memory 8g \
  -e POSTGRES_USER=example_user \
  -e POSTGRES_PASSWORD=example_pwd \
  -e POSTGRES_DB=example \
  -p 5440:5432 \
  -d \
  postgres

Connect to it:

docker exec -it pg_example /bin/bash

# Note: may have to wait a bit before this command works, as the DB comes up
psql "postgres://example_user:example_pwd@localhost:5432/example?sslmode=disable"

Setup our schema:

CREATE TYPE "rider_type" AS ENUM ('ambulatory', 'wheelchair');

CREATE TABLE "rider" (
  "id" SERIAL PRIMARY KEY NOT NULL,
  "type" "rider_type" NOT NULL
);
CREATE INDEX "rider_type_idx" ON "rider"("type");

CREATE TABLE "trip" (
  "id" SERIAL PRIMARY KEY NOT NULL,
  "pickup_at" TIMESTAMP NOT NULL,
  "rider_id" INTEGER NOT NULL REFERENCES "rider"("id"),
  -- Note - this is a denormalization, but will be using it to show how denormalizing can
  -- lead to faster queries than with joins, because you can create efficient compound indexes
  "rider_type" "rider_type" NOT NULL
);
CREATE INDEX "trip_rider_id_idx" ON "trip"("rider_id");
CREATE INDEX "trip_pickup_at_idx" ON "trip"("pickup_at");
-- Index for the denormalized data
-- Depending on the data, sometimes sort key first/filter second is best, sometimes the opposite is true,
-- you should try both with your data/queries
CREATE INDEX "trip_rider_type_pickup_at_idx" ON "trip"("rider_type", "pickup_at");

Create a bunch of fake data:

-- Create 1,000,000 riders, where 5% are wheelchair riders, 95% ambulatory
-- This will take a long time to run
INSERT INTO "rider"
SELECT
  "i"
  , CASE WHEN RANDOM() < 0.05 THEN 'wheelchair'::"rider_type" ELSE 'ambulatory'::"rider_type" END
FROM
  GENERATE_SERIES(1, 1000000) AS "i"
;

-- Create 3,000,000 trips, 3 per rider
-- This will take a long time to run as well (substantially longer than creating the riders)
INSERT INTO "trip" ("pickup_at", "rider_id", "rider_type")
SELECT
  NOW() + (RANDOM() * (INTERVAL '60 DAYS'))
  , "id"
  , "type"
FROM
  (
    SELECT * FROM "rider"
    UNION ALL
    SELECT * FROM "rider"
    UNION ALL
    SELECT * FROM "rider"
  ) AS "x"
;

Ensure a cold cache, then reconnect to PG:

exit
exit
docker stop pg_example
docker start pg_example
docker exec -it pg_example /bin/bash

# Note: may have to wait a bit before this command works, as the DB comes up
psql "postgres://example_user:example_pwd@localhost:5432/example?sslmode=disable"

Try out the performance of queries we'd have to run with a normalized schema + join:

-- First, turn on timing
\timing

-- Then
SELECT
  "trip"."id"
FROM
  "trip"
JOIN
  "rider" ON "rider"."id" = "trip"."rider_id"
WHERE
  "rider"."type" = 'wheelchair'
ORDER BY
  "trip"."pickup_at" DESC
LIMIT
  50
;

For me, this takes ~3.3 seconds, and this is not really THAT much data. With larger data sets you can quickly get to 10s of seconds, even minutes. The query plan/costs for this query.

While a completely cold cache is not realistic, moderately cold caches are common in practice, in my experience:

  • If you can fit 100% of indexes and hot data into memory, caches will be very warm quite consistently
  • However, this is rare in practice, with scale. As you scale, your DB indexes + hot data grows to 100s of GBs, even TBs, and it's just too expensive to pay for DBs with that much RAM
    • Especially if you've got a managed cloud DB (like most startups), DBs with massive amounts of RAM are crazy expensive
    • So you're left with a DB where the indexes + hot data do not fit into RAM. The absolute hottest parts of indexes/data do, but decently often you're paging some of your indexes/data from disk into memory
    • Indexes are stored basically the same way on disk as in memory, so this is fine performance wise as long as you only need to read a bit of the index to satisfy the query. But if you have to read the whole thing, it gets slow

Then we can again ensure a cold cache, then reconnect to PG:

exit
exit
docker stop pg_example
docker start pg_example
docker exec -it pg_example /bin/bash

# Note: may have to wait a bit before this command works, as the DB comes up
psql "postgres://example_user:example_pwd@localhost:5432/example?sslmode=disable"

Then try out the performance of queries we can run with a denormalized schema (no join, and using a compound index that's only possible to create because we denormalized):

-- First, turn on timing
\timing

-- Then
SELECT
  "trip"."id"
FROM
  "trip"
WHERE
  "trip"."rider_type" = 'wheelchair'
ORDER BY
  "trip"."pickup_at" DESC
LIMIT
  50
;

For me, this takes ~65 milliseconds, or ~50x faster than the join version. The query plan/costs for this query.

The key takewaway - for specific data sets/queries, sometimes normalized schema + joins mean you cannot create an efficient query, you need to do some form of denormalization to acheive this. I definitely wouldn't start with denormalizing + no joins, start with normalized schemas and joins, they're fast enough the vast majority of the time. But as you scale, you'll probably run into the odd queries like the above that you simply cannot make consistently fast, that cause slow experiences for users - for those, you'll likely find that denormalizing and adding a compound index (like the above example) is the only reasonable solution.

And finally cleanup:

exit
exit
docker rm -f pg_example --volumes && rm -rf ~/pg_example_data/
@shvargon
Copy link

A very good example but here are some tips that can improve your skills.

There is no reason to open the port to the outside when starting the container; in this example it is not used
switch -p 5440:5432 in the docker run command

When connecting to a docker container, there is no need to log into bash and then log into psql. You can immediately log into psql and this has the following advantages

  • This command is easy to search for and execute via search as it consists of one line (ctrl+r in the terminal)
  • A double exit is not required as in your example, as we directly access psql without entering bash
docker exec -it pg_example psql "postgres://example_user:example_pwd@localhost:5432/example?sslmode=disable"

Well, or if you are not satisfied with this notation and you want to specify the full path, then you can do it this way (although I like the short notation better)

docker exec -it pg_example /usr/bin/psql "postgres://example_user:example_pwd@localhost:5432/example?sslmode=disable"

By the way, you can find out the path to the executable file by connecting to the container and running the command which

docker exec -it pg_example

root@5821a936485d:/# which psql
/usr/bin/psql

It’s also better to stop and start the container this way, again because of searching in the terminal so it’s easier to repeat commands without having to search and execute two commands.

docker stop pg_example && docker start pg_example

rm -rf will most likely have to be executed with sudo if the commands are executed as a non-root user.

Also for those who don't know, it would be good to mention deleting image postgres if they no longer use it in other containers.

You can look at my fork where I corrected everything I said

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment