yashap/join_perf_example.md

## join_perf_example.md

      
    Raw
  

              join_perf_example.md
            
          
    Normalized/join vs. denormalized/no-join performance example

An example of how sometimes you have to denormalize your schema, so that you can build compound indexes,
to get acceptable performance. In some cases a normalized schema with join(s) just won't be fast enough.
Note: I think in almost all cases you should start with a normalized schema and use joins. But when you
hit cases like the above, it's fine to denormalize just for these specific cases - often you'll just have
one or a few such cases in your entire app, where the combination of data size/shape/queries means you
cannot have efficient queries without denormalizing.
The problem case

Two tables, both large, where you sort on one table, filter on the other, reading out one page of data.
Filtering eliminates a lot of records, but also still leaves you with a lot of records.
As an example, imagine a transportation app, where you have trips that have a pickup time and a rider,
and the rider could be a wheelchair or ambulatory rider. We want to get 1 page of wheelchair riders,
sorted by pickup time.
First, lets create a Postgres DB, with some persistent storage at ~/pg_example_data:
docker run \
  --name pg_example \
  -v "$HOME/pg_example_data":/var/lib/postgresql/data \
  --cpus 2 \
  --memory 8g \
  -e POSTGRES_USER=example_user \
  -e POSTGRES_PASSWORD=example_pwd \
  -e POSTGRES_DB=example \
  -p 5440:5432 \
  -d \
  postgres
Connect to it:
docker exec -it pg_example /bin/bash

# Note: may have to wait a bit before this command works, as the DB comes up
psql "postgres://example_user:example_pwd@localhost:5432/example?sslmode=disable"
Setup our schema:
CREATE TYPE "rider_type" AS ENUM ('ambulatory', 'wheelchair');

CREATE TABLE "rider" (
  "id" SERIAL PRIMARY KEY NOT NULL,
  "type" "rider_type" NOT NULL
);
CREATE INDEX "rider_type_idx" ON "rider"("type");

CREATE TABLE "trip" (
  "id" SERIAL PRIMARY KEY NOT NULL,
  "pickup_at" TIMESTAMP NOT NULL,
  "rider_id" INTEGER NOT NULL REFERENCES "rider"("id"),
  -- Note - this is a denormalization, but will be using it to show how denormalizing can
  -- lead to faster queries than with joins, because you can create efficient compound indexes
  "rider_type" "rider_type" NOT NULL
);
CREATE INDEX "trip_rider_id_idx" ON "trip"("rider_id");
CREATE INDEX "trip_pickup_at_idx" ON "trip"("pickup_at");
-- Index for the denormalized data
-- Depending on the data, sometimes sort key first/filter second is best, sometimes the opposite is true,
-- you should try both with your data/queries
CREATE INDEX "trip_rider_type_pickup_at_idx" ON "trip"("rider_type", "pickup_at");
Create a bunch of fake data:
-- Create 1,000,000 riders, where 5% are wheelchair riders, 95% ambulatory
-- This will take a long time to run
INSERT INTO "rider"
SELECT
  "i"
  , CASE WHEN RANDOM() < 0.05 THEN 'wheelchair'::"rider_type" ELSE 'ambulatory'::"rider_type" END
FROM
  GENERATE_SERIES(1, 1000000) AS "i"
;

-- Create 3,000,000 trips, 3 per rider
-- This will take a long time to run as well (substantially longer than creating the riders)
INSERT INTO "trip" ("pickup_at", "rider_id", "rider_type")
SELECT
  NOW() + (RANDOM() * (INTERVAL '60 DAYS'))
  , "id"
  , "type"
FROM
  (
    SELECT * FROM "rider"
    UNION ALL
    SELECT * FROM "rider"
    UNION ALL
    SELECT * FROM "rider"
  ) AS "x"
;
Ensure a cold cache, then reconnect to PG:
exit
exit
docker stop pg_example
docker start pg_example
docker exec -it pg_example /bin/bash

# Note: may have to wait a bit before this command works, as the DB comes up
psql "postgres://example_user:example_pwd@localhost:5432/example?sslmode=disable"
Try out the performance of queries we'd have to run with a normalized schema + join:
-- First, turn on timing
\timing

-- Then
SELECT
  "trip"."id"
FROM
  "trip"
JOIN
  "rider" ON "rider"."id" = "trip"."rider_id"
WHERE
  "rider"."type" = 'wheelchair'
ORDER BY
  "trip"."pickup_at" DESC
LIMIT
  50
;
For me, this takes ~3.3 seconds, and this is not really THAT much data. With larger data sets you can quickly get to 10s of seconds, even minutes. The query plan/costs for this query.
While a completely cold cache is not realistic, moderately cold caches are common in practice, in my experience:

If you can fit 100% of indexes and hot data into memory, caches will be very warm quite consistently
However, this is rare in practice, with scale. As you scale, your DB indexes + hot data grows to 100s of GBs, even TBs, and it's just too expensive to pay for DBs with that much RAM

Especially if you've got a managed cloud DB (like most startups), DBs with massive amounts of RAM are crazy expensive
So you're left with a DB where the indexes + hot data do not fit into RAM. The absolute hottest parts of indexes/data do, but decently often you're paging some of your indexes/data from disk into memory
Indexes are stored basically the same way on disk as in memory, so this is fine performance wise as long as you only need to read a bit of the index to satisfy the query. But if you have to read the whole thing, it gets slow


Then we can again ensure a cold cache, then reconnect to PG:
exit
exit
docker stop pg_example
docker start pg_example
docker exec -it pg_example /bin/bash

# Note: may have to wait a bit before this command works, as the DB comes up
psql "postgres://example_user:example_pwd@localhost:5432/example?sslmode=disable"
Then try out the performance of queries we can run with a denormalized schema (no join, and using a
compound index that's only possible to create because we denormalized):
-- First, turn on timing
\timing

-- Then
SELECT
  "trip"."id"
FROM
  "trip"
WHERE
  "trip"."rider_type" = 'wheelchair'
ORDER BY
  "trip"."pickup_at" DESC
LIMIT
  50
;
For me, this takes ~65 milliseconds, or ~50x faster than the join version.  The query plan/costs for this query.
The key takewaway - for specific data sets/queries, sometimes normalized schema + joins mean you cannot create an efficient query, you need to do some form of denormalization to acheive this. I definitely wouldn't start with denormalizing + no joins, start with normalized schemas and joins, they're fast enough the vast majority of the time. But as you scale, you'll probably run into the odd queries like the above that you simply cannot make consistently fast, that cause slow experiences for users - for those, you'll likely find that denormalizing and adding a compound index (like the above example) is the only reasonable solution.
And finally cleanup:
exit
exit
docker rm -f pg_example --volumes && rm -rf ~/pg_example_data/