- OS: Debian 12 VM on Proxmox VE 8
- CPUs: 8x Xeon E5-2620 v2
- RAM: 8 GiB DDR3
- HDD: Samsung PM863 NVMe in Ceph over Mellanox Infiniband 56 Gbps IPoIB
- Virtualized via
scsi-virtio-single
,aio=native
,io_threads=enabled
- Virtualized via
- FS: ext4
(rw,noatime)
- Postgres 15.6
effective_io_concurrency = 300
random_page_cost = 1.1
shared_buffers = 3072MB
work_mem = 16MB
CREATE TABLE users (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
deleted_at TIMESTAMP,
is_deleted BOOL,
first_name VARCHAR(126) NOT NULL,
last_name VARCHAR(126) NOT NULL,
email VARCHAR(126) NOT NULL UNIQUE,
city VARCHAR(126) NOT NULL,
country VARCHAR(126) NOT NULL,
data JSONB
);
The values were set to have non-null values in 25% in opt_bool
, 66% in opt_date
, and 20% in fanout_id
. These are
arbitrary values; higher fill percentages would increase timings and size to some extent.
{
"context":
"opt_bool": ["", 0, 1],
"opt_date": ["", ISO8601],
"fanout_id": ["", UUIDv4]
}
1,000,000 rows were inserted into the table. As this was done in successive passes, the table was then CLUSTERED
on the PK
(a monotonic integer) to better represent actual user data. VACUUM ANALYZE
was then ran, then statistics gathered:
postgres=# SELECT attname, null_frac, n_distinct, correlation FROM pg_stats WHERE tablename = 'users';
attname | null_frac | n_distinct | correlation
------------+-----------+--------------+----------------
id | 0 | -1 | 1
data | 0.7988333 | 15341 | 0.037587192
is_deleted | 0.9521667 | 1 | 1
last_name | 0 | 13528 | 0.0047757574
created_at | 0 | -0.31019545 | 1
first_name | 0 | 8420 | -0.004551076
email | 0 | -1 | -0.004673534
city | 0 | 7346 | -0.00078515126
deleted_at | 0.9521667 | -0.047833323 | 0.8148568
country | 0 | 211 | 0.034343384
(10 rows)
postgres=# \dti+ users*
List of relations
Schema | Name | Type | Owner | Table | Persistence | Access method | Size | Description
--------+-----------------------------+-------+----------+-------+-------------+---------------+--------+-------------
public | users | table | postgres | | permanent | heap | 653 MB |
public | users_email_uidx | index | postgres | users | permanent | btree | 226 MB |
public | users_pkey | index | postgres | users | permanent | btree | 107 MB |
(5 rows)
Postgres was then restarted and OS caches were dropped to gather cold and warm baseline data. This process was repeated with every test to ensure fair testing.
sudo systemctl stop postgresql &&
sync; sync;
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' &&
sudo systemctl start postgresql
This selects the entire row where a known UUID exists, and uses the ->
and ->>
operators to extract the value as text.
postgres=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE data->'context'->>'fanout_id' = 'd3fd770b-c3a4-40e3-949c-540aa6419f39';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..123614.75 rows=25048 width=183) (actual time=55.854..2698.755 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared read=83582
-> Parallel Seq Scan on users (cost=0.00..120109.95 rows=10437 width=183) (actual time=1766.471..2637.568 rows=0 loops=3)
Filter: (((data -> 'context'::text) ->> 'fanout_id'::text) = 'd3fd770b-c3a4-40e3-949c-540aa6419f39'::text)
Rows Removed by Filter: 1666666
Buffers: shared read=83582
Planning:
Buffers: shared hit=107 read=36
Planning Time: 31.527 ms
JIT:
Functions: 12
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 3.415 ms, Inlining 0.000 ms, Optimization 4.170 ms, Emission 64.969 ms, Total 72.554 ms
Execution Time: 3041.361 ms
(16 rows)
With a cold cache, the execution time was 3041.361 ms
, and with a warm cache, 397.397 ms
.
This selects the entire row where a known UUID exists, and uses the @>
operator to check containment.
postgres=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE (data->'context') @> '{"fanout_id": "d3fd770b-c3a4-40e3-949c-540aa6419f39"}';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..120901.18 rows=50095 width=183) (actual time=57.409..2830.096 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared read=83582
-> Parallel Seq Scan on users (cost=0.00..114891.68 rows=20873 width=183) (actual time=1850.758..2764.774 rows=0 loops=3)
Filter: ((data -> 'context'::text) @> '{"fanout_id": "d3fd770b-c3a4-40e3-949c-540aa6419f39"}'::jsonb)
Rows Removed by Filter: 1666666
Buffers: shared read=83582
Planning:
Buffers: shared hit=99 read=29
Planning Time: 32.520 ms
JIT:
Functions: 12
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 3.376 ms, Inlining 0.000 ms, Optimization 4.291 ms, Emission 66.648 ms, Total 74.315 ms
Execution Time: 3195.974 ms
(16 rows)
With a cold cache, the execution time was 3195.974 ms
, and with a warm cache, 412.312 ms
.
A GIN index was created on the data
column using jsonb_path_ops
, which only supports a limited set of operators
(@>
, @?
, and @@
) but is generally faster than the default jsonb_ops
, and smaller.
postgres=# CREATE INDEX users_data_gin_pathops_idx ON users USING GIN(data jsonb_path_ops);
CREATE INDEX
Time: 5563.825 ms (00:05.564)
postgres=# \di+ users_data*
List of relations
Schema | Name | Type | Owner | Table | Persistence | Access method | Size | Description
--------+-----------------------------+-------+----------+-------+-------------+---------------+-------+-------------
public | users_data_gin_pathops_idx | index | postgres | users | permanent | gin | 52 MB |
(1 row)
postgres=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE data @> '{"context": {"fanout_id": "d3fd770b-c3a4-40e3-949c-540aa6419f39"}}';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on users (cost=6.06..87.06 rows=73 width=184) (actual time=2.812..2.819 rows=1 loops=1)
Recheck Cond: (data @> '{"context": {"fanout_id": "d3fd770b-c3a4-40e3-949c-540aa6419f39"}}'::jsonb)
Heap Blocks: exact=1
Buffers: shared hit=1 read=4
-> Bitmap Index Scan on users_data_gin_pathops_idx (cost=0.00..6.05 rows=73 width=0) (actual time=1.600..1.605 rows=1 loops=1)
Index Cond: (data @> '{"context": {"fanout_id": "d3fd770b-c3a4-40e3-949c-540aa6419f39"}}'::jsonb)
Buffers: shared hit=1 read=3
Planning:
Buffers: shared hit=116 read=39 dirtied=1
Planning Time: 30.644 ms
Execution Time: 3.818 ms
(11 rows)
With a cold cache, the execution time was 3.8 ms
, and with a warm cache, 0.076 ms
.
Next, a traditional BTREE index was created on the fanout_id
key.
postgres=# CREATE INDEX users_data_btree_idx ON users USING BTREE((data -> 'context' ->> 'fanout_id'));
CREATE INDEX
Time: 3312.203 ms (00:03.312)
postgres=# \di+ users_data*
List of relations
Schema | Name | Type | Owner | Table | Persistence | Access method | Size | Description
--------+----------------------------+-------+----------+-------+-------------+---------------+-------+-------------
public | users_data_btree_idx | index | postgres | users | permanent | btree | 43 MB |
public | users_data_gin_pathops_idx | index | postgres | users | permanent | gin | 52 MB |
(2 rows)
postgres=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE data->'context'->>'fanout_id' = 'd3fd770b-c3a4-40e3-949c-540aa6419f39';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Index Scan using users_data_btree_idx on users (cost=0.43..90.38 rows=133 width=184) (actual time=4.711..4.713 rows=1 loops=1)
Index Cond: (((data -> 'context'::text) ->> 'fanout_id'::text) = 'd3fd770b-c3a4-40e3-949c-540aa6419f39'::text)
Buffers: shared read=4
Planning:
Buffers: shared hit=128 read=47 dirtied=9
Planning Time: 24.991 ms
Execution Time: 5.984 ms
(7 rows)
With a cold cache, the execution time was 5.984 ms
, and with a warm cache, 0.077 ms
. This is effectively identical
to the GIN index, and doesn't suffer from the GIN update problem. However, it must be noted here that this is a brand-new
table, so the indices are all perfectly aligned, with minimal bloat. As new rows are added, this would change.
A hash index was created on fanout_id
(which takes quite a bit longer than any other):
postgres=# CREATE INDEX users_data_hash_idx ON users USING HASH ((data->'context'->>'fanout_id'));
CREATE INDEX
Time: 227316.588 ms (03:47.317)
postgres=# \di+ users_data*
List of relations
Schema | Name | Type | Owner | Table | Persistence | Access method | Size | Description
--------+----------------------------+-------+----------+-------+-------------+---------------+--------+-------------
public | users_data_btree_idx | index | postgres | users | permanent | btree | 43 MB |
public | users_data_gin_pathops_idx | index | postgres | users | permanent | gin | 52 MB |
public | users_data_hash_idx | index | postgres | users | permanent | hash | 143 MB |
(3 rows)
postgres=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE data->'context'->>'fanout_id' = 'd3fd770b-c3a4-40e3-949c-540aa6419f39';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Index Scan using users_data_hash_idx on users (cost=0.00..150.84 rows=134 width=184) (actual time=2.784..2.787 rows=1 loops=1)
Index Cond: (((data -> 'context'::text) ->> 'fanout_id'::text) = 'd3fd770b-c3a4-40e3-949c-540aa6419f39'::text)
Buffers: shared read=3
Planning:
Buffers: shared hit=143 read=45
Planning Time: 22.285 ms
Execution Time: 4.439 ms
(7 rows)
With a cold cache, the execution time was 4.439 ms
, and with a warm cache, 0.063 ms
. However, the index is ~3.3x the size
of the BTREE index. Benchmarking under realistic load would be necessary to determine their relative bloat and performance
over time.
The pgstattuple
extension was used to see statistics about the indices created (other than GIN).
Again, it must be emphasized that these are effectively pristine indices with no bloat, and are not
indicative of how they would exist in production. In practice, the BTREE index would approach 50% fragmentation over time.
postgres=# SELECT internal_pages, leaf_pages, avg_leaf_density, leaf_fragmentation FROM pgstatindex('users_data_btree_idx');
internal_pages | leaf_pages | avg_leaf_density | leaf_fragmentation
----------------+------------+------------------+--------------------
34 | 5471 | 90.01 | 0
(1 row)
postgres=# SELECT bucket_pages, overflow_pages, bitmap_pages, unused_pages, ROUND(free_percent::numeric, 2) FROM pgstathashindex('users_data_hash_idx');
bucket_pages | overflow_pages | bitmap_pages | unused_pages | round
--------------+----------------+--------------+--------------+-------
16384 | 1965 | 1 | 0 | 86.63
(1 row)