stephanGarland/indexing_bools_low_cardinality.md

## indexing_bools_low_cardinality.md

      
    Raw
  

              indexing_bools_low_cardinality.md
            
          
    System Specifications


OS: Debian Bullseye 5.10.0-23-amd64
Virtualized: Yes (Proxmox)
CPU: E5-2650 v2 @ 2.60GHz
Allocated Core Count: 16
Allocated RAM: 64 GiB PC3-12800R
Disk: Samsung PM983 1.92 TiB via Ceph
Filesystem: XFS
Mount Options: defaults,noatime
Postgres Version: 15.3
Postgres Options (non-default):

shared_buffers = 16GB
max_wal_size = 8GB


Schema

The table is leftover from a class I taught for my last job on normalization (in this case, unnormalized).
CREATE TABLE public.todo_unf (
    id bigint NOT NULL PRIMARY KEY,
    user_id uuid NOT NULL,
    user_email character varying(254) NOT NULL,
    created_at timestamp with time zone DEFAULT now() NOT NULL,
    tags jsonb DEFAULT '[]'::jsonb NOT NULL,
    shared_with jsonb DEFAULT '[]'::jsonb NOT NULL,
    due_date timestamp with time zone,
    is_completed boolean DEFAULT false NOT NULL,
    is_overdue boolean DEFAULT false NOT NULL,
    description text NOT NULL,
    title character varying(255) NOT NULL,
    user_name character varying(255) NOT NULL
);

The table has 25,000,000 rows, precisely half of which have is_completed = TRUE:
SELECT COUNT(*) FROM todo_unf;
  count   
----------
 25000000
(1 row)

SELECT COUNT(*) FROM todo_unf WHERE is_completed;
  count   
----------
 12500000
(1 row)

There are other indices as well, but the below two were created for the purposes of this demonstration.
CREATE INDEX todo_unf_is_comp ON public.todo_unf USING btree (is_completed);
CREATE INDEX todo_unf_is_comp_false ON public.todo_unf USING btree (is_completed) WHERE (NOT is_completed);

Demonstration

EXPLAIN (ANALYZE, BUFFERS, COSTS) SELECT * FROM todo_unf WHERE NOT is_completed;
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Seq Scan on todo_unf  (cost=0.00..946414.00 rows=12446518 width=186) (actual time=73.242..4992.580 rows=12500000 loops=1)
   Filter: (NOT is_completed)
   Rows Removed by Filter: 12500000
   Buffers: shared hit=696417
 Planning Time: 0.459 ms
 JIT:
   Functions: 2
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.107 ms, Inlining 3.020 ms, Optimization 44.039 ms, Emission 25.878 ms, Total 74.045 ms
 Execution Time: 5679.940 ms
(10 rows)


As you can see, the DB has been warmed up, and the entire table is in buffers - no disk reads. Even still, Postgres
chose to perform a sequential scan, and it took 5.7 seconds.
I'll now disable sequential scans, and re-run:
SET enable_seqscan=OFF;
SET
                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on todo_unf  (cost=107648.65..1048156.09 rows=12446518 width=186) (actual time=1123.364..6877.809 rows=12500000 loops=1)
   Recheck Cond: (NOT is_completed)
   Rows Removed by Index Recheck: 12392772
   Heap Blocks: exact=38348 lossy=658069
   Buffers: shared hit=706941
   ->  Bitmap Index Scan on todo_unf_is_comp_false  (cost=0.00..104537.02 rows=12446518 width=0) (actual time=1084.098..1084.099 rows=12500000 loops=1)
         Buffers: shared hit=10524
 Planning Time: 0.385 ms
 JIT:
   Functions: 2
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 2.142 ms, Inlining 1.748 ms, Optimization 17.962 ms, Emission 10.057 ms, Total 31.908 ms
 Execution Time: 7563.283 ms
(13 rows)

This time, it used the partial index todo_unf_is_comp_false, resulting in a bitmap index scan + bitmap heap scan
(as expected, since not all columns in the selection are covered by an index), and a total time of 7.6 seconds.
Now I'll disable the partial index, forcing it to only have the full index. I use a function for this, but to just
do a one-off you can run UPDATE pg_index SET indisvalid = FALSE WHERE indexrelid = $YOUR_INDEX_NAME::regclass;.
Or for this case, you could test the opposite (WHERE is_completed) as that can't use the partial index.
SELECT toggle_idx('todo_unf_is_comp_false');
                toggle_idx                
------------------------------------------
 index todo_unf_is_comp_false is DISABLED
(1 row)

                                                                   QUERY PLAN                                                                   
------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on todo_unf  (cost=138580.94..1079088.38 rows=12446518 width=186) (actual time=923.364..6981.017 rows=12500000 loops=1)
   Filter: (NOT is_completed)
   Rows Removed by Filter: 12392772
   Heap Blocks: exact=38348 lossy=658069
   Buffers: shared hit=706941
   ->  Bitmap Index Scan on todo_unf_is_comp  (cost=0.00..135469.32 rows=12446517 width=0) (actual time=884.125..884.126 rows=12500000 loops=1)
         Index Cond: (is_completed = false)
         Buffers: shared hit=10524
 Planning:
   Buffers: shared hit=8
 Planning Time: 0.686 ms
 JIT:
   Functions: 2
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.548 ms, Inlining 1.763 ms, Optimization 18.547 ms, Emission 10.064 ms, Total 31.922 ms
 Execution Time: 7735.204 ms
(16 rows)

This was fairly close to the first at 7.7 seconds; a larger table still would likely show further amplification.