tym-xqo/burning-glass-query-analysis.md

## burning-glass-query-analysis.md

      
    Raw
  

              burning-glass-query-analysis.md
            
          
    Burning Glass query analysis

(TL;DR: see "Recommendations" at the end)
OK, so. There are a few things going on here. The example query as originally written:
explain analyze
select count(*)
  from "burning_glass_jobs"
 inner join (
        select "burning_glass_jobs"."id" as pg_search_id
             , (ts_rank((setweight(to_tsvector('simple', coalesce("burning_glass_jobs"."title"::text, '')), 'A') || to_tsvector('simple', coalesce("burning_glass_jobs"."description"::text, ''))), (to_tsquery('simple', ''' ' || 'business' || ' ''' || ':*')), 0)) as rank
          from "burning_glass_jobs"
         where (((setweight(to_tsvector('simple', coalesce("burning_glass_jobs"."title"::text, '')), 'A') || to_tsvector('simple', coalesce("burning_glass_jobs"."description"::text, ''))) @@ (to_tsquery('simple', ''' ' || 'business' || ' ''' || ':*'))))
       ) as pg_search_9049415c020507460935a8
    on "burning_glass_jobs"."id" = pg_search_9049415c020507460935a8.pg_search_id
 where "burning_glass_jobs"."internship" = 'false';
Note that in the first three lines, it's actually joining the entire table to a subquery in order to add the not internship filter, and performs the count there. This forces the planner to add the table to the plan twice:
 Finalize Aggregate  (cost=2432267.16..2432267.17 rows=1 width=8) (actual time=463552.349..463919.615 rows=1 loops=1)
   ->  Gather  (cost=2432266.74..2432267.15 rows=4 width=8) (actual time=463552.334..463919.605 rows=5 loops=1)
         Workers Planned: 4
         Workers Launched: 4
         ->  Partial Aggregate  (cost=2431266.74..2431266.75 rows=1 width=8) (actual time=463519.194..463519.196 rows=1 loops=5)
               ->  Nested Loop  (cost=19484.41..2429784.21 rows=593012 width=0) (actual time=446.430..463445.663 rows=492875 loops=5)
                     ->  Parallel Bitmap Heap Scan on burning_glass_jobs burning_glass_jobs_1  (cost=19483.98..1754566.06 rows=597112 width=8) (actual time=445.764..459919.893 rows=497542 loops=5)
                           Recheck Cond: ((setweight(to_tsvector('simple'::regconfig, COALESCE((title)::text, ''::text)), 'A'::"char") || to_tsvector('simple'::regconfig, COALESCE((description)::text, ''::text))) @@ '''business'':*'::tsquery)
                           Rows Removed by Index Recheck: 489458
                           Heap Blocks: exact=38371 lossy=78236
                           ->  Bitmap Index Scan on index_burning_glass_jobs_full_search  (cost=0.00..18886.86 rows=2388449 width=0) (actual time=435.088..435.088 rows=1208534 loops=1)
                                 Index Cond: ((setweight(to_tsvector('simple'::regconfig, COALESCE((title)::text, ''::text)), 'A'::"char") || to_tsvector('simple'::regconfig, COALESCE((description)::text, ''::text))) @@ '''business'':*'::tsquery)
                     ->  Index Scan using burning_glass_jobs_pkey on burning_glass_jobs  (cost=0.43..1.13 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=2487712)
                           Index Cond: (id = burning_glass_jobs_1.id)
                           Filter: (NOT internship)
                           Rows Removed by Filter: 0
 Planning time: 0.794 ms
 Execution time: 463932.878 ms
(18 rows)

For burning_glass_jobs_1, it's actually using the GIN index that Josh suggested adding in the gist. But it's ALSO doing Index Scan using burning_glass_jobs_pkey on burning_glass_jobs -- which while it says it's an "Index Scan", using the PK index is essentially like scanning the table in index order. Then it's doing a nested loop to join the text search matches to the non-internship results. The whole thing takes about 8 minutes.
We can eliminate the extra scan and join somewhat naively, by turning the inner query into a CTE, moving the not intership filter inline there, and then counting the result in the outer query, like so:
explain analyze
with jobs as
(
    select "burning_glass_jobs"."id" as pg_search_id
             , (ts_rank((setweight(to_tsvector('simple', coalesce("burning_glass_jobs"."title"::text, '')), 'A') || to_tsvector('simple', coalesce("burning_glass_jobs"."description"::text, ''))), (to_tsquery('simple', ''' ' || 'business' || ' ''' || ':*')), 0)) as rank
          from "burning_glass_jobs"
         where (((setweight(to_tsvector('simple', coalesce("burning_glass_jobs"."title"::text, '')), 'A') || to_tsvector('simple', coalesce("burning_glass_jobs"."description"::text, ''))) @@ (to_tsquery('simple', ''' ' || 'business' || ' ''' || ':*'))))
           and "burning_glass_jobs"."internship" = 'false'
)
select count(*) from jobs
Unfortunately, although this eliminates the extra join processing, it actually takes longer to run (nearly 12 minutes):
 Aggregate  (cost=3249952.20..3249952.21 rows=1 width=8) (actual time=709430.211..709767.028 rows=1 loops=1)
   CTE jobs
     ->  Gather  (cost=20479.88..3196581.12 rows=2372048 width=12) (actual time=499.822..708800.279 rows=2464374 loops=1)
           Workers Planned: 4
           Workers Launched: 4
           ->  Parallel Bitmap Heap Scan on burning_glass_jobs  (cost=19479.88..2055515.55 rows=593012 width=12) (actual time=466.644..708803.210 rows=492875 loops=5)
                 Recheck Cond: ((setweight(to_tsvector('simple'::regconfig, COALESCE((title)::text, ''::text)), 'A'::"char") || to_tsvector('simple'::regconfig, COALESCE((description)::text, ''::text))) @@ '''business'':*'::tsquery)
                 Rows Removed by Index Recheck: 489458
                 Filter: (NOT internship)
                 Rows Removed by Filter: 4668
                 Heap Blocks: exact=37880 lossy=77991
                 ->  Bitmap Index Scan on index_burning_glass_jobs_full_search  (cost=0.00..18886.86 rows=2388449 width=0) (actual time=462.664..462.665 rows=1208534 loops=1)
                       Index Cond: ((setweight(to_tsvector('simple'::regconfig, COALESCE((title)::text, ''::text)), 'A'::"char") || to_tsvector('simple'::regconfig, COALESCE((description)::text, ''::text))) @@ '''business'':*'::tsquery)
   ->  CTE Scan on jobs  (cost=0.00..47440.96 rows=2372048 width=0) (actual time=499.826..709241.085 rows=2464374 loops=1)
 Planning time: 0.500 ms
 Execution time: 709785.697 ms
(16 rows)

But if we just need the count here, we can do without the CTE, as well as the select list from the inner query:
explain analyze
select count(*)
  from "burning_glass_jobs"
 where (((setweight(to_tsvector('simple', coalesce("burning_glass_jobs"."title"::text, '')), 'A') || to_tsvector('simple', coalesce("burning_glass_jobs"."description"::text, ''))) @@ (to_tsquery('simple', ''' ' || 'business' || ' ''' || ':*'))))
   and not burning_glass_jobs.internship;
This simpler query performs slightly better than the original query, but not by much (roughly 7.5 minutes):
Finalize Aggregate  (cost=1757044.91..1757044.92 rows=1 width=8) (actual time=453874.883..454209.525 rows=1 loops=1)
   ->  Gather  (cost=1757044.49..1757044.90 rows=4 width=8) (actual time=453871.338..454209.513 rows=5 loops=1)
         Workers Planned: 4
         Workers Launched: 4
         ->  Partial Aggregate  (cost=1756044.49..1756044.50 rows=1 width=8) (actual time=453855.653..453855.654 rows=1 loops=5)
               ->  Parallel Bitmap Heap Scan on burning_glass_jobs  (cost=19479.88..1754561.96 rows=593012 width=0) (actual time=478.233..453767.649 rows=492875 loops=5)
                     Recheck Cond: ((setweight(to_tsvector('simple'::regconfig, COALESCE((title)::text, ''::text)), 'A'::"char") || to_tsvector('simple'::regconfig, COALESCE((description)::text, ''::text))) @@ '''business'':*'::tsquery)
                     Rows Removed by Index Recheck: 489458
                     Filter: (NOT internship)
                     Rows Removed by Filter: 4668
                     Heap Blocks: exact=38345 lossy=80465
                     ->  Bitmap Index Scan on index_burning_glass_jobs_full_search (cost=0.00..18886.86 rows=2388449 width=0) (actual time=449.767..449.767 rows=1208534 loops=1)
                           Index Cond: ((setweight(to_tsvector('simple'::regconfig, COALESCE((title)::text, ''::text)), 'A'::"char") || to_tsvector('simple'::regconfig, COALESCE((description)::text, ''::text))) @@ '''business'':*'::tsquery)
 Planning time: 0.448 ms
 Execution time: 454221.392 ms
(15 rows)

Given the number of records here, the query planner is going to prefer a bitmap scan on the index to a straight index scan. But bitmap scans are lossy: on the first pass, it keeps track of the pages that have matches, but not where the matching records are. So it has to do another pass (Bitmap Heap Scan) to find the rows. In both passes, it has to calculate the setweight() and to_tsvector() functions for every row at execution time in order to find the matches. (In the CTE version, it was also calculating ts_rank() because it was in the select list, which I think accounts for the extra processing time on the recheck.)
So: we can improve this by caching the to_tsvector() calculations and adding to the table:
alter table burning_glass_jobs add column weighted_title tsvector;
alter table burning_glass_jobs add column text_search_description tsvector;
update burning_glass_jobs set weighted_title = setweight(to_tsvector('simple', coalesce("burning_glass_jobs"."title"::text, '')), 'A');
update burning_glass_jobs set text_search_description = to_tsvector('simple', coalesce("burning_glass_jobs"."description"::text, ''));
create index index_burning_glass_jobs_full_search_cached using gin ((weighted_title || text_search_description));
(Note that all this takes several hours to run.) Having done this, we need to recast the query to use the cached columns instead of calling the to_tsvector() functions at query time:
explain analyze
select count(*)
  from "burning_glass_jobs"
 where (weighted_title || text_search_description) @@ (to_tsquery('simple', ''' ' || 'business' || ' ''' || ':*'))
   and not "burning_glass_jobs"."internship"
This yields a query plan that looks very similar to the one immediately above, but runs about four times faster (~2 minutes):
Finalize Aggregate  (cost=1457002.60..1457002.61 rows=1 width=8) (actual time=124083.565..124579.657 rows=1 loops=1)
   ->  Gather  (cost=1457002.18..1457002.59 rows=4 width=8) (actual time=124081.292..124579.633 rows=5 loops=1)
         Workers Planned: 4
         Workers Launched: 4
         ->  Partial Aggregate  (cost=1456002.18..1456002.19 rows=1 width=8) (actual time=124003.803..124003.806 rows=1 loops=5)
               ->  Parallel Bitmap Heap Scan on burning_glass_jobs  (cost=19486.48..1454519.65 rows=593012 width=0) (actual time=388.094..123791.944 rows=492875 loops=5)
                     Recheck Cond: ((weighted_title || text_search_description) @@ '''business'':*'::tsquery)
                     Rows Removed by Index Recheck: 489458
                     Filter: (NOT internship)
                     Rows Removed by Filter: 4668
                     Heap Blocks: exact=39066 lossy=81968
                     ->  Bitmap Index Scan on index_burning_glass_jobs_full_search_cached  (cost=0.00..18893.46 rows=2388449 width=0) (actual time=426.368..426.372 rows=1208534 loops=1)
                           Index Cond: ((weighted_title || text_search_description) @@ '''business'':*'::tsquery)
 Planning time: 0.475 ms
 Execution time: 124588.481 ms
(15 rows)

Obviously, 2 minutes still isn't great, and I think this is about as efficient as this query is going to get: it's using the index to find all matches, and we've minimized the amount of calculation required at execution time. Fundamentally, the issue is this: out of ~6 million rows in the table, a search term like business matches almost 2.5 million -- Postgres has to find all of them (and essentially write them to temp space) before they can be counted. As noted above, adding the caching columns to the table took several hours. In theory, we could keep the cached columns up-to-date for new inserts using triggers on the table, but in practice we bulk-load this table regularly -- trying to update the cache columns after each load seems like it would be onerous and expensive.
I have found that if I try a term with fewer matches (e.g. crypto which has about 8k), then just the original GIN index without the caching column is sufficient to return a count in ~100 milliseconds (assuming the simplified form of the query without ts_rank() in the select list is used).
Recommendations


In all cases, the fundamental issue is the difficulty of count()ing a large number of matches
Due to bulk-loading of the table, caching columns for the tsvector values are probably too expensive to be worth the slight benefit we get from them
Use the concatenated index suggested in Josh's gist:
    CREATE INDEX CONCURRENTLY index_burning_glass_jobs_full_search
    ON burning_glass_jobs
    USING GIN ((
        setweight(
          to_tsvector('simple', coalesce("burning_glass_jobs"."title"::text, '')
         ), 'A')
         || to_tsvector('simple', coalesce("burning_glass_jobs"."description"::text, '')
       )
    ));

Count query should be of this form if possible, to avoid the extra join and ts_rank() calculation:
   select count(*)
     from "burning_glass_jobs"
    where (((setweight(to_tsvector('simple', coalesce("burning_glass_jobs"."title"::text, '')), 'A') || to_tsvector('simple', coalesce("burning_glass_jobs"."description"::text, ''))) @@ (to_tsquery('simple', ''' ' || 'business' || ' ''' || ':*'))))
      and not "burning_glass_jobs"."internship"

This should yield acceptable performance from reasonably selective searches, but terms with more than a few thousand matches will probably continue to time out. I don't know what's doable here, but I would maybe suggest setting some reasonable timeout interval (like a few seconds?) and catch that so afterwards you can just show that the results are "too many to count" somehow instead of showing an exact count.