Skip to content

Instantly share code, notes, and snippets.

@jimfulton
Last active August 16, 2017 17:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jimfulton/4104b5387cf7dce3754bef342a9f84c5 to your computer and use it in GitHub Desktop.
Save jimfulton/4104b5387cf7dce3754bef342a9f84c5 to your computer and use it in GitHub Desktop.
Comparison of GIN and RUM indexes for ranked searches of a database with ~220K documents and searches returning ~50K documents
=> create index pgtextindex_text_vector_idx on pgtextindex using gin (text_vector);
CREATE INDEX
=> analyze pgtextindex;
ANALYZE
=> \d pgtextindex
Table "public.pgtextindex"
Column | Type | Modifiers
-------------------+-----------------------------+----------------------
docid | integer | not null
community_docid | character varying(100) |
content_type | character varying(30) |
creation_date | timestamp without time zone |
modification_date | timestamp without time zone |
coefficient | real | not null default 1.0
marker | character varying[] |
text_vector | tsvector |
Indexes:
"pgtextindex_pkey" PRIMARY KEY, btree (docid)
"pgtextindex_community_docid_index" btree (community_docid, content_type, creation_date)
"pgtextindex_text_vector_idx" gin (text_vector)
SELECT docid,
coefficient * (
ts_rank_cd('{.1, 0.2, 0.4, 1.0}',
text_vector,
to_tsquery('english', :q))
) AS rank,
count(*) over()
FROM pgtextindex
WHERE text_vector @@ to_tsquery('english', :q)
ORDER BY rank DESC
limit 9;
Limit (cost=14260.66..14260.68 rows=9 width=16) (actual time=420.212..420.215 rows=9 loops=1)
-> Sort (cost=14260.66..14339.40 rows=31496 width=16) (actual time=420.210..420.210 rows=9 loops=1)
Sort Key: ((coefficient * ts_rank_cd('{0.1,0.2,0.4,1}'::real[], text_vector, '''euro'''::tsquery))) DESC
Sort Method: top-N heapsort Memory: 25kB
-> WindowAgg (cost=324.10..13603.98 rows=31496 width=16) (actual time=59.031..414.574 rows=33988 loops=1)
-> Bitmap Heap Scan on pgtextindex (cost=324.10..13052.80 rows=31496 width=252) (actual time=13.476..40.841 rows=33988 loops=1)
Recheck Cond: (text_vector @@ '''euro'':*'::tsquery)
Heap Blocks: exact=9679
-> Bitmap Index Scan on pgtextindex_text_vector_idx (cost=0.00..316.22 rows=31496 width=0) (actual time=11.823..11.823 rows=33988 loops=1)
Index Cond: (text_vector @@ '''euro'':*'::tsquery)
Planning time: 0.424 ms
Execution time: 420.661 ms
=> create index pgtextindex_rum_text_vector_idx on pgtextindex using rum (text_vector rum_tsvector_ops);
CREATE INDEX
=> analyze pgtextindex;
ANALYZE
=> \d pgtextindex;
Table "public.pgtextindex"
Column | Type | Modifiers
-------------------+-----------------------------+----------------------
docid | integer | not null
community_docid | character varying(100) |
content_type | character varying(30) |
creation_date | timestamp without time zone |
modification_date | timestamp without time zone |
coefficient | real | not null default 1.0
marker | character varying[] |
text_vector | tsvector |
Indexes:
"pgtextindex_pkey" PRIMARY KEY, btree (docid)
"pgtextindex_community_docid_index" btree (community_docid, content_type, creation_date)
"pgtextindex_rum_text_vector_idx" rum (text_vector)
SELECT docid, (text_vector <=> to_tsquery('english', :q)) / coefficient as rank, count(*) over()
FROM pgtextindex
WHERE text_vector @@ to_tsquery('english', :q)
ORDER BY rank
limit 9;
Limit (cost=14897.28..14897.31 rows=9 width=16) (actual time=470.662..470.665 rows=9 loops=1)
-> Sort (cost=14897.28..15002.50 rows=42087 width=16) (actual time=470.652..470.654 rows=9 loops=1)
Sort Key: (((text_vector <=> '''euro'':*'::tsquery) / coefficient))
Sort Method: top-N heapsort Memory: 25kB
-> WindowAgg (cost=422.17..14019.78 rows=42087 width=16) (actual time=92.669..464.930 rows=33988 loops=1)
-> Bitmap Heap Scan on pgtextindex (cost=422.17..13283.26 rows=42087 width=260) (actual time=37.685..72.324 rows=33988 loops=1)
Recheck Cond: (text_vector @@ '''euro'':*'::tsquery)
Heap Blocks: exact=9679
-> Bitmap Index Scan on pgtextindex_rum_text_vector_idx (cost=0.00..411.65 rows=42087 width=0) (actual time=35.995..35.995 rows=33988 loops=1)
Index Cond: (text_vector @@ '''euro'':*'::tsquery)
Planning time: 6.881 ms
Execution time: 471.571 ms
Query count GIN RUM
euro:* 33988 420ms 430ms
afric:* 40169 410ms 430ms
open:* 56497 580ms 550ms
europe 20452 280ms 270ms
africa 33086 360ms 340ms
open 56120 560ms 530ms
Times are best of many (>30). Basically, for each test, I ran until I didn't see improvements.
I also ran explain analyze for each query before doing timing runs.
@jimfulton
Copy link
Author

jimfulton commented Feb 28, 2017

Some notes on the data:

220K documents. (31% 9 words or less)

Unique words:
P30 9
Median 57
P90 656
p99 1837
p99.9 2928
max 6398

~130K null records (so ~350 records total)

Run on Single-CPU VM w 4G of RAM

@jimfulton
Copy link
Author

I was wondering if prefix search was distorting results, so I added some non-prefix variants. For these RUM seem to a a teeny bit faster, but that could also be noise, as these numbers can vary quite a bit.

@pholly
Copy link

pholly commented Aug 16, 2017

The count(*) over is why not just the RUM indexes are hit. See postgrespro/rum#23 and postgrespro/rum#25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment