Skip to content

Instantly share code, notes, and snippets.

@keegancsmith
Last active August 6, 2020 22:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save keegancsmith/ecdb45c6d0390911fef1aea1ef1f01a4 to your computer and use it in GitHub Desktop.
Save keegancsmith/ecdb45c6d0390911fef1aea1ef1f01a4 to your computer and use it in GitHub Desktop.
exploring pg_trgm

Postgres trigram exploration

  • State “DONE” from “TODO” [2020-08-07 Fri 00:19]
[2020-08-06 Thu 21:30]

I went a bit off the deep end and compiled postgres to better understand the gin + pg_trgm indexes. I was seeing weird behaviour where =~ ‘github.com/sourcegraph’= was slower than =~ ‘sourcegraph’=. Ended up reading a bunch of source code and developer documentation. Postgres architecture and sourcecode is actually quite nice.

GIN is really nice, it stands for Generalized Inverted Index. Basically it abstracts inverted indexes. The developer provides the following functions:

  • extract the list of values in a document. So for pg_trgm that is a list of trigrams.
  • extract the list of values from a query (of a given type). So for pg_trgm if doing a regex it is a list of trigrams from the regex. You can smuggle extra data, this is used later in consistent.
  • consistent a function which given list of values returns true if it matches the query. So if searching a regex like (A|B) consistent will check that the trigrams match in A or B.
  • A way to compare values. GIN uses a btree on the values. So for pg_trgrm it is just comparison on the trigrams, which are int4’s (ie 4 byte int, only uses first 3 bytes).

See https://www.postgresql.org/docs/12/gin-extensibility.html for the details.

export PSQL_SRC=/Users/keegan/src/git.postgresql.org/git/postgresql

# build
cd $PSQL_SRC
./configure --prefix=$PWD/local
make install
make -C contrib/citext install
make -C contrib/pg_trgm COPT=-DTRGM_REGEXP_DEBUG install

# Start DB
./local/bin/initdb -D $PWD/local/pgsql/data
# modify port to be 5000
vim local/pgsql/data/postgresql.conf
./local/bin/pg_ctl -D $PWD/local/pgsql/data -l logfile start
./local/bin/createdb -p 5000 keegan

# Connect
./local/bin/psql -p 5000

# Insert 434497 repos into table
git clone https://github.com/sourcegraph/ghdump.git
cd ghdump/api_response_dump
find . -type f | xargs cat | jq -r '.items | .[].full_name' | awk '{ print "github.com/" $1 }' | sort | uniq | $PSQL_SRC/local/bin/psql -p 5000 -c 'COPY repo (name) FROM STDIN'
create table repo (id serial primary key, name citext not null);
ALTER TABLE ONLY repo ADD CONSTRAINT repo_name_unique UNIQUE (name) DEFERRABLE;
CREATE INDEX repo_name_trgm ON repo USING gin (lower((name)::text) gin_trgm_ops);
vacuum analyse repo;
keegan=# explain analyse select name from repo where lower(name) ~ 'github\.com/sourcegraph';
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on repo  (cost=176.34..335.19 rows=43 width=35) (actual time=8.032..8.186 rows=34 loops=1)
   Recheck Cond: (lower((name)::text) ~ 'github\.com/sourcegraph'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on repo_name_trgm  (cost=0.00..176.33 rows=43 width=0) (actual time=8.011..8.011 rows=34 loops=1)
         Index Cond: (lower((name)::text) ~ 'github\.com/sourcegraph'::text)
 Planning Time: 4.998 ms
 Execution Time: 8.296 ms
(7 rows)

keegan=# explain analyse select name from repo where lower(name) ~ 'sourcegraph';
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on repo  (cost=112.34..271.19 rows=43 width=35) (actual time=1.506..1.655 rows=34 loops=1)
   Recheck Cond: (lower((name)::text) ~ 'sourcegraph'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on repo_name_trgm  (cost=0.00..112.33 rows=43 width=0) (actual time=1.475..1.475 rows=34 loops=1)
         Index Cond: (lower((name)::text) ~ 'sourcegraph'::text)
 Planning Time: 1.950 ms
 Execution Time: 1.701 ms
(7 rows)

I haven’t gotten to the root of why it is slower to search sourcegraph vs github\.com/sourcegraph. My best guess is when confirming when a regex matches a value it is faster to check the former due to not matching the github.com prefix.

I also investigated not using lower(name) as the index. I couldn’t get the postgres planner to use just name. Tried disabling different planners, ensured I was using case insensitive ~*. No dice. We have had ideas of using postgres as the backend for code search, this makes me a bit less confident in that idea.

Compiling pg_trgrm with regex debugging didn’t help that much. It does produce very cool dot diagrams to understand the NFA it uses though. Very nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment