Search on repo listings is slow (2 seconds+) for the most common/obvious queries
(e.g., repo:github.com/sourcegraph
or repo:github.com/facebook
) while
similar queries (like repo:ithub.com/sourcegraph
and repo:sourcegraph
) are
instant.
I strongly suspect this is due to a bad interaction of trigram indexing and
repo_name
strings. Long explanation below.
Basically, any repo_name
starting with github.com
generates a trigram set
for github
and com
. When searching for something like
github.com/sourcegraph
, only the trigrams generated by sourcegraph
distinguish it from all other entries (i.e., and their trigrams) prefixed with
github.com
. It turns out that searching over entries containing these common
github.com
trigrams (since there are many of them, and they appear frequently
in the trigram ordering) makes the indexed search expensive. Since repo_name
shares substrings/prefixed substrings this is hurting trigram search. The bad
news is that the more github.com/...
entries are added, the slower
repo:github.com/...
searches will get.
Conceptually, it would be faster to (trigram) search only on the (unique) repo
orgs and names (the part after github.com/...
) and then filter by the prefix
(if any). But this needs code and/or changes to the DB.
I did some experimenting, and the most effective (and immediate) solution I came
up with is to add a BTREE index to the table. PSQL is smart enough to use the
BTREE index for LIKE github.com/...%
queries (i.e., the most obvious and
common ones) and fall back on the trgm
index for regex.
Based on my experiments it should take < 1min to generate the BTREE index for prod-sized data and < 1GB in size.
I tested a variety of different queries on sourcegraph.com, which behaved weirdly depending on the query. After looking at repo listing code and PSQL conversions I eventually became convinced that the issue was a postgres thing and not something in the Go code.
I started looking into the trigram index, and then became suspicious about the
behavior depending on how the queries were changed. Summary table below for a
Query, the PSQL conversion, trigram generated (using SELECT show_trgm('github.com/sourcegraph');
), the time it took, a link to the query,
and description.
Although the queries are not equivalent, each returns the same results for this case (just to rule out that the result size is not having an effect here).
Query | PSQL Conversion | TRGM | Prod (s) | URL | Description |
---|---|---|---|---|---|
repo:^github\\.com/sourcegraph | lower(name) LIKE github.com/sourcegraph% | {" c"," g"," s"," co"," gi"," so",aph,ceg,com,egr,git,gra,hub,ith,"om ",our,"ph ",rap,rce,sou,thu,"ub ",urc} | 2.30 | What you get when you open in Sourcegraph from GH and erase the repo name: "repo:^github\\.com/sourcegraph/sourcegraph$ | |
repo:github.com/sourcegraph | lower(name) LIKE github.com/sourcegraph% | same as above | 2.54 | What you get when you type "repo:github.com/sourcegraph" on the landing page | |
repo:ithub.com/sourcegraph | lower(name) ~\* ithub.com/sourcegraph | {" c"," i"," s"," co"," it"," so",aph,ceg,com,egr,gra,hub,ith,"om ",our,"ph ",rap,rce,sou,thu,"ub ",urc} | 0.30 | Remove 'g' prefix from previous | |
repo:ithub.com/sourcegrap | lower(name) ~\* ithub.com/sourcegrap | {" c"," i"," s"," co"," it"," so","ap ",ceg,com,egr,gra,hub,ith,"om ",our,rap,rce,sou,thu,"ub ",urc} | 1.62 | Remove 'h' suffix from previous | |
repo:github..om/sourcegraph | lower(name) ~\* github..om/sourcegraph | …excludes " c" | 0.22 | Replace 'c' (generating first trigram) with . in Row 2 query | |
repo:sourcegraph | lower(name) ~\* sourcegraph | {" s"," so",aph,ceg,egr,gra,our,"ph ",rap,rce,sou,urc} | 0.10 | Remove 'github.com' prefix from Row 2 query |
The first two queries represent really common and obvious things a user might
type (this is how I ran into the issue). It is suprisingly slow (> 2s) compared
to what's possible if we tweak the query just a little bit. For example, the 3rd
row changes the query to remove just the "g" in github. Woah, 0.3s
! What
changed? A big part of what changed is that a number of trigrams were removed
compared to the first query (especially " gi" and "git"). I'm guessing that the
removal of these trigrams shortcircuits search for a lot of trigram index
comparisons. Interestingly, taking the same query and removing the h in the
suffix from "sourcegraph" slows things down again (although not as much). My
guess here is that the trigram "aph" helped shortcircuit search in the previous
fast query, and removing it causes a lot of other comparisons to fire again.
Just a guess, and the underlying reason is not so important; the point is that
even with a trick like removing the "g", the trigram search is still sensitive
to the other "ithub" and "com" parts.
In row 5 I change the "c" in "com" to be any char in regex. The result is bizarre: much faster search results, even though "any char" in regex is much more general than "c". I'm not entirely sure exactly how PSQL converts regex to trigrams, but my guess here is that it does something along the lines of matching 'any char' in the trigrams (and critically, no longer produces the " c" and other 'com' trigrams)–in which case, this isn't too big of a surprise since it is probably avoiding a bunch of comparisons again.
As a final example, just "sourcegraph" returns instantaneous results, and hopefully it's clear by now why.
I tested locally to try come up with a solution. It was really hard to observe slow down for small databases, so I had to make a big one that simulated prod. Once I did though, I saw performance issues as above. I experimented with using a BTREE index since that's one of the options that PSQL gives me and should, in theory, be smart about common string prefixes.
My understanding is there's about 3.5 million repos indexed in prod. In my local
testing I used a DB with about 20 million rows. I took one row in the repo
table from my local dev environment, and used it to generate a large DB with
randomized org and repo names (all prefixed with github.com
). Summary:
-
14GB file
-
19.2M rows
-
Randomly generate 1 to 400 repos per org (~100K orgs)
-
Repo name and org name is a random string, 5 to 11 chars long
-
Picked a random org
oplqs
and ran queries that would be generated by search -
Full query below, but basically:
lower(name) LIKE github.com/oplqs%
-
I selectively disabled indexes to test against, e.g.,:
update pg_index set indisvalid = false where indexrelid = 'repo_name_trgm'::regclass
-
Confirmed which index is being used with
EXPLAIN ANALYZE
on the queries
Results:
Mode | Time (s) | Rows |
---|---|---|
TRGM IDX Scan (Prod simulation) | 19.1 | 183 |
No IDX (Seq Scan) | 34.3 | 183 |
BTREE IDX Scan | 0.33 | 183 |
BTREE much much faster. Note that a complete sequential scan is not that much slower than trigram search in for this DB.
Relative index sizes:
Index | Index Size | Table Size |
---|---|---|
repo\_name\_trgm | 1675 MB | 14 GB |
repo\_name\_idx | 988 MB | 14 GB |
It takes 54 seconds to generate repo_name_idx
(it should be less for prod I
suspect, because prod has less rows). So I propose we add a BTREE index as a way
to solve this. It seems inexpensive computationally, is not large (it is
comparable to repo_name_trgm
), and no changes to table schema needed.
One thing to note is that I used the "C" locale for LC_COLLATE
, as indexing with
pg_catalog.default
(en_US.UTF-8
) causes PSQL to ignore the BTREE index. I
didn't look too much further into this when Google searches revealed that "C" locale
still works perfectly fine for UTF. Here's the script I used to generate the BTREE for
the repo
table:
CREATE INDEX repo_name_idx
ON public.repo USING btree
(lower(name::text) COLLATE pg_catalog."C")
TABLESPACE pg_default;
PSQL is smart enough to automatically prefer this table for "github.com/…%"
queries and nothing else needs to be done. I checked that it falls back to
to the trgm
index when it thinks it is better.
SELECT id
, name
, description
, language
, enabled
, created_at
, updated_at
, external_id
, external_service_type
, external_service_id
FROM repo WHERE deleted_at IS NULL AND deleted_at IS NULL
AND (lower(name) LIKE 'github.com/oplqs%') AND enabled ORDER BY id ASC LIMIT 1073741824 OFFSET 0
import base64
import sys
ROW = 5
ORGS=100000
MAX_REPOS=400
MAX_ORG_LEN=11
MAX_REPO_LEN=11
fo = open("in-table.csv", "rw+")
lines = fo.readlines()
line = lines[0]
fo.close()
def randomString(stringLength=10):
"""Generate a random string of fixed length """
letters = string.ascii_lowercase
return ''.join(random.choice(letters) for i in range(stringLength))
for i in xrange(ORGS):
l2 = line.split(",")
N = random.randint(5,MAX_ORG_LEN)
org = randomString(N)
num_repos = random.randint(1,MAX_REPOS)
for x in xrange(num_repos):
l2[0] = str(ROW)
encoding = base64.b64encode("010:Repository{}".format(str(ROW)))
ROW = ROW + 1
M = random.randint(5,MAX_REPO_LEN)
name = randomString(MAX_REPO_LEN)
l2[1] = "github.com/{}/{}".format(org,name)
l2[7] = encoding
sys.stdout.write(','.join(l2))
SELECT
t.tablename,
indexname,
c.reltuples AS num_rows,
pg_size_pretty(pg_relation_size(quote_ident(t.tablename)::text)) AS table_size,
pg_size_pretty(pg_relation_size(quote_ident(indexrelname)::text)) AS index_size,
CASE WHEN indisunique THEN 'Y'
ELSE 'N'
END AS UNIQUE,
idx_scan AS number_of_scans,
idx_tup_read AS tuples_read,
idx_tup_fetch AS tuples_fetched
FROM pg_tables t
LEFT OUTER JOIN pg_class c ON t.tablename=c.relname
LEFT OUTER JOIN
( SELECT c.relname AS ctablename, ipg.relname AS indexname, x.indnatts AS number_of_columns, idx_scan, idx_tup_read, idx_tup_fetch, indexrelname, indisunique FROM pg_index x
JOIN pg_class c ON c.oid = x.indrelid
JOIN pg_class ipg ON ipg.oid = x.indexrelid
JOIN pg_stat_all_indexes psai ON x.indexrelid = psai.indexrelid AND psai.schemaname = 'public' )
AS foo
ON t.tablename = foo.ctablename
WHERE t.schemaname='public'
ORDER BY 1,2;