rvantonder/README.md Secret

## README.md

      
    Raw
  

              README.md
            
          
    Search on repo listings is slow (2 seconds+) for the most common/obvious queries
(e.g., repo:github.com/sourcegraph or repo:github.com/facebook) while
similar queries (like repo:ithub.com/sourcegraph and repo:sourcegraph) are
instant.
I strongly suspect this is due to a bad interaction of trigram indexing and
repo_name strings. Long explanation below.

TLDR Problem

Basically, any repo_name starting with github.com generates a trigram set
for github and com. When searching for something like
github.com/sourcegraph, only the trigrams generated by sourcegraph
distinguish it from all other entries (i.e., and their trigrams) prefixed with
github.com. It turns out that searching over entries containing these common
github.com trigrams (since there are many of them, and they appear frequently
in the trigram ordering) makes the indexed search expensive. Since repo_name
shares substrings/prefixed substrings this is hurting trigram search. The bad
news is that the more github.com/... entries are added, the slower
repo:github.com/... searches will get.

TLDR Proposed Solution

Conceptually, it would be faster to (trigram) search only on the (unique) repo
orgs and names (the part after github.com/...) and then filter by the prefix
(if any). But this needs code and/or changes to the DB.
I did some experimenting, and the most effective (and immediate) solution I came
up with is to add a BTREE index to the table. PSQL is smart enough to use the
BTREE index for LIKE github.com/...% queries (i.e., the most obvious and
common ones) and fall back on the trgm index for regex.
Based on my experiments it should take < 1min to generate the BTREE index for
prod-sized data and < 1GB in size.

Problem and Experiments

I tested a variety of different queries on sourcegraph.com, which behaved
weirdly depending on the query. After looking at repo listing code and PSQL
conversions I eventually became convinced that the issue was a postgres thing
and not something in the Go code.
I started looking into the trigram index, and then became suspicious about the
behavior depending on how the queries were changed. Summary table below for a
Query, the PSQL conversion, trigram generated (using SELECT show_trgm('github.com/sourcegraph');), the time it took, a link to the query,
and description.
Although the queries are not equivalent, each returns the same results for this
case (just to rule out that the result size is not having an effect here).


Query
PSQL Conversion
TRGM
Prod (s)
URL
Description


repo:^github\\.com/sourcegraph
lower(name) LIKE github.com/sourcegraph%
{"  c","  g","  s"," co"," gi"," so",aph,ceg,com,egr,git,gra,hub,ith,"om ",our,"ph ",rap,rce,sou,thu,"ub ",urc}
2.30

What you get when you open in Sourcegraph from GH and erase the repo name: "repo:^github\\.com/sourcegraph/sourcegraph$


repo:github.com/sourcegraph
lower(name) LIKE github.com/sourcegraph%
same as above
2.54

What you get when you type "repo:github.com/sourcegraph" on the landing page


repo:ithub.com/sourcegraph
lower(name) ~\* ithub.com/sourcegraph
{"  c","  i","  s"," co"," it"," so",aph,ceg,com,egr,gra,hub,ith,"om ",our,"ph ",rap,rce,sou,thu,"ub ",urc}
0.30

Remove 'g' prefix from previous


repo:ithub.com/sourcegrap
lower(name) ~\* ithub.com/sourcegrap
{"  c","  i","  s"," co"," it"," so","ap ",ceg,com,egr,gra,hub,ith,"om ",our,rap,rce,sou,thu,"ub ",urc}
1.62

Remove 'h' suffix from previous


repo:github..om/sourcegraph
lower(name) ~\* github..om/sourcegraph
…excludes "  c"
0.22

Replace 'c' (generating first trigram) with . in Row 2 query


repo:sourcegraph
lower(name) ~\* sourcegraph
{"  s"," so",aph,ceg,egr,gra,our,"ph ",rap,rce,sou,urc}
0.10

Remove 'github.com' prefix from Row 2 query


The first two queries represent really common and obvious things a user might
type (this is how I ran into the issue). It is suprisingly slow (> 2s) compared
to what's possible if we tweak the query just a little bit. For example, the 3rd
row changes the query to remove just the "g" in github. Woah, 0.3s! What
changed? A big part of what changed is that a number of trigrams were removed
compared to the first query (especially " gi" and "git"). I'm guessing that the
removal of these trigrams shortcircuits search for a lot of trigram index
comparisons. Interestingly, taking the same query and removing the h in the
suffix from "sourcegraph" slows things down again (although not as much). My
guess here is that the trigram "aph" helped shortcircuit search in the previous
fast query, and removing it causes a lot of other comparisons to fire again.
Just a guess, and the underlying reason is not so important; the point is that
even with a trick like removing the "g", the trigram search is still sensitive
to the other "ithub" and "com" parts.
In row 5 I change the "c" in "com" to be any char in regex. The result is
bizarre: much faster search results, even though "any char" in regex is much
more general than "c". I'm not entirely sure exactly how PSQL converts regex to
trigrams, but my guess here is that it does something along the lines of
matching 'any char' in the trigrams (and critically, no longer produces the " c"
and other 'com' trigrams)–in which case, this isn't too big of a surprise since
it is probably avoiding a bunch of comparisons again.
As a final example, just "sourcegraph" returns instantaneous results, and
hopefully it's clear by now why.

Solution

I tested locally to try come up with a solution. It was really hard to observe
slow down for small databases, so I had to make a big one that simulated prod.
Once I did though, I saw performance issues as above. I experimented with using
a BTREE index since that's one of the options that PSQL gives me and should, in
theory, be smart about common string prefixes.
My understanding is there's about 3.5 million repos indexed in prod. In my local
testing I used a DB with about 20 million rows. I took one row in the repo
table from my local dev environment, and used it to generate a large DB with
randomized org and repo names (all prefixed with github.com). Summary:


14GB file


19.2M rows


Randomly generate 1 to 400 repos per org (~100K orgs)


Repo name and org name is a random string, 5 to 11 chars long


Picked a random org oplqs and ran queries that would be generated by search


Full query below, but basically: lower(name) LIKE github.com/oplqs%


I selectively disabled indexes to test against, e.g.,: update pg_index set indisvalid = false where indexrelid = 'repo_name_trgm'::regclass


Confirmed which index is being used with EXPLAIN ANALYZE on the queries


Results:


Mode
Time (s)
Rows


TRGM IDX Scan (Prod simulation)
19.1
183


No IDX (Seq Scan)
34.3
183


BTREE IDX Scan
0.33
183


BTREE much much faster. Note that a complete sequential scan is not that
much slower than trigram search in for this DB.
Relative index sizes:


Index
Index Size
Table Size


repo\_name\_trgm
1675 MB
14 GB


repo\_name\_idx
988 MB
14 GB


It takes 54 seconds to generate repo_name_idx (it should be less for prod I
suspect, because prod has less rows). So I propose we add a BTREE index as a way
to solve this. It seems inexpensive computationally, is not large (it is
comparable to repo_name_trgm), and no changes to table schema needed.
One thing to note is that I used the "C" locale for LC_COLLATE, as indexing with
pg_catalog.default (en_US.UTF-8) causes PSQL to ignore the BTREE index. I
didn't look too much further into this when Google searches revealed that "C" locale
still works perfectly fine for UTF. Here's the script I used to generate the BTREE for
the repo table:
CREATE INDEX repo_name_idx
    ON public.repo USING btree
    (lower(name::text) COLLATE pg_catalog."C")
    TABLESPACE pg_default;

PSQL is smart enough to automatically prefer this table for "github.com/…%"
queries and nothing else needs to be done. I checked that it falls back to
to the trgm index when it thinks it is better.

Scripts


Query on table (simulate search)

SELECT id
        , name
        , description
        , language
        , enabled
        , created_at
        , updated_at
        , external_id
        , external_service_type
        , external_service_id
        FROM repo WHERE deleted_at IS NULL AND deleted_at IS NULL
        AND (lower(name) LIKE 'github.com/oplqs%') AND enabled ORDER BY id ASC LIMIT 1073741824 OFFSET 0


Repo generation

import base64
import sys

ROW = 5

ORGS=100000
MAX_REPOS=400
MAX_ORG_LEN=11
MAX_REPO_LEN=11

fo = open("in-table.csv", "rw+")
lines = fo.readlines()
line = lines[0]
fo.close()

def randomString(stringLength=10):
    """Generate a random string of fixed length """
    letters = string.ascii_lowercase
    return ''.join(random.choice(letters) for i in range(stringLength))


for i in xrange(ORGS):
    l2 = line.split(",")
    N = random.randint(5,MAX_ORG_LEN)
    org = randomString(N)
    num_repos = random.randint(1,MAX_REPOS)
    for x in xrange(num_repos):
        l2[0] = str(ROW)
        encoding = base64.b64encode("010:Repository{}".format(str(ROW)))
        ROW = ROW + 1
        M = random.randint(5,MAX_REPO_LEN)
        name = randomString(MAX_REPO_LEN)
        l2[1] = "github.com/{}/{}".format(org,name)
        l2[7] = encoding
        sys.stdout.write(','.join(l2))


How big are the indexes and tables?

SELECT
    t.tablename,
    indexname,
    c.reltuples AS num_rows,
    pg_size_pretty(pg_relation_size(quote_ident(t.tablename)::text)) AS table_size,
    pg_size_pretty(pg_relation_size(quote_ident(indexrelname)::text)) AS index_size,
    CASE WHEN indisunique THEN 'Y'
       ELSE 'N'
    END AS UNIQUE,
    idx_scan AS number_of_scans,
    idx_tup_read AS tuples_read,
    idx_tup_fetch AS tuples_fetched
FROM pg_tables t
LEFT OUTER JOIN pg_class c ON t.tablename=c.relname
LEFT OUTER JOIN
    ( SELECT c.relname AS ctablename, ipg.relname AS indexname, x.indnatts AS number_of_columns, idx_scan, idx_tup_read, idx_tup_fetch, indexrelname, indisunique FROM pg_index x
           JOIN pg_class c ON c.oid = x.indrelid
           JOIN pg_class ipg ON ipg.oid = x.indexrelid
           JOIN pg_stat_all_indexes psai ON x.indexrelid = psai.indexrelid AND psai.schemaname = 'public' )
    AS foo
    ON t.tablename = foo.ctablename
WHERE t.schemaname='public'
ORDER BY 1,2;
Query	PSQL Conversion	TRGM	Prod (s)	Description
repo:^github\\.com/sourcegraph	lower(name) LIKE github.com/sourcegraph%	{" c"," g"," s"," co"," gi"," so",aph,ceg,com,egr,git,gra,hub,ith,"om ",our,"ph ",rap,rce,sou,thu,"ub ",urc}	2.30	What you get when you open in Sourcegraph from GH and erase the repo name: "repo:^github\\.com/sourcegraph/sourcegraph$
repo:github.com/sourcegraph	lower(name) LIKE github.com/sourcegraph%	same as above	2.54	What you get when you type "repo:github.com/sourcegraph" on the landing page
repo:ithub.com/sourcegraph	lower(name) ~\* ithub.com/sourcegraph	{" c"," i"," s"," co"," it"," so",aph,ceg,com,egr,gra,hub,ith,"om ",our,"ph ",rap,rce,sou,thu,"ub ",urc}	0.30	Remove 'g' prefix from previous
repo:ithub.com/sourcegrap	lower(name) ~\* ithub.com/sourcegrap	{" c"," i"," s"," co"," it"," so","ap ",ceg,com,egr,gra,hub,ith,"om ",our,rap,rce,sou,thu,"ub ",urc}	1.62	Remove 'h' suffix from previous
repo:github..om/sourcegraph	lower(name) ~\* github..om/sourcegraph	…excludes " c"	0.22	Replace 'c' (generating first trigram) with . in Row 2 query
repo:sourcegraph	lower(name) ~\* sourcegraph	{" s"," so",aph,ceg,egr,gra,our,"ph ",rap,rce,sou,urc}	0.10	Remove 'github.com' prefix from Row 2 query
Mode	Time (s)	Rows
TRGM IDX Scan (Prod simulation)	19.1	183
No IDX (Seq Scan)	34.3	183
BTREE IDX Scan	0.33	183