Skip to content

Instantly share code, notes, and snippets.

@rvantonder
Last active July 28, 2020 00:12
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rvantonder/7745c2360cfced6f12c922588bfbef79 to your computer and use it in GitHub Desktop.
Save rvantonder/7745c2360cfced6f12c922588bfbef79 to your computer and use it in GitHub Desktop.
Slow repo search diagnosis and solution

Search on repo listings is slow (2 seconds+) for the most common/obvious queries (e.g., repo:github.com/sourcegraph or repo:github.com/facebook) while similar queries (like repo:ithub.com/sourcegraph and repo:sourcegraph) are instant.

I strongly suspect this is due to a bad interaction of trigram indexing and repo_name strings. Long explanation below.

TLDR Problem

Basically, any repo_name starting with github.com generates a trigram set for github and com. When searching for something like github.com/sourcegraph, only the trigrams generated by sourcegraph distinguish it from all other entries (i.e., and their trigrams) prefixed with github.com. It turns out that searching over entries containing these common github.com trigrams (since there are many of them, and they appear frequently in the trigram ordering) makes the indexed search expensive. Since repo_name shares substrings/prefixed substrings this is hurting trigram search. The bad news is that the more github.com/... entries are added, the slower repo:github.com/... searches will get.

TLDR Proposed Solution

Conceptually, it would be faster to (trigram) search only on the (unique) repo orgs and names (the part after github.com/...) and then filter by the prefix (if any). But this needs code and/or changes to the DB.

I did some experimenting, and the most effective (and immediate) solution I came up with is to add a BTREE index to the table. PSQL is smart enough to use the BTREE index for LIKE github.com/...% queries (i.e., the most obvious and common ones) and fall back on the trgm index for regex.

Based on my experiments it should take < 1min to generate the BTREE index for prod-sized data and < 1GB in size.

Problem and Experiments

I tested a variety of different queries on sourcegraph.com, which behaved weirdly depending on the query. After looking at repo listing code and PSQL conversions I eventually became convinced that the issue was a postgres thing and not something in the Go code.

I started looking into the trigram index, and then became suspicious about the behavior depending on how the queries were changed. Summary table below for a Query, the PSQL conversion, trigram generated (using SELECT show_trgm('github.com/sourcegraph');), the time it took, a link to the query, and description.

Although the queries are not equivalent, each returns the same results for this case (just to rule out that the result size is not having an effect here).

Query PSQL Conversion TRGM Prod (s) URL Description
repo:^github\\.com/sourcegraph lower(name) LIKE github.com/sourcegraph% {" c"," g"," s"," co"," gi"," so",aph,ceg,com,egr,git,gra,hub,ith,"om ",our,"ph ",rap,rce,sou,thu,"ub ",urc} 2.30 What you get when you open in Sourcegraph from GH and erase the repo name: "repo:^github\\.com/sourcegraph/sourcegraph$
repo:github.com/sourcegraph lower(name) LIKE github.com/sourcegraph% same as above 2.54 What you get when you type "repo:github.com/sourcegraph" on the landing page
repo:ithub.com/sourcegraph lower(name) ~\* ithub.com/sourcegraph {" c"," i"," s"," co"," it"," so",aph,ceg,com,egr,gra,hub,ith,"om ",our,"ph ",rap,rce,sou,thu,"ub ",urc} 0.30 Remove 'g' prefix from previous
repo:ithub.com/sourcegrap lower(name) ~\* ithub.com/sourcegrap {" c"," i"," s"," co"," it"," so","ap ",ceg,com,egr,gra,hub,ith,"om ",our,rap,rce,sou,thu,"ub ",urc} 1.62 Remove 'h' suffix from previous
repo:github..om/sourcegraph lower(name) ~\* github..om/sourcegraph …excludes " c" 0.22 Replace 'c' (generating first trigram) with . in Row 2 query
repo:sourcegraph lower(name) ~\* sourcegraph {" s"," so",aph,ceg,egr,gra,our,"ph ",rap,rce,sou,urc} 0.10 Remove 'github.com' prefix from Row 2 query

The first two queries represent really common and obvious things a user might type (this is how I ran into the issue). It is suprisingly slow (> 2s) compared to what's possible if we tweak the query just a little bit. For example, the 3rd row changes the query to remove just the "g" in github. Woah, 0.3s! What changed? A big part of what changed is that a number of trigrams were removed compared to the first query (especially " gi" and "git"). I'm guessing that the removal of these trigrams shortcircuits search for a lot of trigram index comparisons. Interestingly, taking the same query and removing the h in the suffix from "sourcegraph" slows things down again (although not as much). My guess here is that the trigram "aph" helped shortcircuit search in the previous fast query, and removing it causes a lot of other comparisons to fire again. Just a guess, and the underlying reason is not so important; the point is that even with a trick like removing the "g", the trigram search is still sensitive to the other "ithub" and "com" parts.

In row 5 I change the "c" in "com" to be any char in regex. The result is bizarre: much faster search results, even though "any char" in regex is much more general than "c". I'm not entirely sure exactly how PSQL converts regex to trigrams, but my guess here is that it does something along the lines of matching 'any char' in the trigrams (and critically, no longer produces the " c" and other 'com' trigrams)–in which case, this isn't too big of a surprise since it is probably avoiding a bunch of comparisons again.

As a final example, just "sourcegraph" returns instantaneous results, and hopefully it's clear by now why.

Solution

I tested locally to try come up with a solution. It was really hard to observe slow down for small databases, so I had to make a big one that simulated prod. Once I did though, I saw performance issues as above. I experimented with using a BTREE index since that's one of the options that PSQL gives me and should, in theory, be smart about common string prefixes.

My understanding is there's about 3.5 million repos indexed in prod. In my local testing I used a DB with about 20 million rows. I took one row in the repo table from my local dev environment, and used it to generate a large DB with randomized org and repo names (all prefixed with github.com). Summary:

  • 14GB file

  • 19.2M rows

  • Randomly generate 1 to 400 repos per org (~100K orgs)

  • Repo name and org name is a random string, 5 to 11 chars long

  • Picked a random org oplqs and ran queries that would be generated by search

  • Full query below, but basically: lower(name) LIKE github.com/oplqs%

  • I selectively disabled indexes to test against, e.g.,: update pg_index set indisvalid = false where indexrelid = 'repo_name_trgm'::regclass

  • Confirmed which index is being used with EXPLAIN ANALYZE on the queries

Results:

Mode Time (s) Rows
TRGM IDX Scan (Prod simulation) 19.1 183
No IDX (Seq Scan) 34.3 183
BTREE IDX Scan 0.33 183

BTREE much much faster. Note that a complete sequential scan is not that much slower than trigram search in for this DB.

Relative index sizes:

Index Index Size Table Size
repo\_name\_trgm 1675 MB 14 GB
repo\_name\_idx 988 MB 14 GB

It takes 54 seconds to generate repo_name_idx (it should be less for prod I suspect, because prod has less rows). So I propose we add a BTREE index as a way to solve this. It seems inexpensive computationally, is not large (it is comparable to repo_name_trgm), and no changes to table schema needed.

One thing to note is that I used the "C" locale for LC_COLLATE, as indexing with pg_catalog.default (en_US.UTF-8) causes PSQL to ignore the BTREE index. I didn't look too much further into this when Google searches revealed that "C" locale still works perfectly fine for UTF. Here's the script I used to generate the BTREE for the repo table:

CREATE INDEX repo_name_idx
    ON public.repo USING btree
    (lower(name::text) COLLATE pg_catalog."C")
    TABLESPACE pg_default;

PSQL is smart enough to automatically prefer this table for "github.com/…%" queries and nothing else needs to be done. I checked that it falls back to to the trgm index when it thinks it is better.

Scripts

Query on table (simulate search)

SELECT id
        , name
        , description
        , language
        , enabled
        , created_at
        , updated_at
        , external_id
        , external_service_type
        , external_service_id
        FROM repo WHERE deleted_at IS NULL AND deleted_at IS NULL
        AND (lower(name) LIKE 'github.com/oplqs%') AND enabled ORDER BY id ASC LIMIT 1073741824 OFFSET 0

Repo generation

import base64
import sys

ROW = 5

ORGS=100000
MAX_REPOS=400
MAX_ORG_LEN=11
MAX_REPO_LEN=11

fo = open("in-table.csv", "rw+")
lines = fo.readlines()
line = lines[0]
fo.close()

def randomString(stringLength=10):
    """Generate a random string of fixed length """
    letters = string.ascii_lowercase
    return ''.join(random.choice(letters) for i in range(stringLength))


for i in xrange(ORGS):
    l2 = line.split(",")
    N = random.randint(5,MAX_ORG_LEN)
    org = randomString(N)
    num_repos = random.randint(1,MAX_REPOS)
    for x in xrange(num_repos):
        l2[0] = str(ROW)
        encoding = base64.b64encode("010:Repository{}".format(str(ROW)))
        ROW = ROW + 1
        M = random.randint(5,MAX_REPO_LEN)
        name = randomString(MAX_REPO_LEN)
        l2[1] = "github.com/{}/{}".format(org,name)
        l2[7] = encoding
        sys.stdout.write(','.join(l2))

How big are the indexes and tables?

SELECT
    t.tablename,
    indexname,
    c.reltuples AS num_rows,
    pg_size_pretty(pg_relation_size(quote_ident(t.tablename)::text)) AS table_size,
    pg_size_pretty(pg_relation_size(quote_ident(indexrelname)::text)) AS index_size,
    CASE WHEN indisunique THEN 'Y'
       ELSE 'N'
    END AS UNIQUE,
    idx_scan AS number_of_scans,
    idx_tup_read AS tuples_read,
    idx_tup_fetch AS tuples_fetched
FROM pg_tables t
LEFT OUTER JOIN pg_class c ON t.tablename=c.relname
LEFT OUTER JOIN
    ( SELECT c.relname AS ctablename, ipg.relname AS indexname, x.indnatts AS number_of_columns, idx_scan, idx_tup_read, idx_tup_fetch, indexrelname, indisunique FROM pg_index x
           JOIN pg_class c ON c.oid = x.indrelid
           JOIN pg_class ipg ON ipg.oid = x.indexrelid
           JOIN pg_stat_all_indexes psai ON x.indexrelid = psai.indexrelid AND psai.schemaname = 'public' )
    AS foo
    ON t.tablename = foo.ctablename
WHERE t.schemaname='public'
ORDER BY 1,2;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment