Skip to content

Instantly share code, notes, and snippets.

@mgartner
Created August 22, 2022 21:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mgartner/62e43beb2c0de45ad787e495f4bc2171 to your computer and use it in GitHub Desktop.
Save mgartner/62e43beb2c0de45ad787e495f4bc2171 to your computer and use it in GitHub Desktop.

QA Plan - Trigrams

Note: Any bugs or findings of interested are documented inline with ‼️.

Run Postgres Trigram Examples from pg_trgm Docs

Run the examples from the pg_trgm docs. Ensure that the results match the expected results in the posts. This serves as a rough check for compatibility with Postgres in basic cases.

SELECT word_similarity('word', 'two words');

SELECT strict_word_similarity('word', 'two words'), similarity('word', 'words');

‼️ The built-ins above that shown in the PG docs are both unimplemented in CRDB. This is not a bug.

CREATE TABLE test_trgm (t text);

INSERT INTO test_trgm VALUES
  ('foo'),
  ('bar'),
  ('baz'),
  ('hoard'),
  ('wordy'),
  ('world'),
  ('weird'),
  ('weird'),
  ('xxxfooyyybar'),
  ('wofooord');
INSERT INTO test_trgm SELECT 'empty' FROM generate_series(1, 10000);

SELECT t, similarity(t, 'word') AS sml
  FROM test_trgm
  WHERE t % 'word'
  ORDER BY sml DESC, t;

SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';

SELECT * FROM test_trgm WHERE t ~ '(foo|bar)';

CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops);

-- This should be faster with the index.
SELECT t, similarity(t, 'word') AS sml
  FROM test_trgm
  WHERE t % 'word'
  ORDER BY sml DESC, t;

-- This should be faster with the index.
SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';

-- This should not be faster with the index - we don't accelerate this yet.
SELECT * FROM test_trgm WHERE t ~ '(foo|bar)';

‼️ The first SELECT above without any indexes takes ~6ms in PG and ~26ms in CRDB, running in a single-node cockroach demo. Changing the % comparison to = speeds up the query to ~1.2ms in PG and 8ms in CRDB, which are both about as fast as a simple full table scan over test_trgm like SELECT * FROM test_trgm WHERE t = 'word';. This suggests that there may be optimization opportunities in our implementation of %. Created #86610 to track this.

(Same as above, but with GIN instead of GIST, which are the same in CRDB.)

CREATE TABLE test_trgm (t text);

INSERT INTO test_trgm VALUES
  ('foo'),
  ('bar'),
  ('baz'),
  ('hoard'),
  ('wordy'),
  ('world'),
  ('weird'),
  ('weird'),
  ('xxxfooyyybar'),
  ('wofooord');
INSERT INTO test_trgm SELECT 'empty' FROM generate_series(1, 10000);

SELECT t, similarity(t, 'word') AS sml
  FROM test_trgm
  WHERE t % 'word'
  ORDER BY sml DESC, t;

SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';

SELECT * FROM test_trgm WHERE t ~ '(foo|bar)';

CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);

-- This should be faster with the index.
SELECT t, similarity(t, 'word') AS sml
  FROM test_trgm
  WHERE t % 'word'
  ORDER BY sml DESC, t;

-- This should be faster with the index.
SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';

-- This should not be faster with the index - we don't accelerate this yet.
SELECT * FROM test_trgm WHERE t ~ '(foo|bar)';

‼️ It's possible to use gin_trgm_ops and gist_trgm_ops with the opposite type of index. There's no reason these need to fail in CRDB - GIST and GIN are aliases for our general-purpose inverted indexes. However, we may want to be consistent with Postgres here. For example, both of these succeed in CRDB but fail in PG: Created #86611 to track this.

CREATE INDEX trgm_idx ON test_trgm USING GIST (t gin_trgm_ops);

CREATE INDEX trgm_idx ON test_trgm USING GIN (t gist_trgm_ops);

Extension

Test that CREATE EXTENSION pg_trgm works. It should be a no-op.

Built-ins

Test the following built-ins manually and compare the results to Postgres.

show_trgm

SELECT show_trgm('sql queries team');
SELECT show_trgm('sql queries team'::TEXT);
SELECT show_trgm('sql queries team'::VARCHAR);
SELECT show_trgm('sql queries team'::CHAR);
SELECT show_trgm('');
SELECT show_trgm(' ');
SELECT show_trgm('  ');
SELECT show_trgm('   ');
SELECT show_trgm('a');
SELECT show_trgm('ab');
SELECT show_trgm('foo%bar');
SELECT show_trgm('foo%.***%%$#@bar');

similarity

SELECT similarity('sql queries team', 'sql queries team');
SELECT similarity('sql queries team', 'sql team queries');
SELECT similarity('sql queries team'::TEXT, 'sql team queries'::VARCHAR);
SELECT similarity('foo bar baz', 'foo baz');
SELECT similarity('foo bar baz', 'baz bar');
SELECT similarity('foobarbaz', 'ooaraz');

% (similarity built-in)

SELECT 'foo bar baz' % 'foo baz';
SELECT 'foo bar baz' % 'baz bar';
SELECT 'foobarbaz' % 'ooaraz';

Test changing the session setting pg_trgm.similarity_threshold.

SET pg_trgm.similarity_threshold = 0.05;
SELECT 'foo bar baz' % 'foo baz';
SELECT 'foo bar baz' % 'baz bar';
SELECT 'foobarbaz' % 'ooaraz';

CREATE Tests

These should succeed:

CREATE TABLE test_trgm (s STRING, INVERTED INDEX (s gist_trgm_ops));
SHOW CREATE TABLE test_trgm;
DROP TABLE test_trgm;

CREATE TABLE test_trgm (s STRING, INVERTED INDEX (s gin_trgm_ops));
SHOW CREATE TABLE test_trgm;
DROP TABLE test_trgm;
CREATE TABLE test_trgm (s STRING);
CREATE INVERTED INDEX ON test_trgm (s gist_trgm_ops);
SHOW CREATE TABLE test_trgm;
DROP TABLE test_trgm;
CREATE TABLE test_trgm (s STRING);
CREATE INDEX ON test_trgm USING GIN (s gin_trgm_ops);
SHOW CREATE TABLE test_trgm;
DROP TABLE test_trgm;
CREATE TABLE test_trgm (s STRING);
CREATE INDEX ON test_trgm USING GIST (s gist_trgm_ops);
SHOW CREATE TABLE test_trgm;
DROP TABLE test_trgm;

‼️ We display gin_trgm_ops in SHOW CREATE TABLE regardless of the op class used when the table was created. However, I'm not sure whether this is considered a bug given that the behave the same in CRDB.

These should fail:

CREATE TABLE test_trgm (s STRING, INVERTED INDEX (s));
CREATE TABLE test_trgm (s STRING, INVERTED INDEX (s));
CREATE TABLE test_trgm (s STRING, INDEX (s gist_trgm_ops));
CREATE TABLE test_trgm (s STRING, INDEX (s gin_trgm_ops));

‼️ We allow gist_trgm_ops and gin_trgm_ops for non-inverted indexes. This is a known bug: cockroachdb/cockroach#84512

Index Feature Combinations

Re-create this table to test with before each section.

DROP TABLE IF EXISTS test_trgm;
CREATE TABLE test_trgm (
  k INT PRIMARY KEY,
  a INT,
  b INT,
  x TEXT
);
INSERT INTO test_trgm
SELECT i, i % 29, i * 100,
  chr(ascii('A')+(i%3)) || chr(ascii('A')+(i%5)) ||
  chr(ascii('A')+(i%7)) || chr(ascii('A')+(i%11)) ||
  chr(ascii('A')+(i%13)) || chr(ascii('A')+(i%17)) ||
  chr(ascii('A')+(i%19)) || chr(ascii('A')+(i%23)) ||
  chr(ascii('A')+(i%29)) || chr(ascii('A')+(i%31)) ||
  chr(ascii('A')+(i%37)) || chr(ascii('A')+(i%41)) ||
  chr(ascii('A')+(i%43)) || chr(ascii('A')+(i%47))
FROM generate_series (1, 10000) g(i);
ANALYZE test_trgm;

Partial Trigram Indexes

CREATE INDEX ON test_trgm USING GIST (x gist_trgm_ops) WHERE a > 10;
SELECT k FROM test_trgm WHERE a > 10 AND x LIKE '%FOO%';
SELECT k, similarity(x, 'FOOBAR') FROM test_trgm WHERE a > 10 AND x LIKE '%FOO%' ORDER BY 2;
SELECT k, a, similarity(x, 'FOOBAR') FROM test_trgm WHERE a > 20 AND x LIKE '%FOO%' ORDER BY 3;
UPDATE test_trgm SET a = a + 5 WHERE true;
SELECT k, similarity(x, 'FOOBAR') FROM test_trgm WHERE a > 10 AND x LIKE '%FOO%' ORDER BY 2;
SELECT k, a, similarity(x, 'FOOBAR') FROM test_trgm WHERE a > 15 AND x LIKE '%FOO%' ORDER BY 3;
UPDATE test_trgm SET a = a + 1, x = x || 'world' WHERE true;
SELECT k, a, x, similarity(x, 'BAD world') FROM test_trgm WHERE a > 10 AND x LIKE '%FO%world%' ORDER BY 4, 2, 1 LIMIT 20;

Multi-column Trigram Indexes

CREATE INDEX ON test_trgm USING GIST (a, x gist_trgm_ops);
SELECT k, a, x FROM test_trgm WHERE a = 11 AND x LIKE '%FO%';

‼️ We cannot plan a scan of multi-column trigram indexes when there are not enough characters on the RHS of the LIKE expression. This leads to a confusing error "ERROR: index "i" is inverted and cannot be used for this query". Created #86612 to track this.

‼️ SHOW CREATE TABLE incorrectly shows an opclass for non-inverted prefix columns: INVERTED INDEX test_trgm_a_x_idx (a gin_trgm_ops ASC, x gin_trgm_ops). Created #86614 to track this.

Expression Trigram Indexes

CREATE INDEX ON test_trgm USING GIST ((lower(x)) gist_trgm_ops);
SELECT k, a, x FROM test_trgm WHERE a = 14 AND lower(x) LIKE '%foo%';

Trigram Indexes on Collated Strings

DROP TABLE IF EXISTS test_trgm;
CREATE TABLE test_trgm (s TEXT COLLATE "de_DE");
INSERT INTO test_trgm VALUES
  ('foo'),
  ('Fussball'),
  ('Fußball'),
  ('bar'),
  ('ball');
SELECT show_trgm(s) FROM test_trgm;

‼️ The query above results in an error: "unknown signature: show_trgm(collatedstring{de_DE})". Postgres can produce the trigrams without error. Created #86615 to track this.

SELECT show_trgm(s::text) FROM test_trgm;
SELECT s % 'fussball', similarity(s, 'fussball')
FROM test_trgm;

‼️ The query above results in an error: "unknown signature: show_trgm(collatedstring{de_DE})". Postgres can produce the trigrams without error. Created #86615 to track this.

SELECT s::TEXT % 'fussball', similarity(s::TEXT, 'fussball')
FROM test_trgm;
CREATE INDEX i ON test_trgm USING GIST (s gist_trgm_ops);

‼️ The index creation fails with a confusing error:

ERROR: column s of type string is not allowed as the last column in an inverted index
SQLSTATE: 0A000
HINT: see the documentation for more information about inverted indexes: https://www.cockroachlabs.com/docs/v22.2/inverted-indexes.html

Created #86617 to track this.

I believe the actual problem is that trigrams on collated strings are not supported. The error message should be more clear about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment