Skip to content

Instantly share code, notes, and snippets.

@janko
Last active June 6, 2019 23:58
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save janko/1e592367e474baa0f394 to your computer and use it in GitHub Desktop.
Save janko/1e592367e474baa0f394 to your computer and use it in GitHub Desktop.
PostgreSQL full-text search capabilites (my presentation from our local Ruby meetup)

Full-text search

  • Keywords

  • Typos

  • Stemming

  • Stopword ignore

  • Synonym expansion

  • Unaccenting

  • Queries

    • boolean (lunch AND dessert)
    • phrases ("cherry pie")
  • Ranking

    • how much does a document match a query
    • column weights

Document & Query

SELECT 'Winter is coming' ILIKE '%winter%'

SELECT 'Winter is coming' ILIKE '%winter coming%'

SELECT 'Winter is coming' @@ 'winter coming'

SELECT to_tsvector('Winter is coming') @@ plainto_tsquery('winter coming')
  • tokenization
  • downcasing
  • stemming
  • unaccenting
  • stopword removal
SELECT plainto_tsquery('winter coming')

SELECT to_tsquery('winter & coming')

SELECT to_tsvector('Winter is coming is')
SELECT to_tsvector('Winter is coming') @@ to_tsquery('winters')

SELECT to_tsvector('english', 'Winter is coming') @@ to_tsquery('english', 'winters')

SELECT to_tsvector('english', 'Winter is coming') @@ to_tsquery('english', 'come')

SELECT to_tsvector('Winter is coming'), to_tsvector('english', 'Winter is coming')

Ranking

SELECT ts_rank(to_tsvector('Winter is coming'), to_tsquery('winter'))

SELECT ts_rank(to_tsvector('Winter is coming'), to_tsquery('winter & is & coming'))
SELECT setweight(to_tsvector('Winter is coming'), 'D')

SELECT ts_rank(setweight(to_tsvector('Winter is coming'), 'D'), to_tsquery('winter & is & coming'))

SELECT ts_rank(setweight(to_tsvector('Winter is coming'), 'C'), to_tsquery('winter & is & coming'))

SELECT ts_rank(setweight(to_tsvector('Winter is coming'), 'A'), to_tsquery('winter & is & coming'))

SELECT * FROM posts WHERE to_tsvector(title) || to_tsvector(body) @@ to_tsquery('winter') ORDER BY ts_rank(setweight(to_tsvector(title), 'A') || setweight(to_tsvector(body), 'C'), to_tsquery('winter'))
SELECT ts_rank_cd(to_tsvector('Winter is not coming'), to_tsquery('winter & is & coming'))

SELECT ts_rank_cd(to_tsvector('Winter is not coming'), to_tsquery('winter & is & not'))

Highlighting

SELECT ts_headline('english', 'Winter is coming', to_tsquery('winter'))

SELECT ts_headline('english', 'Winter is coming', to_tsquery('winter'), 'StartSel = <strong>, StopSel = </strong>')

SELECT ts_headline('english', 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus sed ex sit amet lacus facilisis posuere nec non nibh. Nam venenatis mattis ante, ultricies fermentum tortor blandit a. Morbi tellus libero, elementum non turpis quis, lobortis hendrerit magna. Nullam venenatis, lorem eu efficitur posuere, diam tellus finibus massa, fringilla sodales nibh nisi et mauris. Nullam sed mi eu lectus laoreet ultrices. Vestibulum porta eu justo sit amet porta. Fusce vitae dolor bibendum, aliquam risus id, hendrerit sapien.', to_tsquery('eu'), 'HighlightAll = true')

SELECT ts_headline('english', 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus sed ex sit amet lacus facilisis posuere nec non nibh. Nam venenatis mattis ante, ultricies fermentum tortor blandit a. Morbi tellus libero, elementum non turpis quis, lobortis hendrerit magna. Nullam venenatis, lorem eu efficitur posuere, diam tellus finibus massa, fringilla sodales nibh nisi et mauris. Nullam sed mi eu lectus laoreet ultrices. Vestibulum porta eu justo sit amet porta. Fusce vitae dolor bibendum, aliquam risus id, hendrerit sapien.', to_tsquery('eu'), 'MinWords = 1, MaxWords = 5, MaxFragments = 10')

Indices

ALTER TABLE posts ADD COLUMN searchable_content tsvector
UPDATE posts SET searchable_content = to_tsvector(title, 'A') || to_tsvector(body, 'C')

CREATE INDEX post_search ON posts USING gin(searchable_content)

SELECT * FROM posts WHERE searchable_content @@ 'winter' ORDER BY ts_rank(searchable_content, to_tsquery('winter'))

pg_search

require "pg_search"

enable_extension "pg_trgm"
enable_extension "unaccent"

class Post < ActiveRecord::Base
  include PgSearch

  belongs_to :author
  has_many :comments

  pg_search_scope :search,
    against: {title: "A", body: "C"},
    associated_against: {
      author:   {name: "B"},
      comments: {content: "D"},
    },
    using: {
      tsearch: {prefix: true},
      trigram: {},
      # dmetaphone: {},
    },
    ignoring: :accents,
end
  • Indexing is really tricky

    SELECT COUNT(title) FROM posts WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('anatomija')
    • You need to create a separate column

      CREATE INDEX post_idx ON posts USING gin(to_tsvector(title || ' ' || body))
      
      CREATE INDEX post_idx ON posts USING gin(to_tsvector(title) || to_tsvector(body))
      
      CREATE INDEX post_idx ON posts USING gin(setweight(to_tsvector(title), 'A') || setweight(to_tsvector(body), 'B'))
      
      SELECT COUNT(title) FROM posts WHERE content @@ to_tsquery('anatomija')
      • effectively doubles database size

      • need trigger for auto-updating

      CREATE TRIGGER posts_tsv BEFORE INSERT OR UPDATE
      ON posts FOR EACH ROW EXECUTE PROCEDURE
      tsvector_update_trigger(content, 'croatian', title, body);
    • How to index columns from multiple tables (associations)?

  • Unaccent

    • Doesn't work with stemming

      SELECT to_tsvector('naranče'), to_tsvector(unaccent('naranče'))
      
      SELECT unaccent(to_tsvector('naranče'))
      
      SELECT to_tsvector(unaccent('Sok od crvenih naranči')) @@ plainto_tsquery(unaccent('crvene naranče'))
      • theoretically could be included in dictionary, but actually not
    • Can't be indexed out-of-the-box

      SELECT * FROM posts WHERE unaccent(content) @@ query
      
      CREATE INDEX post_idx ON posts USING gin(unaccent(title || ' ' || body))
      
      CREATE OR REPLACE FUNCTION immutable_unaccent(text)
        RETURNS text AS
      $func$
      SELECT unaccent('unaccent', $1)
      $func$  LANGUAGE sql IMMUTABLE SET search_path = public, pg_temp;
      • or can be thrown into a custom trigger, but we still lose stemming
  • Typos correction

    SELECT similarity('naranča',     'naranča');
    
    SELECT similarity('Sok naranča', 'naranča');
    • Table with unique words, and repeat the query with a similar word
  • No out-of-the-box support for some more advanced queries

    • boolean queries (although can be worked around)

      "blue yellow OR red" ⇩ "blue & yellow | red"

    • phrases

  • Indexing:

    • Tokenization
    • Downcasing
    • Stopword removal
    • Unaccenting
    • Stemming
  • Query:

    • Keywords
    • [/] Phrases
    • Typos correction
    • [/] Boolean operators
    • Wildcards & regexes
    • Faceting
  • Ranking:

    • Column weights
    • TF-IDF
  • Display:

    • Highlighting
    • Pagination
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment