Skip to content

Instantly share code, notes, and snippets.

@jasonbosco
Last active August 22, 2023 00:34
Show Gist options
  • Save jasonbosco/f4187f6b4f585d2dc8902af85408994a to your computer and use it in GitHub Desktop.
Save jasonbosco/f4187f6b4f585d2dc8902af85408994a to your computer and use it in GitHub Desktop.

⚠️ UPDATE: v0.25.0 is now generally available. Please upgrade to 0.25.0 and use the official docs here.

Hybrid Search + Built-in Embeddings in Typesense

⚠️ This is an early alpha preview and the API might change in the final release. Use with caution.

Run Typesense

On Typesense Cloud: Reach out to support at typesense dot org to have your cluster be upgraded to an RC build that supports this feature.

When self-hosting:

export TYPESENSE_API_KEY=xyz

mkdir $(pwd)/typesense-data

docker run --platform linux/amd64 -p 8108:8108 -v$(pwd)/typesense-data:/data typesense/typesense:0.25.0.rc65 \
  --data-dir /data --api-key=$TYPESENSE_API_KEY --enable-cors

ℹ If you're running this amd64 Docker build on an Apple M1, performance might be degraded due to ARM emulation.

Create a collection

Create a collection using one of the schema options below:

Remote Embedding Models

Using OpenAI embeddings

curl "http://localhost:8108/collections" \
       -X POST \
       -H "Content-Type: application/json" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
       -d '
{
  "name": "products",
  "fields": [
    {
      "name": "product_name",
      "type": "string"
    },
    {
      "name": "description",
      "type": "string"
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "product_name",
          "description"
        ],
        "model_config": {
          "model_name": "openai/text-embedding-ada-002",
          "api_key": "your_openai_api_key"
        }
      }
    }
  ]
}'

Using PaLM API embeddings

curl "http://localhost:8108/collections" \
       -X POST \
       -H "Content-Type: application/json" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
       -d '
{
  "name": "products",
  "fields": [
    {
      "name": "product_name",
      "type": "string"
    },
    {
      "name": "description",
      "type": "string"
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "product_name",
          "description"
        ],
        "model_config": {
          "model_name": "google/embedding-gecko-001",
          "api_key": "your_palm_api_key_from_makersuite.google.com"
        }
      }
    }
  ]
}
'

Built-in embedding models

Using S-BERT

⚠️ This model currently uses CPU and is quite slow to generate embeddings at scale. We're actively working on adding GPU support which will improve performance signficantly.

{
  "name": "products",
  "fields": [
    {
      "name": "product_name",
      "type": "string"
    },
    {
      "name": "description",
      "type": "string"
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "product_name",
          "description"
        ],
        "model_config": {
          "model_name": "ts/all-MiniLM-L12-v2"
        }
      }
    }
  ]
}

Using E5

⚠️ This model currently uses CPU and is quite slow to generate embeddings at scale. We're actively working on adding GPU support which will improve performance signficantly.

{
  "name": "products",
  "fields": [
    {
      "name": "product_name",
      "type": "string"
    },
    {
      "name": "description",
      "type": "string"
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "product_name",
          "description"
        ],
        "model_config": {
          "model_name": "ts/e5-small"
        }
      }
    }
  ]
}

Using your own custom models

You can also use your own models. But those models must be in ONNX file format since we use ONNX runtime for inferencing. For eg, if you have a model in PyTorch format, lookup instructions on how to convert a PyTorch model to ONNX format.

Once you have the model in ONNX format, create a directory under <typesense_data_dir>/models/<model_name> and store your ONNX model file, vocab file, and a JSON for model config there.

Note: Your model file MUST be named as model.onnx and the config file MUST be named as config.json.

Model config file

This file will contain information about the type of model you want to use. The JSON file must contain model_type (type of the model; we support bert and xlm_roberta at the moment) and vocab_file_name keys.

Example for custom model
<data_dir>/models/test_model/model.onnx
<data_dir>/models/test_model/vocab.txt
<data_dir>/models/test_model/config.json

Here's an example model: https://huggingface.co/typesense/models/tree/main/all-MiniLM-L12-v2

Contents of config.json:

{
    "model_type": "bert",
    "vocab_file_name": "vocab.txt"
}

Create an embedding field using the directory name as model_name in model_config.

{
  "name": "products",
  "fields": [
    {
      "name": "product_name",
      "type": "string"
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "product_name"
        ],
        "model_config": {
          "model_name": "test_model"
        }
      }
    }
  ]
}
Optional Model Parameters

These are optional model parameters, which may be required to use with your custom models.

Indexing prefix and query prefix

Some models may require a prefix to know if texts are queries or they are actual texts to query on (you can check intfloat/e5-small, for example). If you set this property in model_config, the given indexing prefix will be added to the text that will be used to create embeddings when you index a document and query_prefix to the actual query before creating embeddings of it.Example:

{
  "name": "products",
  "fields": [
    {
      "name": "product_name",
      "type": "string"
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "product_name"
        ],
        "model_config": {
          "model_name": "e5-base",
          "indexing_prefix": "passage:",
          "query_prefix": "query:"
        }
      }
    }
  ]
}

For this example, when you index a document:

{
   "product_name": "ABCD"
}

The text used to generate embeddings for the embedding field will be passage: ABCD instead of ABCD. And when you query, if your query is EFGH, it will be embedded as query: EFGH instead of EFGH.

Index documents

When you index a document like this:

curl "http://localhost:8108/collections/products/documents/import?action=create" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
        -H "Content-Type: text/plain" \
        -X POST \
        -d '
{"product_name": "ABCD","description": "This is some description text"}
'

The values for product_name and description are concatenated and sent to the embedding model you specified in the collection schema automatically and the embeddings are stored in the embedding field.

Semantic Search

You can directly set query_by to the auto-embedding field to do a semantic search on this field.

curl "http://localhost:8108/multi_search" \
        -X POST \
        -H "Content-Type: application/json" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
        -d '
{
  "searches": [
    {
      "collection": "products",
      "q": "shoes with white sole",
      "query_by": "embedding",
      "prefix": false
    }
  ]
}
'

This will automatically embed the shoes with white sole query with the same model used for the embedding field and will perform a nearest neighbor vector search.

Hybrid Search Query

You can combine semantic search with keyword search and do a hybrid search by specifying both regular fields and vector fields in query_by:

curl "http://localhost:8108/multi_search" \
        -X POST \
        -H "Content-Type: application/json" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
        -d '
{
  "searches": [
    {
      "collection": "products",
      "q": "shoes with white sole",
      "query_by": "product_name,embedding",
      "prefix": false
    }
  ]
}
'

This will do a keyword search using the product_name field and also semantic search using the vectors in the embedding field and rank both results together.

Hybrid Search Rank Fusion

The score of a document for hybrid search is calculated by rank fusion:

score = 0.7 * (1 / (keyword_search_rank_of_document + 1)) + 0.3 * (1 / (semantic_search_rank_of_document + 1))

keyword_search_rank_of_document is the rank of the document in the results for keyword search, and semantic_search_rank_of_document is the rank of the document in the results for semantic search.

Known Issues 🚧

These issues are known and are being worked on:

  • Altering the schema in-place and changing the field definition for an auto-embedding field makes the field no longer vector searchable
  • Numeric values in Vector fields are attempted to be highlighted
  • The built-in models currently use CPU and are quite slow to generate embeddings at scale. We're actively working on adding GPU support which will improve performance signficantly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment