Skip to content

Instantly share code, notes, and snippets.

@theNullCrown
Last active August 10, 2023 19:13
Show Gist options
  • Save theNullCrown/5bc1a40d189fcc6d9105fe79cdb0a5f6 to your computer and use it in GitHub Desktop.
Save theNullCrown/5bc1a40d189fcc6d9105fe79cdb0a5f6 to your computer and use it in GitHub Desktop.
Smarter filtering of location and other strings for vector search

Fast Fuzzy Filtering (F3)

Algorithm for pre-filtering locations and other similar strings during vector search.

The Problem

We built multiple vector databases of World Bank project documents with multiple location, topic and theme metadata fields using Marqo, AWS Kendra, Weaviate, Pinecone. The typical query looks like:

What are some lessons learned from rural water supply projects in Northern Africa?
What are the effects on girls' education due to sanitation projects in Asia and South America?

When we used vector similarity on metadata fields, strings like Southern Africa and Eastern Africa were pretty similar even though it should not be a match. Further, in the popular vector search engines, vector similarity with metadata only influences the ranking and does not actually filter.

Lexical matching - which is the default pre-filtering strategy in almost all vector search engines - works slighly better but cannot handle more complex cases. For example, if the query requests Africa and the document metadata contains South Africa it should be a match but lexical matching will return False.

We could go one step further and look for string inclusion. After all, query: South Africa should not match metadata: Africa but query: Africa should match metadata: South Africa as South Africa is in Africa. However this does not work for cases like query: East Africa and metadata: Eastern Africa. Also we need to handle cases like query: Africa and South America should match metadata: Eastern Africa. It is possible to use an LLM or other text classification model to do this but this is an operation that is going to be performed as many times as there are documents and it will directly impact the search time.

The Solution

This solution is built on top of the rapidfuzz fuzzy string matching library.

from typing import List, Dict
from rapidfuzz import fuzz
import os
import json

Optional: Synonyms

Since we are going to do fuzzy lexical matching, it is worth using a domain specific synonym list like:

wastewater, sewage, sludge, septage, sullage, effluents, sewerage
peers, contacts, contact point, for more information
waste, MSW, excreta, HCW, litter, feces, refuse
construction, renovation, installation, upgrading
dismantling, demolition
excavation, digging, emptying
development, transformation, integration, reform, strengthening

We can load it in to hashmap that maps every word in a synset to a base word giving us constant time conversion. Here is some example code for that:

def jsonifySynsets() -> None:
    synonym_dict = {}
    filepath = "../data/Synonyms.txt"
    try:
        with open(filepath, "r") as file:
            for line in file:
                line = " ".join(line.split())
                synset = line.strip().split(", ")
                base_word = synset[0]
                for synonym in synset:
                    synonym_dict[synonym] = base_word
    except Exception as e:
        print(f"Error: {e}")
    output_file_path = "../objects/Synonyms.json"
    os.makedirs(os.path.dirname(output_file_path), exist_ok=True)
    with open(output_file_path, "w") as output_file:
        json.dump(synonym_dict, output_file, indent=4)

Filtering

This function supports natural language strings as well as lists. For it to work optimally, make sure to extract the relevant field you are trying to match from the query first. We are using an LLM for this and it works well 99.99% of the time. It also does not affect the latency as it a one time operation. For example:

query: What are some lessons learned from rural water supply projects in Northern Africa?
-->
metatada: {'request': ['lessons learned',], 'project types': ['rural water supply',], 'locations': ['Northern Africa'],}

Main function:

def f3(query_phrase: str or List, meta_phrase: str) -> bool:
    if type(query_phrase) == str:
        queries = (
            query_phrase.lower()
            .replace(" and ", ",")
            .replace(" or ", ",")
            .replace(";", ",")
            .split(",")
        )
    elif type(query_phrase) == List:
        queries = query_phrase
    query_scores = False
    # print(queries)
    for query in queries:
        try:
            with open("../objects/Synonyms.json", "r") as json_file:
                synonyms = json.load(json_file)
        except (FileNotFoundError, json.JSONDecodeError):
            jsonifySynsets()
            with open("../objects/Synonyms.json", "r") as json_file:
                synonyms = json.load(json_file)
        scores = []
        tokenset1 = query.split()
        tokenset2 = meta_phrase.split()
        if len(tokenset1) == 1 and len(tokenset2) == 1 and tokenset1[0] != tokenset2[0]:
            return False
        base_words = {}
        for token in set(tokenset1) | set(tokenset2):
            base_words[token] = synonyms.get(token, token)
        print(tokenset1, tokenset2)
        for query_word in tokenset1:
            base_query_word = base_words[query_word]
            ratios = []
            for meta_word in tokenset2:
                base_meta_word = base_words[meta_word]
                ratios.append(fuzz.WRatio(base_query_word, base_meta_word))
            scores.append(max(ratios))
        query_scores |= min(scores) > 60
        if query_scores:
            return query_scores
    return query_scores

The theoretical complexity is O(max_query_field_tokens * max_metadata_field_tokens) however practically these strings will not contain more than 10-15 words.

Results

Below are some example outputs. It seems to work perfectly for every single case we have tried. Spelling errors may lead to poorer performance but doing a spelling correction on the query is a one time operation.

[Query: China and Brazil] [Metadata: West Brazil] ----> True
[Query: Pacific] [Metadata: Eastern Asia] ----> False
[Query: Niger] [Metadata: Nigeria] ----> False
[Query: East Africa] [Metadata: South Africa] ----> False
[Query: Africa] [Metadata: Southern Africa] ----> True
[Query: East Africa] [Metadata: Africa] ----> False
[Query: East Africa] [Metadata: Eastern Africa] ----> True
[Query: Africa] [Metadata: Africa] ----> False
[Query: sewerage] [Metadata: sludge treatment] ----> True
[Query: sustainable sewerage treatment] [Metadata: sludge treatment] ----> False
[Query: insights] [Metadata: lessons learned and recommendations] ----> True
[Query: lesson learned] [Metadata: lessons learned and recommendations] ----> True
[Query: South Tarawa] [Metadata: Kiribati - South Tarawa Rural Growth Project] ----> True

Conclusion

This does not solve the problem completely but it is a significant improvement over both vector similarity and exact matching metadata fields during vector search.

License

MIT License

Copyright (c) 2023 theNullCrown

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment