Skip to content

Instantly share code, notes, and snippets.

@umutseven92
Last active April 26, 2022 10:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save umutseven92/b1b3ad57c178874b1250fdf4bc86e47b to your computer and use it in GitHub Desktop.
Save umutseven92/b1b3ad57c178874b1250fdf4bc86e47b to your computer and use it in GitHub Desktop.
Search Cliff Notes

Search

Learning to Rank

  • Applies supervised machine learning to search relevance ranking.
  • A model is trained based on some preprocessed data.
  • This data with hand-labeled ranks is then used to optimize future results.

Vector Based Search

  • Each word is represented as a numerical quantity known as vector, using tools like word2vec (Google) or GloVe (Stanford).
    • This representation is called a text (word) embedding.
  • Each vector captures information about what a word means (semantics).
  • Words with similar meanings get similar vectors and can be projected onto a 2-dimensional graph for easy visualization.
  • These vectors are trained by ensuring words with similar meanings are physically near each other.

Query

  • Query expansion is where the query is broadened by introducing additional tokens or phrases.
  • Query relaxation is where tokens are ignored to make the query less restrictive and increase recall.
  • Query translation is where a text-translation paradigm is applied to convert tail (i.e., unpopular, 20th percentile) queries into head queries that account for the bulk of traffic, helping to increase recall.

Metrics

  • Precision is the fraction of relevant search results among the total search results.
    • If the engine returns 30 results, out of which 15 are relevant, then precision in is 15/30.
  • Recall is the ability of a search engine to retrieve all the relevant results from the corpus.
    • If there are 50 documents that are relevant, but the search engine returns only 30 of those, then the recall is 30 out of 50.
  • F score (specifically F1 score) is a single number, that represents both precision and recall.
  • Mean Reciprocal Rank (MRR) guides the search engine to put the most-desired result on the top.
    • It gives the score 1 for clicking the first result, ½ for clicking the second result, ⅓ for the third result, and so on.
  • Mean Average Precision (MAP) allows to quantify the relevance of the top returned results.
  • Normalized Discounted Cumulative Gain (nDCG) is like MAP, but weights the relevance of the result by its position.

Misc

  • Corpus is the complete set of documents that need to be searched.
  • Shingles are word-ngrams. Given a stream of tokens, the shingle filter will create new tokens by concatenating adjacent terms.
    • They give you the ability to pre-bake phrase matching. By building phrases into the index, you can avoid creating phrases at query time and save some processing time/speed.
    • The downside is that you have larger indices and potentially more memory usage.
  • Named-entity recognition (NER) locates, classifies and extracts named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations etc.
    • A named entity is a real-world object, such as persons, locations, organizations, products, etc.
    • In the sentence "Biden is the president of the United States", both "Biden" and the "United States" are named entities.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment