umutseven92/search.md

## search.md

      
    Raw
  

              search.md
            
          
    Search

Learning to Rank


Applies supervised machine learning to search relevance ranking.
A model is trained based on some preprocessed data.
This data with hand-labeled ranks is then used to optimize future results.

Vector Based Search


Each word is represented as a numerical quantity known as vector, using tools like word2vec (Google) or GloVe (Stanford).

This representation is called a text (word) embedding.


Each vector captures information about what a word means (semantics).
Words with similar meanings get similar vectors and can be projected onto a 2-dimensional graph for easy visualization.
These vectors are trained by ensuring words with similar meanings are physically near each other.

Query


Query expansion is where the query is broadened by introducing additional tokens or phrases.
Query relaxation is where tokens are ignored to make the query less restrictive and increase recall.
Query translation is where a text-translation paradigm is applied to convert tail (i.e., unpopular, 20th percentile) queries into head queries that account for the bulk of traffic, helping to increase recall.

Metrics


Precision is the fraction of relevant search results among the total search results.

If the engine returns 30 results, out of which 15 are relevant, then precision in is 15/30.


Recall is the ability of a search engine to retrieve all the relevant results from the corpus.

If there are 50 documents that are relevant, but the search engine returns only 30 of those, then the recall is 30 out of 50.


F score (specifically F1 score) is a single number, that represents both precision and recall.
Mean Reciprocal Rank (MRR) guides the search engine to put the most-desired result on the top.

It gives the score 1 for clicking the first result, ½ for clicking the second result, ⅓ for the third result, and so on.


Mean Average Precision (MAP) allows to quantify the relevance of the top returned results.
Normalized Discounted Cumulative Gain (nDCG) is like MAP, but weights the relevance of the result by its position.

Misc


Corpus is the complete set of documents that need to be searched.
Shingles are word-ngrams. Given a stream of tokens, the shingle filter will create new tokens by concatenating adjacent terms.

They give you the ability to pre-bake phrase matching. By building phrases into the index, you can avoid creating phrases at query time and save some processing time/speed.
The downside is that you have larger indices and potentially more memory usage.


Named-entity recognition (NER) locates, classifies and extracts named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations etc.

A named entity is a real-world object, such as persons, locations, organizations, products, etc.
In the sentence "Biden is the president of the United States", both "Biden" and the "United States" are named entities.