roscalabrin/elasticsearch.md

## elasticsearch.md

      
    Raw
  

              elasticsearch.md
            
          
    Basic Concepts


Node - a server that stores data
Cluster - a collection of nodes
Index - collection of similar documents

Distributed Data

A shard is a subset of the index data.
Solves the problem where the size of an index exceeds the hardware limit of a single node.
Default number of shards in an index is 5.
After you create the index you can't change the number of shards.
Replication

The purpose of having replicas in ES:

High availability (in case a shard or node fails)
Increase performance

The dafault number of replicas is 1 per shard.
Routing - handled automatically bu default. Ensures that documents are distributed equally across shards.
Mapping

It's how the documents and their fields are stored.
A field can have multiple mappings.
Field Data Types

Can be divided into 4 categories:

Core Data Types
Complex Data Types
Geo Data Types
Specialized Data Types

Core Data Types


Text Data Type - used to index full-text values (i.e. descriptions)
Keyword Data Type - used for structured data (i.e. tags, categories). Typically used for filtering and aggregations.
Numeric Data Type
Date Data Type
Boolean Data Type
Binary Data Type
Range Data Type

Complex Data Types


Object Data Type
Array Data Type
Nested Data Type
Geo-Point Data Type
Geo-Shape Data Type

Analyzed

New Document -> Analysis -> Store Document
keywords DO NOT go through this process,
texts go through this process.
You can control which analyzer to use.
Results are added to the inverted index.
There's one inverted index per text field. It allows ES to efficiently perform full-text searches. It's basically a mapping of a field's terms and which documents contain each term.
Analyzer

Process has 3 steps:


1. Character filter
2. Tokenizer
3. Token Filter


It can add, remove or change characters.
Splits into words, removes , ; etc. (i.e. standard tokenizer)
May add, change or remove the token (i.e. lowercase token filter, synonym token filter, stemmer token filter, stop token filter - removes words like and, at, the).


Only the lowercase filter is enabled by default.
Standard Analyzer removes punctuation and lowercase words. Optionally the stop token filter can be enabled.
Relevance Scores

TF/IDF Algorithm

Term Frequency/ Inverse Document Frequency


Term Frequency - the more times a term appear in a field for a given document, the more relevant it is.


Inverse Document Frequency - how often the term appears within the index (across all documents). The logic here is that if a term appears in a lot of documents it has a lower weight. This means that words that appear many times have less significance (i.e. this, the, etc.). If the document contains the term and the term is not frequent in the index, it's a signal that the document is relevant.


Field Lenght Norm - the longer the lenght of the field the less relevant (i.e. nature in the title of 50 characters is more relevant than nature in a 1,000 characters description. A term appearing in a short field has more weight than a term appearting in a long field.


These 3 values are calculated and stored at index time (when a document is added or updated). These values are used to calculate the weight of a given term for a particular document.
Okapi BM25 Algorithm


Handles stop words better
Although the value of stop words is limited, they do have some value. It's no longer common/recommended to remove stop words. In the TF/IDF algorithm, stop words are artificially boosted in longer fields (where they tend to appear more often - i.e. description).

To solve this problem, BM25 uses Nonlinear Term Frequency Saturation, which means that there's an upper limit on how much a term can be boosted based on how many times it appears. As the number of appearances increases, the relevance number or boosting becomes less significant.


Improves the field-length norm factor
Instead of treating a field in the same way across all documents, it takes into consideration the average field length.


Can be configured with parameters


Search

Query Context

How well the documents match? - affect relevance score
Filter Context

Do documents match? - boolean evaluation
ES can cache filters.
Term level queries

Search for exact matches (case sensitive - query is not analyzed). Better for matching enums, numbers, dates, not sentences.
Range queries

Used with numbers or date fields.
Date Math Docs
Full-text queries (match queries)

They are analyzed using the analyzer defined for the search field or the standard analyzer if none is defined.
Match query - It's a boolean query (default operator is OR). The query goes through the analyzer specified in the mapping.
Bool queries (must, filter, should, must_not)

Should Query - Its behavior depends on the bool query as a whole and what else is in the query:


If the bool query in a query context and contains a must or filter object - should queries don't need to match to match the bool query as a whole only purpose is to influence the relevance score of the matching documents.


If the bool query is in the filter context or if it doesn't have a must of filter object at least one of the should queries must match.