Skip to content

Instantly share code, notes, and snippets.

@roscalabrin
Last active July 15, 2021 12:21
Show Gist options
  • Save roscalabrin/01e02c85e18eec33e9c5d284ba681917 to your computer and use it in GitHub Desktop.
Save roscalabrin/01e02c85e18eec33e9c5d284ba681917 to your computer and use it in GitHub Desktop.
Elasticsearch basic definitions

Basic Concepts

  • Node - a server that stores data
  • Cluster - a collection of nodes
  • Index - collection of similar documents

Distributed Data

A shard is a subset of the index data.

Solves the problem where the size of an index exceeds the hardware limit of a single node.

Default number of shards in an index is 5. After you create the index you can't change the number of shards.

Replication

The purpose of having replicas in ES:

  1. High availability (in case a shard or node fails)
  2. Increase performance

The dafault number of replicas is 1 per shard.

Routing - handled automatically bu default. Ensures that documents are distributed equally across shards.

Mapping

It's how the documents and their fields are stored. A field can have multiple mappings.

Field Data Types

Can be divided into 4 categories:

  1. Core Data Types
  2. Complex Data Types
  3. Geo Data Types
  4. Specialized Data Types

Core Data Types

  • Text Data Type - used to index full-text values (i.e. descriptions)
  • Keyword Data Type - used for structured data (i.e. tags, categories). Typically used for filtering and aggregations.
  • Numeric Data Type
  • Date Data Type
  • Boolean Data Type
  • Binary Data Type
  • Range Data Type

Complex Data Types

  • Object Data Type
  • Array Data Type
  • Nested Data Type
  • Geo-Point Data Type
  • Geo-Shape Data Type

Analyzed

New Document -> Analysis -> Store Document

keywords DO NOT go through this process, texts go through this process.

You can control which analyzer to use. Results are added to the inverted index.

There's one inverted index per text field. It allows ES to efficiently perform full-text searches. It's basically a mapping of a field's terms and which documents contain each term.

Analyzer

Process has 3 steps:

1. Character filter 2. Tokenizer 3. Token Filter
It can add, remove or change characters. Splits into words, removes , ; etc. (i.e. standard tokenizer) May add, change or remove the token (i.e. lowercase token filter, synonym token filter, stemmer token filter, stop token filter - removes words like and, at, the).

Only the lowercase filter is enabled by default.

Standard Analyzer removes punctuation and lowercase words. Optionally the stop token filter can be enabled.

Relevance Scores

TF/IDF Algorithm

Term Frequency/ Inverse Document Frequency

  1. Term Frequency - the more times a term appear in a field for a given document, the more relevant it is.

  2. Inverse Document Frequency - how often the term appears within the index (across all documents). The logic here is that if a term appears in a lot of documents it has a lower weight. This means that words that appear many times have less significance (i.e. this, the, etc.). If the document contains the term and the term is not frequent in the index, it's a signal that the document is relevant.

  3. Field Lenght Norm - the longer the lenght of the field the less relevant (i.e. nature in the title of 50 characters is more relevant than nature in a 1,000 characters description. A term appearing in a short field has more weight than a term appearting in a long field.

These 3 values are calculated and stored at index time (when a document is added or updated). These values are used to calculate the weight of a given term for a particular document.

Okapi BM25 Algorithm

  • Handles stop words better Although the value of stop words is limited, they do have some value. It's no longer common/recommended to remove stop words. In the TF/IDF algorithm, stop words are artificially boosted in longer fields (where they tend to appear more often - i.e. description).

To solve this problem, BM25 uses Nonlinear Term Frequency Saturation, which means that there's an upper limit on how much a term can be boosted based on how many times it appears. As the number of appearances increases, the relevance number or boosting becomes less significant.

  • Improves the field-length norm factor Instead of treating a field in the same way across all documents, it takes into consideration the average field length.

  • Can be configured with parameters

Search

Query Context

How well the documents match? - affect relevance score

Filter Context

Do documents match? - boolean evaluation ES can cache filters.

Term level queries

Search for exact matches (case sensitive - query is not analyzed). Better for matching enums, numbers, dates, not sentences.

Range queries

Used with numbers or date fields. Date Math Docs

Full-text queries (match queries)

They are analyzed using the analyzer defined for the search field or the standard analyzer if none is defined.

Match query - It's a boolean query (default operator is OR). The query goes through the analyzer specified in the mapping.

Bool queries (must, filter, should, must_not)

Should Query - Its behavior depends on the bool query as a whole and what else is in the query:

  1. If the bool query in a query context and contains a must or filter object - should queries don't need to match to match the bool query as a whole only purpose is to influence the relevance score of the matching documents.

  2. If the bool query is in the filter context or if it doesn't have a must of filter object at least one of the should queries must match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment