Skip to content

Instantly share code, notes, and snippets.

@umutseven92
Last active May 13, 2022 10:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save umutseven92/07f62ee999e44672915565a14f11db3b to your computer and use it in GitHub Desktop.
Save umutseven92/07f62ee999e44672915565a14f11db3b to your computer and use it in GitHub Desktop.
Elasticsearch Cliff Notes

Elasticsearch

Architecture

  • Nodes join a cluster (named elasticsearch by default).
    • One node becomes the master node.
  • Each index data is divided into shards.
    • Internally, an index is a logical namespace that points to one or more shards.
    • 5 shards by default.
    • Due to how routing works, these shards cannot be increased later; you would need to create a new index.
  • These shards are distributed among nodes.
    • When you index a document, Elasticsearch will determine which shard the document should be routed to for indexing.
    • Same routing will be used when Elasticsearch is retrieving the document.
    • You can enable custom routing, mainly for performance reasons.
    • A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards.
  • When a new node is added or removed, Elasticsearch will redistribute the shards.
  • Each shard is a Lucene index.
    • A Lucene index can be tought as a self-contained search engine, with its own syntax different to Elasticsearch.
  • Each shard gets replicated, and stored in another node. These are replica shards.
    • The primary shard is responsible for keeping replica shards updated.
    • Replica shards are used for both availability and searching.
  • Lucene index contains segments.
    • A segment is an inverted index.
    • A search in a shard will search each segment in turn, then combine their results into the final results for that shard.
    • These segments are immutable.
  • When data is written, it is published into segments.
    • While you are indexing documents, Elasticsearch collects them in memory (and in the transaction log, for safety) then every second or so, writes a new small segment to disk (Flush), and refreshes the search (Refresh).
    • A Refresh is what makes the data in the new segment visible to search.
      • This makes the in-memory data searchable, without needing to write it to disk.
      • A Refresh is done every second, but only on indices that have received one search request or more in the last 30 seconds.
    • A Flush is what actually writes the data into disk.
      • A Flush triggeres the Lucene commit and empties the transaction log.
      • Flush happens automatically depending on how many operations get added to the transaction log, how big they are, and when the last Flush happened.
    • Both refresh and flush can be triggered by _refresh and _flush APIs respectively.
  • As segments grow, they get merged into bigger segments.
    • This is called a Merge.
    • Once the new bigger segment is written, the old segments are dropped.
    • Merging can be quite resource intensive, especially with respect to disk I/O.
    • Maximum segment size is 5GBs; segments bigger than this won't be considered for merging.
    • Merging can be forced by the _forcemerge endpoint.
      • This should only be done on read-only indices, as it can result in >5GB segments to be produced, which means will no longer be considered by future merge requests. Any documents you add to the segment after running forcemerge will never get cleaned up (for example, if they are marked for deletion).
  • When a document is deleted, it gets marked to be deleted, and is not visible. It actually gets deleted during a merge.
  • Each shard copy writes operations (index and delete operations) into its transaction log (translog).
    • This is to prevent data loss on operations that haven't been written to disk yet with a Flush.
    • If a crash happens before the flush, the operations in the transaction log are restored next time Elasticsearch boots up.

Node Types

  • All node types are determined by the roles it gets assigned.
  • All nodes know about all the other nodes in the cluster and can forward client requests to the appropriate node.

Master Node

  • Role: master
  • Only one gets selected as a master node in a cluster.
  • They are responsible for creating or deleting indexes, tracking nodes, and allocating shards to nodes.

Data Nodes

  • Roles: data, data_content, data_content, data_hot, data_warm, data_cold, data_frozen
  • Data nodes are responsible for holding data and performing data-related CRUD operations, like indexing, search, and aggregations.
  • You can configure data nodes so that they only do search and aggregation, not any indexing, to reduce the load in the individual nodes.

Ingest Node

  • Role: ingest
  • Applies ingest pipelines to a documents in order to transform and enrich the document before indexing.
  • With a heavy ingest load, it makes sense to use dedicated ingest nodes.

Coordinating Only Node

  • Role: None
  • These nodes act as load balancers; they know where specific documents can reside and serve search requests only to those nodes.
  • A node is assigned this role if it has no roles defined.

Documents

  • Fields are the smallest individual unit of data in Elasticsearch.
  • These fields are customizable and could include, for example: title, author, date, summary, team, score, etc.
    • Multi-fields are fields that can be indexed in more than one way to produce more search results.
    • Meta-fields deal with a document’s metadata and usually start with an underscore.
  • Mapping defines the different types that reside within an index.
  • Mapping can be done via the API, or via Index Templates.
  • You can’t change the mapping or field type of an existing field.
  • The original document body is stored in the _source field.

Analysis

  • Result of the analysis gets stored in the inverted index.
    • An inverted index is a data structure that maps terms to the ID's of the documents they are found in.
      • ID Term Documents (Posting List)
        1 blue 1
        2 butterfly 2,3
        3 brutus 3
    • When a search is performed, this inverted index is what gets searched.
    • The original document is then retrieved from disk and returned as _source.
  • In most cases, the same analyzer is used at index and search time.
  • Analysis steps are Character Filter -> Tokenizer -> Token Filter
    • Character filters receive the original text as a stream of characters and transform the stream by adding, removing, or changing characters.
      • Example Character filters are html_strip, mapping and pattern_replace.
    • Tokenizers receive a stream of characters, break it up into individual tokens (usually individual words), and output a stream of tokens.
      • They also record the position of each term, start & end character offsets, and token type.
      • Example Tokenizers are standart, whitespace and n-gram.
    • Token filters accept a stream of tokens from a tokenizer and can modify tokens, delete tokens or add tokens.
      • Example Token filters are lowercase, stop and synonym.
  • The default analyser (standard) contains the Standard tokenizer and the Lowercase token filter.
  • Analysers can be debugged via the _analyze endpoint.

Queries

  • Term queries do not do analysis; it matches exact terms.
  • Match queries analyze the search input.
  • track_total_hits controls the computation of total hit counts.
    • Setting it to true will show the exact count, with the expense of a slower query.
    • Setting it to an integer will show the exact count, up to the defined value.
  • timeout is used to set a time out for queries.
  • match_all matches all documents, giving them a _score of 1.
  • match query returns documents matching a provided field.
  • match_phrase query returns documents that match the phrase provided.
    • slop paramter is supported.
  • multi_match allows for multi-field queries.
    • best_fields finds documents which match any field, but uses the _score from the best field. This is equivalent to using a dis_max query with match on every field.
    • most_fields finds documents which match any field and combines the _score from each field. This is equivalent to using a boolean query with should and match on every field.
    • cross_fields treats fields with the same analyzer as though they were one big field, and looks for each word in any field. All terms must be present in at least one field for a document to match.
  • query_string uses a syntax to return documents.
    • Very versatile, but the query is strict and returns an error if the query string includes any invalid syntax.
  • simple_query_string is similar, but uses a simpler syntax.
    • Unlike query_string, does not return errors for invalid syntax- it ignores any invalid parts of the query string.
  • dis_max returns documents that match defined multiple sub-queries.
    • If tie_breaker is 0, the highest scoring match is used.
    • Else, the lesser scoring matches are multiplied by tie_breaker, and added to the highest scoring match.
    • This equivalent to using a multi_match query with best_fields enabled.
  • combined_fields query supports searching multiple text fields as if their contents had been indexed into one combined field.
    • It analyzes the query string into individual terms, then looks for each term in any of the fields.
    • Only supports text fields that have the same analyser, unlike multi_match that supports multiple types of fields with different analysers.

Boolean Queries

  • must: The query must appear in matching documents and will contribute to the score.
  • filter: The query must appear in matching documents, but will not contribute to the score.
  • should: The should appear in matching documents and will contribute to the score.
  • must_not: The query must not appear in matching documents, and will not contribute to the score.
  • The score from each matching must or should clause will be added together to provide the final _score for each document.
  • minimum_should_match specifies the number or percentage of should clauses returned documents must match.
  • Each query accepts a _name in its top level definition.
    • If named queries are used, the response includes a matched_queries property for each hit.

Relevance

  • Elasticsearch uses the BM25 similarity algorithm for relevance scoring, more specifically, the Okapi BM25 implementation.
  • There are many similarity models available, and most settings can be configured.
  • The _explain endpoint can be used to debug the relevance score.
  • Per document, score = qi ∑ boost * idf * tf
  • For performance reasons, Elasticsearch doesn’t calculate scores across all documents in the index. Instead, each shard calculates their own local TF/IDF for the documents contained in that shard.
    • This means that local score and global score can be different, which can produce incorrect results.
    • This difference diminish the more documents that you add to the index, so in practice this is not a problem.
    • dfs_query_then_fetch can be used to solve this problem, but is not recommended.

Boost (boost)

  • boost is the boosting done during query time.
  • Index boosting is strongly discouraged due to adverse effects, like having to reindex all documents to change an index-time boost.

Inverse Document Frequency (idf)

  • Calculated as log(1 + (N - n + 0.5) / (n + 0.5))
  • N is the total amount of documents, and n is the amount of documents which contain the term qi.
  • Queries containing rarer terms have a higher multiplier, so they contribute more to the final score.
  • It penalizes terms that are common.
  • Unlike tf, idf cannot be disabled.

Term Frequency (tf)

  • Calculated as freq / (freq + k1 * (1 - b + b * dl / avgdl))
  • freq is the amount of times the term appears in the document.
    • The more times the query term(s) occur a document, the higher its score will be. T
  • k1 is the term frequency saturation paramater.
    • The default value is 1.2.
    • It limits how much a single query term can affect the score of a given document.
    • It curbs the increase in freq.
    • k1 is typically evaluated in the 0 to 3 range, with the optimal range being 0.5 to 2.0.
  • b is the field length normalisation parameter.
    • The default value is 0.75.
    • If b is bigger, the effects of the length of the document compared to the average length are more amplified.
    • If b is 0, the length of the document would have no bearing on the score.
    • b needs to be between 0 and 1, with the optimal range being 0.3 to 0.9.
  • dl is the length of the field being searched.
  • avgdl is the average length of the field being searched in all documents.
  • dl/avgdl means that the more terms in the document that do not match the query, the lower the score for the document.

Searching

Query Then Fetch

  • This is the default search type.
  • Can cause relevance disrepancies, especially with small number of documents.

Steps

  1. Send the query to each shard
  2. Find all matching documents and calculate scores using local TF/IDF
  3. Build a priority queue of results
  4. Return metadata about the results to requesting node
  5. Scores from all the shards are merged and sorted on the requesting node, documents are selected according to query
  6. The actual docs are retrieved from individual shards where they reside
  7. Results are returned to the client

Distributed Frequency Serach Query Then Fetch

  • Can be activated by setting dfs_query_then_fetch.
  • Not recommended in practice due to low performance

Steps

  1. Prequery each shard asking about TF/IDF
  2. Send the query to each shard
  3. Find all matching documents and calculate scores using global TF/IDF, calculated from the prequery
  4. Build a priority queue of results
  5. Return metadata about the results to requesting node
  6. Scores from all the shards are merged and sorted on the requesting node, documents are selected according to query
  7. The actual docs are retrieved from individual shards where they reside
  8. Results are returned to the client

Performance

  • Writing a proper search query is the main factor influencing search performance in Elasticsearch.
  • To increase the speed of the search, there are two important methods that can be used:
    • Custom routing: Makes it possible to store the chosen data on the same shard. Only one shard will thereby be searched in order to satisfy the query.
    • Force merging: Merging segments continuously until the value of max_num_segments in a shard is reduced to 1.
      • When the number of segments and shards is high, the force merging process will be slow. For example, merging 10,000 segments to 5,000 segments takes less time than merging 10,000 segments to one.
      • This will affect the resources required to perform the process, which will also affect the search requests.
      • In that case, it is recommended to schedule Force Merging during non-busy hours.
  • Assets like the CPU, RAM, and operating system will also affect performance.

Parameters

  • index.refresh_interval: How often to perform a refresh operation, which makes recent changes to the index visible to search. Default is 1s.
    • If not set, shards that haven’t seen search traffic for at least index.search.idle.after seconds will not receive background refreshes until they receive a search request.
  • index.number_of_replicas: Number of replicas each primary shard has. Default 1.
  • indices.memory.index_buffer_size, indices.memory.min_index_buffer_size, indices.memory.max_index_buffer_size: The indexing buffer is used to store newly indexed documents. When it fills up, the documents in the buffer are written to a segment on disk. It is divided between all shards on the node.
  • index.translog.flush.threshold.size: Make a flush after reaching specific size
  • index.translog.retention.age: Duration for keeping a translog file
  • index.translog.sync.interval: How of the translog is synced to disk
  • index.number_of_shards: Number of primary shards per index. Default 1, used to be 5 pre version 7.0.
    • The best practice is to set the number of shards on an index-by-index basis based on the number of nodes you run, your ingest method, your search style, and any data rollover being done.
  • index.shard.check_on_startup: Whether shards should be checked for corruption before opening. Default is false.
    • This is an expensive operation and should only be used when debugging a problem.
    • Elasticsearch automatically performs integrity checks on the contents of shards at various points during their lifecycle. For instance, it verifies the checksum of every file transferred when recovering a replica or taking a snapshot.

Metrics

  • Search Performance Metrics
    • Query Load
    • Query Latency
  • Indexing Performance Metrics
    • Indexing Latency
    • Flush Latency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment