rosskarchner/gist:2381ca5f8136f72b3718ca8353a7d5cc

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
Elasticsearch Internals

Lucene Indexing

Java Library
Shard

A shard is an instance of Lucene
max # of documents in a shard is Integer.MAX_VALUE - 128
clients do not refer to shards directly (use index instead
What's in a shard?

indexed document gets analyzed, put in buffer

buffer defaults to 10% of node heap (set with indices.memory.index_buffer_size)
buffer uis shared by all shards allocated to that node


after buffer is full, or some time passes (refresh_interval), a "lucene flush" occurs
documents are not searchable until flushed to a "segment"


refresh parameter on search

default: false
true: force a lucene flush
wait_for: wait for next flush


Understanding Segments

shard=lucene index= collection of segements

think of segment as an immutable "mini index"


Each segment is a fully independent index

Each segment is composed by a bunch of files
Segment files are written to disk
Segment files are never updated (read only)


a search request is distributed to the shards and on each shard it is performed sequentially over the segments


Segment Merges

segments are periodically merged into larger segments
keeps index size and number of segments management
deleted documents get expunged during a merge
force a merge with POST index_name/_forcemerge


Elasticsearch Indexing

index = 1 or more shards
a shard consists of segments
segments are fysynced to disk during a lucene commit

a relatively heavy operation
should not be performed after every index or delete operation


until a commit segment data is on disk it is susceptible to loss
to prevent this data loss, ES implements a transaction log for each shard
transaction log is written at the same time as buffer
in the event of a failure, translog will be replayed
force es flush with POST index_name/_flush
synced flush

performs a normal flush, then adds a generated unique marker (sync_id) to all shards
sync_id provides a quick way to check if two shards are identical.
happens every minute or so
consider doing it before a cluster restart


Doc Values

a data structure that store the values of a document on-disk in a column-oriented fashion, at index time

which makes sorting and aggregations faster


Caching

one cache per node that is shared by all shards

LRU eviction policy
it only caches queries inside a filter context


one static node-level settings

indices.queries.cache.size: "5%"


segment level each uses bit sets...

array of bits, each position represents a socument

super efficient


Only built for segments that are "big enough"


Shared request cache

one cache per node that is shared by all shards
LRU eviction policy
Data modification invalidates the cache
Good fir for indices that you don't write anymore
search requests return results almost instantly
static setting that must be configured on every node:

indices.request.cache.size: "5%"


shard level request cache
enabled on every index by default


Field Modeling

use cases for granular fields

suppose a version number
allow queries like 5.4 or 6.x
instead of simple version number, index major, minor, and bug fix number + "display name"
now we can search at any desired level


Range data types introduced in 5.2

integer_range
float_range
long_range
double_range
date_range
ip_range


use range query

intersects
contains
within


Mapping parameters

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-params.html
index: false (field will still be in _source)

doc_values: false (field will still be in _source, field can still be used in queries, but not aggregations


enabled: false (can not query or aggregate, will still be in _source)
disabling _all

_all will be removed from future versions of ES
can not use it in new indexes, as of ES6
disable with "_all":{"enabled": false} in mappings


copy_to

use case example: multiple fields that represent location (city, region, country), want to search them in a single query

could just run book query across the fields
or copy all the values to a single array field during indexing


null: treated as if has no value

the average of 5.0 and null is 5.0
use null_value to set a default


coercing data

disable with coerce:false


_meta field

custom metadata in your mapping


Dynamic templates

dealing with large number if fileds
dynamic filed names not known at time of mapping definition
define a fields mapping based on

datatype
name of the field

for example, map f_* as float


path to the field


controlling dynamic fields

strict mappings

by default, if a document is indexed with an unexpected field, the mapping is dynamically modified
in prod, you will likely define mappings prior to indexing

you probably do not want your mappings to change
dynamic: strict

do not index a doc that contains fields not already in mapping
you can still modify the mapping directly
other values:

true, default behavior
false, new fields will be ignored


Fixing Data

Tools for fixing data

Painless is a scripting language designed specifically for use with ES

fast, secure, groovy-like syntax
supports all of Java's data types and a subset of the Java API
use cases

accessing document fields

updated/deleare a field in a document
perform computations on fields before returning them
customize the score of a doc
working with aggregations


Ingest node

execute a script within an ingest pipeline


Reindex API

manipulate data as it is getting reindexed


Basic example: updating a field

increment number of views on a blog post


Can be defined inline, or stored (POST /_scripts/script_name)
scripts are cached (inline or stored)

new script can evict a cached script
default cache is 100 scripts

configure using script.cache.max_size and script.cache.expire


compilation can be expensive

default: 75 compilations per 5-minute window (75/5m)
configured by script.max_compilations_rate
avoid by using parameters


Languages like Groovy, Javascript, and Python are no longer available as of ES6
Reindex API

take data from one index, optionally manipulate, and write to a new one.


Updated by query

same as reindex API, but updates itself


delete by query

like SQL delete.. where


fixing mappings

you can not change the data type in a mapping
must define a new index with the desired data types
post _reindex, define source and dest indexes


reindexing tips

conflicts parameter


picking up mapping changes

you can always add fields, but updating the mapping won't update the index
POST blogs_fixed/_update_by_query
reindexBatch pattern— use a field with a batch ID number. to continue after a fail, update_by_query to finish where you left off.


fixing fields

splitting a field
Ingest Node

provide the ability to preprocess a doc right before it gets indexed
intercepts an index or bulk API request
applies transformations
passes the document back to the index or bulk API
when indexing a doc, you specify a pipeline
centralized, pipelines stored in cluster state
by default, all nodes can ingest. node.ingest=true
pipeline

a set of processors

similar to a filter in Logstash
has read and write access to documents that pass through the pipeline
defined with PUT on the Ingest API

description
processors

https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest-processors.html


on_failure


simulate a pipeline with _simulate endpoint


Advanced Search & Aggregations

Searching for patterns

wildcard query is a term-based search that contains a wildcard


= anything


? = any single character
cam be expensive, especially if you have leading wildcards


regex query

expensive
powerful
based on lucene regex engine https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html#regexp-syntax


dealing with null values

documents will likely have missing or null fields
exists query

wrap exists in a must_not to find docs that are missing a field


scripted queries

script queries must compute a boolean value
test a query with the _execute API
contexts: filter, or scoring


scripted fields

add fields to a query response that are generated in a script
script_fields in query body
expensive, consider adding as a field, calculating on ingest


Search templates

like stored procedures
mustache templates, script.lang=mustache
does not have if/else logic

but you can define a section that gets skipped if the parameter is false or not defined


percentiles aggregation

defaults percentiles, but can be overridden


percentiles_rank (reverse percentiles)
top_hits
missing
scripted aggregations

combine fields for a terms agg


significant terms aggregation

terms aggregations + noise filter

discards commonly common terms that the terms agg would return


low frequency terms in the background data pop out as high frequency terms in the foreground data


Pipeline aggregations

performs computations on the results of other aggregations

the input of a pipeline agg is the output of another agg
use the buckets_path parameter to reference the values in the other aggregations


executed on the coordinating node, after the results of the input agg are collected
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline.html


Cluster Management

Architecture Recap

largest unit is a cluster
cluster made up of nodes
each node is running ES process and typically there is 1x1 relationship between a server and a node
indexes and shards

cluster contains one ore more indices
each index is shared and its shards are distributed over nodes
ES automatically distributes the shards evenly across the nodes


maybe you want more control

your servers specification
what nodes shards are allocated to

data management strategy? some data on less expensive hardware?


Dedicated Nodes

several roles a node can have

master eligible
data
ingest
machine learning
coordinating


nodes can be dedicated only to a single role
by default, a node is master-eligible, data, and ingest

node.master=true
node.data=true
node.ingest=true


why use?

dedicated master eligible nodes

focus on cluster state
machines with low CPU, RAM, and disk resources


dedicated data nodes

fast disks, lots of ram
can focus on data storage and processing client requests


dedicated ingest nodes

can focus on data processing
machines with low disk, medium RAM, and high CPU


coordinating-only node

machines with low disk, medium/high RAM, and medium/high CPU resources
behave like smart load balances
perform the gather/reduce phase of search requests
lightens the load on data nodes


coordinating + ingest not uncommon


Hot/Warm architecture

you can configure the dedicated data nodes in your cluster to use a hot/warm arch

useful for scenarios where you want to control which nodes perform indexing vs query handling


fine-grained control over data allocation
dedicated data nodes can be used as

hot nodes for indexing
warm nodes for older, read-only indices

larger amounts of data may require additional nodes to meet performance requirements


Shard Filtering

the ability to control which nodes the shards for an index are allocated

use node.attr to tag your nodes
use index.routing.allocation to assign indexes to nodes
three types of rules for assigning indexes to node

index.routing.allocations.include
index.routing.allocations.exclude
index.routing.allocations.require


PUT _settings on index to configure


Shard filtering for Hardware

you can use multiple tags, like "temperature" and "size"
example: use nodes that are medium or small, but not hot.
make sure you don't exclude all nodes!
https://www.elastic.co/guide/en/elasticsearch/reference/6.6/ilm-policy-definition.html


Shard allocation awareness

make nodes aware of where they are (racks, availability zone)
you can make ES aware of the physical configuration of your hardware
cluster.routing.allocation.awareness
useful when more than once node shares the same resource (disk, host machine, network switch, rack, etc)


Forced Awareness

suppose you have a rack or zone that fails
the remaining rack will try to reassign all the missing replicas
the single rack might not be able to handle that type of volume
you might not want ES to panic when you restart a rack
stay yellow during a temporary event


Capacity Planning

designing for scale

default settings can take you a long way
proper design can make scaling easier


one shard does not scale well
two shards can scale if we add a node
shard overallocation

if you are expecting your cluster to grow, it's good to plan for that by oveallocating shards

num shards >  num nodes


be careful

unit of scaling is the shard
but you should not have too many


specially sub-utilized shards
each shard uses resources from the machine

if you have too many, it will seriously slow things down


too many shards, ES will be slow
"too much"

a little overallocation is good
a "kajillion" shards is not
a shard can optimally hold from 10-40GB

optimal depends on the use case
a 1GB shard is sub-utilized


https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster


do not overshard

business req: 1GB per day, 6 months retention, ~180GB

could easily have this in 10 shards


common scenario

3 different logs
1 index per day each day
5 shards
6 months retention
~2700 shards! (too many)


index templates!
rollover API

https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-rollover-index.html


show how many shards should I configure?

no simple formula
"it depends!"


too many factors involved:

use case: metrics, logging, search, api etc
hardware
num of docs
size and complexity of docs


before trying to determine your capacity, determine your SLA's

docs/second
queries/second
maximum response time for queries


get some production data

actual documents
actual queries
actual mappings


create a one-node (one shard?) cluster
push it until it breaks (breaks SLA's)
rally automates this? https://esrally.readthedocs.io/en/stable/quickstart.html

run it in docker


number of primary shards

not an exact since, but you can

estimate the total amount of data for your index
leave room for growth
divide by the maximum capacity of a single shard


the number of primary shards is usually determined by indexing speed
for searching, the total number of shards is the measurement (agnostic of index)

note we are not talking about replicas here, just number of primary shards


measure indexing capacity of a node

index documents in a similar way as your app woll
index in parallel until error 429
use this to calculate the number of docs/second
you can calculate how to scale you nodes for ingestion based on your SLA


scaling with replicas

adding replicas provides an additional level od scaling

we know replicas provide high availability
they also provide scaling fro reads and search (IF YOU ADD ADDITIONAL HARDWARE)
cost of replicas:

slower indexing speed (although replicas are indexed in parallel)
more storage on disk
larger associated heap memory footprint of the extra shard(s)


scaling with indices

using multiple indices also provides scaling

if you need to add capacity, consider just creating a new index


then search across both indices to search new and old data

you could even define a single alias for the multiple indices


searching 1 index with 50 shards is equivalent to searching 50 indices with 1 shard each


Capacity planning use cases

understand

what your data looks like
how that data is going to be searched


two common use cases are:

searching fixed-size data a large dataset that may grow slowly

care about relevancy, not so much timeliness
plainning

sizes your indices based on the maximum shard capacity vs the amount of data you anticipate

if you need to increase capacity, either reindex or add multiple indices (preferably not often)
easy to increase throughput by adding nodes and replicas


time based data: grows rapidly, like logs, social media streams, time-based events

all docs have a timestamp, and likely do not change
planning

searching usually involves a timestamp
older docs become less important
data ingestion is a key factor

typically a lot of data coming in
you do not want indexing to become a bottleneck (it has to keep up)


common pattern: time-based indices

new index for da, week, month, year
add the date to the name of the index
search using index wildcards or aliases


data ingestion

hardcode the index name in the app is not scalable
you can use date math in your index name, like <logstash-{now/d}>playself


update aliases with curator: https://www.elastic.co/guide/en/elasticsearch/client/curator/5.6/index.html
aliases

simple to define
need a tool to update the alias
no need to redeploy your application


managing time-based indices

for optimal ingest  rates

spread shards of your active index over as many nodes as possible


optimal search

shrink older indices down to the optimal number of shards
close indices that are no longer being search


use hot/warm architecture

hot for indexing, warm for querying


Document Modeling

the need for doc modeling

in ES, you typically denormalize your data
flat world has advantages

indexing and searching is fast
no need to join tables to lock rows


sometimes relationships matter


we need to bridge the gap between normal and flat
four common techniques for managing relational data in ES

denormalization

store redundant copies of data in each document
example, tweet + user
optimizing for reads
no mechanism for keeping denormalized data up to date— avoid demoralizing data that is likely to change frequently
when denormalizing data does change ensure that there is 1 and only 1 authoritative source for the data
if you do denormalize data that might change, it can help to also denormalize a field that does not change, like an _id.
try not to maintain state in ES, instead try to record events


app-side joins (run multiple queries)

not recommended


nested objects (for working with arrays of objects)
parent/child relationships


denormalization
the need for nested types

ES flattens objects (tags.key, tags.value)
allows arrays of objects to be indexed and queries independently of each other
use nested if you need to maintain the relationship of each object in the array
nested objects are stored separately


nested types
querying a nested type

{query: {nested: {path: tags, query:{}}}}
inner_hits to see why a document matched


the nested aggregation

puts nested  objects in a bucket

useful for performing sub-aggregations


parent/child relationship

1-N relationships
updating a nested object requires reindexing the root object and a complete reindexing of all its nested objects. nested objects are  a READ optimization.
using a join datatype, you can completely seperate two objects while maintaining their relationship

parent and child are complete seperate documents
the parent can be updated without reindexing the children
children can be added/changed/deleted without affecting the parent or other children.


it's a write optimization.
configured in your mappings using the join datatype
parent and all children must live in the same shard
overrides the default shard routing algorithm, uses the parents _id.
therefore, every time you refer to a child document, you must specify its parent's id
defining the relationship

define the mapping, type: join.
index some parent documents
index some child documents, ?routing=parent_id
index/_search will return parents and children


the has_child query

has_child type: relation_name
filter that accepts a query and the child type to run against
results in parent document


the has_parent query

matches children that have parent that matches


always need id of parent when you query children
Kibana has very limited support for nested types and parent/child relationships


Monitoring and Alerting

Stats API
Task management API

show how busy the server is


_cat API

wrapper around many jspn APIs
command-line tool friendly
helpful for simple monitoring
v parameter to show headers
h parameter to specify column, * for all


Diagnosing performance issues

thread pool queue

GET _nodes/thead_pool
GET _nodes/stats/thread_pool
when a queue is full, error 429
_cat/thread_pool?v
a full queue can be good or bad

OK if bulk indexing is faster than ES can handle
bad if search queue is full


the hot_threads API

allows you to view current hot threads on each node
GET _nodes/hot_threads
GET _nodes/node/hot_threads


The indexing slow log

captures long-running index operations to a log file
logs indexing events that take longer than configured thresholds
defaults set in log4j2.properties
can be overridden in index settings, index.indexing.slowlog


search slow logs

disabled by default

usefulness can be limited since it logs per shard
packet beat may be a better solution


Profile API

just set profile to true in your query
profiler response: hard to read, info on each shard that processed the search
nice support in Kibana


Elastic monitoring component

use elasticsearch to monitor elasticsearch
xpack.monitoring.collection.enabled defaults to false
the stats of all nodes are indexed into Elasticsearch
common settings

pack.monitoing.collection.indices (which indices to collect from)
.interval: how often samples are collected, default 10s
.duration: how long before indices created by monitoring are automatically deleted. default 7d
https://www.elastic.co/guide/en/elasticsearch/reference/current/monitoring-settings.html


dedicated monitoring cluster

reduce the load and storage on your other clusters
access to monitoring even when other clusters are unhealthy
seperate security levels from monitoring and production clusters
if elastic security is enabled, then provide credentials
you can also use http exporters, pack.monitoring.exporters


monitoring multiple clusters

many clusters can be monitored with a single monitoring cluster (gold/platinum)


monitoring UI

you can see heap pressure


Clusters dashboard


Configure alerts

for unexpected or extreme situations
elastic alerting

watch for changed or anomalies in your data
perform the necessary actions in response
for example:

open a helpdesk ticket when servers are running out of free space
track network activity to detect malicious activity
send immediate notification if nodes leave the cluster


watch

trigger: when the watch is executed


input

loads data unto the watch payload


condition

whether the watch actions are executed


actions

actions to execute if condition is true
https://www.elastic.co/guide/en/elastic-stack-overview/current/actions.html


PUT _xpack/watcher/watch/match_name
watch executions are stored in ES too (.watcher-history/_search
users can have watcher-specific role


From Dev to Production

disabling dynamic indexes

put a doc into an index that doesn't exist, and it'll be created
disable in a prod environment
PUT _cluster/settings

action.auto_create_index: false


You can whitelist certain patterns, usually time-series indexes


dev vs prod mode

HTTP vs. Transport

HTTP- how APIs are exposed

ports in 9200-9299 range


transport: used for internal communication between nodes in cluster

ports in 9300-9399 range


binds to localhost by default
dev mode: if it does not bind to an external interface
prod mode: if it does bind transport to an external interface
bootstrap tests

dev mode: any bootstrap checks that fail will generate warnings
prod mode: will refuse to start


best practices

avoid running over WAN links between datacenters

not officially supported by elastic


try to have 0 or few hops between nodes
If you have multiple network cards, seperate transport and http traffic

bind to different intefaces
use separate firewall rules for each kind of traffic


use long-lived HTTP connections

client libraries support this
use a proxy/load-balancer


storage best practices

segments are immutable, so the write amplification factor approaches one and is a non-issue
local disk is king!

avoid network storage


Elasticsearch does not need redundant storage

replicas/sofetare provide HA
local disks are better than SAN
RAID 1/5/10 is not necessary


if you have multiple disks in the same server, set RADO0 or path.data
RAID 0

splits data evenly across two or more disks
perfect distribution of the data across the disks
if you lose one disk, you lose them all


path.data

distribute index across multiple SSD's
potential to an unbalanced distribution
if you lose one disk, the data on the other disks are preserved


SSD: use noop or deadline schedule in the OS https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html#_disks
spinning disk are OK for warm nodes, but discourage concurrent merges

index.merge.scheduler.max_thread_count: 1


Trim your SSD's https://www.elastic.co/blog/is-your-elasticsearch-trimmed


Hardware selection

in general, choose medium machines over large machines
loss of a large node has greater impact

prefer six 4cpu X 64GB X 4 1tb drives
avoid two 12 cpu 256GB 12 1tb drives


avoid running multiple nodes on one server

one ES instance can fully consume a machine


larger machines can be helpful as warm nodes

configure shard allocation filtering as previously discussed.


Cloud strategies

use discovery plugin

dynamically configure the unicast hosts


span the cluster across more than one AZ
prefer ephemeral storage over network storage

EBS: maybe provision IOPS?


snapshot to cloud storage with repository plugins
use shard awareness and forced awareness
avoid instances marked with low networking performance.


Throttles

relocation nd recovery throttles to ensure these tasks do not have a negative impact
Recovery

for faster recovery, temporarily increase the number of concurrent recoveries
_cluster/settings

cluster.routing.allocation.node_concurrent_recoveries


Relocation

for faster rebalancing of shards, increase cluster.routing.allocation.cluster_concurrent_rebalance


JVM configuration

only 64-bit JVMs are supported
general, half-memory to JVM
recommendation config/jvm.options
what goes in the heap?

indexing buffer
completion suggester
cluster state
caches

node query cache
shard query cache
fielddata


and more


HEAP size

default is 1GB, not high enough for production
configure with Xms (min heap) and Xmx(max heap)

like -Xms8g -Xmx8g
set Xmx to no more than 50% of your physical RAM


rule of thumb for setting the JVM heap

do not exceed more than 30GB of memory (to not exceed the compressed ordinary object pointers limit)
https://www.elastic.co/blog/a-heap-of-trouble


common causes of poor query performance

use filters to benefit from caching
limit the scope of aggregations by adding a query
use the sampler bucket aggregation
don't try to use ES like a relational database

consider denormalization, or polyglot architecture
avoid parent/child relationships


too many shards

more shards== slower query performance
the default of 5 shards can be too high for some scenarious
having thousands of 100MB indices with 5 shads each is not good
solution

shards can be fairly large (40GB is good)
create as few of them as needed


unnecessary scripting

try to run scripts at index time instead of query time


regex are slow

use them less
clever indexes, need to search the end, try the reverse token filter


Cross-cluster search

allow any node to act as a client for executing queries across multiple clusters
introduced in ES 5.3, replaces "tribe node"
remote clusters are configured in the cluster settings

using the cluster.remote property
seeds is a list of nodes in the remote cluster used to retrieve the cluster state when registering the remote cluster


search with GET cluster_name:index-name/_search
search multiple clusters with GET index_name, cluster_name:blog/_search
how it works

from a search execution perspective, there is no diff between local indices and remote indices
as long as local nodes can reach remote nodes
coordinating node gets size hits from each shard (local and remote) and performs reduction
top hits are fetched from the relevant shards and returned to client


you can use wildcards in cluster names


Overview of Upgrades

Versions are major.minor.patch/maintenance
ES can use indices created in the previous major version, but older indices must be reindexed or deleted

6.x can use indices created in 5.x
6.x cannot use indices created in 2.x or before
5.x can use indices created in 2.x
5.x cannot use indices created in 1.x or before


ES will fail to start if an incompatible index is found
modern versions, you can do rolling upgrades
https://www.elastic.co/products/upgrade_guide
Cluster restart strategies

rolling restart

zero downtime
reads and writes continue to operate normally


Full cluster

can be faster


steps for a rolling restart

stop non-essential indexing (if possible)
disable shard allocation

transitent: cluster.routing.allocation.enable: none
POST _flush/synced


stop and updated one node
restart the node

check node health


re-enable shard allocation and wait

check cluster health


go back to "disable shard allocation"


full cluster restart

stop indexing
disable shard allocation

make sure to use "persistent"


synced flush
shutdown and update all nodes
start all dedicated master nodes

wait for master election


start the other nodes
wait for yellow
reenable shard allocation


Resources

https://www.elastic.co/learn
https://www.elastic.co/training
https://www.elastic.co/community
https://www.elastic.co/docs
https://discuss.elastic.co