danielsnider/elasticsearch-cheatsheet.md

## elasticsearch-cheatsheet.md

      
    Raw
  

              elasticsearch-cheatsheet.md
            
          
    Index


Delete all documents from index
Health / Stats
Get all
Search for string across all fields
Count documents
Get mapping
Get one record
Create one record
Sort output, search for not empty values, in a field with a name that contains a space, and use jq to extract values.
Disable automatic date detecting
Search by several fields
Get All (actually retrieve all pages of results)
Get one field multiple records
Increase number of allowed fields aka. columns
List indexes
Info about an index
Count docs in an index
Search with query stored in file
Count number of fields on index
Count for each unique values in one field find how many unique values there are in another field
Get names of fields
Process names of fields in loop
Get count number of not-null values for every field in index
Count number of distinct values
Get list of unique values in a field and count how many occurrences for each distinct/unique value
Update a field for all records that match a query
Count the unique values across two fields
Test an analyzer
Sorted Euclidean distance
Euclidean distance
Explain Query
Get one document for each unique value in a field
Python test ElasticSearch connection
Python query
String query String
Setup a multi-node cluster

elasticsearch.yml
Start cluster


ElasticSearch-Dump
Security
SearchGuard
ReactiveSearch Simple Custom Security Proxy
Scripts
Nodes
Query Performance Improvement Ideas

Notes

Delete all documents from index

curl -X POST "localhost:9200/index_name/_delete_by_query" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  }
}'

Health / Stats

curl -XGET "http://localhost:9200/_cluster/stats?human&pretty"
curl -XGET "http://localhost:9200/_cat/shards?v"
curl -XGET "http://localhost:9200/_cat/indices?v"
curl -XGET "http://localhost:9200/_cat/allocation?v"

Get all

curl -H 'Content-Type: application/json' -XGET 'localhost:9200/index_name/_search' -d '
{
    "query" : {
        "match_all" : {}
    }
}'

Search for string across all fields

curl -H 'Content-Type: application/json' -XPOST 'localhost:9200/index_name/_search' -d '
{
    "query" : {
      "query_string": { "query": "heart" }
    }
}' | jq . | head -n25

Count documents

curl -H 'Content-Type: application/json' -XGET 'localhost:9200/index_name/_count'

Get mapping

curl -H 'Content-Type: application/json' -XGET 'localhost:9200/index_name/_mapping'

Get one record

curl -H 'Content-Type: application/json' -XGET "localhost:9200/index_name/index_name/MnjvwGkBD86Op0uG1ix5" | jq

Create one record

curl -X PUT "localhost:9200/index_name/index_name/1/_create" -H 'Content-Type: application/json' -d'
{
  "create": "2015/09/02"
}'

Sort output, search for not empty values, in a field with a name that contains a space, and use jq to extract values.

curl -H 'Content-Type: application/json' -XGET 'localhost:9200/index_name/_search' -d '{ "sort":[{"Scheduled Date.keyword" : {"order":"asc"}}], "query" : {"query_string" : {"query": "Scheduled\\ Date:/.*/"}}}'  | jq --raw-output '.hits.hits[]._source."Scheduled Date"'

Disable automatic date detecting

 curl -s -H 'Content-Type: application/json' -X PUT 'localhost:9200/index_name/_mapping'
 {
   "mappings": {
     "$ELASTIC_DOC_TYPE": {
+      "date_detection": false,
       "properties": {
         "orig": {
           "type": "text",

Search by several fields

curl -H 'Content-Type: application/json' -XGET 'localhost:9200/index_name/_search' -d '
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "AccessionNumber.keyword": "123456789"
          }
        },
        {
          "term": {
            "SeriesNumber.keyword": "1"
          }
        },
        {
          "term": {
            "SeriesDescription.raw": "FMRI-AX"
          }
        },
        {
          "term": {
            "InstanceNumber.keyword": "71"
          }
        }
      ]
    }
  }
}'

Get All (actually retrieve all pages of results)

elasticdump   --input=https://elasticindex_names.ccm.sickkids.ca   --input-index=index_name   --output=$  --searchBody='{"_source": ["Report"], "query" : {"match_all" : {} } }'   | jq '{id:._id,Report:._source.Report}'

outputs:
{
  "id": "CnZVSWsBHWg-PhjNBSxI",
  "Report": "Flexion/extension viewso…"
}

Get one field multiple records

curl -s -H 'Content-Type: application/json' -XGET 'http://127.0.0.1:9200/index_name/index_name/_search' -d '
{
  "_source": "PatientID",
  "from": 1,
  "size": 5
}
' | jq ".hits.hits[]._source.PatientID"

Note:
"from" is index position
"size" is number of records
Increase number of allowed fields aka. columns

curl -H 'Content-Type: application/json' -XPUT 'localhost:9200/index_name/_settings' -d '
{
  "index.mapping.total_fields.limit": 100000
}'

List indexes

curl http://localhost:9200/_aliases?pretty=true

Info about an index

curl -s -X GET http://localhost:9200/index_name | jq

Count docs in an index

curl -s -X GET http://localhost:9200/index_name/index_name/_count | jq

Search with query stored in file

curl -v -H 'Content-Type: application/x-ndjson' -H 'Accept: application/json' -XPOST 'https://elasticindex_names.ccm.sickkids.ca/index_name/_msearch' --data-binary @data.json

# required file data.json (must have new line at end of file)
{"preference":"bodypart-list"}
{"query":{"match_all":{}},"highlight":{"pre_tags":["<mark>"],"post_tags":["</mark>"],"fields":{}},"size":0,"from":0,"aggs":{"BodyPartExamined.raw":{"terms":{"field":"BodyPartExamined.raw","size":100,"order":{"_count":"desc"}}}}}

Count number of fields on index

ubuntu@index_names:~$ curl -s -XGET localhost:9200/index_name/_mapping?pretty | grep type | grep text | wc -l
822

Count for each unique values in one field find how many unique values there are in another field

curl -H 'Content-Type: application/json' -XPOST 'localhost:9200/index_name/_search' -d '{
  "aggs": {
    "count_index_names_by_modality": {
      "terms": {
        "field": "Modality.raw",
        "size": 20,
        "order": {
          "_count": "desc"
        }},
        "aggs": {
          "exam_count_per_modality": {
            "cardinality": {
              "field": "AccessionNumber.keyword"
            }
        }
      }
    }
  }
}' | jq

Get names of fields

curl -s -XGET localhost:9200/index_name/_mapping | jq .index_name.mappings.index_name.properties | jq 'keys'

Process names of fields in loop

for i in $(curl -s -XGET localhost:9200/index_name/_mapping | jq .index_name.mappings.index_name.properties | jq 'keys' | jq .[])
do
  echo "key: $i"
done

Get count number of not-null values for every field in index

for FIELD_NAME in $(curl -s -XGET localhost:9200/index_name/_mapping | jq .index_name.mappings.index_name.properties | jq 'keys' | jq .[]); do   NUM_NOT_NULL=$(curl -s -H 'Content-Type: application/json' -XGET 'http://127.0.0.1:9200/index_name/index_name/_search' -d '
  {
     "query" : {
        "constant_score" : {
           "filter" : {
              "exists" : {
                 "field" : '"$FIELD_NAME"'
              }
           }
        }
     }
  }' | jq .hits.total);   echo "$FIELD_NAME: $NUM_NOT_NULL"; done | tee out.json

Count number of distinct values

curl -H 'Content-Type: application/json' -XGET 'http://127.0.0.1:9200/linking/linking/_search' -d '
{
    "size" : 0,
    "aggs" : {
        "distinct_orig" : {
            "cardinality" : {
              "field" : "orig.keyword"
            }
        }
    }
}' | jq

Note: size: 0 here means "Perform Elasticsearch aggregation without returning hits values of documents"
Get list of unique values in a field and count how many occurrences for each distinct/unique value

curl -H 'Content-Type: application/json' -XGET 'http://127.0.0.1:9200/index_name/index_name/_search' -d '
{
    "size": 0,
    "aggs" : {
        "count_orig" : {
            "terms" : { "field" : "ProtocolName.keyword", "size": 2147483647}
        }
    }
}' | jq

Update a field for all records that match a query

es.update_by_query(index='index_name', doc_type='index_name', body={
  'query': {'term': {'AccessionNumber.keyword': 'FUJI95714'}},
  'script': {"inline": "ctx._source.A_new_attribute = 'NEWVALUE'"}}
)

Note: ".keyword" is important to guarantee an exact match, otherwise values are broken by the analyzer into term subsets, More info
Count the unique values across two fields

data.aggs = {
     "exam_count":{
       "cardinality":{
           "script": "doc['AccessionNumber.raw'].value + ' ' + doc['SeriesNumber.raw'].value"
         },
     },
   };

Test an analyzer

curl -H 'Content-Type: application/json' -XGET 'http://127.0.0.1:9200/_analyze' -d '
{
  "analyzer": "standard",
  "text": "3-plane"
}' | jq .

resulting tokens: [3, plane]

Sorted Euclidean distance

curl -X GET "localhost:9200/index_name/_search" -H 'Content-Type: application/json' -d'
{
  "sort": [
    {
      "_script": {
        "type": "number",
        "script": {
          "lang": "painless",
          "source": "return Math.sqrt(Math.pow(Integer.parseInt(doc[\u0027Rows.keyword\u0027].value) - 499, 2) + Math.pow(Integer.parseInt(doc[\u0027Columns.keyword\u0027].value) - 499, 2))"
        },
        "order": "asc"
      }
    }
  ],
  "size": 8,
  "_source": ["Rows", "Columns",'"dicom]
}
' | jq .

Euclidean distance

curl -X PUT "localhost:9200/my_index/_doc/1" -H 'Content-Type: application/json' -d'
{
 "x1": 3.0,
 "y1": 3.0,
 "x2": 0.0,
 "y2": 0.0
}
'
curl -X GET "localhost:9200/my_index/_search" -H 'Content-Type: application/json' -d'
{
 "script_fields": {
   "my_doubled_field": {
     "script": {
       "lang":   "painless",
       "source":
       "return Math.sqrt(Math.pow(doc[\u0027x1\u0027].value - doc[\u0027x2\u0027].value, 2) + Math.pow(doc[\u0027y1\u0027].value - doc[\u0027y2\u0027].value, 2))"
     }
   }
 }
}'

Note: \u0027 means ' and is used as a quote inside of a quote
Explain Query

curl -v -H 'Content-Type: application/json' -X GET 'http://localhost:9200/index_name/index_name/DczUL2wBssoKtfgQuNfg/_explain/' -d '
{
     "query" : {
       "query_string" : {"query":"dcm"}
     }
}' | jq

Get one document for each unique value in a field

curl -H 'Content-Type: application/json' -XPOST 'localhost:9200/index_name/_search' -d '{
  "query" : {
    "query_string": { "query": "heart" }
  },
  "collapse" : {
    "field" : "AccessionNumber.raw"
  }
}' | jq '.hits.hits[]._source.AccessionNumber'

Note: The collapsing is done by selecting only the top sorted document per collapse key. For instance the query below retrieves the best tweet for each user and sorts them by number of likes.
From https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-collapse
Python test ElasticSearch connection

from elasticsearch import Elasticsearch
INDEX_NAME='index_name'
ELASTIC_IP='127.0.0.1'
ELASTIC_PORT=9200
es = Elasticsearch([{'host': ELASTIC_IP, 'port': ELASTIC_PORT}])
print(es.indices.exists(index=INDEX_NAME))


# Lookup dicom by path
result = es.search(
    index=INDEX_NAME,
    doc_type=DOC_TYPE,
    size=1,
    body={'query': {'term': {'filepath_orig.keyword': filepath_orig}}}
)

if result['hits']['total'] == 0:
    return


Python query

from elasticsearch import Elasticsearch

INDEX_NAME = 'index_name'
ELASTIC_IP = 'localhost'
ELASTIC_PORT = 9200

es = Elasticsearch([{'host': ELASTIC_IP, 'port': ELASTIC_PORT}])

query = {
  "query" : {
    "term" : { "filepath.keyword" : "/hpf/projects/file.txt" }
  }
}

query = {"query": {"match_all": {}}}

res = es.search(index=INDEX_NAME, body=query)

String query String

from elasticsearch import Elasticsearch

INDEX_NAME = 'index_name'
ELASTIC_IP = '127.0.0.1'
ELASTIC_PORT = 9200

es = Elasticsearch([{'host': ELASTIC_IP, 'port': ELASTIC_PORT}])

query = {
    "query" : {
      "query_string": { "query": "heart" }
    }
}

res = es.search(index=INDEX_NAME, body=query)

Setup a multi-node cluster

https://www.elastic.co/guide/en/elasticsearch/guide/master/distributed-cluster.html
https://dzone.com/articles/elasticsearch-tutorial-creating-an-elasticsearch-c
elasticsearch.yml

cluster.name: "docker-cluster"
network.host: 0.0.0.0
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: ["127.0.0.1", "127.0.0.1", "127.0.0.1"]

Start cluster

docker run -d --name elasticsearch1 -p 9200:9200 -p 9300:9300 -v `pwd`/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml docker.elastic.co/elasticsearch/elasticsearch:6.7.1
docker run -d --name elasticsearch2 -p 9201:9200 -p 9301:9300 -v `pwd`/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml docker.elastic.co/elasticsearch/elasticsearch:6.7.1
docker run -d --name elasticsearch3 -p 9202:9200 -p 9302:9300 -v `pwd`/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml docker.elastic.co/elasticsearch/elasticsearch:6.7.1

ElasticSearch-Dump

Install elasticdump using the node package manager:
npm install elasticdump -g

Option 1: Download Metadata
Download all metadata associated with the index_names in your current search results. Run this command and wait for all metadata to be downloaded to the file output.json:
elasticdump \
  --input=https://elasticindex_names.ccm.sickkids.ca \
  --output=output.json \
  --searchBody='{"query":{"bool":{"must":[{"bool":{"must":[{"range":{"PatientAgeInt":{"gte":0,"lte":30,"boost":2}}}]}}]}}}'

Option 2: Download File Paths
Download the file path locations of all the index_names in your current search results. This also requires the jq tool, download jq here. Run this command and wait for all the file paths to be downloaded to the file output.txt:
elasticdump \
  --input=https://elasticindex_names.ccm.sickkids.ca \
  --output=$ \
  --searchBody='{"query":{"bool":{"must":[{"bool":{"must":[{"range":{"PatientAgeInt":{"gte":0,"lte":30,"boost":2}}}]}}]}}}' \
| jq ._source.dicom_filepath | tee output.txt

Security

It IS possible to terminate SSL and set up (simple) authentication for the open source version of Elasticsearch and/or Kibana completely for free; you just have to reverse proxy it with something like Nginx or Apache. It is however correct that if you'd like a nice user UI and SSL termination directly on the standalone ES instance, you have to pay.
From https://stackoverflow.com/questions/33242197/is-elasticsearch-is-free-or-costly
You can use the free plugin readonlyrest to enable HTTP Authentication, SSL and ACL. https://readonlyrest.com/free/
SearchGuard

For free ElasticSearch authentication, lookup SearchGuard
ReactiveSearch Simple Custom Security Proxy

It’s also possible to secure your Elasticsearch cluster’s access with a middleware proxy server that is connected to ReactiveSearch. This allows you to set up custom authorization rules, prevent misuse, only pass back non-sensitive data, etc. Here’s an example app where we show this using a Node.JS / Express middleware:
• Proxy Server https://github.com/appbaseio-apps/reactivesearch-proxy-server/blob/master/index.js (can hand implement custom ACLs here)
• Proxy Client https://github.com/appbaseio-apps/reactivesearch-proxy-client/blob/master/src/App.js
Scripts

The scripting module enables you to use scripts to evaluate custom expressions. For example, you could use a script to return "script fields" as part of a search request or evaluate a custom score for a query.
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-using.html
https://www.elastic.co/guide/en/elasticsearch/painless/7.3/painless-walkthrough.html
Nodes

Nodes are servers. Data can be shaded to split up data and increase performance. Replicas are for ensuring availability encase of an outage and for faster searching in parallel on different replicas. A replica duplicates shards on different nodes (servers).
Query Performance Improvement Ideas

• By default I elasticsearch will put 5 shards on one server. However, 5 servers each with one shard will be faster mainly because of 5x disk IO
• The rule of thumb is that shards should consists of 20–40 GB of data.
• Store everything in RAM
"I was able to store the entire index to RAM by using the setting below while indexing the data but now the problem is the RAM usage by ElasticSearch is almost three times the size of the index.  Lucene will often use up to three times the size of the index due to the merging of existing segments."

store.type" : "memory"
gateway.type: fs

From <https://discuss.elastic.co/t/how-to-move-the-whole-data-to-main-memory/8766/3>

• Reduce amount of data
• Lower number of replicas from 1 (default) to 0. Usually, the setup that has fewer shards per node in total will perform better. The reason for that is that it gives a greater share of the available filesystem cache to each shard, and the filesystem cache is probably Elasticsearch’s number 1 performance factor. At the same time, beware that a setup that does not have replicas is subject to failure in case of a single node failure, so there is a trade-off between throughput and availability.
curl -H 'Content-Type: application/json' -XPUT http://$HOST_IP:$ELASTIC_PORT/$ELASTIC_INDEX/_settings -d '
{
    "index" : {
        "number_of_replicas" : 0
    }
}'

From https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-search-speed.html#_warm_up_the_filesystem_cache
• More specific queries to reduce number of fields. A common technique to improve search speed over multiple fields is to copy their values into a single field at index time, and then use this field at search time. This can be automated with the copy-to directive of mappings without having to change the source of documents. Here is an example: https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-search-speed.html
• Harder: Pre-load elastic data into memory. From https://www.elastic.co/guide/en/elasticsearch/reference/master/_pre_loading_data_into_the_file_system_cache.html#_pre_loading_data_into_the_file_system_cache