Skip to content

Instantly share code, notes, and snippets.

@danielsnider
Last active March 3, 2024 15:17
Show Gist options
  • Save danielsnider/e4f78c4c630f68b52a1c4c1fe2300636 to your computer and use it in GitHub Desktop.
Save danielsnider/e4f78c4c630f68b52a1c4c1fe2300636 to your computer and use it in GitHub Desktop.
Elasticsearch Cheatsheet - My Elasticsearch Commands, Queries, and Config Notes

Index

Notes

Delete all documents from index

curl -X POST "localhost:9200/index_name/_delete_by_query" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  }
}'

Health / Stats

curl -XGET "http://localhost:9200/_cluster/stats?human&pretty"
curl -XGET "http://localhost:9200/_cat/shards?v"
curl -XGET "http://localhost:9200/_cat/indices?v"
curl -XGET "http://localhost:9200/_cat/allocation?v"

Get all

curl -H 'Content-Type: application/json' -XGET 'localhost:9200/index_name/_search' -d '
{
    "query" : {
        "match_all" : {}
    }
}'

Search for string across all fields

curl -H 'Content-Type: application/json' -XPOST 'localhost:9200/index_name/_search' -d '
{
    "query" : {
      "query_string": { "query": "heart" }
    }
}' | jq . | head -n25

Count documents

curl -H 'Content-Type: application/json' -XGET 'localhost:9200/index_name/_count'

Get mapping

curl -H 'Content-Type: application/json' -XGET 'localhost:9200/index_name/_mapping'

Get one record

curl -H 'Content-Type: application/json' -XGET "localhost:9200/index_name/index_name/MnjvwGkBD86Op0uG1ix5" | jq

Create one record

curl -X PUT "localhost:9200/index_name/index_name/1/_create" -H 'Content-Type: application/json' -d'
{
  "create": "2015/09/02"
}'

Sort output, search for not empty values, in a field with a name that contains a space, and use jq to extract values.

curl -H 'Content-Type: application/json' -XGET 'localhost:9200/index_name/_search' -d '{ "sort":[{"Scheduled Date.keyword" : {"order":"asc"}}], "query" : {"query_string" : {"query": "Scheduled\\ Date:/.*/"}}}'  | jq --raw-output '.hits.hits[]._source."Scheduled Date"'

Disable automatic date detecting

 curl -s -H 'Content-Type: application/json' -X PUT 'localhost:9200/index_name/_mapping'
 {
   "mappings": {
     "$ELASTIC_DOC_TYPE": {
+      "date_detection": false,
       "properties": {
         "orig": {
           "type": "text",

Search by several fields

curl -H 'Content-Type: application/json' -XGET 'localhost:9200/index_name/_search' -d '
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "AccessionNumber.keyword": "123456789"
          }
        },
        {
          "term": {
            "SeriesNumber.keyword": "1"
          }
        },
        {
          "term": {
            "SeriesDescription.raw": "FMRI-AX"
          }
        },
        {
          "term": {
            "InstanceNumber.keyword": "71"
          }
        }
      ]
    }
  }
}'

Get All (actually retrieve all pages of results)

elasticdump   --input=https://elasticindex_names.ccm.sickkids.ca   --input-index=index_name   --output=$  --searchBody='{"_source": ["Report"], "query" : {"match_all" : {} } }'   | jq '{id:._id,Report:._source.Report}'

outputs:
{
  "id": "CnZVSWsBHWg-PhjNBSxI",
  "Report": "Flexion/extension viewso…"
}

Get one field multiple records

curl -s -H 'Content-Type: application/json' -XGET 'http://127.0.0.1:9200/index_name/index_name/_search' -d '
{
  "_source": "PatientID",
  "from": 1,
  "size": 5
}
' | jq ".hits.hits[]._source.PatientID"

Note: "from" is index position "size" is number of records

Increase number of allowed fields aka. columns

curl -H 'Content-Type: application/json' -XPUT 'localhost:9200/index_name/_settings' -d '
{
  "index.mapping.total_fields.limit": 100000
}'

List indexes

curl http://localhost:9200/_aliases?pretty=true

Info about an index

curl -s -X GET http://localhost:9200/index_name | jq

Count docs in an index

curl -s -X GET http://localhost:9200/index_name/index_name/_count | jq

Search with query stored in file

curl -v -H 'Content-Type: application/x-ndjson' -H 'Accept: application/json' -XPOST 'https://elasticindex_names.ccm.sickkids.ca/index_name/_msearch' --data-binary @data.json

# required file data.json (must have new line at end of file)
{"preference":"bodypart-list"}
{"query":{"match_all":{}},"highlight":{"pre_tags":["<mark>"],"post_tags":["</mark>"],"fields":{}},"size":0,"from":0,"aggs":{"BodyPartExamined.raw":{"terms":{"field":"BodyPartExamined.raw","size":100,"order":{"_count":"desc"}}}}}

Count number of fields on index

ubuntu@index_names:~$ curl -s -XGET localhost:9200/index_name/_mapping?pretty | grep type | grep text | wc -l
822

Count for each unique values in one field find how many unique values there are in another field

curl -H 'Content-Type: application/json' -XPOST 'localhost:9200/index_name/_search' -d '{
  "aggs": {
    "count_index_names_by_modality": {
      "terms": {
        "field": "Modality.raw",
        "size": 20,
        "order": {
          "_count": "desc"
        }},
        "aggs": {
          "exam_count_per_modality": {
            "cardinality": {
              "field": "AccessionNumber.keyword"
            }
        }
      }
    }
  }
}' | jq

Get names of fields

curl -s -XGET localhost:9200/index_name/_mapping | jq .index_name.mappings.index_name.properties | jq 'keys'

Process names of fields in loop

for i in $(curl -s -XGET localhost:9200/index_name/_mapping | jq .index_name.mappings.index_name.properties | jq 'keys' | jq .[])
do
  echo "key: $i"
done

Get count number of not-null values for every field in index

for FIELD_NAME in $(curl -s -XGET localhost:9200/index_name/_mapping | jq .index_name.mappings.index_name.properties | jq 'keys' | jq .[]); do   NUM_NOT_NULL=$(curl -s -H 'Content-Type: application/json' -XGET 'http://127.0.0.1:9200/index_name/index_name/_search' -d '
  {
     "query" : {
        "constant_score" : {
           "filter" : {
              "exists" : {
                 "field" : '"$FIELD_NAME"'
              }
           }
        }
     }
  }' | jq .hits.total);   echo "$FIELD_NAME: $NUM_NOT_NULL"; done | tee out.json

Count number of distinct values

curl -H 'Content-Type: application/json' -XGET 'http://127.0.0.1:9200/linking/linking/_search' -d '
{
    "size" : 0,
    "aggs" : {
        "distinct_orig" : {
            "cardinality" : {
              "field" : "orig.keyword"
            }
        }
    }
}' | jq

Note: size: 0 here means "Perform Elasticsearch aggregation without returning hits values of documents"

Get list of unique values in a field and count how many occurrences for each distinct/unique value

curl -H 'Content-Type: application/json' -XGET 'http://127.0.0.1:9200/index_name/index_name/_search' -d '
{
    "size": 0,
    "aggs" : {
        "count_orig" : {
            "terms" : { "field" : "ProtocolName.keyword", "size": 2147483647}
        }
    }
}' | jq

Update a field for all records that match a query

es.update_by_query(index='index_name', doc_type='index_name', body={
  'query': {'term': {'AccessionNumber.keyword': 'FUJI95714'}},
  'script': {"inline": "ctx._source.A_new_attribute = 'NEWVALUE'"}}
)

Note: ".keyword" is important to guarantee an exact match, otherwise values are broken by the analyzer into term subsets, More info

Count the unique values across two fields

data.aggs = {
     "exam_count":{
       "cardinality":{
           "script": "doc['AccessionNumber.raw'].value + ' ' + doc['SeriesNumber.raw'].value"
         },
     },
   };

Test an analyzer

curl -H 'Content-Type: application/json' -XGET 'http://127.0.0.1:9200/_analyze' -d '
{
  "analyzer": "standard",
  "text": "3-plane"
}' | jq .

resulting tokens: [3, plane]

Sorted Euclidean distance

curl -X GET "localhost:9200/index_name/_search" -H 'Content-Type: application/json' -d'
{
  "sort": [
    {
      "_script": {
        "type": "number",
        "script": {
          "lang": "painless",
          "source": "return Math.sqrt(Math.pow(Integer.parseInt(doc[\u0027Rows.keyword\u0027].value) - 499, 2) + Math.pow(Integer.parseInt(doc[\u0027Columns.keyword\u0027].value) - 499, 2))"
        },
        "order": "asc"
      }
    }
  ],
  "size": 8,
  "_source": ["Rows", "Columns",'"dicom]
}
' | jq .

Euclidean distance

curl -X PUT "localhost:9200/my_index/_doc/1" -H 'Content-Type: application/json' -d'
{
 "x1": 3.0,
 "y1": 3.0,
 "x2": 0.0,
 "y2": 0.0
}
'
curl -X GET "localhost:9200/my_index/_search" -H 'Content-Type: application/json' -d'
{
 "script_fields": {
   "my_doubled_field": {
     "script": {
       "lang":   "painless",
       "source":
       "return Math.sqrt(Math.pow(doc[\u0027x1\u0027].value - doc[\u0027x2\u0027].value, 2) + Math.pow(doc[\u0027y1\u0027].value - doc[\u0027y2\u0027].value, 2))"
     }
   }
 }
}'

Note: \u0027 means ' and is used as a quote inside of a quote

Explain Query

curl -v -H 'Content-Type: application/json' -X GET 'http://localhost:9200/index_name/index_name/DczUL2wBssoKtfgQuNfg/_explain/' -d '
{
     "query" : {
       "query_string" : {"query":"dcm"}
     }
}' | jq

Get one document for each unique value in a field

curl -H 'Content-Type: application/json' -XPOST 'localhost:9200/index_name/_search' -d '{
  "query" : {
    "query_string": { "query": "heart" }
  },
  "collapse" : {
    "field" : "AccessionNumber.raw"
  }
}' | jq '.hits.hits[]._source.AccessionNumber'

Note: The collapsing is done by selecting only the top sorted document per collapse key. For instance the query below retrieves the best tweet for each user and sorts them by number of likes.

From https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-collapse

Python test ElasticSearch connection

from elasticsearch import Elasticsearch
INDEX_NAME='index_name'
ELASTIC_IP='127.0.0.1'
ELASTIC_PORT=9200
es = Elasticsearch([{'host': ELASTIC_IP, 'port': ELASTIC_PORT}])
print(es.indices.exists(index=INDEX_NAME))


# Lookup dicom by path
result = es.search(
    index=INDEX_NAME,
    doc_type=DOC_TYPE,
    size=1,
    body={'query': {'term': {'filepath_orig.keyword': filepath_orig}}}
)

if result['hits']['total'] == 0:
    return

Python query

from elasticsearch import Elasticsearch

INDEX_NAME = 'index_name'
ELASTIC_IP = 'localhost'
ELASTIC_PORT = 9200

es = Elasticsearch([{'host': ELASTIC_IP, 'port': ELASTIC_PORT}])

query = {
  "query" : {
    "term" : { "filepath.keyword" : "/hpf/projects/file.txt" }
  }
}

query = {"query": {"match_all": {}}}

res = es.search(index=INDEX_NAME, body=query)

String query String

from elasticsearch import Elasticsearch

INDEX_NAME = 'index_name'
ELASTIC_IP = '127.0.0.1'
ELASTIC_PORT = 9200

es = Elasticsearch([{'host': ELASTIC_IP, 'port': ELASTIC_PORT}])

query = {
    "query" : {
      "query_string": { "query": "heart" }
    }
}

res = es.search(index=INDEX_NAME, body=query)

Setup a multi-node cluster

https://www.elastic.co/guide/en/elasticsearch/guide/master/distributed-cluster.html https://dzone.com/articles/elasticsearch-tutorial-creating-an-elasticsearch-c

elasticsearch.yml

cluster.name: "docker-cluster"
network.host: 0.0.0.0
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: ["127.0.0.1", "127.0.0.1", "127.0.0.1"]

Start cluster

docker run -d --name elasticsearch1 -p 9200:9200 -p 9300:9300 -v `pwd`/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml docker.elastic.co/elasticsearch/elasticsearch:6.7.1
docker run -d --name elasticsearch2 -p 9201:9200 -p 9301:9300 -v `pwd`/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml docker.elastic.co/elasticsearch/elasticsearch:6.7.1
docker run -d --name elasticsearch3 -p 9202:9200 -p 9302:9300 -v `pwd`/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml docker.elastic.co/elasticsearch/elasticsearch:6.7.1

ElasticSearch-Dump

Install elasticdump using the node package manager:

npm install elasticdump -g

Option 1: Download Metadata Download all metadata associated with the index_names in your current search results. Run this command and wait for all metadata to be downloaded to the file output.json:

elasticdump \
  --input=https://elasticindex_names.ccm.sickkids.ca \
  --output=output.json \
  --searchBody='{"query":{"bool":{"must":[{"bool":{"must":[{"range":{"PatientAgeInt":{"gte":0,"lte":30,"boost":2}}}]}}]}}}'

Option 2: Download File Paths Download the file path locations of all the index_names in your current search results. This also requires the jq tool, download jq here. Run this command and wait for all the file paths to be downloaded to the file output.txt:

elasticdump \
  --input=https://elasticindex_names.ccm.sickkids.ca \
  --output=$ \
  --searchBody='{"query":{"bool":{"must":[{"bool":{"must":[{"range":{"PatientAgeInt":{"gte":0,"lte":30,"boost":2}}}]}}]}}}' \
| jq ._source.dicom_filepath | tee output.txt

Security

It IS possible to terminate SSL and set up (simple) authentication for the open source version of Elasticsearch and/or Kibana completely for free; you just have to reverse proxy it with something like Nginx or Apache. It is however correct that if you'd like a nice user UI and SSL termination directly on the standalone ES instance, you have to pay.

From https://stackoverflow.com/questions/33242197/is-elasticsearch-is-free-or-costly

You can use the free plugin readonlyrest to enable HTTP Authentication, SSL and ACL. https://readonlyrest.com/free/

SearchGuard

For free ElasticSearch authentication, lookup SearchGuard

ReactiveSearch Simple Custom Security Proxy

It’s also possible to secure your Elasticsearch cluster’s access with a middleware proxy server that is connected to ReactiveSearch. This allows you to set up custom authorization rules, prevent misuse, only pass back non-sensitive data, etc. Here’s an example app where we show this using a Node.JS / Express middleware:

• Proxy Server https://github.com/appbaseio-apps/reactivesearch-proxy-server/blob/master/index.js (can hand implement custom ACLs here)

• Proxy Client https://github.com/appbaseio-apps/reactivesearch-proxy-client/blob/master/src/App.js

Scripts

The scripting module enables you to use scripts to evaluate custom expressions. For example, you could use a script to return "script fields" as part of a search request or evaluate a custom score for a query.

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-using.html

https://www.elastic.co/guide/en/elasticsearch/painless/7.3/painless-walkthrough.html

Nodes

Nodes are servers. Data can be shaded to split up data and increase performance. Replicas are for ensuring availability encase of an outage and for faster searching in parallel on different replicas. A replica duplicates shards on different nodes (servers).

Query Performance Improvement Ideas

• By default I elasticsearch will put 5 shards on one server. However, 5 servers each with one shard will be faster mainly because of 5x disk IO

• The rule of thumb is that shards should consists of 20–40 GB of data.

• Store everything in RAM

"I was able to store the entire index to RAM by using the setting below while indexing the data but now the problem is the RAM usage by ElasticSearch is almost three times the size of the index.  Lucene will often use up to three times the size of the index due to the merging of existing segments."

store.type" : "memory"
gateway.type: fs

From <https://discuss.elastic.co/t/how-to-move-the-whole-data-to-main-memory/8766/3>

• Reduce amount of data

• Lower number of replicas from 1 (default) to 0. Usually, the setup that has fewer shards per node in total will perform better. The reason for that is that it gives a greater share of the available filesystem cache to each shard, and the filesystem cache is probably Elasticsearch’s number 1 performance factor. At the same time, beware that a setup that does not have replicas is subject to failure in case of a single node failure, so there is a trade-off between throughput and availability.

curl -H 'Content-Type: application/json' -XPUT http://$HOST_IP:$ELASTIC_PORT/$ELASTIC_INDEX/_settings -d '
{
    "index" : {
        "number_of_replicas" : 0
    }
}'

From https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-search-speed.html#_warm_up_the_filesystem_cache

• More specific queries to reduce number of fields. A common technique to improve search speed over multiple fields is to copy their values into a single field at index time, and then use this field at search time. This can be automated with the copy-to directive of mappings without having to change the source of documents. Here is an example: https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-search-speed.html

• Harder: Pre-load elastic data into memory. From https://www.elastic.co/guide/en/elasticsearch/reference/master/_pre_loading_data_into_the_file_system_cache.html#_pre_loading_data_into_the_file_system_cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment