Skip to content

Instantly share code, notes, and snippets.

@newtonapple
Last active January 18, 2017 04:30
Show Gist options
  • Save newtonapple/fb3de3b32b7039a85479 to your computer and use it in GitHub Desktop.
Save newtonapple/fb3de3b32b7039a85479 to your computer and use it in GitHub Desktop.
Elasticsearch Notes

Elasticsearch runs on:

doc-search01.lo:19200

Create or update a record

curl -XPUT  "doc-search01.lo:19200/megacorp/employee/5" -d '
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}

'

Get the record

curl -XGET  "doc-search01.lo:19200/megacorp/employee/2"

Simple search

Lists first 10 entries:

http://doc-search01.lo:19200/megacorp/employee/_search

Lists first 1000 entries:

http://doc-search01.lo:19200/megacorp/employee/_search?size=1000

Lists all entries with "Smith"

http://doc-search01.lo:19200/megacorp/employee/_search?q=Smith

Lists all entries with "music" in interests:

http://doc-search01.lo:19200/megacorp/employee/_search?q=interests:music

More complex search with the search DSL

You can perform the same interests/music search using the DSL. You just need to pass a JSON request body:

curl -XGET  "doc-search01.lo:19200/megacorp/employee/_search" -d '
  {
      "query" : {
          "match" : {
              "interests" : "music"
          }
      }
  }
'

It's more verbose! Why use it?

Full text search

Try this:

curl -XGET  "doc-search01.lo:19200/megacorp/employee/_search?pretty=1" -d '
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}
'

It will match people who have "rock", "climbing", or "rock climbing" in their about section, sorted by relevance. Nice!

If you want to search for "rock climbing" exact match, use "match_phrase":

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}

The match query is the go-to query—the first query that you should reach for whenever you need to query any field. It is a high-level full-text query, meaning that it knows how to deal with both full-text fields and exact-value fields.

Controlling full text matches

By default, if you search for "brown dog", ES will return that docs have "brown" OR "dog". You can change that to an AND, so it returns docs that contain "brown" AND "dog": http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/match-multi-word.html#match-improving-precision

Or you can specify that "at least 1/2 the search words should be in a doc": http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/match-multi-word.html#match-precision

Suppose you don't want the full stored doc

You can retrieve just the title and author of an indexed book with:

GET /books_test/author/_search?_source=title,author

Or if you just want the indexed content with none of the metadata (like "found", "_index", "_type" etc):

GET /books_test/author/1/_source

Updating docs

A doc is immutable. So to "update" it, you just PUT a new version of that doc. What if you just want to make an incremental change to the existing doc? Elasticsearch has an API for that:

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/partial-updates.html

But behind the scenes, it is just grabbing the current doc data, changing it, and then PUTting that as a new version.

If you do a partial update like this:

POST /website/blog/1/_update
{
   "doc" : {
      "tags" : [ "testing" ],
      "views": 0
   }
}

It will merge this new data with the old data. You can also run some code IN ELASTICSEARCH to make the update:

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/partial-updates.html#_using_scripts_to_make_partial_updates

Retrieving multiple docs

You can get multiple docs if you know their ids:

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_retrieving_multiple_documents.html

Maybe there is a nice way to search several indexes at once? YUP!

/gb,us/_search

search all types in the gb and us indices

/g*,u*/_search

search all types in any indices beginning with g or beginning with u

Bulk indexing

How big is too big?

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/bulk.html#_how_big_is_too_big

They recommend batches of 1k - 5k to start, or around 5 - 15mb in size.

Turn off refreshes completely when bulk indexing:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html#bulk

Mapping

You specify a mapping for a type. You can think of it as:

  • index = database
  • type = table
  • mapping = schema for table
  • document = row in a table

An index has multiple documents. A document has a type. Each document has one or more fields. For example, you might have an index "scribd". That might have types word_document and user. A document of type word_document might have a title and an author. A user might have a name and a login.

You don't need to specify a mapping at all. If you don't, ES will figure it out automatically: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/dynamic-mapping.html

But it might be useful for you to specify a mapping, like this:

"name": {
    "type":     "string",
    "analyzer": "whitespace"
}

"name" is a field in your document. This says that name is a string, and before you index name into the inverted index, analyze it with the whitespace analyzer.

Here are the types you can have: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/mapping-intro.html

IMPORTANT!

Quin strongly recommends that you only have one type per index. You can run into issues if you have multiple types with the same field name on the same index: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/mapping.html#_avoiding_type_gotchas

According to the ES documentation itself:

To ensure that you don't run into these conflicts, it is advisable to ensure that fields with the same name are mapped in the same way in every type in an index.

Specify index settings and mappings on creation

Create an index with the settings and mappings:

PUT /my_index
{
    "settings": { ... any settings ... },
    "mappings": {
        "type_one": { ... any mappings ... },
        "type_two": { ... any mappings ... },
        ...
    }
}

No replicas

Turn off replicas and only one shard for an index:

PUT /index_name
{
    "settings": {
        "number_of_shards" :   1,
        "number_of_replicas" : 0
    }
}

What is stemming?

fox and foxes is basically the same thing. When you store these terms in your inverted index, you don't want to store both since they are similar. Instead, "foxes" can be stemmed, i.e. reduced to it's root form, which is "fox".

What is an analyzer

Take this phrase: "the lord of the rings". We need to store this in our inverted index. An analyzer will:

  1. tokenize this phrase (typically ["the", "lord", "of", "the", "rings"]). These are called tokenizers.
  2. normalize each token (lowercase, remove whitespace, stemming, remove stopwords like "a" and "the"). These are called token filters.

Then you can put those normalized tokens in your inverted index. ES contains some built-in analyzers. It also contains a bunch of tokenizers and token filters that you can mix and match to create custom analyzers.

Built-in analyzers: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/analysis-intro.html#_built_in_analyzers

Language-specific analyzers are interesting too:

Language-specific analyzers are available for many languages. They are able to take the peculiarities of the specified language into account. For instance, the english analyzer comes with a set of English stopwords & common words like and or the which don't have much impact on relevance — which it removes, and it is able to stem English words because it understands the rules of English grammar.

Testing Analyzers

GET /_analyze?analyzer=standard Text to analyze

Will show you how "Text to analyze" will get analyzed by the standard analyzer.

The _all field

In Search Lite we introduced the _all field: a special field that indexes the values from all other fields as one big string. The query_string query clause (and searches performed as ?q=john) defaults to searching in the _all field if no other field is specified.

Search all fields at once:

GET /_search
{
    "match": {
        "_all": "john smith marketing"
    }
}

Set timeouts for search queries!

The timed_out value tells us whether the query timed out or not. By default, search requests do not timeout. If low response times are more important to you than complete results, you can specify a timeout as 10 or "10ms" (10 milliseconds), or "1s" (1 second):

GET /_search?timeout=10ms

Warning: timeout is not a circuit breaker

It should be noted that this timeout does not halt the execution of the query, it merely tells the coordinating node to return the results collected so far and to close the connection. In the background, other shards may still be processing the query even though results have been sent.

Use the timeout because it is important to your SLA, not because you want to abort the execution of long running queries.

Good work Elasticsearch.

"Give me all the docs that are NOT available in the US"

{
    "must_not": { "match": { "geo":  "US" }},
}

In general, you have must, must_not, and should (i.e. nice-to-have):

"bool": {
    "must":     { "match": { "tweet": "elasticsearch" }},
    "must_not": { "match": { "name":  "mary" }},
    "should":   { "match": { "tweet": "full text" }}
}

The bool filter is used to combine multiple filter clauses using Boolean logic. It accepts three parameters:

For exact matches:

  • must These clauses must match, like and
  • must_not These clauses must not match, like not
  • should At least one of these clauses must match, like or

For full text matches, should is instead:

  • should If these clauses match, then they increase the _score, otherwise they have no effect. They are simply used to refine the relevance score for each document.

SEE THE MOST IMPORTANT QUERIES AND FILTERS HERE:

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_most_important_queries_and_filters.html

Query vs filter

query = full text search, inexact (how WELL does this match?) filter = exact match (does "2014-06-01" match "2014-06-02" EXACTLY?)

Use query for text search, use filter for geo

As a general rule, use query clauses for full text search or for any condition that should affect the relevance score, and use filter clauses for everything else.

Filters are important because they are very, very fast. Filters do not calculate relevance (avoiding the entire scoring phase) and are easily cached.

"you should use filters as often as you can"

filters could be used for isbns? what else? if you use filter for text, make sure you read this caveat: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_finding_exact_values.html#_term_filter_with_text

searching with filters in depth: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_finding_exact_values.html

Caching/internals

Internally, when you filter docs, the results are cached in a bitset in memory: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_finding_exact_values.html#_internal_filter_operation

When you search, filters are applied before full-text searches. ES gets all the docs that matched using these bitsets. So it is able to narrow down the # of docs it has to run a full-text search on.

Updates to these bitsets happen incrementally as you add documents to your index.

Sorting

You can ask ElasticSearch to sort the results:

GET /_search
{
    "query" : {
        "filtered" : {
            "filter" : { "term" : { "user_id" : 1 }}
        }
    },
    "sort": { "date": { "order": "desc" }} <-- sort by date
}

Or sort by date, then sort by the _score that elasticsearch automatically assigns the results:

"sort": [
    { "date":   { "order": "desc" }},
    { "_score": { "order": "desc" }}
]

Weighting by relevance

Suppose you are searching for the "lord of the rings". Documents that have all 4 words will be given more weight that docs that only have 2/4 words. This is called "coordination": http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/practical-scoring-function.html#coord

Boosting fields

When you search, you can boost a particular field like "weight author match heavier than translator match here": http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/multi-query-strings.html#prioritising-clauses

Not sure if useful for us since we will most likely just do a full-text search with _all.

You can also boost values with the "should" clause i.e. if this book matches "lord of the rings", show it. And then specify "should" = "tolkien". So books with "tolkien" in them will be ranked higher.

You can also specify how much to boost:

{ "match": {
    "content": {
        "query": "Lucene",
        "boost": 2
    }
}}

Prefix query

Looks like:

GET /my_index/books/_search
{
    "query": {
        "prefix": {
            "title": "the lor"
        }
    }
}

finds all books with titles beginning with "the lor".

The prefix query is a low-level query that works at the term level. It doesn't analyze the query string before searching. It assumes that you have passed it the exact prefix that you want to find.

By default, the prefix query does no relevance scoring. It just finds matching documents and gives them all a score of 1. Really, it behaves more like a filter than a query. The only practical difference between the prefix query and the prefix filter is that the filter can be cached.

This explains how a prefix query works:

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/prefix-query.html

IMPORTANT:

the shorter the prefix, the more terms need to be visited. If we were to look for the prefix W instead of W1, perhaps we would match 10 million terms instead of just one million.

You can also use "match_phrase_prefix":

{
    "match_phrase_prefix" : {
        "brand" : "johnnie walker bl"
    }
}

Slop

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/slop.html

Not sure if this is useful for us. Suppose you set slop to 2. Now instead of searching for "lord of the rings", you can search for "of lord the rings". Slop just makes it so that even if words are out of order, it will still match. The higher the slop, the more words can be out of order.

ngrams

Suppose you are searching for "the lor". You can't use a full-text search, since that would search your inverted index for "the" and "lor". "lor" wouldn't match anything. So you can do a prefix search for "the lor" and that will match terms in your index that START WITH "lor". But prefix search is slower! You have to SCAN your index! What's an alternative?

Convert your input into edge ngrams: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_ngrams_for_partial_matching.html

And then do a full-text search on that.

Here's the walkthrough: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_index_time_search_as_you_type.html

IMPORTANT!

If you index something with your custom edge ngram analyzer, make sure the search query you use later specifies the "standard" analyzer! Otherwise it will use a edge ngram analyzer on your search query! Search for "The name:f condition is satisfied by the second document".

ES docs suggest using the completion suggester:

This data structure lives in memory and makes prefix lookups extremely fast, much faster than any term-based query could be. It is an excellent match for autocompletion of names and brands, whose words are usually organized in a common order: “Johnny Rotten� rather than “Rotten Johnny.�

When word order is less predictable, edge n-grams can be a better solution than the completion suggester. This particular cat may be skinned in myriad ways.

Tokenizer vs token filter

Given a string like "lord of the rings",

Tokenizer converts it into tokens: ["lord", "of", "the", "rings"] Token filter runs a filter on each one of those tokens, like "lowercase" or "n-gram".

Testing analyzers

After you create your own custom analyzer, you can test how it will perform on a string (i.e. what it will index):

curl -XGET 'doc-search01.lo:19200/test/_analyze?analyzer=&pretty' -d 'Foo Bar'

Things to try

Benchmark the following on speed and quality of results:

  • use a prefix search Advantages: built-in solution, no special indexing required Disadvantages: could be slow!

  • use full-text search with edge n-gram analyzer Advantages: hopefully FAST to query since it does exact match Disadvantages: index size could be large, indexing will take time

  • use the completion suggestor I was using Advantages: supposedly blazing fast Disadvantages: inflexible, indexing takes a lot of time, we have to compute every possible input outselves.

On inflexibility:

No filtering, or advanced queries. In many, and perhaps most, autocomplete applications, no advanced querying is required. Let's suppose, however, that I only want auto-complete results to conform to some set of filters that have already been established (by the selection of category facets on an e-commerce site, for example). There is no way to handle this with completion suggest. If I need more advanced querying capabilities I will need to set up a different sort of autocomplete system.

Monitoring

GET _cluster/health
GET _cluster/health?level=indices
GET _cluster/health?level=shards

Other resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment