Skip to content

Instantly share code, notes, and snippets.

@yuriybash
Created August 24, 2018 18:43
Show Gist options
  • Save yuriybash/d96e8ef945520d79741b860324f0f409 to your computer and use it in GitHub Desktop.
Save yuriybash/d96e8ef945520d79741b860324f0f409 to your computer and use it in GitHub Desktop.

Elasticsearch Tips & Tricks, Part 2: Risks of Using Dynamic Mapping

In my previous post, I discussed speeding up Elasticsearch migrations and reindexing operations. In this one, I'll discuss one of the most important parts of using ES effectively: mappings.

Mappings define the structure of, and fields in, documents in a given index. All fields have a mapping, and this mapping determines both how the field gets stored, and how it is indexed.

Elasticsearch is able to infer the type of field it should use based on the data, but this carries several risks one should be aware of. This post describes some of those risks.

tl;dr - don't use dynamic mappings in prod, see bottom of post to see how to turn them off

Risk #1: Incorrect inference of data type

By default, when you index a document with a new field whose type is not explicitly set, ES detects the type of field and adds that field to its mapping table. For example, if you add the following document:

https://gist.github.com/a858b77c993b48554b21173308bb20c6

it infers the field type and creates a mapping that will look something like this:

https://gist.github.com/47fc34e13d6c9b1a565b746d376be438

and something similar for the other fields. This is called dynamic mapping and can be a useful feature but poses two problems: incorrect field types, and mapping explosions.

In any production environment, using dynamic mappings is probably too risky because there is a significant chance ES will infer the type incorrect, and you'll face more problems down the line. For example, let's say you add a document with a new field, conversion_factor, that looks like this:

https://gist.github.com/d2e51ebcc2eae1ac450265fbcb370b75

a half_float is sufficient precision and ES could conceivably save it as that, thought it's more likely it'll save it as a float instead. You end up with a mapping that looks like this:

https://gist.github.com/80f8257b007342d5c97d14342c9bea06

if you try to save a subsequent document with a conversion_factor value of 3.40282346638528860e+37, you will get an error (hopefully) because it does not fit in a 32-bit float. Or, worse, the value is saved but you lose precision, and don't even notice the problem.

Risk #2: Mapping explosion

Let's say the data you are indexing does not always follow the same schema. For example, let's say this is document #1:

https://gist.github.com/b3367bda99dbb204e00af481193d9e25

and this is document #2:

https://gist.github.com/b73c5d4cafc3d324e5df3560bdc49ace

This is dangerous. Note that the first two fields are the same in both documents, but that the third field is unique to each. ES will create separate mappings for each one. If more documents come in, each with its own custom field, this will create a new mapping for each field, and the size of your mappings will explode, ES may start OOM'ing, and all sorts of other problems will begin happening.

What you probably want to do instead is add documents in the following format:

https://gist.github.com/c78406e5d6723b0d713ce6fe510fc491

and

https://gist.github.com/6c959c2dabcebf9e957ff8107cef0312

this will create only two new fields in the mapping registry, something along the lines of:

https://gist.github.com/72287ae4ea92067c6272e9ffc6607f0d

By default, index.mapping.total_fields.limit sets the maximum number of total fields to 1000, but note that you can still have nested fields (at up to 20 levels of depth), so this is something to watch out for.

Risk #3: Inability to use ES aggregations

This risk is a bit more subtle, but is an extension of Risk #2.

One of my personal favorite features in Elasticsearch is Aggregations. Aggregations let you bucket your data by any field (and subfields, and run several of them in serial). We used them extensively at Percolate for collecting metrics on social data.

An aggregation query typically looks something like:

https://gist.github.com/dfb893f42b7d782fe6695b41fe186502

and you get a result that looks something like:

https://gist.github.com/2dd8112a517c42f31c110727b8887a1f

This is a really useful feature in analyzing your data.

The problem is, if you have many fields (as described in Risk #2), it becomes difficult to run aggregations because you have no apriori knowledge of what the fields are - i.e. in "terms" : { "field" : "age" }, you can't replace age with something else.

Whereas, if you use the mapping technique described above, you can run an aggregation with "terms" : { "field" : "custom_field" } and receive a response in this form:

https://gist.github.com/3801db4399ab478e1c7f3b97d2fe3e41

Turning off dynamic mapping

Field mappings can't be updated once they are created - so avoid the problem altogether by turning them off.

Turning off dynamic mappings can be done at the document and object value, by setting dynamic to false or strict. false merely ignored new fields, which may result in data loss - so set it to strict instead.

Defining mappings explicitly

Instead, define mappings explicitly. There are a few subleties here. For more information on this, see this page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment