Skip to content

Instantly share code, notes, and snippets.

@purbon
Last active May 27, 2019 09:43
Show Gist options
  • Save purbon/a5e4fff672a2d97fd855a1b1cda3436f to your computer and use it in GitHub Desktop.
Save purbon/a5e4fff672a2d97fd855a1b1cda3436f to your computer and use it in GitHub Desktop.
Handling wrong date formats in Elasticsearch

Options to handle wrong date formats in elaticsearch

Mapping date field without special handling of incorrect value, in this case all non joda conformant dates (ISO-8601) will be reported as incorrent and an exception will be thrown.

DELETE kafka

PUT kafka
{
    "settings" : {
        "number_of_shards" : 1
    },
    "mappings" : {
        "_doc" : {
            "properties" : {
                "content" : { "type" : "text" },
                "date": { "type": "date" }
            }
        }
    }
}
GET kafka/_mapping

this document will be process as the date is correct ISO.

PUT kafka/_doc/1
{ "date": "2015-01-01" }

this document will fail as empty date is not a correct date

PUT kafka/_doc/2
{ "date": "" }

strings with spaces will fail as well.

PUT kafka/_doc/3
{ "date": " " }

Another option is to allow for null values. This could be used

DELETE kafka_with_null_value

In this mapping we introduce two things, an specific format for the dates, including an special empty string format (that si clearly wrong as a date)

PUT kafka_with_null_value
{
    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas": 0
    },
    "mappings" : {
        "_doc" : {
            "properties" : {
                "content" : { "type" : "text" },
                "date": { 
                  "type": "date", 
                  "format": "yyyy-MM-dd||' '",
                   "null_value": "1900-01-01"
                }
            }
        }
    }
}
GET kafka_with_null_value/_mapping

this document will succeed, as the date format is good ISO

PUT kafka_with_null_value/_doc/1
{ "date": "2015-01-01" }

this document creation will be ok as well, as we specifically allowed for empty strings as date formats (what is wrong by definition)

PUT kafka_with_null_value/_doc/2
{ "date": " " }

If for example, with a pre processing step, we decided to convert all empty strings to null, with the current mapping, we could see how this null will be translated to the default value.

PUT kafka_with_null_value/_doc/3
{ "date": null }

Search for empty date, will only return document two.

GET kafka_with_null_value/_search
{
  "query": {
    "match": {
      "date": " "
    }
  }
}

search for the default value, will return the ones with dates as null. If the default value would be now, we could for example create documentes where null will be the current date.

GET kafka_with_null_value/_search
{
  "query": {
    "match": {
      "date": "1900-01-01"
    }
  }
}

Important take away this case is how I would like my wrong values to be searchable. We always aim for our indexed to contain resonable formatted values, we should always have this target in mind. Another option would be to use the ignore_malformed, in that case the mapping will look like:

PUT kafka_with_malformed_value
{
    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas": 0
    },
    "mappings" : {
        "_doc" : {
            "properties" : {
                "content" : { "type" : "text" },
                "date": { 
                  "type": "date", 
                  "ignore_malformed": true
                }
            }
        }
    }
}

in that case all document creation will succeed, if ignore_malformed set to true, allows the exception to be ignored. The malformed field is not indexed, but other fields in the document are processed normally. What means wrongs dates will not be searchable, but document will be proceed accordingly.

Recomendations

The use of ignore_malformed is acceptable for many use cases, as the optionwill not make the ignored fields (the ones without the correct date format) searchable. However in some use cases, the ones that are specially dependant on the operationality of the date field, a user might like to:

At the end, which option to choose, depends on what you want to do with this field later on in your application logic.

  • Is it mandatory field, use null values and enforce a proper date format.
  • Is it an option field, you can either ** use the ignore_malformed option. ** do a pre processing of the documents.

I hope this is of help,

-- Pere

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment