purbon/dates-in-elasticsearch.md

## dates-in-elasticsearch.md

      
    Raw
  

              dates-in-elasticsearch.md
            
          
    Options to handle wrong date formats in elaticsearch

Mapping date field without special handling of incorrect value, in this case all non joda conformant dates (ISO-8601) will be reported as incorrent and an exception will be thrown.
DELETE kafka

PUT kafka
{
    "settings" : {
        "number_of_shards" : 1
    },
    "mappings" : {
        "_doc" : {
            "properties" : {
                "content" : { "type" : "text" },
                "date": { "type": "date" }
            }
        }
    }
}
GET kafka/_mapping
this document will be process as the date is correct ISO.
PUT kafka/_doc/1
{ "date": "2015-01-01" }
this document will fail as empty date is not a correct date
PUT kafka/_doc/2
{ "date": "" }
strings with spaces will fail as well.
PUT kafka/_doc/3
{ "date": " " }
Another option is to allow for null values. This could be used
DELETE kafka_with_null_value
In this mapping we introduce two things, an specific format for the dates, including an special empty string format (that si clearly wrong as a date)
PUT kafka_with_null_value
{
    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas": 0
    },
    "mappings" : {
        "_doc" : {
            "properties" : {
                "content" : { "type" : "text" },
                "date": { 
                  "type": "date", 
                  "format": "yyyy-MM-dd||' '",
                   "null_value": "1900-01-01"
                }
            }
        }
    }
}
GET kafka_with_null_value/_mapping
this document will succeed, as the date format is good ISO
PUT kafka_with_null_value/_doc/1
{ "date": "2015-01-01" }
this document creation will be ok as well, as we specifically allowed for empty strings as date formats (what is wrong by definition)
PUT kafka_with_null_value/_doc/2
{ "date": " " }
If for example, with a pre processing step, we decided to convert all empty strings to null, with the current mapping, we could see how this null will be translated to the default value.
PUT kafka_with_null_value/_doc/3
{ "date": null }
Search for empty date, will only return document two.
GET kafka_with_null_value/_search
{
  "query": {
    "match": {
      "date": " "
    }
  }
}
search for the default value, will return the ones with dates as null. If the default value would be now, we could for example create documentes where null will be the current date.
GET kafka_with_null_value/_search
{
  "query": {
    "match": {
      "date": "1900-01-01"
    }
  }
}
Important take away this case is how I would like my wrong values to be searchable. We always aim for our indexed to contain resonable formatted values, we should always have this target in mind.
Another option would be to use the ignore_malformed, in that case the mapping will look like:
PUT kafka_with_malformed_value
{
    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas": 0
    },
    "mappings" : {
        "_doc" : {
            "properties" : {
                "content" : { "type" : "text" },
                "date": { 
                  "type": "date", 
                  "ignore_malformed": true
                }
            }
        }
    }
}
in that case all document creation will succeed, if ignore_malformed set to true, allows the exception to be ignored.
The malformed field is not indexed, but other fields in the document are processed normally.
What means wrongs dates will not be searchable, but document will be proceed accordingly.
Recomendations

The use of ignore_malformed is acceptable for many use cases, as the optionwill not make the ignored fields (the ones without the correct date format) searchable.
However in some use cases, the ones that are specially dependant on the operationality of the date field, a user might like to:

use the null value, for example to point it to a resonable default. For example now.
do a pre processing, to ensure all documents with wrongly formed dates are put in a DLQ to await correction.
Since version 6.5, using the https://www.elastic.co/guide/en/elasticsearch/reference/6.5/mapping-ignored-field.html field, is as well recomendable so users know wich fields has been ignored.

At the end, which option to choose, depends on what you want to do with this field later on in your application logic.

Is it mandatory field, use null values and enforce a proper date format.
Is it an option field, you can either
** use the ignore_malformed option.
** do a pre processing of the documents.

I hope this is of help,
-- Pere