flash1293/proposal.md

## proposal.md

      
    Raw
  

              proposal.md
            
          
    Time shifts

A common use case in Kibana is it to compare different time ranges. This is partially implemented in some places (TSVB and Timelion) by doing separate requests, then merging the results together in Kibana.
This approach works fine for some types of data, but it is problematic when used together with the terms aggregation. Consider the following query: "Give me the top 3 best selling products per year, along with the relative growth in comparison to the previous year"
If this is modelled as a yearly date histogram with a nested terms agg ordered by the sum of sales, there's no straighforward mechanism to make sure to get the same terms for the previous year (along with the actual top 3 products of that year):
Desired table:


Year
Product
Revenue
Revenue -1y


2020
Product A
456
122


2020
Product B
123
6


2020
Product C
54
34


2019
Product A
122
78


2019
Product D
112
1


2019
Product E
55
3


One way to do so from the client is to fetch the revenue for the current year, then turn the bucket hierarchy into a large list of filters and use those as a second request to fetcht the data for the previous year:

First query: yearly date histogram of terms with revenue
Second query: fetch the following filters:

2019 Product A
2019 Product B
2019 Product C
2018 Product A
2018 Product D
2018 Product E


If the number of buckets is high, this request can become very complex (and probably pretty inefficient)
If this request can be formulated as a single Elasticsearch query, this logic could be handled within Elasticsearch in a much more efficient way (and especially without an additional roundtrip)
API suggestion

Introduce a time_shift aggregation which works similar to the filter aggregation, except it takes a field and a time shift - this will cause the set of documents of the bucket to change by shifting all imposed filters (by the query of the request and buckets of parent aggregations) by the specified time.
GET kibana_sample_data_logs/_search
{
  "query": {
    "match_all": {}
  },
  "size": 0, 
  "aggs": {
    "year": {
      "date_histogram": {
        "field": "timestamp",
        "interval": "day"
      },
      "aggs": {
        "product": {
          "terms": {
            "field": "product",
            "size": 3,
            "order": { "revenue": "desc" }
          },
          "aggs": {
            "last_year": {
              "time_shift": {
                  "field": "timestamp", // to get this metric, all filters on the timestamp field have to be moved by the shift, then the metric has to be calculated
                  "shift": "-1y"
              },
              "aggs": {
                "revenue": {
                  "sum": {
                    "field": "sales"
                  }
                }
              }
            },
            "revenue": {
              "sum": {
                  "field": "sales"
              }
            }
          }
        }
      }
    }
  }
}

Response
"aggregations" : {
    "year" : {
      "buckets" : [
        {
          "key_as_string" : "2020-01-01T00:00:00.000Z",
          "key" : 1614556800000,
          "doc_count" : 4165,
          "product" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 1137,
            "buckets" : [
              {
                "key" : "Product A",
                "doc_count" : 1574,
                "last_year": {
                  "doc_count" : 2574,
                  "revenue": {
                    "value": 123
                  }
                },
                "revenue": {
                  "value": 567
                }
              },
              {
                "key" : "Product B",
                "doc_count" : 773,
                "last_year": {
                  "doc_count" : 2574,
                  "revenue": {
                    "value": 123
                  }
                },
                "revenue": {
                  "value": 567
                }
              }
            ]
          }
        }
      ]
    }
  }

Affected filters

Time shifts will sometimes be used together with date histograms and sometimes without them. While it's possible in most cases for Kibana to craft a query that will return the right data if there is no date histogram, this would mean a lot of additional complexity on the Kibana side (and probably other consumers), because two paths for time shift have to maintained - with date histogram and without. Also there are some edge cases which are not easily reproducable (and would require Kibana to fall back to the "frozen buckets" approach of client side multi pass) like using variable width histogram.
There are two places in the query which have to be adjusted based on the time shift aggregation:
All ancestor date histogram aggs which are built on the field specified in the time_shift aggregation (everything beyond one date histogram is an edge case, walking up the aggregation tree and picking the next date histogram will work just as fine for the Lens use case - similar to how moving average etc. work)
The "main time filter" in the query of the request (all other filters that happen to operate on the same field can be ignored for the Kibana use case). Kibana will always build requests like this
{
    "aggs": {
        ...
    },
    "query": {
        "bool": {
            "must": [],
            "filter": [
                {
                    "range": {
                        "@timestamp": {
                            "gte": "2021-04-12T14:46:21.191Z",
                            "lte": "2021-04-19T14:46:21.191Z",
                            "format": "strict_date_optional_time"
                        }
                    }
                }
            ],
            "should": [],
            "must_not": []
        }
    }
}

The query building logic within Kibana knows which filter to touch for shifting because it's handled separately from all other filters due to the deep integration into the UI. Elasticsearch however might have a problem to determine which top level filter to consider for shifting. There are a few options for how to communicate to Elasticsearch which filter to use:

Use the first range filter on the specified field in the query - no changes required by Kibana, but seems very brittle
"Mark" the filter in some way with a meta attribute. Just a single range would be allowed to be marked like this - not sure whether that's a direction we want to go into

                    "range": {
                        "@timestamp": {
                            "gte": "2021-04-12T14:46:21.191Z",
                            "lte": "2021-04-19T14:46:21.191Z",
                            "format": "strict_date_optional_time",
                            "time_shift": true
                        }
                    }


Specify the range in the time_shift aggregation instance (only relevant if there is no date histogram - in that case treat the reference_range as the only bucket of a root date histogram aggregation and shift it according to shift):

"time_shift": {
  "field": "timestamp",
  "shift": "-1y",
  "reference_range": {
    "gte": "2021-04-12T14:46:21.191Z",
    "lte": "2021-04-19T14:46:21.191Z",
    "format": "strict_date_optional_time",
  }
}

It seems like the last option is the most explicit one which is also relatively simple to implement on the client side.
Year	Product	Revenue	Revenue -1y
2020	Product A	456	122
2020	Product B	123	6
2020	Product C	54	34
2019	Product A	122	78
2019	Product D	112	1
2019	Product E	55	3