Skip to content

Instantly share code, notes, and snippets.

@flash1293
Last active April 28, 2021 09:24
Show Gist options
  • Save flash1293/d9acacdd9c6439c1cf507f92b05ccca9 to your computer and use it in GitHub Desktop.
Save flash1293/d9acacdd9c6439c1cf507f92b05ccca9 to your computer and use it in GitHub Desktop.
Time offset splitted-filters

This write-up explains the idea behind the splitted-filters approach to time offsets and how it could be implemented in the current system.

Consider a user wanting to create a table like this (top 3 products with yearly revenue and year-over-year growth):

Product Year Revenue Revenue -1y
Product A 2020 456 122
Product A 2019 123 6
Product B 2020 54 34
Product B 2019 122 78
Product C 2020 112 1
Product C 2019 55 3

Without the year over year column, this is a fairly straight-forward query to build in Kibana today:

GET kibana_sample_data_logs/_search
{
  "query": {
    "range": {
      "timestamp": {
        "gte": "2019",
        "lte": "2020"
      }
    }
  },
  "size": 0, 
  "aggs": {
    "product": {
      "terms": {
        "field": "product",
        "size": 3,
        "order": { "overall_revenue": "desc" }
      },
      "aggs": {
        "overall_revenue": {
          "sum": {
            "field": "sales"
          }
        },
        "year": {
          "date_histogram": {
            "field": "timestamp",
            "interval": "year"
          },    
          "aggs": {
            "revenue": {
              "sum": {
                  "field": "sales"
              }
            }
          }
        }
      }
    }
  }
}

How do we get the revenue for the previous year to do the year-over-year comparison? It's not possible to simply do a second request with a shifted time range, because the top products might change, leading to mismatched pairs.

The splitted-filters approach solves this in the following way:

GET kibana_sample_data_logs/_search
{
  "query": {
    "bool": {
      "should": [{
        "range": { // match both time ranges at the same time (shifted and not shifted)
          "timestamp": {
            "gte": "2019",
            "lte": "2020"
          }
        }
      }, {
        "range": {
          "timestamp": {
            "gte": "2018",
            "lte": "2019"
          }
        }
      }],
      "minimum_should_match": 1
    }
  },
  "size": 0, 
  "aggs": {
    "product": {
      "terms": {
        "field": "product",
        "size": 3,
        "order": { "first_year>overall_revenue": "desc" }
      },
      "aggs": {
        "first_year": { // for sorting use a filter agg to scope to the first time range
          "filter": {
            "range": { 
              "timestamp": {
                "gte": "2019",
                "lte": "2020"
              }
            }
          },
          "aggs": {
            "overall_revenue": {
              "sum": {
                "field": "sales"
              }
            }
          }
        },
        "time_offset_split": {
          "filters": { // split by different time ranges
            "filters": {
              "regular": {
                "range": { 
                  "timestamp": {
                    "gte": "2019",
                    "lte": "2020"
                  }
                }
              },
              "-1y": {
                "range": {
                  "timestamp": {
                    "gte": "2018",
                    "lte": "2019"
                  }
                }
              }
            }
          },
          "aggs": { // nest the date histogram and all metrics in this filters
            "year": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "year"
              },    
              "aggs": {
                "revenue": {
                  "sum": {
                      "field": "sales"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

The following things are happening here:

  • Right before the date histogram (or before metrics if there is no date histogram), a filters agg is added to the agg hierarchy, splitting the two time ranges we need data for.
  • The overall time range is changed to match both time ranges, the orginal one and the shifted one
  • For metrics above this filters agg splitting the time ranges, a filter agg is needed to "select" the right time range (like for sorting metrics for terms aggs or overall metrics if we chose to support them)

The result looks like this:

"aggregations" : {
    "product" : {
      "buckets" : [
        {
          "key" : "Product A",
          "doc_count" : 512,
          "first_year" : {
            "doc_count" : 418,
            "overall_revenue" : {
              "value" : 2163634.0
            }
          },
          "time_offset_split" : {
            "buckets" : {
              "-1y" : {
                "doc_count" : 420,
                "year" : {
                  "buckets" : [
                    {
                      "key_as_string" : "2018",
                      "doc_count" : 81,
                      "revenue" : {
                        "value" : 505124.0
                      }
                    },
                    {
                      "key_as_string" : "2019",
                      "doc_count" : 78,
                      "revenue" : {
                        "value" : 363058.0
                      }
                    }
                  ]
                }
              },
              "regular" : {
                "doc_count" : 418,
                "year" : {
                  "buckets" : [
                    {
                      "key_as_string" : "2019",
                      "doc_count" : 65,
                      "revenue" : {
                        "value" : 302736.0
                      }
                    },
                    {
                      "key_as_string" : "2020",
                      "doc_count" : 84,
                      "revenue" : {
                        "value" : 392924.0
                      }
                    }
                  ]
                }
              }
            }
          }
        },
        {
          "key" : "Product B",
          "doc_count" : 248,
          "first_year" : {
            "doc_count" : 215,
            "overall_revenue" : {
              "value" : 1315547.0
            }
          },
          "time_offset_split" : {
            "buckets" : {
              "-1y" : {
                "doc_count" : 211,
                "year" : {
                  "buckets" : [
                    {
                      "key_as_string" : "2018",
                      "key" : 1618963200000,
                      "doc_count" : 28,
                      "revenue" : {
                        "value" : 156543.0
                      }
                    },
// ....

To turn this into a table, the time offset split buckets (regular vs -1y) have to be merged into a single structure. This can either happen by changing the tabify logic and making it aware of this extra layer in the response, or following the "other bucket" approach of rewriting the response so the shifted metrics look like regular metrics:

"aggregations" : {
    "product" : {
      "buckets" : [
        {
          "key" : "Product A",
          "doc_count" : 512,
          "first_year" : {
            "doc_count" : 418,
            "overall_revenue" : {
              "value" : 2163634.0
            }
          },
          "year" : {
            "buckets" : [
              {
                "key_as_string" : "2019",
                "doc_count" : 81,
                "revenue_regular" : {
                  "value" : 505124.0
                },
                "revenue_-1y" : {
                  "value" : 505124.0
                }
             },
             {
               "key_as_string" : "2020",
               "doc_count" : 78,
               "revenue_regular" : {
                 "value" : 363058.0
               },
               "revenue_-1y" : {
                 "value" : 363058.0
               },
             }
// ....

Limitations

  • Can't use terms in date histogram while using time shift. Trying to do so would cause an error.
  • Can't use date range (as it would require special logic and is not exposed in Lens). Would be an error as well.
  • Can't filter on the bound time field in KQL of a filters agg (this could also be validated on Lens side)
  • No metrics on all levels

Implementation

I think this is how we could implement this:

  • all metrics get a time shift attribute
  • toDsl of agg configs - add filters in there in the right position if there are time shifted metrics
  • order by of terms - wrap with filter agg for respective time range
  • overall metrics - wrap in filter agg (optional - we don't use this in Lens, could/should omit this)
  • time range is request handler - becomes bool for both
  • Tabify logic - remap through the filters split, picking the right branch per time shift (or pre-processing step mapping everything together)
  • validation logic in agg configs to check whether something illegal is done
  • Check other bucket logic - I took a look at the code and I think it would work out of the box
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment