Skip to content

Instantly share code, notes, and snippets.

@timothyb89
Last active February 28, 2018 15:40
Show Gist options
  • Save timothyb89/7edea3970ecc9d67578f05527f487bf0 to your computer and use it in GitHub Desktop.
Save timothyb89/7edea3970ecc9d67578f05527f487bf0 to your computer and use it in GitHub Desktop.
Threshold Engine Proposal

System Overview

A proposal for a new, simplified alarm evaluation engine for Monasca. It builds off the existing aggregation engine (https://github.com/monasca/monasca-aggregator) and allows for a simplified implementation of a thresholding engine that shares syntax with the query language.

It would involve two or three new components:

  • transformation engine: evaluates rules stored in a database
    • rules can be dynamically created
    • evaluates rules of two forms:
      • function(timeseries, constant...)
        • e.g. delta(metric{foo=bar})
      • function(timeseries_a, timeseries_b)
        • e.g. metric{foo=bar} > metric{foo=baz}
        • e.g. metric{foo=bar} / metric{foo=baz}
    • each rule executes a single function
    • each rule outputs a single time series
    • could be implemented as an extension or modification to the aggregation engine
  • alarm evaluation engine: triggers events when a metric meets a condition
    • only one form: function(timeseries, constant)
    • more complex rules will be decomposed into the supported form as needed
  • expression compiler: complex expressions are compiled and decomposed into transformation and alarm rules at creation time

Chart

chart

Goal

sum(delta(request_total_time{app="ms-api-api"}[1h])) by (path, method)
 / 
sum(delta(request_count{app="ms-api-api"}[1h])) by (path, metohhod)
 > 5.0

as a prometheus expression

Decompose expression

The complex expression will be decomposed into 5 transformation rules and 1 alarming rule.

temp1 = delta(request_total_time{app="ms-api-api"}[1h])
temp2 = sum(temp1) by (path, method)

temp3 = delta(request_count{app="ms-api-api"}[1h])
temp4 = sum(temp3) by (path, method)

temp5 = temp2 / temp4

if temp5 > 5.0:
    alarm()

Benefits

  • Rules are implicitly evaluated in order
  • Rules are implicitly deduplicated
  • Potentially simplified clustering
    • Can expect cost of each rule to be roughly equal
    • Without changes to keying in the API all cluster members would need to see every metric
      • puts upper bound on performance with confluent_kafka at ~250k metrics/sec
  • Builds on existing aggregation engine. New requirements would include:
    • Dynamic rules
    • New functions, particularly of the function(timeseries_a, timeseries_b) form
  • Expression compiler would probably be part of the API
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment