jgracenin/xrate.md

## xrate.md

      
    Raw
  

              xrate.md
            
          
    Re: prometheus-developers@ Proposal rate() / increase() should cover all points between <metric offset $range>
tl;dr: Here is an unbiased opinion from someone who didn't write xrate or rate, but used both with prod data. I've found Alin's 4 benefits listed a few replies ago to be accurate (and give me the Borgmon-like rate functions that I need). As such, I unfortunately will need to keep this fork indefinitely.
I'm a former Xoolger and Borgmon user. I think Prometheus is a better system in many ways -- the 2.x TSDB and PromQL are spectacular. Thanks for writing this. I know Prometheus isn't sold as a 100% accurate event system, but from what I've seen with my data, a lot of that has to do with rate's extrapolation and not the TSDB.
No matter how I drew a Borgmon graph, rates didn't randomly change or disappear, which is very important for data like network errors. As a result, maybe (?) it doesn't handle rolling restarts as well as Prometheus, but this doesn't apply much to my data. I realize Grafana is likely somewhat at fault too, but the console graph has the same behavior. Also, Alin's xrate fixed the problem for me completely without changes to Grafana.
I patched Alin's code ontop of 2.0.0 which I'm running in prod. Here is a comparison of them with one of my routers that has a known SNMP bug (randomly returns 0 errors without resetting counters, which is a good stress test):

router (verified via snmpwalk): "186 output errors", "0 output errors", "186 output errors"
xincrease(ifOutErrors[15s]): 186, 0, 186 (repeated thousands of times over days without any deviation)
increase(ifOutErrors[45s])/3 (hack since @scrape returns no data): 0, 72, 0, 84, 0, 74, 0, 91, 0, 80 ...

My observations:

increase misled me into thinking that real errors were continuously occurring (and is always X% off of the real value regardless of what I do).
xincrease accurately told me that something was injecting exactly 186 errors several times a minute (a router bug since it never showed more than 186).
xincrease gives me much more confidence that all errors are displayed (e.g. I've seen several accurate "1 error" cases reported where increase shows nothing, unless I get lucky when I hit refresh several times).
xincrease works very accurately @scrape, even in alerts.
Both increase/xincrease handled the constant 0 resets from the buggy router well.

If xrate/rate can't be merged (and/or you don't want to make it a core PromQL function), has any thought been given to better contrib/extended support (namespaced with x_ or x)? It could still go through your design process, maybe guarded by a build or runtime flag (e.g. --promql.extensions=xrate). It beats me having to merge code every release. There's a lot of power with Prometheus, it be nice to support more data science functions (within reason) if someone has a good idea.
Regardless, I'm very happy with Prometheus and I know a lot went into the current rate implementation (I read the thread and watched the video). It's a good default, but the tradeoffs chosen don't work well for my data, hence the need for a new function. It would help a lot of people:

https://stackoverflow.com/questions/38665904/why-does-increase-return-a-value-of-1-33-in-prometheus
https://www.stroppykitten.com/technical/prometheus-grafana-statistics
grafana/grafana#9705
...