tl;dr: Here is an unbiased opinion from someone who didn't write xrate or rate, but used both with prod data. I've found Alin's 4 benefits listed a few replies ago to be accurate (and give me the Borgmon-like rate functions that I need). As such, I unfortunately will need to keep this fork indefinitely.
I'm a former Xoolger and Borgmon user. I think Prometheus is a better system in many ways -- the 2.x TSDB and PromQL are spectacular. Thanks for writing this. I know Prometheus isn't sold as a 100% accurate event system, but from what I've seen with my data, a lot of that has to do with rate's extrapolation and not the TSDB.
No matter how I drew a Borgmon graph, rates didn't randomly change or disappear, which is very important for data like network errors. As a result, maybe (?) it doesn't handle rolling restarts as well as Prometheus, but this doesn't apply much to my data. I realize Grafana is likely somewhat at fault too, but the console graph has the same behavior. Also, Alin's xrate fixed the problem for me completely without changes to Grafana.
I patched Alin's code ontop of 2.0.0 which I'm running in prod. Here is a comparison of them with one of my routers that has a known SNMP bug (randomly returns 0 errors without resetting counters, which is a good stress test):
- router (verified via snmpwalk): "186 output errors", "0 output errors", "186 output errors"
- xincrease(ifOutErrors[15s]): 186, 0, 186 (repeated thousands of times over days without any deviation)
- increase(ifOutErrors[45s])/3 (hack since @scrape returns no data): 0, 72, 0, 84, 0, 74, 0, 91, 0, 80 ...
My observations:
- increase misled me into thinking that real errors were continuously occurring (and is always X% off of the real value regardless of what I do).
- xincrease accurately told me that something was injecting exactly 186 errors several times a minute (a router bug since it never showed more than 186).
- xincrease gives me much more confidence that all errors are displayed (e.g. I've seen several accurate "1 error" cases reported where increase shows nothing, unless I get lucky when I hit refresh several times).
- xincrease works very accurately @scrape, even in alerts.
- Both increase/xincrease handled the constant 0 resets from the buggy router well.
If xrate/rate can't be merged (and/or you don't want to make it a core PromQL function), has any thought been given to better contrib/extended support (namespaced with x_ or x)? It could still go through your design process, maybe guarded by a build or runtime flag (e.g. --promql.extensions=xrate). It beats me having to merge code every release. There's a lot of power with Prometheus, it be nice to support more data science functions (within reason) if someone has a good idea.
Regardless, I'm very happy with Prometheus and I know a lot went into the current rate implementation (I read the thread and watched the video). It's a good default, but the tradeoffs chosen don't work well for my data, hence the need for a new function. It would help a lot of people:
Hey Justin,
Xoogler here too, and former Borgmon dev 10+ years ago. :o)
Because of the spam issue on Google Groups, I've moved the discussion back to the Prometheus issue tracker -- prometheus/prometheus#3806. I'm not particularly confident that it's going to go anywhere. Brian seems dead set on rejecting it; Björn is so busy that he only drops a couple of lines a month; and everyone else seems afraid of touching that code with a ten foot pole.
But feel free to post this there, maybe it helps. And thanks for the support, I've started feeling like a clueless troll because of the lack of feedback.
Cheers,
Alin.