Skip to content

Instantly share code, notes, and snippets.

@jgracenin
Created March 27, 2018 12:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jgracenin/3d4d5849f34f4cc9c676f2cc5b7faef2 to your computer and use it in GitHub Desktop.
Save jgracenin/3d4d5849f34f4cc9c676f2cc5b7faef2 to your computer and use it in GitHub Desktop.

Re: prometheus-developers@ Proposal rate() / increase() should cover all points between <metric offset $range>

tl;dr: Here is an unbiased opinion from someone who didn't write xrate or rate, but used both with prod data. I've found Alin's 4 benefits listed a few replies ago to be accurate (and give me the Borgmon-like rate functions that I need). As such, I unfortunately will need to keep this fork indefinitely.

I'm a former Xoolger and Borgmon user. I think Prometheus is a better system in many ways -- the 2.x TSDB and PromQL are spectacular. Thanks for writing this. I know Prometheus isn't sold as a 100% accurate event system, but from what I've seen with my data, a lot of that has to do with rate's extrapolation and not the TSDB.

No matter how I drew a Borgmon graph, rates didn't randomly change or disappear, which is very important for data like network errors. As a result, maybe (?) it doesn't handle rolling restarts as well as Prometheus, but this doesn't apply much to my data. I realize Grafana is likely somewhat at fault too, but the console graph has the same behavior. Also, Alin's xrate fixed the problem for me completely without changes to Grafana.

I patched Alin's code ontop of 2.0.0 which I'm running in prod. Here is a comparison of them with one of my routers that has a known SNMP bug (randomly returns 0 errors without resetting counters, which is a good stress test):

  • router (verified via snmpwalk): "186 output errors", "0 output errors", "186 output errors"
  • xincrease(ifOutErrors[15s]): 186, 0, 186 (repeated thousands of times over days without any deviation)
  • increase(ifOutErrors[45s])/3 (hack since @scrape returns no data): 0, 72, 0, 84, 0, 74, 0, 91, 0, 80 ...

My observations:

  • increase misled me into thinking that real errors were continuously occurring (and is always X% off of the real value regardless of what I do).
  • xincrease accurately told me that something was injecting exactly 186 errors several times a minute (a router bug since it never showed more than 186).
  • xincrease gives me much more confidence that all errors are displayed (e.g. I've seen several accurate "1 error" cases reported where increase shows nothing, unless I get lucky when I hit refresh several times).
  • xincrease works very accurately @scrape, even in alerts.
  • Both increase/xincrease handled the constant 0 resets from the buggy router well.

If xrate/rate can't be merged (and/or you don't want to make it a core PromQL function), has any thought been given to better contrib/extended support (namespaced with x_ or x)? It could still go through your design process, maybe guarded by a build or runtime flag (e.g. --promql.extensions=xrate). It beats me having to merge code every release. There's a lot of power with Prometheus, it be nice to support more data science functions (within reason) if someone has a good idea.

Regardless, I'm very happy with Prometheus and I know a lot went into the current rate implementation (I read the thread and watched the video). It's a good default, but the tradeoffs chosen don't work well for my data, hence the need for a new function. It would help a lot of people:

@free
Copy link

free commented Mar 29, 2018

Hey Justin,

Xoogler here too, and former Borgmon dev 10+ years ago. :o)

Because of the spam issue on Google Groups, I've moved the discussion back to the Prometheus issue tracker -- prometheus/prometheus#3806. I'm not particularly confident that it's going to go anywhere. Brian seems dead set on rejecting it; Björn is so busy that he only drops a couple of lines a month; and everyone else seems afraid of touching that code with a ten foot pole.

But feel free to post this there, maybe it helps. And thanks for the support, I've started feeling like a clueless troll because of the lack of feedback.

Cheers,
Alin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment