Skip to content

Instantly share code, notes, and snippets.

@toschneck
Last active February 12, 2023 20:35
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save toschneck/e7ef7acd9ca9e204f6aed0abddc59657 to your computer and use it in GitHub Desktop.
Save toschneck/e7ef7acd9ca9e204f6aed0abddc59657 to your computer and use it in GitHub Desktop.
downtime calc prometeheus

Metric collection

Currently somehow Grafana doesn't deliver the same result as a direct Thanos Query, so first we need to connect to the prometheus-thanos-query pod:

k port-forward svc/prometheus-thanos-query 10902:10902

Thanos | Highly available Prometheus setup 2022-06-13 07-56-11

Without respecting of scrape erros of blackbox-exporter

downtime in seconds per hour

Attention currently at thanos 1h data retention so select step 1h

single service

(1-avg_over_time(
probe_success{job="blackbox-exporter-user-cluster-apiservers"}
[1h]
))

*60*60

weigthed seconds downtime of all selected services

(1-
  avg(
    avg_over_time(
      probe_success{job="blackbox-exporter-user-cluster-apiservers"}
      [1h])
  ) 
)
*60*60

weighted uptime % per 1h range

avg(
    avg_over_time(
      probe_success{job="blackbox-exporter-user-cluster-apiservers"}
      [1h])
  ) 

Filter for KKP API

 probe_success{instance=~"(https://mgmt\\.kkp\\.customer\\.corp/rest-api)"}

Ingnore Times of Scrape Errors of blackbox-exporter

Calculate downtime for each API Server query in seconds

# result in o.x values => % of 1h value
(1 -
    (
      # retrun 1 if no alert is fired instead of nothing
      (vector (1) and on ()
          # count probe times only if no PromScrapeFaild is fired
          (ALERTS{alertname="PromScrapeFailed",job="blackbox-exporter-user-cluster-apiservers"}))
       OR  on ()
         # take average of 1h time
         avg_over_time(
             probe_success{job="blackbox-exporter-user-cluster-apiservers"}
         [1h])
    )
)
# calculate seconds out of value
*60*60

Calculate weighted downtime in seconds of all selected services

# result in o.x values => % of 1h value
avg(
    (1 -
        (
          # retrun 1 if no alert is fired instead of nothing
          (vector (1) and on ()
              # count probe times only if no PromScrapeFaild is fired
              (ALERTS{alertname="PromScrapeFailed",job="blackbox-exporter-user-cluster-apiservers"}))
           OR  on ()
             # take average of 1h time
             avg_over_time(
                 probe_success{job="blackbox-exporter-user-cluster-apiservers"}
             [1h])
        )
    )
)
# calculate seconds out of value
*60*60

KKP API

result in o.x values => % of 1h value

(1 - ( # retrun 1 if no alert is fired instead of nothing (vector (1) and on () # count probe times only if no PromScrapeFaild is fired (ALERTS{alertname="PromScrapeFailed",instance="(.*.rest-api)")) OR on () # take average of 1h time avg_over_time( probe_success{instance="(.*.rest-api)"} [1h]) ) )

calculate seconds out of value

6060

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment