toschneck/promql-query-downtime-sla.md

## promql-query-downtime-sla.md

      
    Raw
  

              promql-query-downtime-sla.md
            
          
    Metric collection

Currently somehow Grafana doesn't deliver the same result as a direct Thanos Query, so first we need to connect to the prometheus-thanos-query pod:
k port-forward svc/prometheus-thanos-query 10902:10902

Without respecting of scrape erros of blackbox-exporter

downtime in seconds per hour

Attention currently at thanos 1h data retention so select step 1h
single service
(1-avg_over_time(
probe_success{job="blackbox-exporter-user-cluster-apiservers"}
[1h]
))

*60*60

weigthed seconds downtime of all selected services
(1-
  avg(
    avg_over_time(
      probe_success{job="blackbox-exporter-user-cluster-apiservers"}
      [1h])
  ) 
)
*60*60

weighted uptime % per 1h range
avg(
    avg_over_time(
      probe_success{job="blackbox-exporter-user-cluster-apiservers"}
      [1h])
  ) 

Filter for KKP API
 probe_success{instance=~"(https://mgmt\\.kkp\\.customer\\.corp/rest-api)"}

Ingnore Times of Scrape Errors of blackbox-exporter

Calculate downtime for each API Server query in seconds

# result in o.x values => % of 1h value
(1 -
    (
      # retrun 1 if no alert is fired instead of nothing
      (vector (1) and on ()
          # count probe times only if no PromScrapeFaild is fired
          (ALERTS{alertname="PromScrapeFailed",job="blackbox-exporter-user-cluster-apiservers"}))
       OR  on ()
         # take average of 1h time
         avg_over_time(
             probe_success{job="blackbox-exporter-user-cluster-apiservers"}
         [1h])
    )
)
# calculate seconds out of value
*60*60

Calculate weighted downtime in seconds of all selected services

# result in o.x values => % of 1h value
avg(
    (1 -
        (
          # retrun 1 if no alert is fired instead of nothing
          (vector (1) and on ()
              # count probe times only if no PromScrapeFaild is fired
              (ALERTS{alertname="PromScrapeFailed",job="blackbox-exporter-user-cluster-apiservers"}))
           OR  on ()
             # take average of 1h time
             avg_over_time(
                 probe_success{job="blackbox-exporter-user-cluster-apiservers"}
             [1h])
        )
    )
)
# calculate seconds out of value
*60*60

KKP API

result in o.x values => % of 1h value

(1 -
(
# retrun 1 if no alert is fired instead of nothing
(vector (1) and on ()
# count probe times only if no PromScrapeFaild is fired
(ALERTS{alertname="PromScrapeFailed",instance="(.*.rest-api)"))
OR  on ()
# take average of 1h time
avg_over_time(
probe_success{instance="(.*.rest-api)"}
[1h])
)
)
calculate seconds out of value

6060