Currently somehow Grafana doesn't deliver the same result as a direct Thanos Query, so first we need to connect to the prometheus-thanos-query
pod:
k port-forward svc/prometheus-thanos-query 10902:10902
Attention currently at thanos 1h
data retention so select step 1h
single service
(1-avg_over_time(
probe_success{job="blackbox-exporter-user-cluster-apiservers"}
[1h]
))
*60*60
weigthed seconds downtime of all selected services
(1-
avg(
avg_over_time(
probe_success{job="blackbox-exporter-user-cluster-apiservers"}
[1h])
)
)
*60*60
weighted uptime % per 1h range
avg(
avg_over_time(
probe_success{job="blackbox-exporter-user-cluster-apiservers"}
[1h])
)
Filter for KKP API
probe_success{instance=~"(https://mgmt\\.kkp\\.customer\\.corp/rest-api)"}
# result in o.x values => % of 1h value
(1 -
(
# retrun 1 if no alert is fired instead of nothing
(vector (1) and on ()
# count probe times only if no PromScrapeFaild is fired
(ALERTS{alertname="PromScrapeFailed",job="blackbox-exporter-user-cluster-apiservers"}))
OR on ()
# take average of 1h time
avg_over_time(
probe_success{job="blackbox-exporter-user-cluster-apiservers"}
[1h])
)
)
# calculate seconds out of value
*60*60
# result in o.x values => % of 1h value
avg(
(1 -
(
# retrun 1 if no alert is fired instead of nothing
(vector (1) and on ()
# count probe times only if no PromScrapeFaild is fired
(ALERTS{alertname="PromScrapeFailed",job="blackbox-exporter-user-cluster-apiservers"}))
OR on ()
# take average of 1h time
avg_over_time(
probe_success{job="blackbox-exporter-user-cluster-apiservers"}
[1h])
)
)
)
# calculate seconds out of value
*60*60
(1 -
(
# retrun 1 if no alert is fired instead of nothing
(vector (1) and on ()
# count probe times only if no PromScrapeFaild is fired
(ALERTS{alertname="PromScrapeFailed",instance="(.*.rest-api)"))
OR on ()
# take average of 1h time
avg_over_time(
probe_success{instance="(.*.rest-api)"}
[1h])
)
)
6060