Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save tsisodia10/17784835fa9f54bc1a77f58680b3a3d9 to your computer and use it in GitHub Desktop.
Save tsisodia10/17784835fa9f54bc1a77f58680b3a3d9 to your computer and use it in GitHub Desktop.
Federating Kong Mesh Metrics in OpenShift

Federate Kong Mesh Metrics

We will federate the Kong Mesh Prometheus to the OpenShift Prometheus. We will then leverage OpenShift Monitoring to handle the aggregated metrics.

Slides

Pending

  • Define more Params in ServiceMonitor to dedupe metrics from Kong
  • Create PrometheusRules for Alerting when something goes down
  • Create Dashboards

TOC


Install Kong Mesh

Install control plane on kong-mesh-system namespace

kumactl install control-plane --cni-enabled --license-path=license | oc apply -f -

Wait for the Control Plane pod to be ready

oc wait --for=condition=ready pod -l app.kubernetes.io/instance=kong-mesh -n kong-mesh-system --timeout=240s

Expose the service

oc expose svc/kong-mesh-control-plane -n kong-mesh-system --port http-api-server

Verify the Installation

http -h `oc get route kong-mesh-control-plane -n kong-mesh-system -ojson | jq -r .spec.host`/gui/

output

HTTP/1.1 200 OK
accept-ranges: bytes
cache-control: private
content-length: 5962
content-type: text/html; charset=utf-8
date: Thu, 05 May 2022 15:56:54 GMT
set-cookie: 559045d469a6cf01d61b4410371a08e0=1cb25fd8a600fb50f151da51bc64109c; path=/; HttpOnly

Kuma Demo application

Apply scc of anyuid to kuma-demo

oc adm policy add-scc-to-group anyuid system:serviceaccounts:kuma-demo

Install resources in kuma-demo ns

oc apply -f https://raw.githubusercontent.com/kumahq/kuma-demo/master/kubernetes/kuma-demo-aio.yaml

Wait for the demo app to be ready

oc wait --for=condition=ready pod -l app=kuma-demo-frontend -n kuma-demo --timeout=240s

oc wait --for=condition=ready pod -l app=kuma-demo-backend -n kuma-demo --timeout=240s

oc wait --for=condition=ready pod -l app=postgres -n kuma-demo --timeout=240s

oc wait --for=condition=ready pod -l app=redis -n kuma-demo --timeout=240s

Expose the frontend service

oc expose svc/frontend -n kuma-demo

Validate the deployment

http -h `oc get route frontend -n kuma-demo -ojson | jq -r .spec.host` 

output

HTTP/1.1 200 OK
cache-control: max-age=3600
cache-control: private
content-length: 862
content-type: text/html; charset=UTF-8
date: Tue, 10 May 2022 11:11:11 GMT
etag: W/"251702827-862-2020-08-16T00:52:19.000Z"
last-modified: Sun, 16 Aug 2020 00:52:19 GMT
server: envoy
set-cookie: 7132be541f54d5eca6de5be20e9063c8=d64fd1cc85da2d615f07506082000ef8; path=/; HttpOnly

The namespace had an annotation kuma.io/sidecar-injection, check confirm this:

oc get ns kuma-demo -ojsonpath='{ .metadata.annotations.kuma\.io\/sidecar-injection }'

output

enabled

Check sidecar injection has been performed (NS has sidecar label)

oc -n kuma-demo get po -ojson | jq '.items[] | .spec.containers[] | .name'

output

"kuma-fe"
"kuma-sidecar"
"kuma-be"
"kuma-sidecar"
"master"
"kuma-sidecar"
"master"
"kuma-sidecar"

Enable mTLS and Traffic permissions

oc apply -f -<<EOF
apiVersion: kuma.io/v1alpha1
kind: Mesh
metadata:
  name: default
spec:
  mtls:
    enabledBackend: ca-1
    backends:
    - name: ca-1
      type: builtin
EOF

Take down Redis in the kuma-demo namespace (for Alert Demo)

oc scale deploy/redis-master --replicas=0 -n kuma-demo

Configure Metrics

Note: If you have configured already the mTLS in your mesh, the default installation won't work because the Grafana deployment has an initContainer that pulls the dashboards from a github repository. Ruben built a custom grafana image that works around the issue.

  • Apply scc of non-root to kuma-metrics
oc adm policy add-scc-to-group nonroot system:serviceaccounts:kong-mesh-metrics

Install metrics

kumactl install metrics | oc apply -f -

Fix Grafana by patching Grafana image on the deployment

oc patch deployment -n kong-mesh-metrics grafana -p='[{"op": "remove", "path": "/spec/template/spec/initContainers"}]' --type=json

oc set image deploy/grafana -n kong-mesh-metrics grafana=quay.io/ruben/grafana:8.3.3-kong

Strict mTLS is enabled, Prometheus will need to be configured to scrape using certificates.

oc label ns kong-mesh-metrics kuma.io/sidecar-injection-

oc delete po -n kong-mesh-metrics --force --grace-period=0 --all

Wait for the metrics pods to be ready

oc wait --for=condition=ready pod -l app=grafana -n kong-mesh-metrics --timeout=240s

oc wait --for=condition=ready pod -l app=prometheus -n kong-mesh-metrics --timeout=240s

Configure the metrics in the existing mesh

oc apply -f -<<EOF
apiVersion: kuma.io/v1alpha1
kind: Mesh
metadata:
  name: default
spec:
  mtls:
    enabledBackend: ca-1
    backends:
    - name: ca-1
      type: builtin
  metrics:
    enabledBackend: prometheus-1
    backends:
    - name: prometheus-1
      type: prometheus
      conf:
        port: 5670
        path: /metrics
        skipMTLS: true
EOF

Create a ClusterroleBinding to Allow Scraping

We need to allow the prometheus-k8s service account to scrape the kong-mesh-metrics resources.

Create Clusterrole/Clusterrolebinding

oc apply -f -<<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  creationTimestamp: null
  name: kong-prom
rules:
- apiGroups:
  - ""
  resources:
  - endpoints
  - services
  - pods
  - pods/status
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: null
  name: kong-prom-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kong-prom
subjects:
- kind: ServiceAccount
  name: prometheus-k8s
  namespace: openshift-monitoring
EOF

Check permissions for prometheus-k8s service-account

oc auth can-i get pods --namespace=kong-mesh-metrics --as system:serviceaccount:openshift-monitoring:prometheus-k8s

oc auth can-i get endpoints --namespace=kong-mesh-metrics --as system:serviceaccount:openshift-monitoring:prometheus-k8s

oc auth can-i get services --namespace=kong-mesh-metrics --as system:serviceaccount:openshift-monitoring:prometheus-k8s 

output

yes
yes
yes

Federate Metrics to OpenShift Monitoring

A ServiceMonitor is meant to tell Prometheus what metrics to scrape. Typically we use a ServiceMonitor or PodMonitor per application. Another thing that a ServiceMonitor can do is federate a prometheus instance.

Create a ServiceMonitor in OpenShift Monitoring to federate Kong Metrics

oc apply -f -<<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kong-federation
  namespace: openshift-monitoring
  labels:
    app: kong-mesh
spec:
  jobLabel: exporter-kong-mesh
  namespaceSelector:
    matchNames:
    - kong-mesh-metrics
  selector:
    matchLabels:
      app: prometheus
      component: server
  endpoints:
  - interval: 2s # we should use 30 seconds (only for demo)
    scrapeTimeout: 2s # we should use 30 seconds (not a sane default)
    path: /federate
    targetPort: 9090
    port: http
    params:
      'match[]':
      - '{job=~"kuma-dataplanes"}'
      - '{job=~"kubernetes-service-endpoints",kubernetes_namespace=~"kong-mesh-system"}'
    honorLabels: true
EOF

Make sure OCP Prom logs are clean

oc logs prometheus-k8s-1  -n openshift-monitoring --since=1m -c prometheus | grep kong-mesh-metrics

Verify you are scraping Metrics

Go to the OpenShift Prometheus and take a look at the targets

Go to Prom:

oc port-forward svc/prometheus-operated -n openshift-monitoring 9090 

open Prometheus Targets

do ctrl-f or whatever you do to search in your browser and search for kong-federation. May take about 30 seconds to become healthy.

Create PrometheusRules

Prom rules alert us when things go wrong/down. The simplest and most important prometheus rules that you can have are rules that trigger alerts when services go down and stay down. We are going to define just a few rules:

  1. ControlPlane Down
  2. Federation Down (Kong's Prom Server is down)
  3. Kong Demo Backend Down
  4. Kong Demo Frontend Down
  5. Kond Demo Postgres Down
  6. Kong Demo Redis Down

Lets create the PromRule(s)

oc apply -f -<<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    role: alert-rules
  name: mesh-rules
  namespace: openshift-monitoring
spec:
  groups:
  - name: dataplane-rules
    rules:
    - alert: KongDemoBackendDown
      annotations:
        description: Demo Backend pod has not been ready the defined time period.
        summary: Demo Backend is down.
      expr: absent(up{app="kuma-demo-backend",job="kuma-dataplanes"} == 1)
      for: 60s
      labels:
        severity: critical
    - alert: KongDemoFrontendDown
      annotations:
        description: Demo Frontend pod has not been ready the defined time period.
        summary: Demo Frontend is down.
      expr: absent(up{app="kuma-demo-frontend",job="kuma-dataplanes"} == 1)
      for: 5s
      labels:
        severity: critical
    - alert: KongDemoDBDown
      annotations:
        description: Demo DB pod has not been ready the defined time period.
        summary: Demo DB is down.
      expr: absent(up{app="postgres",job="kuma-dataplanes"} == 1)
      for: 5s
      labels:
        severity: critical
    - alert: KongDemoCacheDown
      annotations:
        description: Demo Cache pod has not been ready the defined time period.
        summary: Demo Cache is down.
      expr: absent(up{app="redis",job="kuma-dataplanes"} == 1)
      for: 5s
      labels:
        severity: critical
  - name: mesh-rules
    rules:
    - alert: KongControlPlaneDown
      annotations:
        description: ControlPlane pod has not been ready for over a minute.
        summary: CP is down.
      expr: absent(kube_pod_container_status_ready{namespace="kong-mesh-system"})
      for: 5s
      labels:
        severity: critical
    - alert: KongMetricsDown
      annotations:
        description: Kong Metrics not being federated.
        summary: Kong Prometheus is down.
      expr: absent(kube_pod_container_status_ready{container="prometheus-server",namespace="kong-mesh-metrics"})
      for: 1m
      labels:
        severity: critical
EOF

We want to be alerted when a critical service goes down, so lets test the alert to make sure we will be notified when these incidents occur. To keep this brief, we will only test the KongDemoCacheDown rule.

Take down the Cache in Kuma Demo (this should have already been done in a previous step)

oc scale deploy/redis-master -n kuma-demo --replicas=0

Check the Alerts in the OpenShift Prometheus Go to Prom and wait until you see the alert for KongDemoCacheDown

oc port-forward svc/prometheus-operated -n openshift-monitoring 9090 

open Prometheus Alerts

you can also get the alerts directly from the API:

http localhost:9090/api/v1/alerts | jq '.data.alerts | .[] | select(.labels.alertname | contains("KongDemoCacheDown"))'

output

{
  "labels": {
    "alertname": "KongDemoCacheDown",
    "severity": "critical"
  },
  "annotations": {
    "description": "Demo Cache pod has not been ready the defined time period.",
    "summary": "Demo Cache is down."
  },
  "state": "firing",
  "activeAt": "2022-05-17T14:31:00.456457993Z",
  "value": "1e+00"
}

Bring up the Cache in Kuma Demo

oc scale deploy/redis-master -n kuma-demo --replicas=1

Create Grafana Dashboards

Since Grafana is now deprecated in OCP 4.10, we are using the Grafana Operator for ease of use and configuration.

https://access.redhat.com/solutions/6615991

Create the Grafana namespace

oc apply -f -<<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: grafana
spec: {}
status: {}
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: grafana
  namespace: grafana
spec:
  targetNamespaces:
  - grafana
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/grafana-operator.grafana: ""
  name: grafana-operator
  namespace: grafana
spec:
  channel: v4
  installPlanApproval: Automatic
  name: grafana-operator
  source: community-operators
  sourceNamespace: openshift-marketplace
  startingCSV: grafana-operator.v4.4.1
EOF

Wait for Grafana Controller to become ready

oc wait --for=condition=Ready --timeout=180s pod -l control-plane=controller-manager -n grafana

Create an instance of Grafana

oc apply -f -<<EOF
apiVersion: integreatly.org/v1alpha1
kind: Grafana
metadata:
  name: grafana
  namespace: grafana
spec:
  baseImage: grafana/grafana:8.3.3 # same as Kong's Grafana
  client:
    preferService: true
  config:
    security:
      admin_user: "admin"
      admin_password: "admin"
    users:
      viewers_can_edit: True
    log:
      mode: "console"
      level: "error"
    log.frontend:
      enabled: true
    auth:
      disable_login_form: True
      disable_signout_menu: True
    auth.anonymous:
      enabled: True
  service:
    name: "grafana-service"
    labels:
      app: "grafana"
      type: "grafana-service"
  dashboardLabelSelector:
    - matchExpressions:
        - { key: app, operator: In, values: [grafana] }
  resources:
    # Optionally specify container resources
    limits:
      cpu: 200m
      memory: 200Mi
    requests:
      cpu: 100m
      memory: 100Mi
EOF

Wait for Grafana to become ready

oc wait --for=condition=Ready --timeout=180s pod -l app=grafana -n grafana

Connect Prometheus to Grafana

  • Grant grafana serviceaccount cluster-monitoring-view clusterrole.
  • Get Bearer Token from grafana-serviceaccount
  • Create an instance of GrafanaDataSource with Bearer Token
oc apply -f -<<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: null
  name: cluster-monitoring-view
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-monitoring-view
subjects:
- kind: ServiceAccount
  name: grafana-serviceaccount
  namespace: grafana
---
apiVersion: integreatly.org/v1alpha1
kind: GrafanaDataSource
metadata:
  name: prometheus-grafanadatasource
  namespace: grafana
spec:
  datasources:
    - access: proxy
      editable: true
      isDefault: true
      jsonData:
        httpHeaderName1: 'Authorization'
        timeInterval: 5s
        tlsSkipVerify: true
      name: Prometheus
      secureJsonData:
        httpHeaderValue1: 'Bearer $(oc serviceaccounts get-token grafana-serviceaccount -n grafana)'
      type: prometheus
      url: 'https://thanos-querier.openshift-monitoring.svc.cluster.local:9091'
  name: prometheus-grafanadatasource.yaml
EOF

Create the Dashboards

oc apply -f mesh/dashboards

Bring up the Grafana Instance

oc port-forward svc/grafana-service 3000 -n grafana 

open Grafana

Clean up

Clean up Grafana

oc delete grafanadashboard,grafanadatasource,grafana -n grafana --all --force --grace-period=0

oc delete subs,og,csv -n grafana --all --force --grace-period=0

Uninstall Demo App

oc delete -f https://raw.githubusercontent.com/kumahq/kuma-demo/master/kubernetes/kuma-demo-aio.yaml
oc adm policy remove-scc-from-group anyuid system:serviceaccounts:kuma-demo
oc delete routes -n kuma-demo --force --all
oc delete mesh default

Uninstall metrics

oc delete servicemonitor -n openshift-monitoring kong-federation 
oc delete prometheusrules -n openshift-monitoring mesh-rules
oc adm policy remove-scc-from-group nonroot system:serviceaccounts:kong-mesh-metrics
kumactl install metrics | oc delete -f -
oc delete pvc,po --force --grace-period=0 --all -n kong-mesh-metrics

Uninstall kong mesh

kumactl install control-plane --cni-enabled | oc delete -f -
sleep 3;
oc delete route -n kong-mesh-system --all
oc delete po,pvc --all -n kong-mesh-system --force --grace-period=0
sleep 3;
oc delete clusterrolebinding cluster-monitoring-view   
oc delete clusterrolebinding kong-prom-binding
oc delete clusterrole kong-prom 

oc delete ns grafana

Gotchas

In Grafana Dashboards, Prometheus DataSource uid must be Prometheus with Capital "P".

Bottom

Go to the top

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment