duxing/datadog_helm_kops.md

## datadog_helm_kops.md

      
    Raw
  

              datadog_helm_kops.md
            
          
    Context

Due to a lack of control plane observability, I recently re-integrated Datadog helm chart on our kops-provisioned k8s cluster. kops is definitely not the most popular k8s solution and the official control plane monitoring guide doesn't cover detailed steps.
Throughout the process, I ran into one major issue (details later), potentially caused by compatibility between kops and datadog-agent. Investigation kept me busy, and I still don't have a definitely answer to "fix" it. However, I came up with a solution to bypass this issue to ensure full visibility coverage for the control plane.
Overview

A k8s control plane has 4 major components:

kube-api-server
etcd
kube-scheduler
kube-controller-manager

all of which are supported by native Datadog integrations (comes with datadog-agent). The recommended integration guide depends on kubernetes integration auto-discovery, but it does not work on kops-provisioned control planes
I'll walk through the issue and findings, and follow up with a step-by-step guide on how to bypass it.
The details covered here are based on the following setup:

kops-provisioned k8s cluster
datadog-agent deployed via the datadog helm chart

For brevity, I'll refer to "kops-provisioned control plane node(s)" as "control node(s)" unless explicitly specified.
Problem

Using kube-scheduler to illustrate the problem (same problem for all 4 components).
Example integration (values.yaml for datadog helm chart):
datadog:
  apiKey: <DATADOG_API_KEY>
...
  ignoreAutoConfig:
  - kube_scheduler
...
  confd:
    kube_scheduler.yaml: |-
      ad_identifiers:
        - kube-scheduler
      instances:
        - prometheus_url: https://%%host%%:10259/metrics
          ssl_verify: false
          bearer_token_auth: true

This is the recommended approach from the official control plane monitoring guide, and the approach here is based on kubernetes integration auto-discovery
On a control node, the configuration above does NOT turn on the integration(s):

valid configuration file for the integration exists under /etc/datadog-agent/conf.d/
integration is not detected as on via agent status output

Investigation and Findings

I thoroughly checked the configuration content and walked through the helm chart values reference as the first thing to check, and the I did not see anything wrong.
I've had experience setting up Datadog helm chart for control plane monitoring on EKS clusters and docker-desktop / minikube using helm, the identical configuration doesn't work 100% but at least the integrations are detected correctly with auto-discovery. The container names I saw via running docker ps on control nodes have the right shortname & image name (which are used to derive ad_identifier by datadog-agent). So I'm sure the configuration (especially the ad_identifiers section) is not the problem.
The next thing I did was turn on debug log(datadog.logLevel: debug, logs available at /var/log/datadog/agent.log) for datadog helm chart on both my kops cluster and a docker-desktop / minikube. From the debug log I figured roughly how auto-discovery datadog-agent works:

filed-based configurations (/etc/datadog-agent/conf.d/) are loaded into memory and running containers & processes are detected
each detected container/process will have an identifier, which is compared against configurations of integrations with auto-discovery turned on (via ad_identifiers)
once an ad_identifier is matched, the rest of the yaml configuration will be used for the integration.

The process above can be verified via debug log.
On a control node, the desired container (kube-scheduler, same for the other 3 components) is NOT identified as kube-scheduler. I noticed many containers were identified as container ids (in the format of "docker://<container_id>") but none of the container ids match the actual container id for kube-scheduler (you can identify container id by kubectl describe pod/<kube-scheduler-pod-name> or ssh to control node and run docker ps).
Either the kube-scheduler(same for the other 3 components) is not detected, or it is detected as a container id that doesn't match its own.
This is where I realized that there aren't further actionable things I can do with this approach. Fortunately, my goal is to get integrations working for control nodes, one way or another, and I was able to come up with an alternative solution.
Solution

The TL;DR; version of the solution is: use file-based configuration without auto-discovery.
Integrations are driven by configuration files (located under /etc/datadog-agent/conf.d/). The helm-native approach mentioned above works by converting the datadog.confd key-value pairs to one auto_conf.yaml per integration. The non-helm solution for configuration is to provision your own conf.yamls for your integrations.
To bypass the auto-discovery issue on kops-provisioned cluster, we can:

provision a ConfigMap with desired configurations
mount the ConfigMap as volume(s) to datadog-agent: agents.volumes + agents.volumeMounts
replace template variables with resolvable ones
disable auto-config (synonym for "autodiscovery") for integration: datadog.ignoreAutoConfig

Datadog configuration k8s ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: my-datadog-configmap

data:
  kube_apiserver_metrics.yaml: |+
    init_config:
    instances:
      - prometheus_url: https://%%env_DD_KUBERNETES_KUBELET_HOST%%:443/metrics
        tls_verify: false
        bearer_token_auth: true
        bearer_token_path: /var/run/secrets/kubernetes.io/serviceaccount/token
  etcd.yaml: |+
    init_config:
    instances:
      # etcd-manager-main
      - prometheus_url: "https://%%env_DD_KUBERNETES_KUBELET_HOST%%:4001/metrics"
        tls_verify: false
        tls_cert: /host/etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.crt
        tls_private_key: /host/etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.key
      # etcd-manager-events
      - prometheus_url: "https://%%env_DD_KUBERNETES_KUBELET_HOST%%:4002/metrics"
        tls_verify: false
        tls_cert: /host/etc/kubernetes/pki/etcd-manager-events/etcd-clients-ca.crt
        tls_private_key: /host/etc/kubernetes/pki/etcd-manager-events/etcd-clients-ca.key
  kube_scheduler.yaml: |+
    init_config:
    instances:
      - prometheus_url: "http://%%env_DD_KUBERNETES_KUBELET_HOST%%:10251/metrics"
        ssl_verify: false
  kube_controller_manager.yaml: |+
    init_config:
    instances:
      - prometheus_url: "http://%%env_DD_KUBERNETES_KUBELET_HOST%%:10252/metrics"
        ssl_verify: false

Explanation


template variables are specific to autodiscovery feature. In the context of non-autodiscovery configurations, not all template variables can be resolved. e.g. %%host%% does not resolve. Fortunately %%env_<ENV_VAR>%% seems to be resolving fine.
kops provisions 2 etcd clusters: main and events. 2 instances of etcd integration is required with slightly different tls_cert and tls_private_key. Although I've verified these are interchangeable.
kops uses etcd-manager as the parent process for etcd. Port 2380/2381 is for peer communication (server-to-server), and 4001/4002 is for client communication (client-to-server). In this case the agent will be a "client" of etcd server, using port 4001 / 4002 is desired (instead of the 2379 port in normal etcd setup).
kube-scheduler serves HTTP on port 10251 and HTTPS on port 10259
kube-controller-manager serves HTTP on port 10252 and HTTPS on port 10257

value.yaml for datadog helm chart

datadog:
  ignoreAutoConfig:
  - etcd
  - kube_scheduler
  - kube_controller_manager
  - kube_apiserver_metrics

agents:
  volumes:
    - name: my-config
      configMap:
        name: my-datadog-configmap
    - name: etcd-pki
      hostPath:
        path: /etc/kubernetes/pki
  volumeMounts:
    - name: etcd-pki
      mountPath: /host/etc/kubernetes/pki
      readOnly: true
    - name: my-config
      mountPath: /etc/datadog-agent/conf.d/kube_apiserver_metrics.d/conf.yaml
      subPath: kube_apiserver_metrics.yaml
    - name: my-config
      mountPath: /etc/datadog-agent/conf.d/etcd.d/conf.yaml
      subPath: etcd.yaml
    - name: my-config
      mountPath: /etc/datadog-agent/conf.d/kube_scheduler.d/conf.yaml
      subPath: kube_scheduler.yaml
    - name: my-config
      mountPath: /etc/datadog-agent/conf.d/kube_controller_manager.d/conf.yaml
      subPath: kube_controller_manager.yaml

Explanation


Certificates and private keys (located under /etc/kubernetes/pki from host) is required by etcd client-to-server communication.