superseb/test-pleg.md

## test-pleg.md

      
    Raw
  

              test-pleg.md
            
          
    PLEG tester

A few commands to run to test what triggers PLEG.
Docker response time

When using Docker, all container statuses are compared and it needs to happen within 3 minutes. Else the following log will be shown:
Message:PLEG is not healthy: pleg was last seen active

To test for Docker responsiveness (or possibly a hanging container and/or pod stuck in Terminating), run the following command:
time docker ps --format "{{.ID}}\t{{.Names}}" | while read id name; do RESP=$(/usr/bin/time -f"%e" docker inspect $id 2>&1  > /dev/null); echo -n "${RESP}: "; echo "${name} (${id})"; done

This will print a per container runtime of docker inspect on that container, plus the total time it took to complete.
Get child processes of kubelet

If Docker is already slow, you can try to lookup child processes of kubelet that might hang/got stuck:
ps -A -o pid,ppid,comm | grep $(pidof kubelet)

kubelet metrics

The kubelet reports metrics on pleg (amongst others).
Add /opt/rke to the certificate files if you are running RancherOS or Container Linux.
curl -sLk --cacert /etc/kubernetes/ssl/kube-ca.pem --cert /etc/kubernetes/ssl/kube-node.pem --key /etc/kubernetes/ssl/kube-node-key.pem https://127.0.0.1:10250/metrics | grep pleg
# HELP kubelet_pleg_relist_interval_microseconds Interval in microseconds between relisting in PLEG.
# TYPE kubelet_pleg_relist_interval_microseconds summary
kubelet_pleg_relist_interval_microseconds{quantile="0.5"} 1.079013e+06
kubelet_pleg_relist_interval_microseconds{quantile="0.9"} 1.116376e+06
kubelet_pleg_relist_interval_microseconds{quantile="0.99"} 1.447572e+06
kubelet_pleg_relist_interval_microseconds_sum 2.4932124052e+10
kubelet_pleg_relist_interval_microseconds_count 24601
# HELP kubelet_pleg_relist_latency_microseconds Latency in microseconds for relisting pods in PLEG.
# TYPE kubelet_pleg_relist_latency_microseconds summary
kubelet_pleg_relist_latency_microseconds{quantile="0.5"} 78632
kubelet_pleg_relist_latency_microseconds{quantile="0.9"} 116205
kubelet_pleg_relist_latency_microseconds{quantile="0.99"} 446852
kubelet_pleg_relist_latency_microseconds_sum 3.2610447e+08
kubelet_pleg_relist_latency_microseconds_count 24601

The numbers shown are from a healthy node, with response times under or just above a second. These will show way higher on an overloaded system, possibly going into 5.xxxxxx or 6.xxxxxx, summing up on to 3+ minutes to make PLEG go unhealthy.