adleong/stale-discovery.md

## stale-discovery.md

      
    Raw
  

              stale-discovery.md
            
          
    Linkerd Stale Discovery Runbook

Identifying if you have stale endpoints

If you have a pod experiencing unexplained 503s, check the proxy logs from that
pod.  If you see connection errors to IP addresses which do not correspond to
running pods, your Linkerd proxy likely has stale endpoints.  The IP addresses
with the connection errors likely correspond to pods which have been recently
deleted.
Connection errors in the Linkerd logs may look like this:
linkerd2_proxy::app::errors unexpected error: error trying to connect: No route to host (os error 113) (address: 10.10.3.181:8080)

Debugging Steps

Manually Trigger the Error

Use the kubectl exec command to run a shell in the affected pod.  Manually
curl other services in the cluster to see which ones are reachable and which
ones return a 503 error.  This allows you to determine for which services you
have stale endpoints.
Determine the endpoint set according to Kubernetes

Use the kubectl get endpoints/<svc> command to get a list of the endpoints
for the service according to Kubernetes.  Ensure it matches the IP addresses
of the pods of that service.  Ensure that the IP that Linkerd failed to connect
to is not in the list.
Determine the endpoint set according to the destintaion controller

Use the linkerd endpoints <svc>.<ns>.svc.<cluster-domain> command to get a
list of the endpoints for the servcei according to the destination controller.
Ensure it matches the list from Kubernetes.
Check the endpoint update metric

Use the kubectl -n linkerd port-forward deploy/linkerd-prometheus 9090 command
to expose the Prometheus dashboard and browse to it at http://localhost:9090
in your browser.  Look at the graph for the query:
endpoints_updates{service="<svc>", namespace="<ns>"}.  This metric should
increment every time there is an update from the Kubernetes API to this endpoint
set.  Check to see when the most recent update was.  Does it correspond to
the most recent change to the service?
Try making a change to the service such as scaling it up by 1 pod.  Does the
endpoints_updates metric increment?
Check the endpoint subscribers metric

Using the same Prometheus dashboard, look at the
endpoints_subscriber{service="<svc>", namespace="<ns>"} metric.  This shows
the number of subscribers to this service.  It's hard to know exactly what this
number should be, because it depends on what all is running at the moment, but
it should definitely be greater than zero.