Skip to content

Instantly share code, notes, and snippets.

@adleong
Created November 5, 2019 01:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save adleong/8cdb9b4bb02c5bcf40f9bd9558346ce4 to your computer and use it in GitHub Desktop.
Save adleong/8cdb9b4bb02c5bcf40f9bd9558346ce4 to your computer and use it in GitHub Desktop.

Linkerd Stale Discovery Runbook

Identifying if you have stale endpoints

If you have a pod experiencing unexplained 503s, check the proxy logs from that pod. If you see connection errors to IP addresses which do not correspond to running pods, your Linkerd proxy likely has stale endpoints. The IP addresses with the connection errors likely correspond to pods which have been recently deleted.

Connection errors in the Linkerd logs may look like this:

linkerd2_proxy::app::errors unexpected error: error trying to connect: No route to host (os error 113) (address: 10.10.3.181:8080)

Debugging Steps

Manually Trigger the Error

Use the kubectl exec command to run a shell in the affected pod. Manually curl other services in the cluster to see which ones are reachable and which ones return a 503 error. This allows you to determine for which services you have stale endpoints.

Determine the endpoint set according to Kubernetes

Use the kubectl get endpoints/<svc> command to get a list of the endpoints for the service according to Kubernetes. Ensure it matches the IP addresses of the pods of that service. Ensure that the IP that Linkerd failed to connect to is not in the list.

Determine the endpoint set according to the destintaion controller

Use the linkerd endpoints <svc>.<ns>.svc.<cluster-domain> command to get a list of the endpoints for the servcei according to the destination controller. Ensure it matches the list from Kubernetes.

Check the endpoint update metric

Use the kubectl -n linkerd port-forward deploy/linkerd-prometheus 9090 command to expose the Prometheus dashboard and browse to it at http://localhost:9090 in your browser. Look at the graph for the query: endpoints_updates{service="<svc>", namespace="<ns>"}. This metric should increment every time there is an update from the Kubernetes API to this endpoint set. Check to see when the most recent update was. Does it correspond to the most recent change to the service?

Try making a change to the service such as scaling it up by 1 pod. Does the endpoints_updates metric increment?

Check the endpoint subscribers metric

Using the same Prometheus dashboard, look at the endpoints_subscriber{service="<svc>", namespace="<ns>"} metric. This shows the number of subscribers to this service. It's hard to know exactly what this number should be, because it depends on what all is running at the moment, but it should definitely be greater than zero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment