If you have a pod experiencing unexplained 503s, check the proxy logs from that pod. If you see connection errors to IP addresses which do not correspond to running pods, your Linkerd proxy likely has stale endpoints. The IP addresses with the connection errors likely correspond to pods which have been recently deleted.
Connection errors in the Linkerd logs may look like this:
linkerd2_proxy::app::errors unexpected error: error trying to connect: No route to host (os error 113) (address: 10.10.3.181:8080)
Use the kubectl exec
command to run a shell in the affected pod. Manually
curl other services in the cluster to see which ones are reachable and which
ones return a 503 error. This allows you to determine for which services you
have stale endpoints.
Use the kubectl get endpoints/<svc>
command to get a list of the endpoints
for the service according to Kubernetes. Ensure it matches the IP addresses
of the pods of that service. Ensure that the IP that Linkerd failed to connect
to is not in the list.
Use the linkerd endpoints <svc>.<ns>.svc.<cluster-domain>
command to get a
list of the endpoints for the servcei according to the destination controller.
Ensure it matches the list from Kubernetes.
Use the kubectl -n linkerd port-forward deploy/linkerd-prometheus 9090
command
to expose the Prometheus dashboard and browse to it at http://localhost:9090
in your browser. Look at the graph for the query:
endpoints_updates{service="<svc>", namespace="<ns>"}
. This metric should
increment every time there is an update from the Kubernetes API to this endpoint
set. Check to see when the most recent update was. Does it correspond to
the most recent change to the service?
Try making a change to the service such as scaling it up by 1 pod. Does the endpoints_updates metric increment?
Using the same Prometheus dashboard, look at the
endpoints_subscriber{service="<svc>", namespace="<ns>"}
metric. This shows
the number of subscribers to this service. It's hard to know exactly what this
number should be, because it depends on what all is running at the moment, but
it should definitely be greater than zero.