We’re running linkerd 1.3.5, with namerd 1.3.5. We’re using Consul as service discovery backend. Linkerd communicates with namerd via namerd’s gRPC endpoint.
We’ve recently had 4 incidents caused by one or more linkerd processes having an incorrect instance set for a service.
During each incident, the instance set for one or more services was completely different than the instance set in consul.
When I queried each of our namerd instance’s delegator API directly, I got the correct instance set for each service. When I queried namerd indirectly, via the degraded linkerd’s delegator API, I also got the correct instance set for each service. But when I queried the /client_state.json API on the degraded linkerds, I got the incorrect instance set. This corresponds with the degraded behavior, where linkerd is reporting 100% Connection Refused for the service, because it is trying to route to an IPs and ports that no longer have instances of that service running on them.
The incidents “start” when a service deploys. Most linkerds pick up the new instance set, but some hold onto the old (and now gone) instance set, and never get the new instance set until we restart the linkerd process.
We’ve found a single service metric that correlates with the incident: loadbalancer/rebuilds. We do not think this correlates with whatever the root cause is, however.
What additional metrics can we look at try and find a root cause?
What other steps can we take to debug this issue (e.g. additional logging)?