Linkerd instance set for service diverages from namerd & consul


#1

We’re running linkerd 1.3.5, with namerd 1.3.5. We’re using Consul as service discovery backend. Linkerd communicates with namerd via namerd’s gRPC endpoint.

We’ve recently had 4 incidents caused by one or more linkerd processes having an incorrect instance set for a service.

During each incident, the instance set for one or more services was completely different than the instance set in consul.

When I queried each of our namerd instance’s delegator API directly, I got the correct instance set for each service. When I queried namerd indirectly, via the degraded linkerd’s delegator API, I also got the correct instance set for each service. But when I queried the /client_state.json API on the degraded linkerds, I got the incorrect instance set. This corresponds with the degraded behavior, where linkerd is reporting 100% Connection Refused for the service, because it is trying to route to an IPs and ports that no longer have instances of that service running on them.

The incidents “start” when a service deploys. Most linkerds pick up the new instance set, but some hold onto the old (and now gone) instance set, and never get the new instance set until we restart the linkerd process.

We’ve found a single service metric that correlates with the incident: loadbalancer/rebuilds. We do not think this correlates with whatever the root cause is, however.

What additional metrics can we look at try and find a root cause?

What other steps can we take to debug this issue (e.g. additional logging)?


#2

Thanks for this report, @chrismikehogan! From your description, it sounds like somehow Linkerd’s gRPC watch to Namerd is getting stuck and not receiving updates.

Some questions that will help us dig into the issue further:

  • Does the affected Linkerd continue to get updates for other services? Or does its state become frozen for all services?
  • Is it possible to get TCP dumps of communication between Linkerd and Namerd before/during/after the deploy? I understand that before may be difficult if you can’t predict which Linkerd instances will exhibit the issue.

The good news is that we’re currently working on adding a lot more logging and introspectability to Linkerd’s watches so it should be much easier to debug these types of issues in the next Linkerd release. Unfortunately, that doesn’t help much right now.


#3

Does the affected Linkerd continue to get updates for other services? Or does its state become frozen for all services?

Uncertain. We believe it affects all services, but we failed to do a test deploy before the last remaining degraded instance was restarted. If/when we encounter this again, we’ll be sure to do a deploy of our canary app to see.

Is it possible to get TCP dumps of communication between Linkerd and Namerd before/during/after the deploy? I understand that before may be difficult if you can’t predict which Linkerd instances will exhibit the issue.

We can get a TCP dump, but unfortunately we also failed to grab one while the issue was in-flight.

I’m attaching a metrics dump from a degraded linkerd, but it appears that by the time I got to pulling the raw metrics, we’d stopped traffic to this linkerd for long enough that it decided to purge most of it’s service-specfic metrics. It still has it’s metrics around namerd communication, however.linkerd_metrics.json (23.1 KB)


#4

@Alex we had the issue happen again. This time it impacted 12 linkerd instances.

I was able to grab tcpdumps for linkerd-to-namerd communication, as well as netstats to show how many namerd connections each linkerd had at the time.

Unfortunately, I’m not able to share this directly with y’all until we have an NDA with you. I doubt this is the appropriate forum to start that discussion, so I’ll reach out to y’all separately about it.

In the meantime, would you be able to give guidance on what exactly to look for in the these pcaps and metrics jsons?


#5

Alex may have some better advice, but some of the upcoming features slated for the next Linkerd release (e.g. https://github.com/linkerd/linkerd/pull/1956) will be really helpful in debugging this.


#6

Namerd maintains a streaming gRPC response to Linkerd where it streams address updates to Linkerd. Therefore, I’d look for these types of things:

  • Is the tcp connection between Linkerd and Namerd alive and healthy?
  • Is the HTTP/2 response stream from Namerd still open? (i.e. the response has not sent an EOS frame and neither side has sent a stream reset frame)
  • Did Namerd send a message to Linkerd when the service deploys?
  • Did the Namerd message contain the correct ips?

And which of the above differ between a healthy Linkerd that gets the update vs an unhealthy Linkerd which does not?


#7

As William says, the diagnostics going into the next Linkerd release will make much of this information available through an admin endpoint rather than needing to gather it through tcpdumps.


#8

Thanks all. We’ll keep on the lookout for the improved logging.

We found something interesting though. First of all, I had a typo in my original write-up: We’re actually running linkerd 1.3.5, talking to namerd 1.3.7

When I downgraded our namerd cluster to 1.3.6, the issue went away (and degraded linkerds fixed themselves without requiring a restart).

This morning I upgraded back to namerd 1.3.7, and within an hour had some linkerds with services in an incorrect (i.e. different from consul) state.

When I again downgraded to namerd 1.3.6, the linkerd instances in a confused state fixed themselves within minutes, and we’ve not seen any more issues since.

I also tried 1.4.0, and still get the broken behavior.

Based on our observations over the last few days, it seems that a linkerd doesn’t get a totally bad state, but rather a linkerd becomes confused about a specific service’s state, while other service’s states remain accurate, and other linkerds monitoring that same service don’t get confused.

We’ve been able to correlate this with deploys of a service, but other than that, haven’t come up with reproduction steps.

We’ll continue to keep an eye on it, and let you know if we come up with steps to reproduce, but I wanted to share the bit of information re: 1.3.6 vs 1.3.7. It definitely seems that 1.3.7 introduced whatever this bug is.


#9

Very interesting! 1.3.7 made some substantial changes to our H2 implementation (https://github.com/linkerd/linkerd/pull/1879) so it is always possible a bug was introduced.


#10

Hmmmmm… yeah. If you get any closer to a repro in the next few days, or even if you don’t, please file an issue. If you can pin it to a version change like that then this is almost definitely a bug.


#11

Linkerd 1.4.1 has been released with the diagnostics I mentioned earlier: https://github.com/linkerd/linkerd/releases/tag/1.4.1

That may help to shed some light on what is going on here.