Namerd timeouts when using mesh interpreter

We’re running linkerd 1.3.5, with namerd 1.3.5, using the mesh interpreter.

Our linkerd instances are continually falling into a broken state, where requests to a particular service results in exceeded 10.seconds to unspecified while dyn binding /svc/1.1/GET/secret-canary-stage.service.local. Remote Info: Not Available

We’re trying to track down why this is happening, and any help would be appreciated. Attached is the logs from a linkerd which
eventually fell into a bad state. Love level was set to ALL.

Some additional info: we have not been able to catch this happening “live” while having tcpdump running. While we were running tcpdump on a problem box, however, we noticed that we were only seeing packets from 2 of the 3 namerds in our cluster. We’re assuming that calling linkerd’s /delegator.json API results in a direct call to namerd (is that correct, by the way?)

Not sure why, but my uploaded files are not showing.

Thanks for the logs. I’m pasting in some additional context from Slack so that it doesn’t get lost.

We’ve been experiencing issues with our 1.3.0 namerd/linkerd deployment over the last couple of weeks. Linkerd instances are regularly falling into a state where requests for certain services result in an error like com.twitter.finagle.RequestTimeoutException: exceeded 10.seconds to unspecified while dyn binding ....

Strangely, requests for other services via the same “broken” linkerd instance succeed. Even more strangely, requests to the /delegator.json API on the problematic linkerd instance succeed most of the time (some times they time out). The delegation API to all namerds in our cluster succeed when curl’d manually from the problematic linkerd’s box.

Bouncing namerd cluster does not resolve issue. Only bouncing problematic linkerds resolve the issue (until they fall into a bad state again). The issue was observed with linkerd 1.3.0+namerd 1.3.0, and linkerd 1.3.5 + namerd 1.3.0.

We’re currently updating our namerd clusters to the latest (1.3.5), and have dialed up logging on the datacenter that’s been showing the problem to level=ALL.

Do yall have any recommendations for particular errors to look for, or for particular metrics to track, while we’re monitoring this issue? (edited)

(FWIW we suspect network issues as the ultimate root cause, since the only cluster showing the problem is our local-datacenter cluster, which recently was re-IP’d among other changes. But I’ve had some trouble actually getting a log statement or metric from linkerd that would shed light what error is actually occurring)

@chrismikehogan you mentioned in slack that you would share some your results of a tcpdump between linkerd and namerd, where you able to find anything strange?

The only thing I noticed while running tcpdump on a linkerd instance experiencing the issue, while curling the /delegation.json endpoint over and over, was that I was seeing chatter from only 2 of the 3 namerd’s in that environment’s cluster.

Looking a different linkerd seeing the problem at the same time, again while curling the /delegation.json endpoint, I only saw it sending/receiving from 2 of 3 namerds, but a different 2 than the other.

It seems that a given linkerd’s connection to a namerd would fail in some way that the linkerd did not recognize as a failure, and so would not retry.

Has anything changed in your efforts to resolve this, or are you still experiencing the issue?

Huge apology for dropping off the face of the earth. Been busy over here.

We stopped seeing the issue once we ran linkerd in standalone mode (i.e. no namerd) in the problematic datacenter, and we confirmed that there was a firewall between the linkerd’s and the namerd cluster in that datacenter, which is not true in any other datacenter.

This has left us pretty confident that the firewall was provoking issues with linkerd. There’s possibly a bug in linkerd where it doesn’t handle the firewall terminating the connection to namerd, but we haven’t any direct proof of that. Only the indirect not-quite-proof of seeing things don’t work well when there’s a firewall in the mix.

We’re going to fix our network config so our linkerd doesn’t have to go through a firewall to reach namerd.

Glad that got sorted out! We’ll keep an eye out for this issue in the future in case there is an underlying issue with putting namerd behind a firewall.