One instance (out of 5) of linkerd is routing to incorrect destination

We see an expected linkerd behavior now and then which is as follow — one of linkerd instance starts routing traffic to wrong instance. We have dtab like 100 * serviceA-blue + 0 * serviceA-green. And one of the linkerd starts routing traffic to green color. I have tried resolving the problematic path on all namerd instances using the resolve http api, and all instances return the correct instances for the said path.

I was talking to @olix0r about it yesterday in the slack channel and he asked for metrics. Here is the link to metric – https://gist.github.com/thedebugger/654dc113db2751f8c6e728214c5c19dc. More info in the slack channel – https://linkerd.slack.com/archives/C0JV5E7BR/p1504133694000165

We are running linkerd version v0.9.1 in the docker container. Let me know if you need more information.

Looking at those metrics, I can confirm that Linkerd is routing /serviceA to serviceA-green:

rt/serviceA-mesh/dst/id/%/io.l5d.localhost/#/io.l5d.consul/bg-sjc1/serviceA-green/path/serviceA/requests": 79867

Are you able to share the full Linkerd config for that instance (you can get this from :9990/config.json and the full dtab that it’s using from namerd?

Hi Alex,

Yes, currently it is routing to green color because dtab has 100 weight to green. We had to change it to remain consistent. I think if shift the weight back to blue, we would see the problem again.

Below is the gist that has the config file of problematic linkerd, along with metric and config of one of the working linkerd instance. And in the end of the gist, you should see current dtabs.

NOTE: we did restart problematic linkerd to see if that helps but it didn’t when the weight was set to blue. That is why you may see no stats for serviceA-blue on the problematic linkerd instance metric, but it is there on working linking instance metric.

Let me know if you need more information.

It sounds like perhaps certain Linkerds aren’t getting the dtab update from namerd when you shift traffic from green to blue. Would you say that is accurate? There are certain edge conditions with the io.l5d.namerd interpreter that can cause Linkerd to miss updates, particularly if namerd is restarting or if Linkerd is load balancing over a cluster of namerd instances. Have you tried using the io.l5d.mesh interpreter instead? It uses gRPC and does not suffer from those edge conditions.

Perhaps linkerds aren’t getting the dtab update. However, anecdotally speaking, we have restarted them few times and it didn’t solve the problem. So maybe namerd at faults but they also seems to be returning the right state. So things don’t add up, and we would like to revalidate our observations. Can you help us figure out what should we capture when we see the problem that might help us figure out the source of the problem? We are on v0.9.1 and we are planning to update it, but we would also need evidence to support the hypothesis that update would fix it. Can you point to the said bug(s) in namerd interpreter? Github search didn’t yield anything fruitful for me.

Here’s the issue where Linkerd can miss namerd updates: https://github.com/linkerd/linkerd/issues/807

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.