We see an expected linkerd behavior now and then which is as follow — one of linkerd instance starts routing traffic to wrong instance. We have dtab like 100 * serviceA-blue + 0 * serviceA-green. And one of the linkerd starts routing traffic to green color. I have tried resolving the problematic path on all namerd instances using the resolve http api, and all instances return the correct instances for the said path.
Are you able to share the full Linkerd config for that instance (you can get this from :9990/config.json and the full dtab that it’s using from namerd?
Yes, currently it is routing to green color because dtab has 100 weight to green. We had to change it to remain consistent. I think if shift the weight back to blue, we would see the problem again.
Below is the gist that has the config file of problematic linkerd, along with metric and config of one of the working linkerd instance. And in the end of the gist, you should see current dtabs.
NOTE: we did restart problematic linkerd to see if that helps but it didn’t when the weight was set to blue. That is why you may see no stats for serviceA-blue on the problematic linkerd instance metric, but it is there on working linking instance metric.
It sounds like perhaps certain Linkerds aren’t getting the dtab update from namerd when you shift traffic from green to blue. Would you say that is accurate? There are certain edge conditions with the io.l5d.namerd interpreter that can cause Linkerd to miss updates, particularly if namerd is restarting or if Linkerd is load balancing over a cluster of namerd instances. Have you tried using the io.l5d.mesh interpreter instead? It uses gRPC and does not suffer from those edge conditions.
Perhaps linkerds aren’t getting the dtab update. However, anecdotally speaking, we have restarted them few times and it didn’t solve the problem. So maybe namerd at faults but they also seems to be returning the right state. So things don’t add up, and we would like to revalidate our observations. Can you help us figure out what should we capture when we see the problem that might help us figure out the source of the problem? We are on v0.9.1 and we are planning to update it, but we would also need evidence to support the hypothesis that update would fix it. Can you point to the said bug(s) in namerd interpreter? Github search didn’t yield anything fruitful for me.