How to manage unexpected errors?


#1

Background: We’re running linkerd 1.x on Kubernetes as a Daemonset then use http_proxy=$(NODE_NAME):4440 environment variable in our service pods.

We have an important service (let’s call it “foo”) which end-users access directly over HTTP, this service needs to communicate with service “bar” over HTTP.

I’m not :100: sure why, but very occasionally we get this error from linkerd:

com.twitter.finagle.NoBrokersAvailableException: No hosts are available for /svc/$NODE_NAME:9990, Dtab.base=[...], Dtab.local=[]. Remote Info: Not Available

If I delete the l5d pod on the node wait for it to start up again, it’s fine – therefore I presume it’s some sort of local (to the l5d pod) name resolution error? (We don’t use namerd, just the standard io.l5d.k8s namer)

However, when that error happens it causes the user to receive a 500 as we’re not expecting it in our application code.

What would be the best practice here? Should the health_check for foo try to send requests to bar to make sure they don’t 502 ? Really, the user should never end up on a service which is on the same node as a broken l5d pod.


Or, this is highlighting a larger problem in our l5d configuration? It usually does run without any issues and has done for multiple months now.


Linkerd Config: https://gist.github.com/SamFleming/08e4fddb1d4956de2ef6e69466731ac3


#2

Hi @sam, the error looks like something is trying to access Linkerd’s admin dashboard through Linkerd and because /svc/$NODE_NAME:9990 isn’t resolved through the dtab, Linkerd fails. It is strange that this causes 500s to be returned to a user.

You mentioned something about health_check for foo do you mind elaborating on how that works? is the health check trying to access /svc/$NODE_NAME:9990?


#3

Hi @dennis.ab I’m sorry, the error was actually on :4140 - sorry for the confusion there.

But yes, my theory was the health check for foo should check that l5d is up and running, but I’m not sure on the best way to achieve this.


#4

No worries! just trying to understand the scenario better. So, let me see if I have this right, you have a service foo that talks to bar through l5d. Sometimes, the service foo receives 500s errors because there might be something funky going on between Linkerd and foo, is that right?

Is the NoBrokersAvailableException the only thing you see in the logs when this happens? That error usually shows up when an IP cannot be resolved for a particular service name. However, the service name Linkerd is trying to resolve is /svc/$NODE_NAME:9990, which I assume, your service is calling? Is that the health_check IP foo is using to see if Linkerd is up?

Additionally, are the requests to bar idempotent? You might be able to add a retry configuration to resend requests if you run into the issue. It could be that service bar is undergoing a rolling update so an IP is unavailable and therefore a retry would resend the request to bar.

Are you able to confirm that service bar is available when this issue happens?