Background: We’re running linkerd 1.x on Kubernetes as a Daemonset then use
http_proxy=$(NODE_NAME):4440 environment variable in our service pods.
We have an important service (let’s call it “foo”) which end-users access directly over HTTP, this service needs to communicate with service “bar” over HTTP.
I’m not sure why, but very occasionally we get this error from linkerd:
com.twitter.finagle.NoBrokersAvailableException: No hosts are available for /svc/$NODE_NAME:9990, Dtab.base=[...], Dtab.local=. Remote Info: Not Available
If I delete the l5d pod on the node wait for it to start up again, it’s fine – therefore I presume it’s some sort of local (to the l5d pod) name resolution error? (We don’t use namerd, just the standard io.l5d.k8s namer)
However, when that error happens it causes the user to receive a 500 as we’re not expecting it in our application code.
What would be the best practice here? Should the
foo try to send requests to
bar to make sure they don’t 502 ? Really, the user should never end up on a service which is on the same node as a broken l5d pod.
Or, this is highlighting a larger problem in our l5d configuration? It usually does run without any issues and has done for multiple months now.