Linkerd does not sync up with namerd if the latter goes down and then comes back up

What seems to be happening is that:

  1. namerd and linkerd are up
  2. namerd goes down
  3. linkerd continues to work off the cached dtabs
  4. namerd comes back up again
  5. dtab in namerd is modified
  6. linkerd continues to use the stale cached dtab
  7. restarting linkerd causes it to pick up the fresh dtab from namerd

I think this is probably due to this issue: https://github.com/linkerd/linkerd/issues/807

The solution would be to use one of the streaming interfaces for namerd: either io.l5d.namerd.http (streaming HTTP/1.1) or io.l5d.mesh (gRPC).

I changed the interface between linkerd and namerd to mesh. That does not fix the syncing problem.

  1. When namerd goes down, linkerd continues to pass traffic using the cached dtab.
  2. PROBLEM: The linkerd dtab page does not show the cached dtab. It shows an error message instead that it cannot connect with namerd.
  3. PROBLEM: When namerd comes back up, linkerd remains disconnected from it, and does not sync any changes made to namerd dtab. The linkerd gui dtab page continues to show the error message as before.

Does this functionality work OK for you folks?

I’m wondering whether fail-fast and failure-accrual is enabled for the gRPC interface…

How / where would I check / enable that?

Also - see above for how linkerd dtab gui is misbehaving. Would that be attributable o this as well?

I just tested this and verified that linkerd does reconnect to namerd and gets dtab updates. However, linkerd uses exponential backoff on retries to namerd with the time between retries going up to 10 minutes by default. This means that it can be up to 10 minutes before linkerd will reconnect to namerd once it comes back online. This backoff is configurable here: https://linkerd.io/config/1.1.0/linkerd/index.html#namerd-retry

@Alex Which interface did you test this with? All three?

Just now I tested with io.l5d.mesh but it should work with all 3.

Thanks for all your help! [xxxx]

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.