Production high-availability/zero-downtime deployment questions

We’re just starting down the path to integrating Linkerd into our tech stack and I’m thinking ahead to how it will be deployed in production. Specifically, I’m thinking about how it will be deployed in a high-availability fashion and updated with zero downtime. After reading through the documentation and forums, I have a few questions about how to accomplish this:

  1. Is is possible/recommended to deploy a cluster of linkerd instances behind a load balancing VIP? In other words, requests would go through a central load balancer (e.g. F5, HAProxy, etc.) to one of a cluster of linkerd instances to be routed to the destination service. Any concerns or considerations when using this setup?

  2. In the per-host deployment model (https://linkerd.io/in-depth/deployment/), how does one upgrade or reconfigure the linkerd instances in a way that results in zero downtime? In other words, if we need to reconfigure or update the linkerd instances, is there a way to do so such that the services running on that host see no downtime in the linkerd instance?

  3. When running a separate namerd, are there examples or documentation for how to run that in a high-availability mode? How do I keep namerd from being a single point of failure? Can it be run in a cluster and load balanced across the linkerd instances? Other ways to provide redundancy in this layer?

Thanks!
David

Hi @dhayha! Good questions.

  1. Is is possible/recommended to deploy a cluster of linkerd instances behind a load balancing VIP?

While possible to deploy linkerd as a cluster, it loses a lot of advantages/features e.g. tls upgrading, and works better as an intra-cluster proxy/loadbalancer. Interested to hear more about your setup though.

how does one upgrade or reconfigure the linkerd instances in a way that results in zero downtime?

Upgrading routing rules can be done dynamically using namerd (as you probably already know). Rolling restarts are how I would handle reconfiguring linkerds so that users don’t see any downtime. Relatedly, we’ve also got a couple more dynamic config changes in the pipe.

  1. When running a separate namerd, are there examples or documentation for how to run that in a high-availability mode? How do I keep namerd from being a single point of failure? Can it be run in a cluster and load balanced across the linkerd instances? Other ways to provide redundancy in this layer?

Hm, I forget if we’ve got a doc on this or not. In contrast to linkerd, I would definitely run namerd as a cluster. Each linkerd keeps a cache of the info namerd sends, so if namerd goes down, linkerd continues to route to previously known services. The Buoyant team has fought a ton of service discovery outage fires in the past, so resiliency is a priority :slight_smile:

Thanks for the answers…some follow up questions/comments:

  1. When you mention the TLS upgrading, I assume you’re referring to the ability for Linkerd to handle the SSL termination between hosts? I was mainly thinking of this as a way to do rolling restarts of linkerd in a way that would be transparent to the clients routing through those instances. It does seem like host to host mechanism does make more sense, however.

  2. I do understand that namerd can be used to dynamically change the routing rules. However, I’m still unclear about how to do the rolling restarts in a way that would result in zero downtime. For example, suppose I have a host with half a dozen (or more) microservices running on it, all of them are routing their outbound requests through the local linkerd instance. How would I bounce that linkerd instance (e.g. due to config change or version upgrade) in a way that is transparent to the microservices routing through that instance?

  3. Good to hear, if you could find any documentation or examples, I’d appreciate it. I didn’t immediately see how to configure the cluster of namerd instances from reading the linkerd documentation.

  4. In the linkerd to linkerd model, how do folks typically handle the service registration? Do the services themselves advertise the linkerd connection info to the service discovery registry? Or is another mechanism used to register services?

Thanks!
David

  1. That’s right, the ability to transparently encrypt cross-host traffic.

  2. Linkerd supports a graceful shutdown where it stops accepting new requests and waits a short amount of time for in-flight requests to complete before shutting down. However, the entire node will be effectively offline for the duration of the linkerd restart. Ideally your services should be replicated across multiple nodes so that restarting the linkerd on a single node doesn’t impact availability of the services.

  3. There’s not much to say about namerd deployments other than: run at least 3 to ensure high availability. There’s a nice Kubernetes example of running namerd here: https://blog.buoyant.io/2016/11/04/a-service-mesh-for-kubernetes-part-iv-continuous-deployment-via-traffic-shifting/

  4. It depends on the environment. In scheduled environments like Kubernetes or Mesos, there’s a scheduler that know where each service is running. Therefore, it’s not necessary for the services to register themselves because linkerd can simply ask the scheduler where things are. That said, if for some reason you need linkerd to register itself, it does support a few announcers: https://linkerd.io/config/1.1.0/linkerd/index.html#announcers

  1. That’s good to know about the graceful shutdown. But it doesn’t help if I have services that are, say, consuming from a message queue and making outbound requests. I’d need to make sure they pause their processing while the linkerd instance is bouncing.

3a) Thanks for the link. I must be missing something, as I didn’t see the linkerd configuration that points at a cluster of multiple namerd instances.

3b) It also wasn’t clear to me how the blue-green deploy was done. Am I reading things right that there’s a Jenkins plugin & script that is making calls to namerctl to update the dtab entries?

  1. We’d like to get to a Docker container world, but we’re not there yet. I’m a bit confused about how linkerd would route an incoming request from another linkerd instance to the correct service. Is there something that favors services sitting on the same host? That is, my client makes a request to it’s local linkerd. That instance locates and selects the host and port to route that request to. The remote linkerd instance accepts the request and then…what? How does it know to route to the local instance? Presumably it’s using the same service discovery mechanism as the original linkerd. Or does each linkerd instance necessarily have a unique configuration, depending on what’s running on the host? Or is that where announcers come in?

Thanks!
David

  1. Yes, I think you’re right. You’d want to pause those jobs before initiating the linkerd restart.

3a) Just a simple round-robin DNS is usually how linkerd can address a cluster of namerds.

3b) That’s correct.

  1. Great question, this is where transformers come in. The linkerd on the destination host will do a service directory lookup for the target service and then the io.l5d.localhost transformer will filter that list down to the entries which are running locally to that linkerd. This blog post describes in more detail how this works in Kubernetes. For non-kubernetes environments it’s very similar but the io.l5d.port transformer is used instead of the io.l5d.k8s.daemonset transformer and the io.l5d.localhost transformer is used instead of the io.l5d.k8s.localnode transformer.