Hi Linkerd Guys, how are you?
In the past month or so I’m struggling to configure a proper transparent SSL/TLS architecture for my DC/OS cluster. The purpose is to achieve TLS communication internally and also when trying to reach another cluster in our network/organisation. It works something like this:
We have two Linkerd - “internal” and “external”:
- Internal Linkerd sits on every agent in a cluster and acts as a proxy. Every request any service is trying to make is entering Linkerd via the incoming port 3128 and continues via the outgoing port 3126. We have a dtab configuration that “intercepts” every outgoing request with the help of a fs-namer and sends it to the “public” or “external Linkerd”. The dtab is built to recognise whether a request needs to stay in the cluster or reach another one, “behind” the external Linkerd. If the request is meant for another cluster, it’s being sent there after the fs-namer is activated.
- External Linkerd sits “in the middle” of the request path and acts as a “gateway” to and from each cluster to another. Every external Linkerd instance has a single IP address. As I said, it has only one router configured to listen on port 3126. The fs-namer in the internal Linkerd sends every request to this IP:3126 (notice the configuration is static for now). In the external Linkerd’s configuration, the router uses a port transformer to change the destination port of every outgoing request to port 3126 - which, as you’ve may noticed, is also the outgoing port of internal Linkerd. That’s because external Linkerd uses Marathon namer as well as the internal, and the port it will get after binding is just a dynamic Marathon port - but we need it to continue to the static Linkerd port on the other end.
So every request trying to leave a cluster transforms via the fs-namer to reach the external Linkerd (if needed). Before it continues to the destination service in the second cluster, external Linkerd transforms the request’s port again so it could reach the Linekrd proxy on the destination cluster and agent as well.
Fs-namer is just a static IP:PORT file with the addresses of external Linkerd (port 3126).
So everything works but the problem is the availability of this architecture since we have two main concerns.
First: certificate update. I understand we can just inject a new certificate to every running instance/container of Linkerd and it’s supposed to be able to update those certificates online, on the go, with no downtime. I’ve just wondered if there’s maybe a way to use a single location for certificates, like a S3 bucket that all the instances will download the certificate from.
Second: upgrade policies / strategies. If I understand correctly, both Linkerd internal and external are listening on static host ports. If I want to restart or upgrade Linkerd service, an instance has to be down/killed on an agent before a new instance can grab the port on the same agent. Otherwise, a new instance couldn’t be loaded since the port is already in use by the old version of Linkerd. It’s a conflict. From tests I’ve made, when I do a rolling restart (meaning killing every instance of Linkerd one by one) I have about 1%-2% of failed requests (meaning: requests that weren’t able to reach the destination agent). The tests I’ve made are rigorous of course: I’ve deployed two services on 8 agents (4 and 4) and also 8 internal Linkerd proxies. Those 2 services can reach each other with zero failures only if all 8 Linkerd instances are up. In a production environment, every service has
X instances that are spread over the cluster, and Linkerd (proxy) has the same amount of instances as the number of agents in the cluster. So of course that a rolling update of Linkerd shouldn’t harm that much of requests - but it’s still concerning.
Thank you for your help.