Upgrading Linkerd as a proxy on DC/OS


#1

Hi Linkerd Guys, how are you?
In the past month or so I’m struggling to configure a proper transparent SSL/TLS architecture for my DC/OS cluster. The purpose is to achieve TLS communication internally and also when trying to reach another cluster in our network/organisation. It works something like this:

We have two Linkerd - “internal” and “external”:

  • Internal Linkerd sits on every agent in a cluster and acts as a proxy. Every request any service is trying to make is entering Linkerd via the incoming port 3128 and continues via the outgoing port 3126. We have a dtab configuration that “intercepts” every outgoing request with the help of a fs-namer and sends it to the “public” or “external Linkerd”. The dtab is built to recognise whether a request needs to stay in the cluster or reach another one, “behind” the external Linkerd. If the request is meant for another cluster, it’s being sent there after the fs-namer is activated.
  • External Linkerd sits “in the middle” of the request path and acts as a “gateway” to and from each cluster to another. Every external Linkerd instance has a single IP address. As I said, it has only one router configured to listen on port 3126. The fs-namer in the internal Linkerd sends every request to this IP:3126 (notice the configuration is static for now). In the external Linkerd’s configuration, the router uses a port transformer to change the destination port of every outgoing request to port 3126 - which, as you’ve may noticed, is also the outgoing port of internal Linkerd. That’s because external Linkerd uses Marathon namer as well as the internal, and the port it will get after binding is just a dynamic Marathon port - but we need it to continue to the static Linkerd port on the other end.

So every request trying to leave a cluster transforms via the fs-namer to reach the external Linkerd (if needed). Before it continues to the destination service in the second cluster, external Linkerd transforms the request’s port again so it could reach the Linekrd proxy on the destination cluster and agent as well.

These are our configurations:
external.txt (1.3 KB)
internal.txt (2.1 KB)

Fs-namer is just a static IP:PORT file with the addresses of external Linkerd (port 3126).

So everything works but the problem is the availability of this architecture since we have two main concerns.

First: certificate update. I understand we can just inject a new certificate to every running instance/container of Linkerd and it’s supposed to be able to update those certificates online, on the go, with no downtime. I’ve just wondered if there’s maybe a way to use a single location for certificates, like a S3 bucket that all the instances will download the certificate from.

Second: upgrade policies / strategies. If I understand correctly, both Linkerd internal and external are listening on static host ports. If I want to restart or upgrade Linkerd service, an instance has to be down/killed on an agent before a new instance can grab the port on the same agent. Otherwise, a new instance couldn’t be loaded since the port is already in use by the old version of Linkerd. It’s a conflict. From tests I’ve made, when I do a rolling restart (meaning killing every instance of Linkerd one by one) I have about 1%-2% of failed requests (meaning: requests that weren’t able to reach the destination agent). The tests I’ve made are rigorous of course: I’ve deployed two services on 8 agents (4 and 4) and also 8 internal Linkerd proxies. Those 2 services can reach each other with zero failures only if all 8 Linkerd instances are up. In a production environment, every service has X instances that are spread over the cluster, and Linkerd (proxy) has the same amount of instances as the number of agents in the cluster. So of course that a rolling update of Linkerd shouldn’t harm that much of requests - but it’s still concerning.

Thank you for your help.


#2

Hi @jacobgo,
Thanks for submitting all the information. We will investigate and get back to you soon.
Franzi


#3

Hi @jacobgo,

Thanks for your patience waiting for our reply. As for your questions:
1 - Linkerd doesn’t do this, but this is a reasonable feature request. Could you file a feature request, please? And if you would like to work on it, we always welcome PR’s! :slight_smile:
2- Here, the safest thing to do is to drain the node in dc/os, then do the Linkerd upgrade, and then add node back in. However, if that’s too heavyweight, it should work the other way as well. Couple questions:

  • How are you shutting down Linkerd?
  • Do you see the shutdown logs for Linkerd indicating that a graceful shutdown occurred?

#4

Thanks for the reply.

My test script sent a request that looks something like this to Marathon API: curl -X POST "http://marathon.mesos:8080/v2/linkerd/restart?force=true" -H "accept: application/json" (I used the internal name but it doesn’t really matter IMHO). It initiated a restart.

The log line in stderr afterwards was W0112 01:37:47.058881 901 logging.cpp:91] RAW: Received signal SIGTERM from process 19140 of user 0; exiting and in stdout - Received killTask for task linkerd.26cbb46c-f739-11e7-88f2-4a2ea8f63361. I’ve poster here logs for the linkerd proxy.

Don’t see something unusual, but maybe I’m missing something.


#5

Also, I want to clarify something, because I think it was lost in translation: the main issue for us is the upgrade/restart process. We can manage certificates with the current stats, but draining nodes just because of a Linkerd upgrade is somewhat extreme IMHO.

I believe it isn’t just a Linkerd issue, or a core Linkerd issue, because it somewhat connected to the Marathon namer as well. The problem currently with working with Linkerd as a proxy in DC/OS is that when a Linkerd instance is being killed (as part of an upgrade, let’s say), Marathon and Marathon namer won’t recognise that a certain DC/OS agent is effectively “down” - it has no proxy for a (hopefully very small) period of time. So Linkerd will pull the health status for each service from Marathon itself, that should report that all services (besides Linkerd for the moment) are healthy. Thus, Linkerd will continue or try to refer traffic to a service on an agent that can’t reply - just because a Linkerd instance is down. Hence downtime or a risk to one.

It seems that maybe Linkerd should somehow know which of its instances is being killed/upgraded. I’m tempted to suggest almost a gossip algorithm and a cluster-aware behaviour… Since somehow Linkerd (and Marathon / namer?) should not refer requests to all instances on an agent without a Linkerd proxy (if one uses Linkerd as proxy on each agent).

Have you seen my configs? Did I do something wrong? If so - really want to hear from you and totally open to review (:

Thanks so much.


#6

Hi @jacobgo,

Apologies for the misunderstanding. It sounds like using Linkerds’s retry functionality could help your issue. Here is a link to the documentation. It allows you to set which kind of requests are allowed to be retried. Essentially it is pinging the node even before it gets to the load balancer and checks if the request would be routed though. If an error occurs (you can set specific errors in the Linkerd config), it will retry a different node.
For more detail, here is a blog post discussing retried among other things in a bit more detail.
Does this sound like something you were having in mind? Let me know if you have more questions!


#7

Amazing!
Indeed it seems like something that can solve this. I’m going to bring it to testing. A wonderful post BTW.

Thanks so much for your help.


#8

Should we see responses with a 200 status but with no body at all? Content-Length: 0? It seems that our (Linkerd) proxies pass requests to external Linkerd, but it reports something like this during the proxies’ restart:

D 0129 12:50:01.409 UTC THREAD29 TraceId:e2ff4924300ad2f6: Exception propagated to the default monitor (upstream address: /SERVICE_A:PORT, downstream address: /SERVICE_B:PORT, label: %/io.l5d.port/PORT/#/io.l5d.marathon/SERVICE_B).
com.twitter.finagle.ChannelClosedException: Connection reset by peer at remote address: /SERVICE_B:PORT from service: %/io.l5d.port/PORT/#/io.l5d.marathon/SERVICE_B. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /SERVICE_B, Downstream label: %/io.l5d.port/PORT/#/io.l5d.marathon/SERVICE_B, Trace Id: e2ff4924300ad2f6.15a3242d4381ad8a<:1eb698fb5806d7b8

Linkerd-proxy’s log is empty.

Is this com.twitter.finagle.ChannelClosedException: Connection reset by peer at remote address something you know about? In the context of empty bodies? It seems that somehow services using the Linkerd proxy always respond 200 (maybe Linkerd itself always respond 200 in this architecture? Or just one of them?), but during a restart some of the responses are without body. Because of that I can’t really configure the retries, since no 500 status occur. All the impact I’ve reported is probably because failing bodies, since unfortunately I’ve checked only response bodies last week (January 21th). Today I’ve changed the logic and separated failing statuses from failing bodies - and got 100% 200 in all responses.

I’ll check the services we’ve created for testing, but just wanted to clarify this with you.

Thank you.


#9

Hi guys,

just wanted to update that we’ve checked our test services and it seems it’s not a problem in our code - the service can throw 500 when is appropriate. Did you have time to check this out?

Also, about certificates update - do you have a best practise for this in DC/OS? Like a cron job downloading from S3 for instance? Or an attached volume (-> kinda think it’s not the use case here, but who knows)? Would be glad to hear


#10

Hi @jacobgo, we apologize for not getting back to you immediately. We will check out the issues you are seeing and give recommendations on certificate updates.


#11

@jacobgo just to validate my understanding of this issue, you are seeing that the external Linkerd proxy is sending an empty 200 request to an internal Linkerd proxy during restart. Is that correct? If so, which Linkerd proxy is undergoing restart, the external or internal? Linkerd does not generate any 200 responses but rather just forwards the responses it gets from its clients. The only requests Linkerd generates are 500 requests on Linkerd exceptions.

With regards to the certificate update, we don’t really have best practices on how to do certificate updates. Using S3 and a cron jobs sounds like a viable solution but it really depends on the environment you have set up.


#12
  • Via a script, I’m sending a plain curl non-stop to a service running on an agent (“Service A” or “Service B” in my diagram). It supposed to respond with a message (body) and 200 status.
  • While this script is running and sending requests, I’m restarting the internal Linkerd proxy, meaning the Linkerd instance on that same agent which holds “Service A/B”. The public Linkerd is stable and doesn’t undergo any restarts or failures I’m aware of.
  • I believe the empty 200 responses are carried by the Public Linkerd, since it reports the exceptions in the logs, but for now it’s an assumption.

So, regarding your question, Dennis, I believe you indeed got that right:

external Linkerd proxy is sending an empty 200 request to an internal Linkerd proxy during restart (of the internal Linkerd proxy)

I understand that Linkerd itself isn’t responsible for the statuses in the responses, but we’re concerned with the empty bodies and see exceptions the logs. Why we get 200 if there’s a Linkerd exception?


#13

I also did a similar test with a slightly different architecture - I’ve created an endpoint that doesn’t need to call other services, just reply via the Linkerd internal proxy. When I’m sending non-stop requests and restarting Linkerd - I get 0 failures (200 with full body). So it indeed seems as a Linker-to-Linker issue, since this occurs only when Linkerd internal proxy needs to pass via the External one.


#14

I’m not sure of all the details of your latest test setup, but I think this is good progress towards isolating the issue. Obviously there’s lots of things that could potentially go wrong in the linker-to-linker scenerio when you’re restarting one of the linkerd instances.

Consider this scenerio: The client (source) linkerd sends an HTTP request to the server linkerd. The service behind the server (destination) linkerd sets up its response header including the 200 OK response and response headers. Then it starts generating the body. Before the response body starts getting sent back to the client, you restart the client (source) which causes the client to close the connection. The server (destination) service notices the closed connection and flushes and closes the connection on its end. In that scenerio you would end up with a (wrongly) truncated 200 OK response. Do you think this is what’s happening?

If so, then we can think about the following things:

  1. During a restart, is the client (source) linkerd not waiting around long enough to receive responses to its sent requests? Is this a bug in linkerd or is the restart of linkerd not being done gracefully?

  2. What the client (source) linkerd should do when it receives a truncated 200 OK response. In some cases it won’t be able to gracefully handle a truncated response due to limitations of the HTTP protocol, but in some cases it probably could gracefully handle them.

Anyway, again, I’m not sure I 100% understand your scenerio. I’m mostly trying to figure this out from first principles without even considering any linkerd specifics.


#15

First, see https://github.com/linkerd/linkerd/issues/1806 regarding a limitation of the live reloading of certificates.

As a security engineer, I’d advise against storing your certificates and private keys in S3 unless that’s the only possibility. I’m not familiar with DC/OS but I know it has a “secret” mechanism. In general I recommend using the “secret” mechanism in DC/OS (and Kubernetes and other similar platforms) for private keys, and I also recommend storing the certificate and private key together.

From tests I’ve made, when I do a rolling restart (meaning killing every instance of Linkerd one by one) I have about 1%-2% of failed requests (meaning: requests that weren’t able to reach the destination agent).

How are you shutting down linkerd? By default linkerd will wait 10 seconds for all its request queues to drain. You should be able to control this timeout with shutdownGraceMs in the admin section of the configuration file. This isn’t documented; I filed an issue for the documentation at
https://github.com/linkerd/linkerd/issues/1807.


#16

I think I get what you’re saying, and it’s indeed reasonable and seems like a valid scenario. But it’s not that I have a Linkerd client (what you call “source”) and a Linkerd server (“destination”). I have a Linkerd client talking to a server - which in turn talks to another Linkerd server, which “looks like” a client :slight_smile:

Like in my diagram: ServiceA --> AgentA --> Linkerd internal proxy A --> Linkerd external --> Linkerd internal proxy B --> AgentB --> ServiceB. Linkerd internal B, AgentB and ServiceB can well may be on another cluster even. That’s the purpose of the “external” Linkerd - to shift traffic to other external locations. The only way ServiceA and AgentA can communicate with B is via Linkerd internal proxy. So you see that when internal proxy A is reaching Linkerd external - A is the client and External is the server; but when Linkerd external is opening a connection to internal proxy B - External is the client and proxy B is the server.

I would expect that if indeed one of those internal proxies is closing a connection before generating a proper body - I wouldn’t get a 200 status back, and preferably - get a 500 to allow retries to happen. That’s the problem: I don’t really care about how gracefully Linkerd is shutting down; but if I’ll always get 200 with empty bodies, I would continue to lose 1%-2% of my total requests during a upgrade/restart (since we commit to return both a status and a body).

regarding your questions:

  1. Linkerd is being restarted gracefully via the Marathon api (Upgrading Linkerd as a proxy on DC/OS)

  2. I think, but that’s really a personal view, that a 200 status with an empty or damaged body - shouldn’t be really a 200… But I don’t see myself as an expert in REST / HTTP protocol. Actually, we just need a way to activate retries when this happens. If changing the response status is the only way - good; if there’s another way - I would be glad to here about it.

Regarding your suggestion on using the DC/OS secrets plugin, I’m not using the enterprise edition, but we do have other ways of achieving this. I tried to simplify the architecture for this discussion, since we wanted to hear the best practises from you. I’ll take a look at the issues you linked to.

Thanks for your help.


#17

Thanks. I saw earlier you reduced this down to a simpler configuration and that simpler configuration didn’t have the same problem. I think at this point we need to find the simplest configuration that still does have the problem. I hope that we can find a simpler configuration that has fewer than 7 things involved! :slight_smile:

I agree with you that when a proxy sees that the upstream has given it a truncated 200 response, it should return a 5xx response whenever possible. It won’t always be possible because, once the proxy writes its own response header, the “200 OK” has already been sent on the network and we can’t change it to a 5xx. However, up until the time the proxy writes its response header, it can change the response code. Also, we could try to defer writing out response header until we really, really need to, e.g. until we’ve received at least part of the upstream’s response body. Today I’ll do an experiment to see exactly what linkerd is doing here.


#18

Hi guys - how are you?
Have you had the chance to test this architecture?