AWS ELB, Marathon and Linkerd graceful shutdown synchronisation


#1

Hey guys,
how are you?

We’re using Linkerd on top of DC/OS. We also have an AWS ELB to redirect clients to our cluster. As you may well know, the ELB’s functionality is quite simple - redirect certain ports (most commonly 80 and 443) to Linkerd ports, in this case. The ELB samples a pool of EC2 instances and when it gets a TCP response from a certain port (Linkerd is listening on) - it registers the instance as “inService”. This port-check is known as simply the health check of the ELB.

The health check has two important parameters: “healthy threshold” and “unhealthy threshold”. Unhealthy Threshold is the number of consecutive failed health checks that must occur before declaring an EC2 instance unhealthy. You may already guessed that Healthy Threshold is the number of consecutive successful health checks that must occur before declaring an EC2 instance healthy. In the back end, we have Marathon’s grace period for each task it runs and the graceful shutdown grace period of Linkerd itself.

The point is we have a certain “mismatch” in synchronising all these grace periods. All of these time frames should make sure, basically, that no request will fail during a Linkerd shutdown/restart. The problem is that immediately when Linkerd receives a SIGTERM, it blocks all new incoming requests. The ELB samples each instance every X seconds (most of the times 30 seconds) and because its thresholds can range between 2 to 10, it well may be that it will still mark a certain instance as healthy (“inService”) though Linkerd is effectively down, meaning: not receiving requests after SIGTERM. Marathon as well samples /admin/ping as a health check. It too can still mark a certain Linkerd instance as healthy though Linkerd immediately blocks all requests before killing itself. Meaning: Linkerd’s graceful shutdown process itself isn’t enough to ensure a graceful shutdown, when using more load balancers in the way.

Could it be we need /admin/ping to report some other status in the period of time after SIGTERM is received, until the final SIGKILL is executed? Then Marathon and AWS ELB will sample this endpoint and would know immediately that it isn’t healthy, though the task still hasn’t been killed.


#2

@jacobgo Hi! :wave: I think one thing I would like to understand is, does the health check consider Connection Refused as a failed health result? You are correct in that when Linkerd receives a SIGTERM all connections are blocked. In this case, all connections are returned with a Connection Refused error. You mentioned.

It too can still mark a certain Linkerd instance as healthy though Linkerd immediately blocks all requests before killing itself

IIUC, the health check still considers Linkerd as healthy even after it receives a Connection Refused.


#3

Well I’ll check this tomorrow (different time zone :)) but I think, from the top of my head, that because currently our AWS ELB health check is TCP based, even a “connection refused” would be considered as a failed one.

The thing is, that the ELB will wait for at least two health checks to fail before marking the instance running Linkerd “outOfService”.

One can of course reduce the intervals between health checks to, let’s say, 1-2 seconds, but it will maybe cause load, be chatty, and still - wouldn’t be accurate, since Linkerd closes connection immidiately but the ELB waits for a few seconds (2 health checks * 2 seconds interval).


#4

I had a similar problem, our load balancer must get 3 consecutive fails (10s apart) to /admin/ping before marking the linkerd as down, so just gracefully stopping linkerd doesn’t help, the LB kept queuing requests to it for 30s.

My solution was to run a TCP proxy service ( I used socat) between the load balancer and linkerd’s admin port.

LB -> 8080 -> linkerd
LB -> 9080 -> socat -> 10800 -> linkerd

To removed linkerd from the LB pool, shutdown the TCP proxy, wait for the LB to mark the linkerd as not available (any requests to port 8080 are still routed), then shutdown linkerd itself.

I have tested this by firing continuous requests (1 per second) through the LB to a simple echo process behind the linkerd’s (I have 3). Using the above to take linkerd’s in and out of service resulted in no request failures.

This issue has been discussed before, what’s really needed is a way to control the response of the ping without it being tied to a shutdown (which block inbound requests)


#5

Thanks @pjp!

Where was this issue discussed before, as you mentioned?


#6

Hi.

Just search for “shutdown” here and also look at this article