how are you?
We’re using Linkerd on top of DC/OS. We also have an AWS ELB to redirect clients to our cluster. As you may well know, the ELB’s functionality is quite simple - redirect certain ports (most commonly
443) to Linkerd ports, in this case. The ELB samples a pool of EC2 instances and when it gets a TCP response from a certain port (Linkerd is listening on) - it registers the instance as “inService”. This port-check is known as simply the health check of the ELB.
The health check has two important parameters: “healthy threshold” and “unhealthy threshold”.
Unhealthy Threshold is the number of consecutive failed health checks that must occur before declaring an EC2 instance unhealthy. You may already guessed that
Healthy Threshold is the number of consecutive successful health checks that must occur before declaring an EC2 instance healthy. In the back end, we have Marathon’s grace period for each task it runs and the graceful shutdown grace period of Linkerd itself.
The point is we have a certain “mismatch” in synchronising all these grace periods. All of these time frames should make sure, basically, that no request will fail during a Linkerd shutdown/restart. The problem is that immediately when Linkerd receives a SIGTERM, it blocks all new incoming requests. The ELB samples each instance every X seconds (most of the times 30 seconds) and because its thresholds can range between 2 to 10, it well may be that it will still mark a certain instance as healthy (“inService”) though Linkerd is effectively down, meaning: not receiving requests after SIGTERM. Marathon as well samples
/admin/ping as a health check. It too can still mark a certain Linkerd instance as healthy though Linkerd immediately blocks all requests before killing itself. Meaning: Linkerd’s graceful shutdown process itself isn’t enough to ensure a graceful shutdown, when using more load balancers in the way.
Could it be we need
/admin/ping to report some other status in the period of time after SIGTERM is received, until the final SIGKILL is executed? Then Marathon and AWS ELB will sample this endpoint and would know immediately that it isn’t healthy, though the task still hasn’t been killed.