Unable to prevent requests reaching the endpoint(internal) after circuit breaking is in action

I am trying to verify linkerd’s circuit breaking configuration by requesting through simple error prone endpoint deployed as a pod in the same k8s cluster where linkerd is deployed as a daemonset.

I have noticed circuit breaking happening by observing the logs but when I try to hit the endpoint again, the linkerd is still making the call to the endpoint and get the response back to the client.

Setup and Test

I used below configs to setup linkerd and its endpoint,

https://raw.githubusercontent.com/linkerd/linkerd-examples/master/k8s-daemonset/k8s/linkerd-egress.yaml

https://raw.githubusercontent.com/zillani/kubex/master/examples/simple-err.yml

Endpoint behavior
Endpoint always return 500 internal server error

Failure accrual setting

client:
kind: io.l5d.static
configs:
- prefix: "/$/io.buoyant.rinet/443/{service}"
tls:
commonName: "{service}"
failureAccrual:
kind: io.l5d.consecutiveFailures
failures: 2
backoff:
kind: constant
ms: 360000000

Proxy curl

http_proxy=$(kubectl get svc l5d -o jsonpath="{.status.loadBalancer.ingress[0].*}"):4140 curl -L http://:8080/simple-err

note: I have changed the version of linkerd-egress.yml to 1.0.0 as version 1.0.2 is giving “hostname cannot be null” exception which is making difficult to trace the logs. I have set the backofftime to a very long value prevent any probes.

Observations:

  • Observed probes despite the long backofftime

“rt/outgoing/client/$/io.buoyant.rinet/8080/ac99…093.us-west-2.elb.amazonaws.com/failure_accrual/probes” : 2,

l5d log

  • UTC THREAD23 TraceId:e57aa1baa5148cc5: FailureAccrualFactorymarking connection to “$/io.buoyant.rinet/8080/” as dead

Problem
After the node being marked as dead, a new request to the linkerd (same http_proxy command above) is hitting the endpoint and returning the response.

I had to dig into the Finagle code to understand exactly how this works. It’s pretty fascinating and I recommend taking a look for yourself if you’re interested in the details: https://github.com/twitter/finagle/blob/develop/finagle-core/src/main/scala/com/twitter/finagle/liveness/FailureAccrualFactory.scala

When failure accrual (circuit breaker) triggers, the endpoint is put into a state called Busy. This actually doesn’t guarantee that the endpoint won’t be used. Most load balancers (including the default P2CLeastLoaded) will simply pick the healthiest endpoint. In the case where failure accrual has triggered on all endpoints, this means it will have to pick one in the Busy state.

Stepping back for a moment, I’d love to hear about what goals you’d like to accomplish using circuit breaking. The way we think about it, the purpose of circuit breaking is to improve client observed success rate by making it less likely that requests will be sent to unhealthy replicas. This does not mean guaranteeing that all traffic will be shut off to those unhealthy replicas. In fact, when all endpoints are marked as Busy, it’s better to at least try one than just give up immediately.

I hope this is informative and I look forward to hearing more about your use-case.

3 Likes

I am working on a POC on Linkerd’s circuit breaking. I need to go further and test with one healthy and one unhealthy instance behind my ELB. tyvm for taking the time to address this, that helped me understand better about circuit breaking, you guys are cool!

1 Like

@Alex we are migrating from hystrix (https://github.com/Netflix/Hystrix/wiki#problem) and trying to verify equivalent functionality including shedding load. At this point, after discussing with the team, we’re fine with the existing functionality in the underlying finagle code.

1 Like