I am trying to verify linkerd’s circuit breaking configuration by requesting through simple error prone endpoint deployed as a pod in the same k8s cluster where linkerd is deployed as a daemonset.
I have noticed circuit breaking happening by observing the logs but when I try to hit the endpoint again, the linkerd is still making the call to the endpoint and get the response back to the client.
Setup and Test
I used below configs to setup linkerd and its endpoint,
http_proxy=$(kubectl get svc l5d -o jsonpath="{.status.loadBalancer.ingress[0].*}"):4140 curl -L http://:8080/simple-err
note: I have changed the version of linkerd-egress.yml to 1.0.0 as version 1.0.2 is giving “hostname cannot be null” exception which is making difficult to trace the logs. I have set the backofftime to a very long value prevent any probes.
UTC THREAD23 TraceId:e57aa1baa5148cc5: FailureAccrualFactorymarking connection to “$/io.buoyant.rinet/8080/” as dead
Problem
After the node being marked as dead, a new request to the linkerd (same http_proxy command above) is hitting the endpoint and returning the response.
When failure accrual (circuit breaker) triggers, the endpoint is put into a state called Busy. This actually doesn’t guarantee that the endpoint won’t be used. Most load balancers (including the default P2CLeastLoaded) will simply pick the healthiest endpoint. In the case where failure accrual has triggered on all endpoints, this means it will have to pick one in the Busy state.
Stepping back for a moment, I’d love to hear about what goals you’d like to accomplish using circuit breaking. The way we think about it, the purpose of circuit breaking is to improve client observed success rate by making it less likely that requests will be sent to unhealthy replicas. This does not mean guaranteeing that all traffic will be shut off to those unhealthy replicas. In fact, when all endpoints are marked as Busy, it’s better to at least try one than just give up immediately.
I hope this is informative and I look forward to hearing more about your use-case.
I am working on a POC on Linkerd’s circuit breaking. I need to go further and test with one healthy and one unhealthy instance behind my ELB. tyvm for taking the time to address this, that helped me understand better about circuit breaking, you guys are cool!
@Alex we are migrating from hystrix (https://github.com/Netflix/Hystrix/wiki#problem) and trying to verify equivalent functionality including shedding load. At this point, after discussing with the team, we’re fine with the existing functionality in the underlying finagle code.