Struggling to get l5d mesh to scale

originally posted to slack: https://linkerd.slack.com/archives/C0JV5E7BR/p1503606264000614

We’ve been working on getting l5d running properly on k8s/aws for a few months now and we’re at our wits end. long story short: it does not appear to scale. Hopefully someone here will have ideas.

It appears that under load (at least 100 req/sec), l5d starts to mark connections to services as dead for no reason. This has been observed during soak testing of a service with simple GET requests. There are no errors in the services’ logs, the instances show no restarts and they remain in the ready state throughout testing. Readiness and liveliness probes continue to return 200, even when l5d is returning 503s to Gatling. Hitting the service directly translates to req/s of over triple that of what l5d can handle.

The first error that shows up in the l5d log is the always “FailureAccrualFactory marking connection … as dead”

[l5d-rxl7h l5d] I 0824 19:27:21.997 UTC THREAD25: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service" as dead. Remote Address: Inet(/100.84.170.219:4141,Map(nodeName -> ip-172-24-33-64.ec2.internal))
[l5d-rxl7h l5d] I 0824 19:27:22.089 UTC THREAD31: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service" as dead. Remote Address: Inet(/100.84.217.146:4141,Map(nodeName -> ip-172-24-113-122.ec2.internal))

We’re using https://github.com/linkerd/linkerd-examples/blob/master/k8s-daemonset/k8s/servicemesh.yml almost as-is though we’ve been through many different configurations in an attempt to hash this out.

As noted in the slack channel we’ve checked metrics.json (tried to attach but new users can’t) and believe that the info indicates 100% success of the backend service, unless I’m misreading it. Here’s a portion of the metrics referring to the service in question:

"rt/http-incoming/service/svc/foo-service/success": 198,
"rt/http-incoming/service/svc/foo-service/request_latency_ms.count": 0,
"rt/http-incoming/service/svc/foo-service/retries/per_request.count": 0,
"rt/http-incoming/service/svc/foo-service/retries/total": 0,
"rt/http-incoming/service/svc/foo-service/retries/budget_exhausted": 0,
"rt/http-incoming/service/svc/foo-service/retries/budget": 100.0,
"rt/http-incoming/service/svc/foo-service/failures": 32,
"rt/http-incoming/service/svc/foo-service/failures/com.twitter.finagle.RequestTimeoutException": 32,
"rt/http-incoming/service/svc/foo-service/requests": 230,
"rt/http-incoming/service/svc/foo-service/pending": 0.0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/connect_latency_ms.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/failed_connect_latency_ms.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/sent_bytes": 75240,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/service_creation/failures": 21,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/service_creation/failures/com.twitter.finagle.Failure": 21,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/service_creation/failures/com.twitter.finagle.Failure/com.twitter.finagle.RequestTimeoutException": 21,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/service_creation/service_acquisition_latency_ms.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/connection_received_bytes.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/connection_duration.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/failure_accrual/removals": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/failure_accrual/probes": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/failure_accrual/removed_for_ms": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/failure_accrual/revivals": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/connects": 187,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/pool_num_waited": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/success": 198,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/request_latency_ms.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/pool_waiters": 0.0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/retries/requeues_per_request.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/retries/request_limit": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/retries/budget_exhausted": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/retries/cannot_retry": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/retries/not_open": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/retries/budget": 100.0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/retries/requeues": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/received_bytes": 409266,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/read_timeout": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/write_timeout": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/connection_sent_bytes.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/connection_requests.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/service/svc/foo-service/request_latency_ms.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/service/svc/foo-service/success": 198,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/service/svc/foo-service/pending": 0.0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/service/svc/foo-service/requests": 198,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/pool_num_too_many_waiters": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/socket_unwritable_ms": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/closes": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/pool_cached": 187.0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/nack_admission_control/dropped_requests": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/status/1XX": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/status/4XX": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/status/2XX": 198,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/status/error": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/status/3XX": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/status/5XX": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/status/200": 198,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/pool_size": 0.0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/available": 1.0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/time/1XX.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/time/4XX.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/time/2XX.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/time/error.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/time/3XX.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/time/5XX.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/socket_writable_ms": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/cancelled_connects": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/dtab/size.count": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/requests": 198,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/loadbalancer/size": 1.0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/loadbalancer/rebuilds": 1,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/loadbalancer/closed": 0.0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/loadbalancer/load": 0.0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/loadbalancer/meanweight": 1.0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/loadbalancer/adds": 1,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/loadbalancer/updates": 1,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/loadbalancer/algorithm/p2c_least_loaded": 1.0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/loadbalancer/available": 1.0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/loadbalancer/max_effort_exhausted": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/loadbalancer/busy": 0.0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/loadbalancer/removes": 0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/pending": 0.0,
"rt/http-incoming/client/%/io.l5d.k8s.localnode/100.84.217.146/#/io.l5d.k8s/test-ns/http/foo-service/connections": 1.0,

and from another route:

"rt/http-outgoing/service/svc/foo-service/success": 119,
"rt/http-outgoing/service/svc/foo-service/request_latency_ms.count": 0,
"rt/http-outgoing/service/svc/foo-service/retries/per_request.count": 0,
"rt/http-outgoing/service/svc/foo-service/retries/total": 0,
"rt/http-outgoing/service/svc/foo-service/retries/budget_exhausted": 0,
"rt/http-outgoing/service/svc/foo-service/retries/budget": 100.0,
"rt/http-outgoing/service/svc/foo-service/failures": 34,
"rt/http-outgoing/service/svc/foo-service/failures/com.twitter.finagle.service.ResponseClassificationSyntheticException": 32,
"rt/http-outgoing/service/svc/foo-service/failures/com.twitter.finagle.NoBrokersAvailableException": 1,
"rt/http-outgoing/service/svc/foo-service/failures/restartable": 1,
"rt/http-outgoing/service/svc/foo-service/failures/restartable/com.twitter.finagle.Failure": 1,
"rt/http-outgoing/service/svc/foo-service/failures/restartable/com.twitter.finagle.Failure/com.twitter.finagle.ConnectionFailedException": 1,
"rt/http-outgoing/service/svc/foo-service/failures/restartable/com.twitter.finagle.Failure/com.twitter.finagle.ConnectionFailedException/io.netty.channel.ConnectTimeoutException": 1,
"rt/http-outgoing/service/svc/foo-service/failures/rejected": 1,
"rt/http-outgoing/service/svc/foo-service/failures/rejected/com.twitter.finagle.Failure": 1,
"rt/http-outgoing/service/svc/foo-service/failures/rejected/com.twitter.finagle.Failure/com.twitter.finagle.ConnectionFailedException": 1,
"rt/http-outgoing/service/svc/foo-service/failures/rejected/com.twitter.finagle.Failure/com.twitter.finagle.ConnectionFailedException/io.netty.channel.ConnectTimeoutException": 1,
"rt/http-outgoing/service/svc/foo-service/sourcedfailures//svc/foo-service": 1,
"rt/http-outgoing/service/svc/foo-service/sourcedfailures//svc/foo-service/com.twitter.finagle.NoBrokersAvailableException": 1,
"rt/http-outgoing/service/svc/foo-service/requests": 153,
"rt/http-outgoing/service/svc/foo-service/pending": 0.0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/connect_latency_ms.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/failed_connect_latency_ms.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/sent_bytes": 58588,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/service_creation/failures": 4,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/service_creation/failures/com.twitter.finagle.NoBrokersAvailableException": 1,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/service_creation/failures/com.twitter.finagle.Failure": 3,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/service_creation/failures/com.twitter.finagle.Failure/com.twitter.finagle.ConnectionFailedException": 3,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/service_creation/failures/com.twitter.finagle.Failure/com.twitter.finagle.ConnectionFailedException/io.netty.channel.ConnectTimeoutException": 3,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/service_creation/service_acquisition_latency_ms.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/connection_received_bytes.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/connection_duration.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/failure_accrual/removals": 1,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/failure_accrual/probes": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/failure_accrual/removed_for_ms": 5000,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/failure_accrual/revivals": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/connects": 152,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/pool_num_waited": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/success": 119,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/request_latency_ms.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/pool_waiters": 0.0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/retries/requeues_per_request.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/retries/request_limit": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/retries/budget_exhausted": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/retries/cannot_retry": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/retries/not_open": 1,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/retries/budget": 100.0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/retries/requeues": 3,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/received_bytes": 260891,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/read_timeout": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/write_timeout": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/connection_sent_bytes.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/connection_requests.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/service/svc/foo-service/success": 119,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/service/svc/foo-service/request_latency_ms.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/service/svc/foo-service/failures": 32,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/service/svc/foo-service/failures/com.twitter.finagle.service.ResponseClassificationSyntheticException": 32,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/service/svc/foo-service/requests": 151,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/service/svc/foo-service/pending": 0.0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/pool_num_too_many_waiters": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/socket_unwritable_ms": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/closes": 154,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/pool_cached": 0.0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/nack_admission_control/dropped_requests": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/status/1XX": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/status/4XX": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/status/2XX": 119,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/status/error": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/status/3XX": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/status/503": 32,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/status/5XX": 32,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/status/200": 119,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/failures": 32,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/failures/com.twitter.finagle.service.ResponseClassificationSyntheticException": 32,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/pool_size": 0.0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/available": 0.0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/time/1XX.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/time/4XX.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/time/2XX.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/time/error.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/time/3XX.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/time/5XX.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/socket_writable_ms": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/cancelled_connects": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/dtab/size.count": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/requests": 151,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/loadbalancer/size": 2.0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/loadbalancer/rebuilds": 9,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/loadbalancer/closed": 0.0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/loadbalancer/load": 0.0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/loadbalancer/meanweight": 1.0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/loadbalancer/adds": 6,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/loadbalancer/updates": 9,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/loadbalancer/algorithm/p2c_least_loaded": 1.0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/loadbalancer/available": 2.0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/loadbalancer/max_effort_exhausted": 0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/loadbalancer/busy": 0.0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/loadbalancer/removes": 4,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/pending": 0.0,
"rt/http-outgoing/client/%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service/connections": 1.0,

Please advise. Thanks!

1 Like

I’ve updated your trust level to be able to post files, I think the rest of metrics.json would be useful here. It’s true that the backend success rate is 100%, but the issue is linkerd is having trouble bringing up a connection with the service.

com.twitter.finagle.RequestTimeoutException, for example, shows up when linkerd is unable to bind a path to a client. You could try increasing bindingTimeoutMs https://linkerd.io/config/1.1.3/linkerd/index.html#router-parameters and see if that alleviates anything, although the defaults are already pretty generous.

You could also try decreasing maxConcurrentRequests if your backends are not equipped to handle that many concurrent requests https://linkerd.io/config/1.1.3/linkerd/index.html#server-parameters

Would also suggest a load test over a half hour or longer, as linkerd/jvm warms up and builds a number of open connections (your backends maintain open connections for efficiency, yes?).

I suspect this problem is related to either 1) a service discovery issue where linkerd is attempting to talk to instances that are down, just coming up, or don’t exist. or 2) a concurrency issue.

Hope that helps

1 Like

Thanks esbie. Unfortunately that doesn’t seem to help. We set:

      client:
        failureAccrual:
          kind: none

and

  servers:
    maxConcurrentRequests: 1000

and

  bindingTimeoutMs: 60000

on the http routers.

The admin is reporting p50 latencies of ~83155ms. The first few issues we see are still

[l5d-s46dm l5d] I 0825 01:01:32.399 UTC THREAD27 TraceId:4796572172f4679a: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/test-ns/http-incoming/l5d/#/io.l5d.k8s.http/test-ns/http/foo-service" as dead. Remote Address: Inet(/100.84.170.228:4141,Map(nodeName -> ip-172-24-33-64.ec2.internal))

We did note that one l5d instance restarted twice. We’re seeing thousands of

[l5d-29np7 l5d] E 0825 01:00:36.187 UTC THREAD25 TraceId:c3a588dc48a56983: service failure: com.twitter.finagle.NoBrokersAvailableException: No hosts are available for /svc/foo-service, Dtab.base=[/ph=>/$/io.buoyant.rinet;/svc=>/ph/80;/svc=>/$/io.buoyant.porthostPfx/ph;/k8s=>/#/io.l5d.k8s.http;/portNsSvc=>/#/portNsSvcToK8s;/host=>/portNsSvc/http/test-ns;/host=>/portNsSvc/http;/svc=>/$/io.buoyant.http.domainToPathPfx/host], Dtab.local=[]. Remote Info: Not Available

even though we can hit foo-service directly without any problem while under load.

l5d admin won’t respond.

As far as the backend, we never see the backend service struggling. When hitting the backend directly we scale to over three times what we see at l5d’s best. It doesn’t restart or even error out. Whenever we pull l5d out of the mix rps goes up substantially.

Please see attached metrics.json. Thanks!

metrics.json (101.4 KB)

1 Like

Ok got it, this sounds more like a configuration issue than a scaling issue, but there are still a lot of unknowns. What’s the linkerd version/k8s version? How many nodes? Does the admin dtab ui fail too when you plug in /svc/foo-service to the http-outgoing router? Does linkerd 502 when called directly or just under gatling? Exactly what kind of call is being made to linkerd versus the foo service directly? Anything interesting in the logs of the restarted linkerd instances (like OOM or something)?

P.S. Setting failureAccrual to none means that there should never be a single FailureAccrualFactory mention in the logs, so that’s weird. Make sure that when updating config you delete and recreate the l5d daemonset… just re-applying isn’t enough.

1 Like

Thanks again for following up with this!

What’s the linkerd version/k8s version? How many nodes?

We’re using l5d 1.1.3 on k8s 1.7.2. We have 6 nodes in our test env of which only 3 are used for app level deployments. l5d daemonset successfully deploys to all 3. We use rbac.

Does the admin dtab ui fail too when you plug in…

Under load, the admin UI does not come up. After testing completed I got this:

/svc/foo-servce    Possible Resolution Path 
    /host/foo-servce /svc => /$/io.buoyant.http.domainToPathPfx/host    Several Possible Resolution Paths 
        /portNsSvc/http/foo-servce /host => /portNsSvc/http    Possible Resolution Path
            (residual: /#/portNsSvcToK8s/http/foo-servce) /portNsSvc => /#/portNsSvcToK8s    No Further Branch Matches 
        /portNsSvc/http/test-ns/foo-servce /host => /portNsSvc/http/test-ns    Possible Resolution Path 
            /k8s/test-ns/http/foo-servce /portNsSvc => /#/portNsSvcToK8s    Possible Resolution Path 
                (residual: /#/io.l5d.k8s.http/test-ns/http/foo-servce) /k8s => /#/io.l5d.k8s.http    No Further Branch Matches 

Does linkerd 502 when called directly or just under gatling?

During a soak test I can hit the service via l5d without error although latencies are off the charts varied from 150ms to 109950ms.

The l5d logs are chock full of:

[l5d-559hn l5d] E 0825 16:55:17.002 UTC THREAD23 TraceId:ffd03d1e4bca262d: service failure: com.twitter.finagle.ChannelClosedException: ChannelException at remote address: /100.84.170.212:8080 from service: %/io.l5d.k8s.localnode/100.84.170.252/#/io.l5d.k8s/test-ns/http/foo-service. Remote Info: Upstream Address: /100.84.217.157:58188, Upstream Client Id: Not Available, Downstream Address: /100.84.170.212:8080, Downstream Client Id: %/io.l5d.k8s.localnode/100.84.170.252/#/io.l5d.k8s/test-ns/http/foo-service, Trace Id: ffd03d1e4bca262d.470e635f16438b62<:e57a28f45a501ff5

Exactly what kind of call is being made to linkerd versus the foo service directly?

When using l5d we hit the foo-service via l5d.test-ns.svc.cluster.local:4140 with header Host: foo-service. When hitting it directly we hit the k8s ClusterIP service at foo-service.test-ns.svc.cluster.local:8080. The backend service is a simple Dropwizard service.

Anything interesting in the logs of the restarted linkerd instances (like OOM or something)?

The instances don’t restart. Everything is up: l5d and backends services. l5d just (seems to) decides to stop recognizing the backend services as being up. k8s shows no restarts. Everything is green and the backends respond to direct requests. The l5d logs show thousands of statements like the one noted above.

Setting failureAccrual to none means that there should never be a single FailureAccrualFactory mention in the logs, so that’s weird

Perhaps this was some kibble from previous runs. We’ve completely removed the DS and all related resources before performing a fresh deployment with a fresh soak test of 50 rps for 30 min. The results of which are still disheartening. I’ve attached updated metrics.json and our linkerd.yaml. The latter is essentially the example sans the grpc and h2 routing. It also uses a linkerd image that we use in-house that simply derives from the 1.1.3 official image but uses envsubst to do $NAMESPACE replacement in the config.

The tests showed an interesting distribution of failures with the first third of the test being completely quirky and then leveling out to expected (hoped for) performance.

Overall this was our best test yet with only 19% being hard failures. However those are pretty gross numbers when looking at that first third. We’ll spin up another test shortly.

linkerd.yaml (9.4 KB)
metrics.json (54.1 KB)

Thanks for all this additional information! I asked @olix0r and @klingerf about what they think might be going on.

Based on

"rt/http-outgoing/server/0.0.0.0/4140/connects": 33774,
"rt/http-outgoing/server/0.0.0.0/4140/requests": 33779,

we conclude gatling isn’t reusing connections, and this has big performance implications. One pretty likely theory from @olix0r is that the host hits max num of ports early on and then struggles to allocate any additional connections until the old closed ones leave the connection table. This theory is supported by the fact that you can’t make a connection with the admin ui while under load.

You could confirm this by running netstat -anl on the host under load and there would be tons and tons of TIME_WAIT entries.

This issue wouldn’t necessarily happen sans proxy, since proxies (almost by definition) open twice as many sockets as a server by itself.

Usually opening and then closing 200 connections per second doesn’t accurately reflect most applications’ production traffic. If for some reason you’re in the minority of operators that absolutely DOES need to open a truckton of connections every minute, there is some tuning that can be done. But it’s at the linux kernel level–linkerd can’t do anything if the host won’t open more connections.

Also, linkerd can handle much much higher qps in part because it’s so smart about connection reuse, and a lot of those benefits are lost if it’s forced to constantly throw away connections.

OK enough proselytizing, here are my recommendations. Either reconfigure gatling to reuse connections, or use a different loadtester. At Buoyant we use slowcooker, which is very explicit about concurrency vs qps.

2 Likes

Thanks for this reply. We’ve been working on nailing down a maximum throughput to baseline off of before going back to this. We did update gatling, as suggested, to share connections as that is a more likely use case for us. We can confirm after numerous trials that without l5d in the picture we can sustain 500 req/s over a 30 min duration with 0 total failures and a 99th percentile resp duration of 85ms.

Adding l5d back in the mix during a 50 req/s soak test over 30 min produced an 11% failure rate. This is with no timeouts and failureAccrual off. Unfortunately, that’s an order of magnitude difference. It’s not even comparable. We really wanted l5d to work, but it appears to be a massive impediment to our throughput.

Unless anyone has any other ideas - and we very much appreciate your ideas and time - the only conclusion we can draw is that l5d is not appropriate for moderate to high throughput systems. We will revisit l5d in the future to see if anything has changed. We are as sad as anyone. We’ve invested months getting the configurations “correct”, nailing down images, k8s configs, bringing the team up to speed on the product, etc. If nothing else, it was a learning experience.

Thanks for all your help and we look forward to revisiting this product in the future!

Hi @torr! Bummer we couldn’t get this working for you. I wonder what’s going on. We have many production users (some of whom are on the Slack and discourse) who use Linkerd at far greater request volumes than 500 RPS. In our internal cloud K8s-on-AWS testbed, we can do 30k-40k RPS on an individual instance with a p99 of 20-60ms. If you’re interested, we can share these setup scripts with you, and you can compare and contrast with your config.

2 Likes

I certainly would love to have a look! Knowing the basic behavior of these components and what they’re trying to do, I find it confounding that we are struggling so much. It just doesn’t seem to follow that sans l5d the numbers are good, then as soon as we go through l5d - even in a very targeted way isolating one simple call to one simple service - things go south. It just shouldn’t be.

Sorry for the delay. We are reconstituting our perf harness for public consumption.

Hi @torr,

We’ve taken another look at your case, and have some new suggestions to address the performance issues you are seeing. I concur with Esbie and William that this is a configuration issue, as linkerd (and its underlying technology, Finagle) scales to some of the largest sites on Earth. In principal we should be able to achieve higher throughput and lower latency with linkerd than without. I’m attaching a modified version of the linkerd.yaml file you uploaded earlier. We’d like you to try it out:

linkerd-modified.yaml (8.6 KB)

This file contains a number of modifications which we believe may alleviate the performance issues you are seeing. This file is modified in the following ways:

  • Removes the k8s resources/limits and resources/requests
    • Setting resources/requests to 50m (0.05 CPUs) is too low for high throughput applications, better to let k8s manage the resources.
  • Sets JVM_HEAP_MIN and JVM_HEAP_MAX to 512M.
    • These values initialize the JVM with a static amount of memory, avoiding heap fragmentation and GC issues that may arise when memory usage is allocated and increased in many small increments.
  • Disables io.l5d.recentRequests and zipkin-collector
    • Sampling requests with these plugins does incur a performance penalty, we’d like to rule them out.
  • Sets maxConcurrentRequests, bindingTimeoutMs, and failureAccrual back to default values.
    • With these set at default values, we’d still expect to see better performance than you are reporting. We’d like to measure these at default again to get a baseline.

Let us know if you have any questions, we look forward to your findings.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.