Server connection limits


#1

Hello,

I’m trying to learn more about an issue we are having that we suspect is due to a high number of incoming connections to Linkerd.

We run Linkerd as an ingress controller in Kubernetes, and basic load testing from a small number of clients has us easily handling 7-8k RPS, which is ample head room for expected load. The real-world traffic coming into the cluster comes from approx 150 hosts, each opening 80 concurrent connections. We see throughput start to suffer once l5d instances hit about 1000-1200 server connections. CPU and memory are generally stable, and well within limits.

We can (and will) improve connection pooling in the client, but meanwhile would like to understand more about the constraint we’re hitting.

Kubernetes configuration:

  • Kubernetes 1.9.6 on GKE
  • Linkerd configured as daemonset (during load test instances peak up to utilising 2CPUs of each host)
  • 5 hosts: 8 vCPU, 30GB.
  • 1024MB -Xms and -Xmx
  •   resources:
        limits:
          memory: 1.5Gi
        requests:
          memory: 1.5Gi
    

Linkerd Config:

  admin:
      ip: 0.0.0.0
      port: 9990

    namers:
    - kind: io.l5d.k8s

    telemetry:
    - kind: io.l5d.prometheus

    routers:
    - protocol: http
      identifier:
        kind: io.l5d.ingress
      servers:
        - port: 80
          ip: 0.0.0.0
      dtab: /svc => /#/io.l5d.k8s
      client:
        loadBalancer:
          kind: ewma
          maxEffort: 5
          decayTimeMs: 5000

Can give more detail about our configuration if necessary, but meanwhile wondering about…

  • How many incoming connections would you expect Linkerd to be able to handle with the above configuration? (we’ve previously frontend this service with nginx on the same K8S infra/config).
  • Any best practises in relation to connection reuse/pooling that we should be thinking about?

Thanks,
Andy.


#2

Hi Andy!

That is quite a bit of concurrency so it’s not too surprising that you are seeing some slowdown. One thing to try would be to increase the number of Finagle workers to see if that helps deal with the concurrency. The default number of Finagle workers is 8 but setting this to 2 times the number of available CPUs is sometimes a good heuristic. You can see how to configure the number of finagle workers here: https://github.com/linkerd/linkerd/pull/1889


#3

Hi Alex,

Thanks for this detail. As expected, we’ve mostly solved our issues by reducing incoming concurrent connections. I still would like to tune the Finagle settings to try and deal with some slow down we see when the number of connections spikes suddenly and we can’t scale quite quick enough.

It looks like the change you referenced above will be in v1.4.0, so I will probably wait until this version is released before experimenting further. I’ll keep this issue open until then.

Thanks again - hope to see you next week in Copenhagen!