No hosts available error with multi-router configuration

Load Test on Linked with multi-router setup is giving inconsistent results with error % ranging from 0 to 50 %

Load test setup:

  • throughput 500 req/sec
  • timeout 60 sec

Setup

For default namespace,
4140 -> outgoing
4141 -> incoming

For namespace ns1
4143 -> outgoing-ns1
4144 -> incoming-ns1

For namespace ns2
4145 -> outgoing-ns2
4146 -> incoming-ns2

Tested against service named service1 deployed in namespaces “default”, “ns1” & "ns2"

http_proxy=L5D:4140 curl http://service1 (routes to service1 in namespace default)
http_proxy=L5D:4143 curl http://service1 (routes to service1 in namespace ns1)
http_proxy=L5D:4145 curl http://service1 (routes to service1 in namespace ns2)

Linkerd is unable to resovle the service name, happens only during load tesst. Observing following problems frequently,

No hosts are available for /svc/service1, Dtab.base=[/k8s=>/#/io.l5d.k8s;/portNsSvc=>/#/portNsSvcToK8s;/host=>/portNsSvc/http/default;/host=>/portNsSvc/http;/svc=>/$/io.buoyant.http.domainToPathPfx/host], Dtab.local=[]. Remote Info: Not Available

No hosts are available for /svc/service1, Dtab.base=[/k8s=>/#/io.l5d.k8s;/portNsSvc=>/#/portNsSvcToK8s;/host=>/portNsSvc/http/ns1;/host=>/portNsSvc/http;/svc=>/$/io.buoyant.http.domainToPathPfx/host], Dtab.local=[]. Remote Info: Not Available

No hosts are available for /svc/service1, Dtab.base=[/k8s=>/#/io.l5d.k8s;/portNsSvc=>/#/portNsSvcToK8s;/host=>/portNsSvc/http/ns2;/host=>/portNsSvc/http;/svc=>/$/io.buoyant.http.domainToPathPfx/host], Dtab.local=[]. Remote Info: Not Available

linkerd.yml (10.4 KB)

admin-ui delegator

outgoing

outgoing-ns1

outgoing-ns2

Hi @zshaik. Would you mind providing a bit more detail:

  1. linkerd config file
  2. screenshots of admin dtab ui, with /svc/service1 route testing

@siggy updated, thanks!

Thanks for the config file @zshaik, it’s helpful.

A few comments:

  1. Looking at the failureAccrual policy:
failureAccrual:
  kind: io.l5d.successRate
  successRate: 0.9
  requests: 10
  backoff:
    kind: constant
    ms: 600000

This seems pretty aggressive. If it fails two requests out of ten, it won’t attempt retry for 10 minutes. Can you rerun your tests with the failureAccrual blocks completely disabled?

  1. A cpu resource limit of 1000m is likely problematic. Linkerd detects how many CPUs are available on a system, and then allocates threads based on that number. It is likely Linkerd thinks it has many more CPUs available to it than Kubernets is actually giving it. Can you rerun your tests with resources disabled?
resources:
  requests:
    memory: "500Mi"
    cpu: "400m"
  limits:
    memory: "2Gi"
    cpu: "1000m"
  1. Can you share /admin/metrics.json after test runs, both before and after making changes per #1 and #2?
1 Like

Hi @siggy, the errors are gone by removing the resource limitation!! Here’s the metrics

with_resources.json (168.4 KB)
without_resources.json (161.3 KB)

also, can you sugget me on the retry budget item & failure accrual? suggested values for production?

    retries:
      budget:
        minRetriesPerSec: 1
        percentCanRetry: 0.1
        ttlSecs: 15
    responseClassifier:
      kind: io.l5d.http.retryableRead5XX

Hi @zshaik. Generally we recommend sticking with the default values unless you run into performance issues. The defaults are tuned for a high-performance production environment. If you do need to tune these values, recommend setting up a reproducible benchmark (that mimics your production conditions) and experiment with different values.

ty @siggy one more question, if resource limits are not specified then how does linkerd utilizes the CPU per node and what is the expected behaviour in terms of memory/cpu utilization?
For instance, I have a 3 node cluster with each node is of type t2.medium (https://aws.amazon.com/ec2/instance-types) and linkerd deployed as daemonset and serving 10 endpoints in the same cluster at 500 req/sec.

Hi @zshaik. For memory, it’s fine to set limits, just ensure your JVM_HEAP_* flags are 33% less than the memory limit you give the container. For CPU, Linkerd’s load will generally track with request volume. The important thing is that Linkerd’s thread count aligns correctly with the CPU it has available to it. If you really do want to limit CPU, you can do it via taskset. Partially inspired by our topic here, we’ve created a Linkerd Performance Tuning guide, more info at:

1 Like

that was really helpful, thank you @siggy I tested linkerd to monitor p99 & p55 in linkerd-viz as got results as expected with the recommendations. I have few more questions (I couldn’t continue the conversation on slack)

  1. suppose if I don’t give resource limits in my k8s config and run linkerd, then what should be the jvm_heap value in linkerd be set because that depends on RAM?
  2. what is recommended resource (RAM) limitation for linkerd?
    3, supose if all my nodes in k8s cluster has 1cpu each and have linkerd wih no resource limits then does k8s overallocates cpu to other containers as cpu limit is not defined to linkerd?

Hi @zshaik,
I hope these answers make it a bit clearer, let me know if you have any follow up questions:

  1. system memory should be about 1.5x jvm_heap values. For example if the nodes have 1.5GB memory, jvm_heap should be ~1GB
  2. recommend jvm_heap values should be around 1024M
  3. I’m unfamiliar with exactly how k8s allocates cpu, but it tries to provide cpu to containers based on need.
1 Like