Linkerd Performance Tuning

performance
cpu
memory
docker
kubernetes

#1

The goal of this post is to document various tuning strategies to achieve optimal Linkerd performance.

An optimally-tuned Linkerd with 1GB RAM and ample CPU should provide the following performance:

qps p50 p95 p99
1k <1ms 1ms 5ms
10K 1ms 2ms 4ms
20K 1ms 4ms 6ms
30K 2ms 6ms 10ms
40K 2ms 17ms 31ms

These numbers are highly dependent on the environment, configuration, and latency characteristics of the cluster Linkerd is running in.

Memory

In production, set the JVM_HEAP_MIN and JVM_HEAP_MAX environment variables to the same value, to avoid memory fragmentation. For high qps applications, we recommend at least 1024M.

By default we set JVM_HEAP_MIN to 32M and JVM_HEAP_MAX to 1024M. This is intended for a development environment where, for example, you’d want to easily boot Linkerd on your personal computer.

If Linkerd is running in a virtualized environment, be sure to give 33% headroom above the JVM. For example, if your Docker container or Kubernetes config provides 1GB to Linkerd, give the JVM 768M.

Note that these environment variables are for convenience, and map to JVM flags -Xms and -Xmx.

CPU

Avoid setting CPU limitations in virtualized/containerized environments. Linkerd tunes its thread count based on the number of CPUs it detects on the host. If your environment provides a smaller share of CPUs, Linkerd will allocate more threads than it should, causing performance degregdation. To verify this, check the jvm/num_cpus metric in Linkerd’s /admin/metrics.json endpoint.

If you need to limit the CPUs available to Linkerd, use taskset to specify the number of CPUs to give it. For example, to limit Linkerd to 4 CPUs in Kubernetes:

command: ["taskset"]
args:
- "--cpu-list"
- "0-3"
- "/io.buoyant/linkerd/1.3.3-SNAPSHOT/bundle-exec"
- "/io.buoyant/linkerd/config/config.yaml"

Telemeters

Avoid setting high sampleRate values for Linkerd Telemeters, specifically:

  • io.l5d.recentRequests
  • io.l5d.statsd
  • io.l5d.tracelog
  • io.l5d.zipkin

If you encounter performance issues, and have one or more of these telemeters enabled, try disabling them and re-running your benchmarks.

Config Defaults

Linkerd provides powerful performance tuning via Failure Accrual, Load-Balancing, and Retries. By default, these parameters are tuned for a high traffic production environment. We recommend you start with defaults, and carefully tune to your environment, re-running benchmarks with each change.

linkerd-viz

linkerd-viz provides a quick overview of top-line service and linkerd performance. It builds on metrics available from /admin/metrics.json.

linkerd-viz includes two dashboards, a default for service-level peformance, and linkerd-health dashboard:

Default Dashboard

The default dashboard shows service-level performance, including success rate, latency, and request volume.

linkerd-health Dashboard

The linkerd-health dashboard provides Linkerd performance, including memory usage, GC, and uptime.

Flame Graphs

If Linkerd appears to be CPU bound, you can generate Flame Graphs from a running Linkerd to easily identify hot spots. The instructions below will generate an interactive SVG file, allowing you to browse code-paths by CPU usage.

# be sure to install the old gperftools, we explicitly need pprof from that package,
# not the one from https://github.com/google/pprof
brew install gperftools

$ pprof --version
pprof (part of gperftools 2.0)

Copyright 1998-2007 Google Inc.

This is BSD licensed software; see the source for copying conditions
and license information.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

git clone https://github.com/brendangregg/FlameGraph

# build flame graph
curl -s "http://localhost:9990/admin/pprof/profile?seconds=30&hz=100" > l5d.pprof
pprof --collapsed l5d.pprof > l5d.collapsed
FlameGraph/flamegraph.pl --color=java --hash l5d.collapsed > l5d.svg
open l5d.svg

Further reading


No hosts available error with multi-router configuration
#2

#3

#4