Note: this post is about Linkerd 1.x. For Linkerd 2.x performance benchmarks, please see the blog post: https://linkerd.io/2019/05/18/linkerd-benchmarks/
The goal of this post is to document various tuning strategies to achieve optimal Linkerd performance.
An optimally-tuned Linkerd with 1GB RAM and ample CPU should provide the following performance:
qps | p50 | p95 | p99 |
---|---|---|---|
1k | <1ms | 1ms | 5ms |
10K | 1ms | 2ms | 4ms |
20K | 1ms | 4ms | 6ms |
30K | 2ms | 6ms | 10ms |
40K | 2ms | 17ms | 31ms |
These numbers are highly dependent on the environment, configuration, and latency characteristics of the cluster Linkerd is running in.
Memory
In production, set the JVM_HEAP_MIN
and JVM_HEAP_MAX
environment variables to the same value, to avoid memory fragmentation. For high qps applications, we recommend at least 1024M
.
By default we set JVM_HEAP_MIN
to 32M
and JVM_HEAP_MAX
to 1024M
. This is intended for a development environment where, for example, you’d want to easily boot Linkerd on your personal computer.
If Linkerd is running in a virtualized environment, be sure to give 33% headroom above the JVM. For example, if your Docker container or Kubernetes config provides 1GB to Linkerd, give the JVM 768M
.
Note that these environment variables are for convenience, and map to JVM flags -Xms
and -Xmx
.
CPU
Avoid setting CPU limitations in virtualized/containerized environments. Linkerd tunes its thread count based on the number of CPUs it detects on the host. If your environment provides a smaller share of CPUs, Linkerd will allocate more threads than it should, causing performance degregdation. To verify this, check the jvm/num_cpus
metric in Linkerd’s /admin/metrics.json
endpoint.
If you need to limit the CPUs available to Linkerd, use taskset to specify the number of CPUs to give it. For example, to limit Linkerd to 4 CPUs in Kubernetes:
command: ["taskset"]
args:
- "--cpu-list"
- "0-3"
- "/io.buoyant/linkerd/1.3.3-SNAPSHOT/bundle-exec"
- "/io.buoyant/linkerd/config/config.yaml"
Telemeters
Avoid setting high sampleRate
values for Linkerd Telemeters, specifically:
io.l5d.recentRequests
io.l5d.statsd
io.l5d.tracelog
io.l5d.zipkin
If you encounter performance issues, and have one or more of these telemeters enabled, try disabling them and re-running your benchmarks.
Config Defaults
Linkerd provides powerful performance tuning via Failure Accrual, Load-Balancing, and Retries. By default, these parameters are tuned for a high traffic production environment. We recommend you start with defaults, and carefully tune to your environment, re-running benchmarks with each change.
linkerd-viz
linkerd-viz provides a quick overview of top-line service and linkerd performance. It builds on metrics available from /admin/metrics.json
.
linkerd-viz includes two dashboards, a default for service-level peformance, and linkerd-health dashboard:
Default Dashboard
The default dashboard shows service-level performance, including success rate, latency, and request volume.
linkerd-health Dashboard
The linkerd-health dashboard provides Linkerd performance, including memory usage, GC, and uptime.
Flame Graphs
If Linkerd appears to be CPU bound, you can generate Flame Graphs from a running Linkerd to easily identify hot spots. The instructions below will generate an interactive SVG file, allowing you to browse code-paths by CPU usage.
# be sure to install the old gperftools, we explicitly need pprof from that package,
# not the one from https://github.com/google/pprof
brew install gperftools
$ pprof --version
pprof (part of gperftools 2.0)
Copyright 1998-2007 Google Inc.
This is BSD licensed software; see the source for copying conditions
and license information.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
git clone https://github.com/brendangregg/FlameGraph
# build flame graph
curl -s "http://localhost:9990/admin/pprof/profile?seconds=30&hz=100" > l5d.pprof
pprof --collapsed l5d.pprof > l5d.collapsed
FlameGraph/flamegraph.pl --color=java --hash l5d.collapsed > l5d.svg
open l5d.svg