This page describes some things you might find useful if you’re preparing to run (or running!) linkerd in production.
Metrics to monitor
We recommend setting an alert that checks that these metrics are 0:
rt/*/bindcache/bound/oneshots rt/*/bindcache/client/oneshots rt/*/bindcache/path/oneshots rt/*/bindcache/tree/oneshots
When any of these are non-zero, a cache has been exhausted and parts of the client stack are being built in the request serving path.
Other things our users have found helpful to keep an eye on:
failure_accrual:removals - tracks the number of times a host has been removed due to failure accrual.
loadbalancer:available - gauge of how many nodes the load balancer thinks are ready to receive traffic.
If true, connection failures are punished aggressively. This should be set to false on clients that talk to small clusters with fewer than ~3 nodes.