Request latency degradation on Linkerd level affects our end-user page response times significantly. In other cases when it’s degradation on Linkerds connecting internal services for, say, background jobs and what not, degradation may affect timeouts. So even though handletime_us is in microseconds, I observed it increasing whenever we see a degradation request latency, which makes it a significant metric to track for us.
In my experience so far, /admin endpoint worker related patches in 1.4.4 fixed timeouts that I’ve experienced when I started using Prometheus to scrape Linkerd metrics. However, general degradation on normal requests going through Linkerd is still occurring (running 1.4.6 throughout our infrastructure).
I’ll work on preparing metrics for you tomorrow and as an example I take metrics used to produce those graphs above which are from 4 Linkerds running on 4 dedicated machines.
I will also take a look at other admin endpoints, take note of CPU/thread dumps and lock contention info and wait for degradation to appear somewhere to compare with state of Linkerd at that time versus normal operation.
Thank you for all the help.