Request latency degradation (correlated with server handletime_us)


#1

After Linkerd has been up and running for a week or two we would start seeing sudden degradation in request latency. The only metrics I could correlate the behavior for sure was Finagle’s handletime_us. Linkerd restart causes performance metrics to go back to normal.

The graphs I am posting are for 4 linkerds running on bare metal. This has been seen on random Linkerds throughout our infrastructure.

I have historical metrics available in either Whisper/Graphite form or Prometheus. I assume this will help with debugging so please let me know what is needed.


#2


#3

@slowpoke thanks for opening this issue. Do you mind give us a bit more information about your setup? What are your Linkerd configurations? What kind of services are you using, HTTP/1.1 or HTTP/2? What is the request load for an affected Linkerd? Does this happen under heavy load or does that not matter?


#4

Consul as namer. HTTP 1.1. Exposes Prometheus metrics. Zipkin with sample rate of 0.001. Admin endpoint handle time affected in the same way. Config:

routers:
- protocol: http
  service:
    responseClassifier:
      kind: io.l5d.http.retryableRead5XX
  client:
    loadBalancer:
      kind: ewma
      maxEffort: 10
      decayTimeMs: 5000
    failureAccrual:
      kind: io.l5d.successRate
      successRate: 0.9
      requests: 1000
      backoff:
        kind: jittered
        minMs: 5000
        maxMs: 50000


#5


#6


#7

Load does not seem to matter in this or any other case when we’ve experienced “degradation”.

image


#8

Interesting that you mention that the admin endpoint is affected as well. What Linkerd version are you using? As of 1.4.4, there was an improvement to the admin endpoint so that it served metrics in its own thread and not on data path threads. If it’s possible, it may be worth trying that out to see if there is an improvement in request latency


#9

Yes, I am aware and I’ve upgraded to 1.4.4 for that reason as soon as it got out.


#10

Restarted affected Linkerds yesterday. Here’s the effect:


Looking for any pointers on how to approach debugging this for Linkerd internally, please.


#11

Hi @slowpoke, we have had other users report this issue as well and we are actively investigating how to fix this My guess is that the PrometheusTelemeter begins to choke when its process a large amount of stats, especially stats from a Linkerd configuration with numerous routers.


#12

In addition to noting the latency of these metrics requests, it would be valuable to know the size of the payload and the contents. In particular, it would be valuable to share the full response payload from the metrics endpoint:

  • while handletime is still fast
  • after handletime has become slow
  • after rebooting Linkerd

By comparing the metrics at these three points, we may be able to determine what is causing the slowness. @slowpoke are you able to share metrics dumps either publicly or privately?


#13

I am not sure if we’re on the same page here. The graphs I am posting above are NOT for /admin metrics endpoints. Request latency increases for normal requests going through Linkerd. Whenever that happens /admin endpoints are also being affected in the same way, as one of the other graphs I posted above shows.

We haven’t done any metric pulls from Linkerd using Prometheus before March or April. This “degradation” started last year at some point and I have metrics to prove that. Therefore, it is a mistake to think that degradation in request latency and this relation with server handle time (which is a different metric - Finagle metric - than admin endpoint handle time) is direct result of Prometheus scraping.

Please note these findings above. I’d prefer to share metrics with you privately. I will contact you on Slack tomorrow.


#14

Thanks for the clarification and sorry for the confusion. Looking again at the graphs I now see what you mean. I notice that the handletime degradation for requests is in the microseconds and there doesn’t seem to be a noticeable impact on request latency. How serious of a problem is this?

I do notice that admin handletime is in the milliseconds, which is concerning.

The metrics dumps I mentioned in the previous message will still be helpful to understand what’s happening here and also to help us try to reproduce this issue. You can also take a look at some of Linkerd’s admin endpoints which are useful for diagnosing performance issues:

  • /admin/contention: shows lock contention
  • /admin/pprof/profile: generates a CPU usage profile
  • /admin/threads: dump of current thread stacktraces

#15

Request latency degradation on Linkerd level affects our end-user page response times significantly. In other cases when it’s degradation on Linkerds connecting internal services for, say, background jobs and what not, degradation may affect timeouts. So even though handletime_us is in microseconds, I observed it increasing whenever we see a degradation request latency, which makes it a significant metric to track for us.

In my experience so far, /admin endpoint worker related patches in 1.4.4 fixed timeouts that I’ve experienced when I started using Prometheus to scrape Linkerd metrics. However, general degradation on normal requests going through Linkerd is still occurring (running 1.4.6 throughout our infrastructure).

I’ll work on preparing metrics for you tomorrow and as an example I take metrics used to produce those graphs above which are from 4 Linkerds running on 4 dedicated machines.

I will also take a look at other admin endpoints, take note of CPU/thread dumps and lock contention info and wait for degradation to appear somewhere to compare with state of Linkerd at that time versus normal operation.

Thank you for all the help.