Suspected slow leak for Linkerd 1.1.2

We are observing CPU spike after l5d running after 2+ days.
Restarting L5d process and everything returned to normal for another 2+ days…

Did heap dump, sent to olix0r, it seems it is related to stat collector.

Note I have this in my settings:

telemetry:

  • kind: io.l5d.statsd
    experimental: true
    prefix: l5d_router_prod
    hostname: 127.0.0.1
    port: 8125
    gaugeIntervalMs: 10000
    sampleRate: 0.01
  • kind: io.l5d.recentRequests
    sampleRate: 1.0

On my side, disabled statsd for a box and continue to observe.

Thanks for providing your config details @leozc. A couple follow up questions:

  1. When you say disabled statsd for a box and continue to observe, do you mean you disabled io.l5d.statsd and continued to observe the memory leak? Can you try with the telemetry section completely removed? Or with only io.l5d.recentRequests?
  2. We have a theory about some leaks in io.l5d.statsd. Can you try running with Docker image buoyantio/linkerd:statsd-leaks and verify if the issue persists? Multiple heap dumps are very helpful for comparison.

Thanks again,
sig

  1. Only recentRequests plugin left
  2. Is this a new build - I have everything setup and will be able to direct
    some real traffic in if I can have the binary (pls ping me on slack)

It seems it is leaking somewhere else.

Now Linkerd can survive 4 days of JVM time, then we observed increasing CPU consumption and eventually the box suffered 100% CPU consumption and terminated by autoscaling group due to health check failure.

I took the box out from autoscaling group (in AWS) before termination- and without ingress traffic, the machine’s CPU dropped back to idle level (1-2%) - without restarting the process.
However when I reflowed traffic back to l5d, the CPU immediately reach back to max.

So I am pretty sure it is a persistent issue somewhere.

Yes restarting JVM helps

Sent through the dump in private channel

@eliza you mentioned there were some reports related to this issue, may I know where I can look at this?

@leozc can you make a Github issue for this? We’ll track it with the other bugs we’re aiming at fixing for the next release. Thanks

1 Like

done

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.