Thanks for providing your config details @leozc. A couple follow up questions:
When you say disabled statsd for a box and continue to observe, do you mean you disabled io.l5d.statsd and continued to observe the memory leak? Can you try with the telemetry section completely removed? Or with only io.l5d.recentRequests?
We have a theory about some leaks in io.l5d.statsd. Can you try running with Docker image buoyantio/linkerd:statsd-leaks and verify if the issue persists? Multiple heap dumps are very helpful for comparison.
Now Linkerd can survive 4 days of JVM time, then we observed increasing CPU consumption and eventually the box suffered 100% CPU consumption and terminated by autoscaling group due to health check failure.
I took the box out from autoscaling group (in AWS) before termination- and without ingress traffic, the machine’s CPU dropped back to idle level (1-2%) - without restarting the process.
However when I reflowed traffic back to l5d, the CPU immediately reach back to max.
So I am pretty sure it is a persistent issue somewhere.