Namerd cpu/memory rising over time

Since upgrading to 1.3.1 from 1.1.3 (both linkerd and namerd) I’ve been seeing namerd’s CPU and memory slowly climbing under moderate traffic, eventually experiencing weird spikes until the pod crashes with out of memory errors. I haven’t changed any of my configs, just the version.

I’m not sure what info is useful for debugging this, so let me know what you need. I’m running l5d in daemonset mode in k8s.

Here’s a two day profile:

namerd config:

admin:
  port: 9990

namers:
- kind: io.l5d.k8s
  experimental: true
  host: localhost
  port: 8001
- kind: io.l5d.k8s
  prefix: /io.l5d.k8s.ds
  host: localhost
  port: 8001
  transformers:
  - kind: io.l5d.k8s.daemonset
    namespace: default
    port: incoming
    service: l5d

storage:
  kind: io.l5d.k8s
  host: localhost
  port: 8001
  namespace: default

interfaces:
- kind: io.l5d.thriftNameInterpreter
  ip: 0.0.0.0
  port: 4100
- kind: io.l5d.httpController
  ip: 0.0.0.0
  port: 4180

Appreciate the detail @aaronyoung. If possible we’d love to get a heap dump from your namerd. To generate one, swap out namerd’s Docker image for buoyantio/namerd:1.3.2-SNAPSHOT-jdk. When you have observed significant memory growth, run jmap -dump:live,format=b,file=namerd.heap 1 in the container, and send us the heap file (gdrive or dropbox works). Let us know if you have any questions about this. We’ll dig in on our side as well. Thanks!

Will do, thanks.

I actually typo’d up there, I upgraded to 1.3.1, not 1.3.2

Got it. We did introduce several fixes in 1.3.2, specifically related to namerd and thriftNameInterpreter. Please test with 1.3.2 if possible.

Also a metrics.json dump when you generate the heap dump is helpful.

Ok, so with 1.3.2 it’s linkerd that’s running out of memory, while both l5d and namerd pods are running at high cpu.metrics.json (25.0 KB)
namerd heap

Hi @aaronyoung. Thanks for the heap dump! We’re about to merge a leak fix that looks pretty closely related to this (https://github.com/linkerd/linkerd/pull/1714). We plan to release that as part of 1.3.3 later this week. If possible, can you re-run your tests on that branch, (or at least on 1.3.3 when it’s released)?

@aaronyoung 1.3.3 is released. Would you mind confirming this has fixed your issue? Thanks!

I’ve had it in since yesterday morning, so far so good!

Hi @aaronyoung, just checking in again – is everything still looking good?