Linkerd CPU usage climbs over time

I have noticed that the CPU usage of linkerd grows over time(over a week, towards 100%). Restarting it brings it back to normal range. The app talking to linkerd lives on the same box, talking via localhost. All routing is done via a consul lookup.

It’s linkerd 1.1.2. I captured a 2 minute JFR. I also have 4 machines that are getting close to hitting 100% if we want something more.

The config is below, and I’ll see if I can get a file attachment attached for the JFR.

Can’t attach, the JFR is here
https://drive.google.com/file/d/0B7l2YR59dtHnSjk2RW9sM1drd0k/view?usp=sharing

Metrics: https://drive.google.com/file/d/0B7l2YR59dtHnX2szY1hkWnN2NHc/view?usp=sharing

  port: 9990
namers:
  - kind: io.l5d.consul
    host: dockerhost
    port: 8500
    prefix: /consul
    includeTag: true
routers:
- protocol: http
  label: cgp
  maxRequestKB: 53477
  maxHeadersKB: 12
  dtab: |
    /svc => /#/consul/.local;
    /svc/*/mot-v2-test => /#/consul/.local/live/mot-v2-test | /#/consul/us-east-1-vpc-88394aef/live/mot-v2-test;
    /svc/*/mot-v2-stage => /#/consul/.local/live/mot-v2-stage | /#/consul/us-east-1-vpc-35196a52/live/mot-v2-stage;
    /svc/hostAndPort => /$/inet;
  identifier:
  - kind: io.l5d.header
    header: X-Cloud-Gate-Destination-Id
#  service:
#    responseClassifier:
#      kind: io.l5d.http.retryableRead5XX
  client:
    requestAttemptTimeoutMs: 60000
    loadBalancer:
      kind: ewma
    failureAccrual:
      kind: io.l5d.successRate
      successRate: 0.9
      requests: 20
      backoff:
        kind: constant
        ms: 10000
  servers:
  - port: 4140
    ip: 0.0.0.0
  bindingCache:
    paths: 1000
    trees: 1000
    bounds: 1000
    clients: 1000
telemetry:
- kind: io.l5d.zipkin
  host: zipkin.service.consul
  port: 9410
- kind: io.l5d.prometheus
usage:
  enabled: false```

Can you also provide a dump of :9990/admin/metrics.json of a linkerd instance in a high CPU state?

JFR here: https://drive.google.com/file/d/0B7l2YR59dtHnSjk2RW9sM1drd0k/view?usp=sharing
metrics https://drive.google.com/file/d/0B7l2YR59dtHnX2szY1hkWnN2NHc/view?usp=sharing

Hi @taer, I the metrics you give me don’t appear to line up with the config file in this post. For example, the config file defines a router called edge but the metrics only contain information about a router called cgp. Did these come from different linkerd configurations?

No, thats right. I initially sanitized the config file, but wasn’t about to try to sanitize the metrics. :slight_smile:

cgp == edge

Updated the config in the initial post

Thanks for the info! Digging into it a bit I see a lot of activity in the internals of the consul namer. This could be due to a bug where the namer is over-active or it may just be due to processing a high rate of updates from consul and potentially not being able to keep up.

It’s difficult to tell exactly what is going on from a small 2 minute recording. If you have graphs of CPU, heap size, and any other interesting metrics over a multi-day period, it would help us determine if this is due to something like a memory leak leading to high amounts of GC or something else. I’d also like to know what happens if you leave Linkerd running in this state. Does CPU eventually level out or does the OS eventually kill the process? If you are able to take a heap snapshot using something like jmap, that would also help us to examine the current state of Linkerd and determine if anything is wrong.

Another interesting experiment you could run would be to sever Linkerd’s connection to consul (perhaps by killing the consul agent on a test node) and see if Linkerd’s CPU goes back to normal and if it begins to climb again if the consul agent is restarted.

I stopped traffic to a prod server while keeping the app up, then restarted consul. I resumed traffic, and the CPU went back to the high level.

If we leave this running, the machine hits 100%, and our latency spikes to 15+ second p99 response times.

Here is a shared folder with more diag info
https://drive.google.com/open?id=0B7l2YR59dtHnZVh2R3lJWGZsUXM

We don’t have heap monitoring on the linkerd process, and the CPU is for the box as a whole. The JFR for linkerd shows the relative percentages.

I screenshotted the CPU usage climb - linkerd-cpu(that includes our app, which also climbs during the day).

There is one called linkerdRestarted that shows CPU usage of an instance I restarted.

I grabbed 4 jstacks with about 3 seconds between each to get a poor mans thread profiling. And a jmap -histo

The JFR is a 10 minute one. I can go longer this afternoon when traffic spikes up if that would be more useful.

More info:

linkerd’s pid is 3841. It’s running in a docker container, talking to consul in a different docker container. They’re both in host networking mode, but we use the docker IP to be consistent w/ bridge mode apps. Thus the 172.17.0.1 IP

# netstat -atpn | grep 3841|grep 8500|grep ESTABLISHED|wc
     75     525    7275

75 connections between linkerd and consul. Sample of them

tcp        0      0 172.17.0.1:59110        172.17.0.1:8500         ESTABLISHED 3841/java       
tcp        0      0 172.17.0.1:59064        172.17.0.1:8500         ESTABLISHED 3841/java       
tcp        0      0 172.17.0.1:59080        172.17.0.1:8500         ESTABLISHED 3841/java       
tcp        0      0 172.17.0.1:59158        172.17.0.1:8500         ESTABLISHED 3841/java   ```

One more thing that I noticed in the /admin/metrics.json that you provided is that there are no /namer metrics, only /rt and /jvm. Are the /namer metrics really missing from your Linkerd or do you not collect these?

That’s the raw dump. I don’t think we disabled them. We do use a baseDtab in the linkerd.yaml file and no external namerd. Would that make a difference?

That’s very strange. There should be stats like:

  "namer/#/io.l5d.consul/client/request_payload_bytes.count": 0,
  "namer/#/io.l5d.consul/client/response_payload_bytes.count": 0,
  "namer/#/io.l5d.consul/client/namer/nametreecache/expires": 0,
  "namer/#/io.l5d.consul/client/namer/nametreecache/evicts": 0,
  "namer/#/io.l5d.consul/client/namer/nametreecache/misses": 0,
  "namer/#/io.l5d.consul/client/namer/nametreecache/oneshots": 0,
  "namer/#/io.l5d.consul/client/namer/dtabcache/expires": 0,
  "namer/#/io.l5d.consul/client/namer/dtabcache/evicts": 0,
  "namer/#/io.l5d.consul/client/namer/dtabcache/misses": 0,
  "namer/#/io.l5d.consul/client/namer/dtabcache/oneshots": 0,
  "namer/#/io.l5d.consul/client/namer/namecache/expires": 0,
  "namer/#/io.l5d.consul/client/namer/namecache/evicts": 0,
  "namer/#/io.l5d.consul/client/namer/namecache/misses": 0,
  "namer/#/io.l5d.consul/client/namer/namecache/oneshots": 0,
  "namer/#/io.l5d.consul/#/io.l5d.consul/lookups": 0,
  "namer/#/io.l5d.consul/#/io.l5d.consul/service/opens": 0,
  "namer/#/io.l5d.consul/#/io.l5d.consul/service/updates": 0,
  "namer/#/io.l5d.consul/#/io.l5d.consul/service/closes": 0,
  "namer/#/io.l5d.consul/#/io.l5d.consul/service/errors": 0

There was. Somehow, I managed to not get the whole file. This instance has been restarted yesterday, so it’s not at CPU pain levels yet, but here is the metrics for it.

I can’t attach drive links anymore. https://gist.github.com/taer/5a1e3c44fb9896dabd4412815e4d20ce

Hi @taer, I’ve recently made some changes to Consul watch handling that should hopefully resolve this issue. I’ve pushed a snapshot tagged buoyantio/linkerd:1.2.1-CONSUL-FIX-SNAPSHOT and buoyantio/namerd:1.2.1-CONSUL-FIX-SNAPSHOT to Docker Hub, and I’d love to know if you see improved performance running these images.

Placed 1.2.1 into prod on 5/6 instances. Placed test patched linkerd on the 6th.

I’ll follow up in a few days once the 5/6 start growing. Are there any stats I can pull from metrics that could verify this is helping?

It’s looking better so far. The blue line is the test instance. I’ll let the test continue to make sure, but wanted to give some early feedback

2 Likes

@taer Encouraging! Keep us up to date as the test progresses.

The test instance is starting to catch up with the rest. There is more traffic now, so it’s likely legit CPU. But the gap has vanished.

Overnight, it baselined at 25% while the non-patched hovered in the low 30s. Now (11am), they’re about the same in the 50s. I’ve grabbed a metrics.json for the patched and non-patched. I’ll check the CPU usage when traffic dies down tonight and report more.

Thanks, @taer! Sounds like the new build is an improvement but not a total solution. I’m really looking forward to seeing this next round of CPU profiles.

I restarted the 5 1.2.1.GA servers, leaving the patched server running(up ~6 days). The restarted 1.2.1 GA servers are totaling about 20% CPU usage, while the 1.2.1.patch one is about 45% at the same time.

JFRs
1.2.1.GA
https://drive.google.com/open?id=0B7l2YR59dtHna3dLeWFFVWt3aE0

1.2.1.patch
https://drive.google.com/open?id=0B7l2YR59dtHnc1RXaHB4UmFoVk0