Linkerd listeners killed by OOM, possible to restart?

A linkerd installation got into a strange state on one of our servers. The main process was running, but it was not listening to any of the configured listeners including the admin server. Looking at the output of dmesg, I saw:

[1610458.538405] Out of memory: Kill process 2755 (java) score 25 or sacrifice child
[1610458.543553] Killed process 2755 (java) total-vm:3372444kB, anon-rss:196340kB, file-rss:0kB

So that explains it. The host’s memory shot up to around 96%, so I assume linkerd's listener process was sacrificed.

In this case, since the master process was still alive, is there a way to configure linkerd to restart the worker processes/threads/listeners if they die?

Hi @amitsaha! We don’t have a way to restart worker processes. More concerning, linkerd should not OOM like this. Something you can try immediately is setting JVM_HEAP_MIN and JVM_HEAP_MAX to something like 512M or 1024M (and be sure to set them to the same value).

Can you tell us a bit more about your environment? Specifically:

  1. linkerd version (we have fixed some issues related to this in 1.3.0 and 1.3.1).
  2. linkerd config
  3. environment (kubernetes/dcos/etc), how much memory are you giving it?

Thanks, will try those settings. The information regarding our linkerd setup is as follows:

  • Version: 1.2.1
  • Config: Talking to Python thrift services discovered via consul (> 5 servers and clients)
  • Environment: AWS Ubuntu 16.04 VM shared with the service that’s using linkerd to talk to other services). I haven’t configured the systemd unit file to give it any specific memory. Do you have any suggestions?

Please let me know if I can furnish more information.

Setting JVM_HEAP_MIN and JVM_HEAP_MAX to 1024M should give you plenty of head room for high (~40,000 qps) traffic. Whatever you set this to, recommend configuring systemd to give it 50% more than the JVM, in this case, 1.5GB.

Let us know how the 1.3.1 upgrade goes. If you can share your actual linkerd config yaml, that can be helpful too.

Thanks. We are currently running linkerd with the following Xms and Xmx settings:

 /usr/bin/java -XX:+PrintCommandLineFlags -Djava.net.preferIPv4Stack=true -Dsun.net.inetaddr.ttl=60 -Xms32M -Xmx1024M -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+CMSClassUnloading

So I think that’s the same as specifying the JVM_ variables above?

Here’s our current config:

admin:
  ip: 0.0.0.0
  port: 9990

telemetry:
- kind: io.l5d.prometheus


namers:
 - kind: io.l5d.consul
   host: consul.host.name
   useHealthCheck: true
   prefix: /default
 - kind: io.l5d.consul
   host: consul.host.name
   useHealthCheck: true
   healthStatuses:
   - warning
   prefix: /fallback

routers:
- protocol: thrift
  label: service1
  thriftProtocol: binary
  thriftMethodInDst: true
  dtab: |
    /svc => /#/fallback/.local/service1;
    /svc => /#/default/.local/service1;
  client:
    thriftFramed: false
    failureAccrual:
      kind: io.l5d.consecutiveFailures
      failures: 5
      backoff:
        kind: jittered
        minMs: 5000
        maxMs: 300000
  servers:
  - ip: 0.0.0.0
    port: 11010
    thriftFramed: false
....
<few more routers here - thrift Python>

Correct, the linkerd executable we ship maps JVM_HEAP_MIN and JVM_HEAP_MAX to -Xms and -Xmx, respectively: https://github.com/linkerd/linkerd/blob/master/project/LinkerdBuild.scala#L220

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.