Possible Memory leak

We’re seeing our RSS sizes for linkerd grow till it gets gets killed for being over it’s docker limit of 2G.

Here is one that is currently close
RSS: 2048536

Seems to be similar to Linkerd: Non-heap Memory Leak on 1.3.1

Attached 4 screenshots of memory graphs. All are over 2 weeks. One is RSS, the other 3 are from metrics.json
/jvm/heap/used
/jvm/nonheap/used
/jvm/mem/current/used

Attached metrics.json and config.

config.json (761 Bytes)

metrics.json (49.9 KB)

This is with linkerd 1.3.5

Adding same data the prior reporter did. I’ll look into getting NMT enabled as well.

jstat gc

root@aspmes021:/io.buoyant/linkerd# jstat -gc 13
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT   
1088.0 1088.0  0.0   560.0   8704.0   4356.2   287168.0   169919.8  59616.0 54080.3 9240.0 8725.3  99317 1352.729 3354    58.460 1411.188

And

# cat /proc//13/status
Name:	java
State:	S (sleeping)
Tgid:	13
Ngid:	0
Pid:	13
PPid:	1
TracerPid:	0
Uid:	0	0	0	0
Gid:	0	0	0	0
FDSize:	1024
Groups:	
NStgid:	13
NSpid:	13
NSpgid:	1
NSsid:	1
VmPeak:	16034000 kB
VmSize:	16023892 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	 2024476 kB
VmRSS:	 2019504 kB
VmData:	15949088 kB
VmStk:	     132 kB
VmExe:	       4 kB
VmLib:	   19408 kB
VmPTE:	    5104 kB
VmPMD:	      72 kB
VmSwap:	       0 kB
HugetlbPages:	       0 kB
Threads:	201
SigQ:	0/1031244
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000000000000
SigCgt:	2000000181005ccf
CapInh:	00000000a80425fb
CapPrm:	00000000a80425fb
CapEff:	00000000a80425fb
CapBnd:	00000000a80425fb
CapAmb:	0000000000000000
Seccomp:	2
Cpus_allowed:	00000000,00000000,00000000,00000000,0000ffff,ffffffff
Cpus_allowed_list:	0-47
Mems_allowed:	00000000,00000003
Mems_allowed_list:	0-1
voluntary_ctxt_switches:	17
nonvoluntary_ctxt_switches:	2

OOM kill from syslog

Feb 28 17:39:14 aspmes025 kernel: java invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=0
Feb 28 17:39:14 aspmes025 kernel: java cpuset=a2317b65c5e61876dacbf79bfc3d1a58decc76c396760129bdb1f8f420a37e58 mems_allowed=0-1
Feb 28 17:39:14 aspmes025 kernel: CPU: 2 PID: 4903 Comm: java Not tainted 4.4.0-98-generic #121-Ubuntu
Feb 28 17:39:14 aspmes025 kernel: Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.3.6 06/03/2015
Feb 28 17:39:14 aspmes025 kernel:  0000000000000286 9f461285ce665f59 ffff8838f425fc88 ffffffff813fb2c3
Feb 28 17:39:14 aspmes025 kernel:  ffff8838f425fd68 ffff883f3de64600 ffff8838f425fcf8 ffffffff8120ca7e
Feb 28 17:39:14 aspmes025 kernel:  ffff881ffec56e00 ffff8838f425fcc8 ffffffff81192e8b ffff881feaded400
Feb 28 17:39:14 aspmes025 kernel: Call Trace:
Feb 28 17:39:14 aspmes025 kernel:  [<ffffffff813fb2c3>] dump_stack+0x63/0x90
Feb 28 17:39:14 aspmes025 kernel:  [<ffffffff8120ca7e>] dump_header+0x5a/0x1c5
Feb 28 17:39:14 aspmes025 kernel:  [<ffffffff81192e8b>] ? find_lock_task_mm+0x3b/0x80
Feb 28 17:39:14 aspmes025 kernel:  [<ffffffff81193452>] oom_kill_process+0x202/0x3c0
Feb 28 17:39:14 aspmes025 kernel:  [<ffffffff81200654>] ? mem_cgroup_iter+0x204/0x390
Feb 28 17:39:14 aspmes025 kernel:  [<ffffffff812026b3>] mem_cgroup_out_of_memory+0x2b3/0x300
Feb 28 17:39:14 aspmes025 kernel:  [<ffffffff81203488>] mem_cgroup_oom_synchronize+0x338/0x350
Feb 28 17:39:14 aspmes025 kernel:  [<ffffffff811fe560>] ? kzalloc_node.constprop.49+0x20/0x20
Feb 28 17:39:14 aspmes025 kernel:  [<ffffffff81193b04>] pagefault_out_of_memory+0x44/0xc0
Feb 28 17:39:14 aspmes025 kernel:  [<ffffffff8106b302>] mm_fault_error+0x82/0x160
Feb 28 17:39:14 aspmes025 kernel:  [<ffffffff8106b7b8>] __do_page_fault+0x3d8/0x400
Feb 28 17:39:14 aspmes025 kernel:  [<ffffffff8106b802>] do_page_fault+0x22/0x30
Feb 28 17:39:14 aspmes025 kernel:  [<ffffffff81846838>] page_fault+0x28/0x30
Feb 28 17:39:14 aspmes025 kernel: Task in /docker/a2317b65c5e61876dacbf79bfc3d1a58decc76c396760129bdb1f8f420a37e58 killed as a result of limit of /docker/a2317b65c5e61876dacbf79bfc3d1a58decc76c396760129bdb1f8f420a37e58
Feb 28 17:39:14 aspmes025 kernel: memory: usage 2048000kB, limit 2048000kB, failcnt 368
Feb 28 17:39:14 aspmes025 kernel: memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Feb 28 17:39:14 aspmes025 kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Feb 28 17:39:14 aspmes025 kernel: Memory cgroup stats for /docker/a2317b65c5e61876dacbf79bfc3d1a58decc76c396760129bdb1f8f420a37e58: cache:52KB rss:2047948KB rss_huge:1798144KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:12KB active_anon:2045920KB inactive_file:0KB active_file:0KB unevictable:0KB
Feb 28 17:39:14 aspmes025 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Feb 28 17:39:14 aspmes025 kernel: [ 4499]     0  4499     4519     2727      13       5        0             0 containerpilot
Feb 28 17:39:14 aspmes025 kernel: [ 4518]     0  4518  4022613   515645    1297      19        0             0 java
Feb 28 17:39:14 aspmes025 kernel: Memory cgroup out of memory: Kill process 4518 (java) score 1009 or sacrifice child
Feb 28 17:39:14 aspmes025 kernel: Killed process 4518 (java) total-vm:16090452kB, anon-rss:2042416kB, file-rss:20260kB

Jstack

jstac (162.5 KB)

I grabbed 2 baselines for the linkerd process. One after startup, which I then diffed once the heap got to it’s normal operating area(where it was no longer growing). That’s the first one. It’s not terribly interesting.

I grabbed a new baseline, and left it for 13 hours(overnight, low traffic time). In that time, the RSS grew from 1.05G to 1.3G. Looking at that NMT, you can see the only major growth was internal at +112,784KB. Will update again once it gets close to the 2G mark.

The JVM is

# java -version
java version "1.8.0_152"
Java(TM) SE Runtime Environment (build 1.8.0_152-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.152-b16, mixed mode)

NMT: Startup to heap fill

Total: reserved=2683234KB +66369KB, committed=804898KB +217801KB

-                 Java Heap (reserved=1048576KB, committed=258688KB +149796KB)
                            (mmap: reserved=1048576KB, committed=258688KB +149796KB)
 
-                     Class (reserved=1103438KB +2639KB, committed=62126KB +1999KB)
                            (classes #9334 +62)
                            (malloc=3662KB +591KB #15829 +1351)
                            (mmap: reserved=1099776KB +2048KB, committed=58464KB +1408KB)
 
-                    Thread (reserved=174472KB +8260KB, committed=174472KB +8260KB)
                            (thread #170 +8)
                            (stack: reserved=173732KB +8224KB, committed=173732KB +8224KB)
                            (malloc=542KB +27KB #890 +40)
                            (arena=198KB +9 #338 +16)
 
-                      Code (reserved=50781KB +238KB, committed=6061KB +1926KB)
                            (malloc=861KB +238KB #2112 -77)
                            (mmap: reserved=49920KB, committed=5200KB +1688KB)
 
-                        GC (reserved=77390KB +10KB, committed=74974KB +598KB)
                            (malloc=73970KB +10KB #448 +15)
                            (mmap: reserved=3420KB, committed=1004KB +588KB)
 
-                  Compiler (reserved=413KB +104KB, committed=413KB +104KB)
                            (malloc=283KB +104KB #588 +229)
                            (arena=130KB #2)
 
-                  Internal (reserved=169173KB +56479KB, committed=169173KB +56479KB)
                            (malloc=169141KB +56479KB #16560 +297)
                            (mmap: reserved=32KB, committed=32KB)
 
-                    Symbol (reserved=12346KB +15KB, committed=12346KB +15KB)
                            (malloc=10867KB +15KB #86423 +145)
                            (arena=1479KB #1)
 
-    Native Memory Tracking (reserved=1951KB +32KB, committed=1951KB +32KB)
                            (malloc=21KB +1KB #236 +10)
                            (tracking overhead=1930KB +31KB)
 
-               Arena Chunk (reserved=830KB -1407KB, committed=830KB -1407KB)
                            (malloc=830KB -1407KB)
 
-                   Unknown (reserved=43864KB, committed=43864KB)
                            (mmap: reserved=43864KB, committed=43864KB)

NMT post heap fill, 13 hours later

Total: reserved=2797899KB +114277KB, committed=939339KB +134053KB

-                 Java Heap (reserved=1048576KB, committed=277936KB +19248KB)
                            (mmap: reserved=1048576KB, committed=277936KB +19248KB)
 
-                     Class (reserved=1104022KB +584KB, committed=62966KB +840KB)
                            (classes #9346 +12)
                            (malloc=4246KB +584KB #16695 +868)
                            (mmap: reserved=1099776KB, committed=58720KB +256KB)
 
-                    Thread (reserved=174472KB, committed=174472KB)
                            (thread #170)
                            (stack: reserved=173732KB, committed=173732KB)
                            (malloc=542KB #890)
                            (arena=198KB #338)
 
-                      Code (reserved=50921KB +135KB, committed=6397KB +331KB)
                            (malloc=1001KB +135KB #2332 +215)
                            (mmap: reserved=49920KB, committed=5396KB +196KB)
 
-                        GC (reserved=77391KB +1KB, committed=75051KB +77KB)
                            (malloc=73971KB +1KB #455 +7)
                            (mmap: reserved=3420KB, committed=1080KB +76KB)
 
-                  Compiler (reserved=456KB +43KB, committed=456KB +43KB)
                            (malloc=326KB +43KB #643 +51)
                            (arena=130KB #2)
 
-                  Internal (reserved=282982KB +112784KB, committed=282982KB +112784KB)
                            (malloc=282950KB +112784KB #16739 +178)
                            (mmap: reserved=32KB, committed=32KB)
 
-                    Symbol (reserved=12349KB +4KB, committed=12349KB +4KB)
                            (malloc=10870KB +4KB #86459 +36)
                            (arena=1479KB #1)
 
-    Native Memory Tracking (reserved=1972KB +22KB, committed=1972KB +22KB)
                            (malloc=21KB #236)
                            (tracking overhead=1951KB +22KB)
 
-               Arena Chunk (reserved=894KB +703KB, committed=894KB +703KB)
                            (malloc=894KB +703KB)
 
-                   Unknown (reserved=43864KB, committed=43864KB)
                            (mmap: reserved=43864KB, committed=43864KB)

Looks like after providing the suggestion below, it has now stabilized for them. I found this through Evans blog where this option instructs java nio to not cache the buffers over 256KB.

-Djdk.nio.maxCachedBufferSize=262144

http://www.evanjones.ca/java-bytebuffer-leak.html

3 Likes

Thanks for tracking that down, @chrisgoffinet! Since this seems like a good idea to do by default, would you be interested in the fame and glory of submitting a PR for this flag for Linkerd?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.