Linkerd pods in strange state - possible memory leak

Hi, everyone!
I found an issue happening in my linkerd pods and wanted to share with you to find out why this happened.

Our basic setup:
Linkerd 1.1.2 (also happened with 1.2.0 and 1.3.0)
k8s version 1.7.0 (also happened with 1.7.8)
grpc version: 1.5.0 (also happened with 1.3.0)

We use linkerd on the server side to serve as a load balancer for our grpc application, running on kubernetes. The clients don’t use linkerd.

Our linkerd configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: l5d-config
  namespace: (...)
data:
  config.yaml: |-
    admin:
      ip: 0.0.0.0
      port: 9990
    namers:
    - kind: io.l5d.k8s
      experimental: true
      host: localhost
      port: 8001
    telemetry:
    - kind: io.l5d.prometheus
    - kind: io.l5d.recentRequests
      sampleRate: 0.25
    usage:
      orgId: linkerd-daemonset-grpc
    routers:
    - protocol: h2
      label: outgoing
      experimental: true
      dtab: |
        (...)
      identifier:
        kind: io.l5d.header.path
        segments: 1
      servers:
      - port: 4140
        ip: 0.0.0.0

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    app: l5d
  name: l5d
  namespace: (...)
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app: l5d
      annotations:
        linkerd. io/scrape: 'true'
        prometheus. io/scrape: 'true'
    spec:
      volumes:
      - name: l5d-config
        configMap:
          name: "l5d-config"
      containers:
      - name: l5d
        image: buoyantio/linkerd:1.1.2
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        args:
        - /io.buoyant/linkerd/config/config.yaml
        ports:
        - name: outgoing
          containerPort: 4140
          hostPort: 4140
        - name: admin
          containerPort: 9990
        volumeMounts:
        - name: "l5d-config"
          mountPath: "/io.buoyant/linkerd/config"
          readOnly: true
      - name: kubectl
        image: buoyantio/kubectl:v1.6.2
        args:
        - "proxy"
        - "-p"
        - "8001"

apiVersion: v1
kind: Service
metadata:
  annotations:
    "service.beta.kubernetes.io/aws-load-balancer-internal": 0.0.0.0/0
  name: l5d
  namespace: (...)
spec:
  selector:
    app: l5d
  type: LoadBalancer
  ports:
  - name: outgoing
    port: 4140
  - name: incoming
    port: 4141
  - name: admin
    port: 9990

Once the clients started using our application, the memory of our linkerd pods started increasing one by one. None of the pods were killed. Instead, there was a threshold in which the pod stopped receiving requests. Once all pods reached this state, the client couldn’t reach the API anymore.
We tried several version changes, including linkerd, grpc and kubernetes, but issue persisted. Keep in mind our other server applications use the same configuration and didn’t seem to have this issue.
Our application returns a grpc status exception (on purpose) for almost half of the requests. For testing, we removed this exception throw and the issue stopped happening.
Still, why do the linkerd pods reach that state with exceptions? If an exception is thrown, how does linkerd deal with that connection?

Hey @brendel thanks for filing this! This looks a like a new issue. (We had some previous k8s memory leak fixes but they were included in 1.3.0 but I see you’ve tried that version). Can you file a bug report? Some metrics dumps from affected linkerds (in addition to the config you posted above) would be helpful. Thanks!!

Update: we’re currently tracking this issue at https://github.com/linkerd/linkerd/issues/1685