I found an issue happening in my linkerd pods and wanted to share with you to find out why this happened.
Our basic setup:
Linkerd 1.1.2 (also happened with 1.2.0 and 1.3.0)
k8s version 1.7.0 (also happened with 1.7.8)
grpc version: 1.5.0 (also happened with 1.3.0)
We use linkerd on the server side to serve as a load balancer for our grpc application, running on kubernetes. The clients don’t use linkerd.
Our linkerd configuration:
apiVersion: v1 kind: ConfigMap metadata: name: l5d-config namespace: (...) data: config.yaml: |- admin: ip: 0.0.0.0 port: 9990 namers: - kind: io.l5d.k8s experimental: true host: localhost port: 8001 telemetry: - kind: io.l5d.prometheus - kind: io.l5d.recentRequests sampleRate: 0.25 usage: orgId: linkerd-daemonset-grpc routers: - protocol: h2 label: outgoing experimental: true dtab: | (...) identifier: kind: io.l5d.header.path segments: 1 servers: - port: 4140 ip: 0.0.0.0 apiVersion: extensions/v1beta1 kind: DaemonSet metadata: labels: app: l5d name: l5d namespace: (...) spec: updateStrategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 template: metadata: labels: app: l5d annotations: linkerd. io/scrape: 'true' prometheus. io/scrape: 'true' spec: volumes: - name: l5d-config configMap: name: "l5d-config" containers: - name: l5d image: buoyantio/linkerd:1.1.2 env: - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP args: - /io.buoyant/linkerd/config/config.yaml ports: - name: outgoing containerPort: 4140 hostPort: 4140 - name: admin containerPort: 9990 volumeMounts: - name: "l5d-config" mountPath: "/io.buoyant/linkerd/config" readOnly: true - name: kubectl image: buoyantio/kubectl:v1.6.2 args: - "proxy" - "-p" - "8001" apiVersion: v1 kind: Service metadata: annotations: "service.beta.kubernetes.io/aws-load-balancer-internal": 0.0.0.0/0 name: l5d namespace: (...) spec: selector: app: l5d type: LoadBalancer ports: - name: outgoing port: 4140 - name: incoming port: 4141 - name: admin port: 9990
Once the clients started using our application, the memory of our linkerd pods started increasing one by one. None of the pods were killed. Instead, there was a threshold in which the pod stopped receiving requests. Once all pods reached this state, the client couldn’t reach the API anymore.
We tried several version changes, including linkerd, grpc and kubernetes, but issue persisted. Keep in mind our other server applications use the same configuration and didn’t seem to have this issue.
Our application returns a grpc status exception (on purpose) for almost half of the requests. For testing, we removed this exception throw and the issue stopped happening.
Still, why do the linkerd pods reach that state with exceptions? If an exception is thrown, how does linkerd deal with that connection?