Hi, everyone!
I found an issue happening in my linkerd pods and wanted to share with you to find out why this happened.
Our basic setup:
Linkerd 1.1.2 (also happened with 1.2.0 and 1.3.0)
k8s version 1.7.0 (also happened with 1.7.8)
grpc version: 1.5.0 (also happened with 1.3.0)
We use linkerd on the server side to serve as a load balancer for our grpc application, running on kubernetes. The clients don’t use linkerd.
Our linkerd configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: l5d-config
namespace: (...)
data:
config.yaml: |-
admin:
ip: 0.0.0.0
port: 9990
namers:
- kind: io.l5d.k8s
experimental: true
host: localhost
port: 8001
telemetry:
- kind: io.l5d.prometheus
- kind: io.l5d.recentRequests
sampleRate: 0.25
usage:
orgId: linkerd-daemonset-grpc
routers:
- protocol: h2
label: outgoing
experimental: true
dtab: |
(...)
identifier:
kind: io.l5d.header.path
segments: 1
servers:
- port: 4140
ip: 0.0.0.0
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
labels:
app: l5d
name: l5d
namespace: (...)
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
template:
metadata:
labels:
app: l5d
annotations:
linkerd. io/scrape: 'true'
prometheus. io/scrape: 'true'
spec:
volumes:
- name: l5d-config
configMap:
name: "l5d-config"
containers:
- name: l5d
image: buoyantio/linkerd:1.1.2
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
args:
- /io.buoyant/linkerd/config/config.yaml
ports:
- name: outgoing
containerPort: 4140
hostPort: 4140
- name: admin
containerPort: 9990
volumeMounts:
- name: "l5d-config"
mountPath: "/io.buoyant/linkerd/config"
readOnly: true
- name: kubectl
image: buoyantio/kubectl:v1.6.2
args:
- "proxy"
- "-p"
- "8001"
apiVersion: v1
kind: Service
metadata:
annotations:
"service.beta.kubernetes.io/aws-load-balancer-internal": 0.0.0.0/0
name: l5d
namespace: (...)
spec:
selector:
app: l5d
type: LoadBalancer
ports:
- name: outgoing
port: 4140
- name: incoming
port: 4141
- name: admin
port: 9990
Once the clients started using our application, the memory of our linkerd pods started increasing one by one. None of the pods were killed. Instead, there was a threshold in which the pod stopped receiving requests. Once all pods reached this state, the client couldn’t reach the API anymore.
We tried several version changes, including linkerd, grpc and kubernetes, but issue persisted. Keep in mind our other server applications use the same configuration and didn’t seem to have this issue.
Our application returns a grpc status exception (on purpose) for almost half of the requests. For testing, we removed this exception throw and the issue stopped happening.
Still, why do the linkerd pods reach that state with exceptions? If an exception is thrown, how does linkerd deal with that connection?