Linkerd briefly fails to find services after kubernetes rolling deploy


#1

I have a kubernetes setup with multiple microservices. Two of these services use grpc to communicate, so I have them set up to use linkerd to enable load balancing. I found that deploying a new version of the server microservice caused a temporary (~20 seconds) increase in error rate (~75% failure – measured using datadog metrics emitted by the client when a request failed). Redeployment was done by running kubectl apply -f <file>, and I was running slow-cooker throughout the process to ensure a ‘stable’ load.

I’m pretty sure this isn’t supposed to happen, but I can’t find much information on how new deployments are supposed to be performed with linkerd. One suggestion I found was to create two deployments side by side, and to reroute traffic to the new deployment once it is stable, but this isn’t a good solution for me…

The error we saw from the grpc client was:
Could not get <resource>: rpc error: code = Unavailable desc = stream terminated by RST_STREAM with error code: REFUSED_STREAM
(note: we are not using grpc streaming)

The linkerd logs were as follows:
com.twitter.finagle.NoBrokersAvailableException: No hosts are available for <service>, Dtab.base=[/srv=>/#/io.l5d.k8s/default/grpc/<service>;/grpc=>/srv;/svc=>/$/io.buoyant.http.domainToPathPfx/grpc], Dtab.local=[]. Remote Info: Not Available

I 0418 09:23:26.232 UTC THREAD23: no available endpoints

I 0418 09:23:26.225 UTC THREAD24 TraceId:0df797a5d09775f9: %/io.l5d.k8s.localnode/10.60.2.71/#/io.l5d.k8s/default/grpc/<service>: name resolution is negative (local dtab: Dtab())

The linkerd config:

# runs linkerd in a daemonset, in linker-to-linker mode, routing gRPC requests
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: l5d-config
data:
  config.yaml: |-
    admin:
      ip: 0.0.0.0
      port: 9990
    namers:
    - kind: io.l5d.k8s
      host: localhost
      port: 8001
    routers:
    - protocol: h2
      label: incoming 
      experimental: true 
      identifier:
        kind: io.l5d.header.path
        segments: 1
      interpreter:
        kind: default
        transformers:
        - kind: io.l5d.k8s.localnode
      dtab: |
        /srv        => /#/io.l5d.k8s/default/grpc/myservice;
        /grpc/*     => /srv/myservice;
        /svc        => /$/io.buoyant.http.domainToPathPfx/grpc;
      servers:
      - port: 4141
        ip: 0.0.0.0
    - protocol: h2
      label: outgoing
      experimental: true
      dtab: |
        /srv        => /#/io.l5d.k8s/default/grpc/myservice;
        /grpc       => /srv;
        /svc        => /$/io.buoyant.http.domainToPathPfx/grpc;
      identifier:
        kind: io.l5d.header.path
        segments: 1
      interpreter:
        kind: default
        transformers:
        - kind: io.l5d.k8s.daemonset
          namespace: default
          port: incoming
          service: linkerd
      servers:
      - port: 4140
        ip: 0.0.0.0
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    app: linkerd
  name: linkerd
spec:
  template:
    metadata:
      labels:
        app: linkerd
    spec:
      volumes:
      - name: l5d-config
        configMap:
          name: "l5d-config"
      containers:
      - name: linkerd
        image: buoyantio/linkerd:1.3.7
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        args:
        - /io.buoyant/linkerd/config/config.yaml
        ports:
        - name: outgoing
          containerPort: 4140
          hostPort: 4140
        - name: incoming
          containerPort: 4141
        - name: admin
          containerPort: 9990
        volumeMounts:
        - name: "l5d-config"
          mountPath: "/io.buoyant/linkerd/config"
          readOnly: true

      - name: kubectl
        image: buoyantio/kubectl:v1.8.5
        args:
        - "proxy"
        - "-p"
        - "8001"
---
apiVersion: v1
kind: Service
metadata:
  name: linkerd
  labels:
    app: myservice
spec:
  selector:
    app: linkerd
  type: ClusterIP
  ports:
  - name: incoming
    port: 4141
  - name: outgoing
    port: 4140
  - name: admin
    port: 9990

#2

Hi @Louisebc!

What kind of Kubernetes object manages the pods for your service? Is it a Deployment? ReplicationController? The NoBrokersAvailableException means that Linkerd couldn’t find any pods to route to. This might be because all of the pods are restarting at once. You may want to look into rolling updates so that there are always pods available, even during a deploy.

Hope this helps.


#3

Hi Alex,

The services are both Deployments with rolling updates. It has always worked as expected until we started using linkerd…


#4

Ah, ok. Very interesting. One thing you can do to investigate this is to watch (continuously refresh) the /client_state.json endpoint on Linkerd’s admin server as you do the deploy. This endpoint shows the internal state of Linkerd’s load balancers. I would expect that you should see the old ip addresses disappear as the old pods are deleted and the new ip addresses appear as the new pods are created. As long as the rolling deploy isn’t too fast, this list should never become empty. If the state in /client_state.json doesn’t match the current set of pods in Kubernetes, that could be the source of the problem.


#5

@Louisebc did you get a chance to try Alex’s suggestion?


#6

I tried Alex’s suggestion, and was still getting weird behaviour where the array that is supposed to contain the incoming IP address remained empty after redeployment. Turns out this is only the case when using a single pod, so we’ve started using multiple by default. Thanks for you help!