Linkerd stops sending traffic to grpc kubernetes pods

Hi,

I have been seen this behavior multiple times now. I am running linkerd:1.3.4. The following is the full set of configuration:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: l5d
  labels:
    app: l5d
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  minReadySeconds: 5
  selector:
    matchLabels:
      app: l5d
  template:
    metadata:
      labels:
        app: l5d
    spec:
      serviceAccount: default
      containers:
        - name: l5d
          image: buoyantio/linkerd:1.3.4
          imagePullPolicy: Always
          args: [/etc/l5d/config/config.yaml]
          env:
          - {name: POD_IP, valueFrom: {fieldRef: {fieldPath: status.podIP}}}
          lifecycle:
            preStop:
              exec:
                command: ["/usr/bin/curl", "-X", "POST", "http://localhost:9990/admin/shutdown"]
          livenessProbe:
            httpGet:
              port: admin
              path: /admin/ping
          readinessProbe:
            httpGet:
              port: admin
              path: /admin/ping
          ports:
          - {containerPort: 9990, name: admin}
          - {containerPort: 6100, hostPort: 6100, name: prod-h2-out}
          - {containerPort: 6200, hostPort: 6200, name: prod-h2-in}
          - {containerPort: 6300, hostPort: 6300, name: prod-h2-ing}
          - {containerPort: 6400, hostPort: 6400, name: prod-h1-in}
          - {containerPort: 6500, hostPort: 6500, name: prod-h1-ing}
          volumeMounts:
          - mountPath: /etc/l5d/config
            name: config
            readOnly: true
          - mountPath: /etc/l5d/tls/ca
            name: ca-tls
            readOnly: true
          - mountPath: /etc/l5d/tls/mesh-client
            name: mesh-client-tls
            readOnly: true
          - mountPath: /etc/l5d/tls/mesh-server
            name: mesh-server-tls
            readOnly: true
          - mountPath: /etc/l5d/tls/ingress-production
            name: production-tls
            readOnly: true
        - args:
          - proxy
          - -p
          - "8001"
          image: buoyantio/kubectl:v1.4.0
          name: kubectl
      volumes:
      - name: config
        configMap:
          name: l5d-config
      - name: ca-tls
        secret:
          secretName: ca.pki
      - name: mesh-client-tls
        secret:
          secretName: l5d.client-pki
      - name: mesh-server-tls
        secret:
          secretName: l5d.server-pki
      - name: production-tls
        secret:
          secretName: production-ingress.pki

Linkerd configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: l5d-config
  namespace: l5d-system
data:
  config.yaml: |-
    admin:
      ip: 0.0.0.0
      port: 9990
    telemetry:
    - kind: io.l5d.prometheus
    - kind: io.l5d.recentRequests
      sampleRate: 0.01
      capacity: 10
    - kind: io.zipkin.http
      host: zipkin.default.svc.cluster.local:9411
      initialSampleRate: 0.02
    - kind: io.l5d.statsd
      experimental: true
      prefix: linkerd
      hostname: dd-agent.default.svc.cluster.local
      port: 8125
      gaugeIntervalMs: 10000
      sampleRate: 1.0
    usage:
      orgId: brightcove-rna
    namers:
    - kind: io.l5d.k8s
    routers:
    - label: production-h1-in
      protocol: http
      identifier:
      - kind: io.l5d.path
        segments: 3
      - kind: io.l5d.path
        segments: 2
      interpreter:
        kind: io.l5d.k8s.configMap
        experimental: true
        name: production-dtabs
        filename: h1-incoming
        namespace: l5d-system
        # todo: enable this when we have an answer for how to
        # correctly handle external destination requests
        # transformers:
        # - kind: io.l5d.k8s.localnode
      servers:
      - port: 6400
        ip: 0.0.0.0
        tls:
          caCertPath: /etc/l5d/tls/ca/l5d-system.crt
          certPath: /etc/l5d/tls/mesh-server/server.crt
          keyPath: /etc/l5d/tls/mesh-server/server.key
          requireClientAuth: true
      client:
        kind: io.l5d.static
        configs:
        - prefix: "/$/io.buoyant.rinet/443/{service}"
          tls:
            commonName: "{service}"
    - label: production-h2-in
      protocol: h2
      experimental: true
      identifier:
        kind: io.l5d.header.path
        segments: 1
      interpreter:
        kind: io.l5d.k8s.configMap
        experimental: true
        name: production-dtabs
        filename: h2-incoming
        namespace: l5d-system
        transformers:
        - kind: io.l5d.k8s.localnode
      servers:
      - port: 6200
        ip: 0.0.0.0
        tls:
          caCertPath: /etc/l5d/tls/ca/l5d-system.crt
          certPath: /etc/l5d/tls/mesh-server/server.crt
          keyPath: /etc/l5d/tls/mesh-server/server.key
          requireClientAuth: true
    - label: production-h2-out
      protocol: h2
      experimental: true
      originator: true
      identifier:
        kind: io.l5d.header.path
        segments: 1
      interpreter:
        kind: io.l5d.k8s.configMap
        experimental: true
        name: production-dtabs
        filename: h2-outgoing
        namespace: l5d-system
        transformers:
        - kind: io.l5d.k8s.daemonset
          namespace: l5d-system
          service: l5d
          port: prod-h2-in
      servers:
      - port: 6100
        ip: 0.0.0.0
      client:
        tls:
          commonName: l5d
          trustCerts:
          - /etc/l5d/tls/ca/l5d-system.crt
          clientAuth:
            certPath: /etc/l5d/tls/mesh-client/client.crt
            keyPath: /etc/l5d/tls/mesh-client/client.key
    - label: production-h1-ing
      protocol: http
      originator: true
      identifier:
      - kind: io.l5d.header.token
      interpreter:
        kind: io.l5d.k8s.configMap
        experimental: true
        name: production-dtabs
        filename: h1-ingress
        namespace: l5d-system
        transformers:
        - kind: io.l5d.k8s.daemonset
          namespace: l5d-system
          service: l5d
          port: prod-h1-in
      servers:
      - port: 6500
        ip: 0.0.0.0
        clearContext: true
      client:
        tls:
          commonName: l5d
          trustCerts:
          - /etc/l5d/tls/ca/l5d-system.crt
          clientAuth:
            certPath: /etc/l5d/tls/mesh-client/client.crt
            keyPath: /etc/l5d/tls/mesh-client/client.key
    - label: production-h2-ing
      protocol: h2
      experimental: true
      originator: true
      identifier:
        kind: io.l5d.header.path
        segments: 1
      interpreter:
        kind: io.l5d.k8s.configMap
        experimental: true
        name: production-dtabs
        filename: h2-incoming
        namespace: l5d-system
        transformers:
        - kind: io.l5d.k8s.daemonset
          namespace: l5d-system
          service: l5d
          port: prod-h2-in
      servers:
      - port: 6300
        ip: 0.0.0.0
        clearContext: true
        tls:
          caCertPath: /etc/l5d/tls/ca/production.crt
          certPath: /etc/l5d/tls/ingress-production/server.crt
          keyPath: /etc/l5d/tls/ingress-production/server.key
          requireClientAuth: true
      client:
        loadBalancer:
          kind: heap
        tls:
          commonName: l5d
          trustCerts:
          - /etc/l5d/tls/ca/l5d-system.crt
          clientAuth:
            certPath: /etc/l5d/tls/mesh-client/client.crt
            keyPath: /etc/l5d/tls/mesh-client/client.key

The dtabs lok like this:

apiVersion: v1
kind: ConfigMap
metadata:
  name: production-dtabs
  namespace: l5d-system
data:
  h1-ingress: |
    /k8s => /#/io.l5d.k8s ;
    /srv => /k8s/production/http ;
    /svc/data.example.com => /srv/analytics-api-headless ;
    /svc/analytics.api.example.com => /srv/analytics-api-headless ;
  h1-incoming: |
    /k8s => /#/io.l5d.k8s ;
    /localAPI    => /k8s/production/http/analytics-api-headless ;
    /redirectAPI => /$/io.buoyant.rinet/443/REDACTED;
    /svc => /localAPI ;
    # overrides
    /svc/v1/reports            => /redirectAPI ;
    /svc/v1/quotas             => /redirectAPI ;
    /svc/*/private/log_sources => /redirectAPI ;
    /svc/*/private/hadoopjobs  => /redirectAPI ;
    /svc/*/private/loader      => /redirectAPI ;
  h2-incoming: |
    # env mappings -- these do not change unless you change the namespace or the port name ---
    /k8s => /#/io.l5d.k8s ;
    /srv => /k8s/production/grpc ;
    # service mappings -- these do not change unless you add more services ---
    /svc/example.api.Semaphore                => /srv/semaphore-headless ;
    /svc/example.api.TimezoneService          => /srv/oslo-headless ;
    /svc/example.api.HistoricalStatusAPI      => /srv/contra-headless ;
    /svc/example.api.MaterializedViewRegistry => /srv/contra-headless ;
    /svc/example.api.AnalyticsAPI             => /srv/bag-headless ;
  h2-outgoing: |
    # env mappings -- these do not change unless you change the namespace or the port name ---
    /k8s => /#/io.l5d.k8s ;
    /srv => /k8s/production/grpc ;
    # service mappings -- these do not change unless you add more services ---
    /svc/example.api.Semaphore                                   => /srv/semaphore-headless ;
    /svc/example.api.TimezoneService                      => /srv/oslo-headless ;
    /svc/example.api.HistoricalStatusAPI                => /srv/contra-headless ;
    /svc/example.api.MaterializedViewRegistry => /srv/contra-headless ;
    /svc/example.api.AnalyticsAPI                               => /srv/bag-headless ;

This setup works pretty well for a few hours and then one by one pods start to get marked as “dead” and get no more traffic. Also, if we delete a pod, the replacement pod also doesn’t get any traffic. It seems like there is some mismatch in the mapping.

Looking at the logs of one of the linkerd pods i see the following:

➜  ~ kubectl -n l5d-system logs -f l5d-btkkg l5d
-XX:+AggressiveOpts -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+CMSScavengeBeforeRemark -XX:InitialHeapSize=33554432 -XX:MaxHeapSize=1073741824 -XX:MaxNewSize=357916672 -XX:MaxTenuringThreshold=6 -XX:OldPLABSize=16 -XX:+PrintCommandLineFlags -XX:+ScavengeBeforeFullGC -XX:-TieredCompilation -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseStringDeduplication 
Dec 17, 2017 10:16:42 AM com.twitter.finagle.http.HttpMuxer$ $anonfun$new$1
INFO: HttpMuxer[/admin/metrics.json] = com.twitter.finagle.stats.MetricsExporter(<function1>)
Dec 17, 2017 10:16:42 AM com.twitter.finagle.http.HttpMuxer$ $anonfun$new$1
INFO: HttpMuxer[/admin/per_host_metrics.json] = com.twitter.finagle.stats.HostMetricsExporter(<function1>)
I 1217 10:16:43.032 UTC THREAD1: linkerd 1.3.4 (rev=c0e3455880151982842920659c2084c42d195e6b) built at 20171215-132634
I 1217 10:16:43.773 UTC THREAD1: Finagle version 7.1.0 (rev=37212517b530319f4ba08cc7473c8cd8c4b83479) built at 20170906-132024
W 1217 10:16:45.058 UTC THREAD1: Warning, you're using the `io.l5d.statsd` telemeter, which is unsupported and probably won't do what you expect. Use of this telemeter may lead to poor performance or decreased data quality.
Please see https://discourse.linkerd.io/t/deprecating-the-statsd-telemeter for more information.
I 1217 10:16:45.461 UTC THREAD1: Resolver[inet] = com.twitter.finagle.InetResolver(com.twitter.finagle.InetResolver@7e53339)
I 1217 10:16:45.461 UTC THREAD1: Resolver[fixedinet] = com.twitter.finagle.FixedInetResolver(com.twitter.finagle.FixedInetResolver@58fef7f7)
I 1217 10:16:45.461 UTC THREAD1: Resolver[neg] = com.twitter.finagle.NegResolver$(com.twitter.finagle.NegResolver$@31ddb930)
I 1217 10:16:45.461 UTC THREAD1: Resolver[nil] = com.twitter.finagle.NilResolver$(com.twitter.finagle.NilResolver$@12eedfee)
I 1217 10:16:45.462 UTC THREAD1: Resolver[fail] = com.twitter.finagle.FailResolver$(com.twitter.finagle.FailResolver$@3c3c4a71)
I 1217 10:16:45.462 UTC THREAD1: Resolver[flag] = com.twitter.server.FlagResolver(com.twitter.server.FlagResolver@1352434e)
I 1217 10:16:45.462 UTC THREAD1: Resolver[zk] = com.twitter.finagle.zookeeper.ZkResolver(com.twitter.finagle.zookeeper.ZkResolver@4f9a6c2d)
I 1217 10:16:45.462 UTC THREAD1: Resolver[zk2] = com.twitter.finagle.serverset2.Zk2Resolver(com.twitter.finagle.serverset2.Zk2Resolver@2b6fcb9f)
I 1217 10:16:46.697 UTC THREAD1: Tracer: com.twitter.finagle.zipkin.thrift.ScribeZipkinTracer
I 1217 10:16:46.697 UTC THREAD1: Tracer: zipkin.finagle.http.HttpZipkinTracer
I 1217 10:16:46.697 UTC THREAD1: Tracer: zipkin.finagle.kafka.KafkaZipkinTracer
I 1217 10:16:46.722 UTC THREAD1: connecting to usageData proxy at Set(Inet(stats.buoyant.io/104.28.22.233:443,Map()))
I 1217 10:16:46.748 UTC THREAD1: connecting to StatsD at dd-agent.default.svc.cluster.local:8125 as linkerd
I 1217 10:16:47.082 UTC THREAD1: tracer: io.buoyant.telemetry.recentRequests.RecentRequetsTracer@2a644938
I 1217 10:16:47.083 UTC THREAD1: tracer: zipkin.finagle.http.HttpZipkinTracer@6215509
I 1217 10:16:47.740 UTC THREAD1: serving http admin on /0.0.0.0:9990
I 1217 10:16:47.767 UTC THREAD1: serving qa-h1-in on /0.0.0.0:4400
I 1217 10:16:49.460 UTC THREAD1: serving qa-h2-in on /0.0.0.0:4200
I 1217 10:16:49.495 UTC THREAD1: serving qa-h2-out on /0.0.0.0:4100
I 1217 10:16:49.522 UTC THREAD1: serving qa-h1-ing on /0.0.0.0:4500
I 1217 10:16:49.545 UTC THREAD1: serving qa-h2-ing on /0.0.0.0:4300
I 1217 10:16:49.563 UTC THREAD1: serving staging-h1-in on /0.0.0.0:5400
I 1217 10:16:49.583 UTC THREAD1: serving staging-h2-in on /0.0.0.0:5200
I 1217 10:16:49.601 UTC THREAD1: serving staging-h2-out on /0.0.0.0:5100
I 1217 10:16:49.618 UTC THREAD1: serving staging-h1-ing on /0.0.0.0:5500
I 1217 10:16:49.638 UTC THREAD1: serving staging-h2-ing on /0.0.0.0:5300
I 1217 10:16:49.668 UTC THREAD1: serving production-h1-in on /0.0.0.0:6400
I 1217 10:16:49.726 UTC THREAD1: serving production-h2-in on /0.0.0.0:6200
I 1217 10:16:49.741 UTC THREAD1: serving production-h2-out on /0.0.0.0:6100
I 1217 10:16:49.758 UTC THREAD1: serving production-h1-ing on /0.0.0.0:6500
I 1217 10:16:49.775 UTC THREAD1: serving production-h2-ing on /0.0.0.0:6300
I 1217 10:16:49.788 UTC THREAD1: initialized
W 1217 10:17:03.679 UTC THREAD55 TraceId:a0ed64a8e501cd7d: Exception propagated to the default monitor (upstream address: /10.4.19.51:45564, downstream address: /10.4.20.36:8080, label: %/io.l5d.k8s.localnode/10.4.20.37/#/io.l5d.k8s/production/grpc/bag-headless).
Reset.Cancel

W 1217 10:17:03.701 UTC THREAD53 TraceId:988a6bb1b760ea02: Exception propagated to the default monitor (upstream address: /10.4.2.62:45700, downstream address: /10.4.20.36:8080, label: %/io.l5d.k8s.localnode/10.4.20.37/#/io.l5d.k8s/production/grpc/bag-headless).
Reset.Cancel

W 1217 10:17:03.918 UTC THREAD50 TraceId:2f2d676e3a0e306c: Exception propagated to the default monitor (upstream address: /10.4.16.137:43920, downstream address: /10.4.20.36:8080, label: %/io.l5d.k8s.localnode/10.4.20.37/#/io.l5d.k8s/production/grpc/bag-headless).
Reset.Cancel

W 1217 10:17:03.927 UTC THREAD50 TraceId:97df192998387a62: Exception propagated to the default monitor (upstream address: /10.4.16.137:43920, downstream address: /10.4.20.36:8080, label: %/io.l5d.k8s.localnode/10.4.20.37/#/io.l5d.k8s/production/grpc/bag-headless).
Reset.Cancel

I 1217 10:17:03.950 UTC THREAD50 TraceId:c1790bcb24411676: FailureAccrualFactory marking connection to "%/io.l5d.k8s.localnode/10.4.20.37/#/io.l5d.k8s/production/grpc/bag-headless" as dead. Remote Address: Inet(/10.4.20.36:8080,Map(nodeName -> gke-capybara-misty-silence-917-3c1a662f-b967))
W 1217 10:17:03.952 UTC THREAD50 TraceId:c1790bcb24411676: Exception propagated to the default monitor (upstream address: /10.4.16.137:43920, downstream address: /10.4.20.36:8080, label: %/io.l5d.k8s.localnode/10.4.20.37/#/io.l5d.k8s/production/grpc/bag-headless).
Reset.Cancel

W 1217 10:17:03.958 UTC THREAD50 TraceId:a42d1872d3a49ac1: Exception propagated to the default monitor (upstream address: /10.4.16.137:43920, downstream address: /10.4.20.36:8080, label: %/io.l5d.k8s.localnode/10.4.20.37/#/io.l5d.k8s/production/grpc/bag-headless).
Reset.Cancel

W 1217 10:19:17.660 UTC THREAD55 TraceId:fe818d744dfd00fd: Exception propagated to the default monitor (upstream address: /10.4.19.51:45564, downstream address: /10.4.20.36:8080, label: %/io.l5d.k8s.localnode/10.4.20.37/#/io.l5d.k8s/production/grpc/bag-headless).
Reset.Cancel

W 1217 10:19:17.676 UTC THREAD55 TraceId:e96433130c6f2a10: Exception propagated to the default monitor (upstream address: /10.4.19.51:45564, downstream address: /10.4.20.36:8080, label: %/io.l5d.k8s.localnode/10.4.20.37/#/io.l5d.k8s/production/grpc/bag-headless).
Reset.Cancel

W 1217 10:19:17.729 UTC THREAD55 TraceId:5d74bcba3ec4077e: Exception propagated to the default monitor (upstream address: /10.4.19.51:45564, downstream address: /10.4.20.36:8080, label: %/io.l5d.k8s.localnode/10.4.20.37/#/io.l5d.k8s/production/grpc/bag-headless).
Reset.Cancel

E 1217 10:19:19.955 UTC THREAD55: [S L:/10.4.20.37:6200 R:/10.4.19.51:45564] dispatcher failed
com.twitter.finagle.ChannelClosedException: ChannelException at remote address: /10.4.19.51:45564. Remote Info: Not Available
	at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$1.channelInactive(ChannelTransport.scala:188)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at com.twitter.finagle.netty4.channel.ChannelRequestStatsHandler.channelInactive(ChannelRequestStatsHandler.scala:35)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:377)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:342)
	at io.netty.handler.codec.http2.Http2ConnectionHandler.channelInactive(Http2ConnectionHandler.java:391)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at com.twitter.finagle.netty4.channel.ChannelStatsHandler.channelInactive(ChannelStatsHandler.scala:131)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:377)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:342)
	at io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:978)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1337)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:916)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:744)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:23)
	at java.lang.Thread.run(Thread.java:748)

W 1217 10:20:22.500 UTC THREAD49 TraceId:5d5b4e0caf25df71: Exception propagated to the default monitor (upstream address: /10.4.12.143:41506, downstream address: /10.4.20.36:8080, label: %/io.l5d.k8s.localnode/10.4.20.37/#/io.l5d.k8s/production/grpc/bag-headless).
Reset.Cancel

W 1217 10:20:22.507 UTC THREAD49 TraceId:988dd7c85861a977: Exception propagated to the default monitor (upstream address: /10.4.12.143:41506, downstream address: /10.4.20.36:8080, label: %/io.l5d.k8s.localnode/10.4.20.37/#/io.l5d.k8s/production/grpc/bag-headless).
Reset.Cancel

W 1217 10:20:24.602 UTC THREAD49 TraceId:cae5aec8eec19390: Exception propagated to the default monitor (upstream address: /10.4.12.143:41506, downstream address: /10.4.20.36:8080, label: %/io.l5d.k8s.localnode/10.4.20.37/#/io.l5d.k8s/production/grpc/bag-headless).
Reset.Cancel

W 1217 10:40:28.355 UTC THREAD60 TraceId:7407c1b36006ac41: Exception propagated to the default monitor (upstream address: /10.240.0.3:41428, downstream address: /10.4.11.178:6200, label: %/io.l5d.k8s.daemonset/l5d-system/prod-h2-in/l5d/#/io.l5d.k8s/production/grpc/oslo-headless).
Reset.Cancel

E 1217 11:15:49.279 UTC THREAD73: service failure: com.twitter.finagle.NoBrokersAvailableException: No hosts are available for /svc/35.225.239.54, Dtab.base=[], Dtab.local=[]. Remote Info: Not Available
W 1217 11:18:08.777 UTC THREAD60 TraceId:3e177ef5ea8c9569: Exception propagated to the default monitor (upstream address: /10.240.0.3:41428, downstream address: /10.4.1.135:6200, label: %/io.l5d.k8s.daemonset/l5d-system/prod-h2-in/l5d/#/io.l5d.k8s/production/grpc/oslo-headless).
Reset.Cancel

There are also a bunch of errors like this:

W 1217 16:27:21.021 UTC THREAD60 TraceId:7493533dc326e686: Exception propagated to the default monitor (upstream address: /10.240.0.3:41428, downstream address: /10.4.1.135:6200, label: %/io.l5d.k8s.daemonset/l5d-system/prod-h2-in/l5d/#/io.l5d.k8s/production/grpc/oslo-headless).
Reset.Cancel

I 1217 16:27:21.165 UTC THREAD59 TraceId:df5c4438e5607258: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/l5d-system/prod-h2-in/l5d/#/io.l5d.k8s/production/grpc/oslo-headless" as dead. Remote Address: Inet(/10.4.11.178:6200,Map(nodeName -> gke-capybara-still-voice-65-fe114934-nz8f))
W 1217 16:27:21.166 UTC THREAD59 TraceId:df5c4438e5607258: Exception propagated to the default monitor (upstream address: /10.240.0.92:44827, downstream address: /10.4.11.178:6200, label: %/io.l5d.k8s.daemonset/l5d-system/prod-h2-in/l5d/#/io.l5d.k8s/production/grpc/oslo-headless).
Reset.Cancel

W 1217 16:43:09.334 UTC THREAD59: Unhandled exception in connection with /10.240.0.92:44827, shutting down connection
io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place

Finally, with reference to kubernetes setup I see this error twice in the logs:

W 1218 16:07:20.907 UTC THREAD19 TraceId:b56dfcc59bc7f31d: Exception propagated to the default monitor (upstream address: /10.4.20.34:43646, downstream address: localhost/127.0.0.1:8001, label: client).
com.twitter.io.Reader$ReaderDiscarded: This writer's reader has been discarded
	at com.twitter.finagle.netty4.http.StreamTransports$$anon$1.discard(StreamTransports.scala:71)
	at com.twitter.finagle.http.DelayedReleaseService$$anon$2$$anon$3.discard(DelayedReleaseService.scala:58)
	at com.twitter.finagle.http.DelayedReleaseService$$anon$2$$anon$3.discard(DelayedReleaseService.scala:58)
	at io.buoyant.k8s.Watchable.$anonfun$watch$5(Watchable.scala:131)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
	at com.twitter.util.Try$.apply(Try.scala:15)
	at com.twitter.util.Future$.apply(Future.scala:166)
	at io.buoyant.k8s.Watchable.$anonfun$watch$4(Watchable.scala:131)
	at com.twitter.util.Closable$$anon$5.close(Closable.scala:144)
	at com.twitter.util.Closable$$anon$6.close(Closable.scala:148)
	at com.twitter.util.Var$$anon$4.$anonfun$closable$1(Var.scala:399)
	at com.twitter.util.Closable$$anon$5.close(Closable.scala:144)
	at com.twitter.util.Closable$.com$twitter$util$Closable$$safeClose(Closable.scala:67)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:125)
	at com.twitter.util.Closable$$anon$3.onTry$1(Closable.scala:122)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:128)
	at com.twitter.util.Closable$$anon$3.close(Closable.scala:134)
	at com.twitter.util.Closable$$anon$6.close(Closable.scala:148)
	at io.buoyant.k8s.Watchable.$anonfun$activity$10(Watchable.scala:285)
	at com.twitter.util.Closable$$anon$5.close(Closable.scala:144)
	at com.twitter.util.Var$$anon$4.$anonfun$closable$1(Var.scala:399)
	at com.twitter.util.Closable$$anon$5.close(Closable.scala:144)
	at com.twitter.util.Closable$.com$twitter$util$Closable$$safeClose(Closable.scala:67)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:125)
	at com.twitter.util.Closable$$anon$3.onTry$1(Closable.scala:122)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:128)
	at com.twitter.util.Closable$$anon$3.close(Closable.scala:134)
	at com.twitter.util.Closable$.com$twitter$util$Closable$$safeClose(Closable.scala:67)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:125)
	at com.twitter.util.Closable$$anon$3.close(Closable.scala:134)
	at com.twitter.util.Var$$anon$4.$anonfun$closable$1(Var.scala:399)
	at com.twitter.util.Closable$$anon$5.close(Closable.scala:144)
	at com.twitter.util.Closable$.com$twitter$util$Closable$$safeClose(Closable.scala:67)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:125)
	at com.twitter.util.Closable$$anon$3.onTry$1(Closable.scala:122)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:128)
	at com.twitter.util.Closable$$anon$3.close(Closable.scala:134)
	at com.twitter.util.Closable$.com$twitter$util$Closable$$safeClose(Closable.scala:67)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:125)
	at com.twitter.util.Closable$$anon$3.close(Closable.scala:134)
	at com.twitter.util.Closable$.com$twitter$util$Closable$$safeClose(Closable.scala:67)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:125)
	at com.twitter.util.Closable$$anon$3.close(Closable.scala:134)
	at com.twitter.util.Closable$.com$twitter$util$Closable$$safeClose(Closable.scala:67)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:125)
	at com.twitter.util.Closable$$anon$3.close(Closable.scala:134)
	at com.twitter.util.Closable$$anon$6.close(Closable.scala:148)
	at com.twitter.util.Closable$.com$twitter$util$Closable$$safeClose(Closable.scala:67)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:125)
	at com.twitter.util.Closable$$anon$3.onTry$1(Closable.scala:122)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:128)
	at com.twitter.util.Closable$$anon$3.close(Closable.scala:134)
	at com.twitter.util.Closable$$anon$6.close(Closable.scala:148)
	at com.twitter.util.Closable$.com$twitter$util$Closable$$safeClose(Closable.scala:67)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:125)
	at com.twitter.util.Closable$$anon$3.onTry$1(Closable.scala:122)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:128)
	at com.twitter.util.Closable$$anon$3.close(Closable.scala:134)
	at com.twitter.util.Closable$$anon$6.close(Closable.scala:148)
	at com.twitter.util.Closable$.com$twitter$util$Closable$$safeClose(Closable.scala:67)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:125)
	at com.twitter.util.Closable$$anon$3.onTry$1(Closable.scala:122)
	at com.twitter.util.Closable$$anon$3.closeSeq(Closable.scala:128)
	at com.twitter.util.Closable$$anon$3.close(Closable.scala:134)
	at com.twitter.util.Closable.close(Closable.scala:20)
	at com.twitter.util.Closable.close$(Closable.scala:20)
	at com.twitter.util.Closable$$anon$3.close(Closable.scala:101)
	at com.twitter.util.Closable$$anon$1.run(Closable.scala:161)

E 1218 16:07:20.910 UTC THREAD19: retrying k8s request to /api/v1/namespaces/qa/endpoints/oslo-headless on error com.twitter.io.Reader$ReaderDiscarded: This writer's reader has been discarded
W 1218 16:09:51.730 UTC THREAD67 TraceId:b523be958b833b86: Exception propagated to the default monitor (upstream address: /10.4.16.138:42146, downstream address: /10.4.20.36:8080, label: %/io.l5d.k8s.localnode/10.4.20.37/#/io.l5d.k8s/production/grpc/bag-headless).
Reset.Cancel

Any suggestions or workarounds?

@ankurcha once again, thanks so much for this detailed report. I do have a couple more questions for you. What version of kubernetes are you using and are you using GKE? One thing that may be ideal is if you could have a minimal self contained kubernetes configuration that triggers the issue you are running into so we can debug the issue.

Hi,

Yes, I am running on GKE (1.8.3-gke.0). I don’t have a good idea for reproduction steps. What all information do you think I can include?

The setup is pretty standard (AFAIK):

client -> ingress l5d -> incoming l5d -> pod (:8080).

Hi @ankurcha, have you used the same configurations with Linkerd versions prior to 1.3.4 without seeing this error?

I have not. I can try 1.3.3 but it looks like https://github.com/linkerd/linkerd/pull/1751 is pretty critical.

Okay, thanks, that’s good to know – just trying to pin down a specific version where this issue was introduced. Since you’re using gRPC, you’re correct that you probably want to avoid version 1.3.3.

I am experiencing a very similar issue, will post details later today.

Thanks for notifying us @bourquep. We will watch out for more details

I had a very similar issue a few weeks ago while testing Linkerd (v1.3.1 I believe) as an HTTPS ingress controller, but my cluster was running k8s 1.6 so I figured I would try it again on an 1.8 cluster before digging deeper into the Linkerd logs, to make sure it wasn’t a k8s issue in that older version.

So I had reverted to the nginx ingress controller and forgot about it.

Since then, I have upgraded my cluster to 1.8.4-gke.0 and got back to experimenting with Linkerd, this time for routing intra-cluster gRPC calls between services (no Linkerd ingress yet).

Yesterday evening I had a completely working setup, with my main rest-api service talking to 7 gRPC services via Linkerd.

Everything ran smoothly throughout the night, but this morning we started seeing RST_STREAM errors returned by the gRPC clients.

I was on the bus when this happened, so doing deep troubleshooting on my phone was not really an option so I simply killed all the l5d pods and let the DaemonSet recreate them. Right after restarting Linkerd, the errors went away and gRPC service communications resumed.

Now that I am back at my desk, here’s the data I’ve been able to gather regarding this morning’s outage.

Linkerd configuration

I am omitting the RBAC config and yaml comments for brievity.

apiVersion: v1
kind: ConfigMap
metadata:
  name: l5d-config
  namespace: linkerd
data:
  config.yaml: |-
    admin:
      ip: 0.0.0.0
      port: 9990
    namers:
    - kind: io.l5d.k8s
    - kind: io.l5d.k8s
      prefix: /io.l5d.k8s.http
      transformers:
      - kind: io.l5d.k8s.daemonset
        namespace: linkerd
        port: http-incoming
        service: l5d
    - kind: io.l5d.k8s
      prefix: /io.l5d.k8s.grpc
      transformers:
      - kind: io.l5d.k8s.daemonset
        namespace: linkerd
        port: grpc-incoming
        service: l5d
    - kind: io.l5d.rewrite
      prefix: /portNsSvcToK8s
      pattern: "/{port}/{ns}/{svc}"
      name: "/k8s/{ns}/{port}/{svc}"

    telemetry:
    - kind: io.l5d.prometheus
    - kind: io.l5d.recentRequests
      sampleRate: 0.25

    usage:
      orgId: studyo-co

    routers:
    - label: grpc-outgoing
      protocol: h2
      experimental: true
      servers:
      - port: 4340
        ip: 0.0.0.0
      identifier:
        kind: io.l5d.header.token
      dtab: |
        /k8s => /#/io.l5d.k8s.grpc ;
        /portNsSvc => /#/portNsSvcToK8s ;
        /host => /portNsSvc/grpc/studyo ; # If no namespace is provided, default to 'studyo'
        /host => /portNsSvc/grpc ;
        /svc => /$/io.buoyant.http.domainToPathPfx/host ;
      client:
        kind: io.l5d.static
        configs:
        - prefix: "/$/inet/{service}"
          tls:
            commonName: "{service}"

    - label: gprc-incoming
      protocol: h2
      experimental: true
      servers:
      - port: 4341
        ip: 0.0.0.0
      identifier:
        kind: io.l5d.header.token
      interpreter:
        kind: default
        transformers:
        - kind: io.l5d.k8s.localnode
      dtab: |
        /k8s => /#/io.l5d.k8s ;
        /portNsSvc => /#/portNsSvcToK8s ;
        /host => /portNsSvc/grpc/studyo ; # If no namespace is provided, default to 'studyo'
        /host => /portNsSvc/grpc ;
        /svc => /$/io.buoyant.http.domainToPathPfx/host ;

### DaemonSet ###
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    app: l5d
  name: l5d
  namespace: linkerd
spec:
  template:
    metadata:
      labels:
        app: l5d
    spec:
      volumes:
      - name: l5d-config
        configMap:
          name: "l5d-config"
      containers:
      - name: l5d
        image: buoyantio/linkerd:1.3.3
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        args:
        - /io.buoyant/linkerd/config/config.yaml
        ports:
        - name: grpc-outgoing
          containerPort: 4340
          hostPort: 4340
        - name: grpc-incoming
          containerPort: 4341
        volumeMounts:
        - name: "l5d-config"
          mountPath: "/io.buoyant/linkerd/config"
          readOnly: true

      - name: kubectl
        image: buoyantio/kubectl:v1.8.5
        args:
        - "proxy"
        - "-p"
        - "8001"

### Service ###
---
apiVersion: v1
kind: Service
metadata:
  name: l5d
  namespace: linkerd
spec:
  selector:
    app: l5d
  type: ClusterIP
  ports:
  - name: grpc-outgoing
    port: 4340
  - name: grpc-incoming
    port: 4341

Linkerd logs

These are the logs right before the errors started showing up. It looks like all of a sudden all my services stopped being visible to Linkerd. Note that nothing changed on the cluster at that time (we didn’t update any deployments).

E  I 1219 12:55:59.192 UTC THREAD10: Reaping /svc/canvas-connector
 
E  I 1219 12:55:59.194 UTC THREAD10: Reaping /svc/blackbaud-connector
 
E  I 1219 12:55:59.210 UTC THREAD10: Reaping %/io.l5d.k8s.localnode/10.8.0.155/#/io.l5d.k8s/studyo/grpc/canvas-connector
 
E  I 1219 12:55:59.222 UTC THREAD10: Reaping %/io.l5d.k8s.localnode/10.8.0.155/#/io.l5d.k8s/studyo/grpc/blackbaud-connector
 
E  I 1219 12:56:06.377 UTC THREAD10: Reaping /svc/canvas-connector
 
E  I 1219 12:56:07.365 UTC THREAD10: Reaping %/io.l5d.k8s.daemonset/linkerd/grpc-incoming/l5d/#/io.l5d.k8s.grpc/studyo/grpc/canvas-connector
 
E  time="2017-12-19T12:56:24Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633" 
 
E  time="2017-12-19T12:56:24Z" level=info msg="Done checkpointing in-memory metrics and chunks in 324.697422ms." source="persistence.go:665" 
 
E  I 1219 12:56:32.508 UTC THREAD10: Reaping /svc/canvas-connector
 
E  I 1219 12:56:32.511 UTC THREAD10: Reaping /svc/blackbaud-connector
 
E  I 1219 12:56:33.145 UTC THREAD10: Reaping %/io.l5d.k8s.daemonset/linkerd/grpc-incoming/l5d/#/io.l5d.k8s.grpc/studyo/grpc/canvas-connector
 
E  I 1219 12:56:33.158 UTC THREAD10: Reaping %/io.l5d.k8s.daemonset/linkerd/grpc-incoming/l5d/#/io.l5d.k8s.grpc/studyo/grpc/blackbaud-connector
 
E  I 1219 13:00:53.447 UTC THREAD36 TraceId:45b399e1c51a46bc: %/io.l5d.k8s.daemonset/linkerd/grpc-incoming/l5d/#/io.l5d.k8s.grpc/studyo/grpc/generator: name resolution is negative (local dtab: Dtab())
 
E  I 1219 13:00:53.454 UTC THREAD36: no available endpoints
 
E  com.twitter.finagle.NoBrokersAvailableException: No hosts are available for /svc/generator, Dtab.base=[/k8s=>/#/io.l5d.k8s.grpc;/portNsSvc=>/#/portNsSvcToK8s;/host=>/portNsSvc/grpc/studyo;/host=>/portNsSvc/grpc;/svc=>/$/io.buoyant.http.domainToPathPfx/host], Dtab.local=[]. Remote Info: Not Available
 
E  
 
E  I 1219 13:00:56.867 UTC THREAD33 TraceId:02526ca87928e0fb: %/io.l5d.k8s.daemonset/linkerd/grpc-incoming/l5d/#/io.l5d.k8s.grpc/studyo/grpc/classroom-connector: name resolution is negative (local dtab: Dtab())
 
E  I 1219 13:00:56.872 UTC THREAD33 TraceId:02526ca87928e0fb: %/io.l5d.k8s.daemonset/linkerd/grpc-incoming/l5d/#/io.l5d.k8s.grpc/studyo/grpc/classroom-connector: name resolution is negative (local dtab: Dtab())
 
E  I 1219 13:00:56.874 UTC THREAD33 TraceId:02526ca87928e0fb: %/io.l5d.k8s.daemonset/linkerd/grpc-incoming/l5d/#/io.l5d.k8s.grpc/studyo/grpc/classroom-connector: name resolution is negative (local dtab: Dtab())
 
E  I 1219 13:00:56.880 UTC THREAD33: no available endpoints
 
E  com.twitter.finagle.NoBrokersAvailableException: No hosts are available for /svc/classroom-connector, Dtab.base=[/k8s=>/#/io.l5d.k8s.grpc;/portNsSvc=>/#/portNsSvcToK8s;/host=>/portNsSvc/grpc/studyo;/host=>/portNsSvc/grpc;/svc=>/$/io.buoyant.http.domainToPathPfx/host], Dtab.local=[]. Remote Info: Not Available 

And then, the Linkerd logs is filled with “Endpoint is marked down” errors:

E  E 1219 14:14:44.050 UTC THREAD13: retrying k8s request to /api/v1/namespaces/studyo/endpoints/cron on error com.twitter.finagle.FailedFastException: Endpoint client is marked down. For more details see: https://twitter.github.io/finagle/guide/FAQ.html#why-do-clients-see-com-twitter-finagle-failedfastexception-s. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:8001, Downstream label: client, Trace Id: 4fe634abb25a0c96.4fe634abb25a0c96<:4fe634abb25a0c96
 
E  E 1219 14:14:44.052 UTC THREAD13: retrying k8s request to /api/v1/namespaces/studyo/endpoints/blackbaud-connector on error com.twitter.finagle.FailedFastException: Endpoint client is marked down. For more details see: https://twitter.github.io/finagle/guide/FAQ.html#why-do-clients-see-com-twitter-finagle-failedfastexception-s. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:8001, Downstream label: client, Trace Id: e75dff9234d7dc88.e75dff9234d7dc88<:e75dff9234d7dc88
 
E  E 1219 14:14:44.053 UTC THREAD13: retrying k8s request to /api/v1/namespaces/studyo/endpoints/canvas-connector on error com.twitter.finagle.FailedFastException: Endpoint client is marked down. For more details see: https://twitter.github.io/finagle/guide/FAQ.html#why-do-clients-see-com-twitter-finagle-failedfastexception-s. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:8001, Downstream label: client, Trace Id: b4dcd0bf79040744.b4dcd0bf79040744<:b4dcd0bf79040744
 
E  E 1219 14:14:44.055 UTC THREAD13: retrying k8s request to /api/v1/namespaces/studyo/endpoints/classroom-connector on error com.twitter.finagle.FailedFastException: Endpoint client is marked down. For more details see: https://twitter.github.io/finagle/guide/FAQ.html#why-do-clients-see-com-twitter-finagle-failedfastexception-s. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:8001, Downstream label: client, Trace Id: a259c3d6b9cb2add.a259c3d6b9cb2add<:a259c3d6b9cb2add
 
E  E 1219 14:14:44.057 UTC THREAD13: retrying k8s request to /api/v1/namespaces/studyo/endpoints/classroom-connector on error com.twitter.finagle.FailedFastException: Endpoint client is marked down. For more details see: https://twitter.github.io/finagle/guide/FAQ.html#why-do-clients-see-com-twitter-finagle-failedfastexception-s. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:8001, Downstream label: client, Trace Id: a259c3d6b9cb2add.a259c3d6b9cb2add<:a259c3d6b9cb2add
 
E  E 1219 14:14:44.058 UTC THREAD13: retrying k8s request to /api/v1/namespaces/studyo/endpoints/generator on error com.twitter.finagle.FailedFastException: Endpoint client is marked down. For more details see: https://twitter.github.io/finagle/guide/FAQ.html#why-do-clients-see-com-twitter-finagle-failedfastexception-s. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:8001, Downstream label: client, Trace Id: 9f8ff6b9198e438d.9f8ff6b9198e438d<:9f8ff6b9198e438d
 
E  E 1219 14:14:44.060 UTC THREAD13: retrying k8s request to /api/v1/namespaces/studyo/endpoints/canvas-connector on error com.twitter.finagle.FailedFastException: Endpoint client is marked down. For more details see: https://twitter.github.io/finagle/guide/FAQ.html#why-do-clients-see-com-twitter-finagle-failedfastexception-s. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:8001, Downstream label: client, Trace Id: dd932390496a5ff4.dd932390496a5ff4<:dd932390496a5ff4

If there is anything else I can provide you, I’ll be happy to oblige.

Also, since I don’t know how to reproduce the issue (it happened out of the blue), is there anything I can configure now (e.g. log levels) so that we have more details to chew on the next time the bug manifests itself?

Pascal

@bourquep there are couple of settings your can use to change logging levels, this may help surface more information. You could restart linkerd with this -log.level=DEBUG flag or use the tracelog telemeter in your linkerd config here. Be advised. Tracelog in linkerd by default samples at 100% that can be a lot of information so I would start of with smaller values

I have enabled -log-level=DEBUG, will post those logs next time the problem manifests itself.

sounds good @bourquep

Got a failure on 1 pod from the DaemonSet (out of 3 ). Logs from this linkerd pod:

D 0102 20:48:16.412 UTC THREAD31 TraceId:8f84692e2129033c: k8s restarting watch on /api/v1/watch/namespaces/studyo/endpoints/generator, resource version None was too old
D 0102 20:49:58.266 UTC THREAD37 TraceId:45cec1470328d5a0: k8s lookup: /studyo/grpc/generator /studyo/grpc/generator
D 0102 20:49:59.755 UTC THREAD34 TraceId:d62e936f5fb02c32: k8s returned 'too old resource version' error with incorrect HTTP status code, restarting watch
I 0102 20:50:08.280 UTC THREAD10: [S L:/10.8.0.195:4341 R:/10.8.0.195:48078 S:3] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.RequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4341 while dyn binding /svc/generator. Remote Info: Not Available

D 0102 20:50:08.280 UTC THREAD10: [S L:/10.8.0.195:4341 R:/10.8.0.195:48078 S:3] resetting Reset.InternalError in com.twitter.finagle.buoyant.h2.netty4.Netty4StreamTransport$RemoteClosed@f2e51f5
D 0102 20:50:08.282 UTC THREAD10: [S L:/10.8.0.195:4341 R:/10.8.0.195:48078 S:3] stream reset from local; resetting remote: Reset.InternalError
D 0102 20:50:08.284 UTC THREAD36 TraceId:45cec1470328d5a0: [C L:/10.8.0.195:48078 R:/10.8.0.195:4341 S:3] stream reset from remote: Reset.InternalError
W 0102 20:50:08.285 UTC THREAD36 TraceId:45cec1470328d5a0: Exception propagated to the default monitor (upstream address: /10.8.0.168:52372, downstream address: /10.8.0.195:4341, label: %/io.l5d.k8s.daemonset/linkerd/grpc-incoming/l5d/#/io.l5d.k8s.grpc/studyo/grpc/generator).
Reset.InternalError

I have only included the logs that refer to the generator service, but the same errors are present for my other gRPC services (restarting watch, resource version too old).

Also, in the kubectl container in that same pod, I’m seeing a ton of these in the logs:

I0102 19:21:29.511038       1 logs.go:41] http: proxy error: context canceled
I0102 19:34:26.555429       1 logs.go:41] httputil: ReverseProxy read error during body copy: context canceled

linkerd image: buoyantio/linkerd:1.3.4
kubectl image: buoyantio/kubectl:v1.8.5
k8s version: v1.8.4-gke.0 (running on GKE with alpha features disabled)

Hi @bourquep!

I see that you linked to this from linkerd#1730. Are you reasonably convinced that this is the same issue, i.e. should we use your report to prioritise work there instead of treating this as a separate investigation? This is important to make sure we’re prioritising the most important issues for our users and not duplicating effort :slight_smile:

Cheers

I believe they are the same issue, or at least have the same root cause. What they have in common:

  • The io.l5d.k8s namer stops being able to resolve a k8s service/endpoint.
  • The problem is intermittent, with no obvious trigger (although the OP of the #1730 issue mentions that deploying a new version of a service sometimes triggers the bug, on my end it starts occurring out of the blue - nothing was updated on my test cluster during the Holidays and the problem manifested itself).
  • We both see k8s resource too old in the linkerd logs, accompanied by k8s restarting watch messages.
  • We both see context canceled errors in the kubectl container logs from the l5d pods
  • Restarting l5d (killing the pods and letting the DaemonSet recreate them) fixes the issue.

Let me know if you’d like me to post this back to the issue.

Pascal

Thanks! I will consolidate the resolution there then!