Marking connection as dead + 0.0.0.0/4140 timeouts (Kubernetes 1.7 incompatibility?)

Hello

We’re using Linkerd for our microservices that use gRPC to communicate with each other. We’re running this on Kubernetes (Google Cloud) with the Linkerd daemonset configuration.

Currently we have two environments running. One for development and one for staging.
On development, all microservices have one replica and there are 4 nodes. This cluster is running on Kubernetes 1.6.4.
On staging, the microservices have two replicas and there are 3 nodes. This cluster is running on Kubernetes 1.7.3.
The only difference in configuration between the two is the log level, and the sample rate of tracing (Zipkin, recent request).

Recently we updated to Linkerd 1.3.0 on both environments and shortly after we began experiencing some weird errors on the staging environment. Linkerd on the development environment is working as expected.

The following is happening:
Linkerd is marking connections as dead every few minutes, and I can’t seem to figure out why. “Reset.InternalError” is also mentioned, but I’m not really sure what that means.

Example log:

I 1011 09:13:05.450 UTC THREAD10 TraceId:db99517bcf5efd0b: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/[name]-api" as dead. Remote Address: Inet(/10.52.0.64:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-bk48))
I 1011 09:13:05.452 UTC THREAD10: [S L:/10.52.1.15:4140 R:/10.52.1.33:59128 S:22089] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available


W 1011 09:13:05.450 UTC THREAD10 TraceId:db99517bcf5efd0b: Exception propagated to the default monitor (upstream address: /10.52.1.33:59128, downstream address: /10.52.0.64:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/[name]-api).
Reset.InternalError

I 1011 09:18:47.480 UTC THREAD10 TraceId:59b4c4ecb20c02e6: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/[name]-api" as dead. Remote Address: Inet(/10.52.0.64:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-bk48))
W 1011 09:18:47.481 UTC THREAD10 TraceId:59b4c4ecb20c02e6: Exception propagated to the default monitor (upstream address: /10.52.1.33:59128, downstream address: /10.52.0.64:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/[name]-api).
Reset.InternalError

I 1011 09:18:47.483 UTC THREAD10: [S L:/10.52.1.15:4140 R:/10.52.1.33:59128 S:22165] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

There seems to be something wrong with the Linkerd instance running on that particular node (gke-luna-pool-1-dfd37df9-bk48). When requests end up going through that instance, they time out, even though I see them arriving in the microservice.

The Linkerd pod on that node is able to reach the microservice that’s being marked as dead just fine.
What could be the problem here?

I hope it’s clear enough.
Any help is highly appreciated!

Thanks
Dean

Hi @Dean – apologies for the delay in responding here. Just to clarify, are you seeing all of the linkerd processes in your staging environment marking connections as dead? Or is it happening with just one of the instances, gke-luna-pool-1-dfd37df9-bk48? Does restarting that instance fix the issue?

I can try to reproduce this in one of our Kubernetes environments. Would you mind sharing the linkerd config that you’re using in the staging environment?

Hi @klingerf !

No problem. Thanks for taking a look at my problem.

The FailureAccrualFactory marking connection to -snip- as dead error happens on other nodes as well, but it always seems to mention that one node indeed, Remote Address: Inet(/10.52.4.78:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-bk48)).

I’ve tried restarting the whole daemonset, so every single instance of Linkerd, but it did not help much.
I’ve enabled the trace log level to get some more insight, but the logs are still unclear to me.

Sure, here’s the staging config:

apiVersion: v1
kind: ConfigMap
metadata:
  name: linkerd-config
data:
  config.yaml: |-
    admin:
      port: 9990

    namers:
    - kind: io.l5d.k8s
      host: localhost
      port: 8001

    telemetry:
    - kind: io.l5d.zipkin
      host: zipkin.[namespace].svc.cluster.local
      port: 9410
      sampleRate: 0.10
    - kind: io.l5d.recentRequests
      sampleRate: 0.10

    usage:
      orgId: snip

    routers:
    - protocol: h2
      label: outgoing
      experimental: true
      dtab: |
        /srv        => /#/io.l5d.k8s/[namespace]/grpc;
        /grpc       => /srv;
        /svc        => /$/io.buoyant.http.domainToPathPfx/grpc;
      identifier:
        kind: io.l5d.header.path
        segments: 1
      interpreter:
        kind: default
        transformers:
        - kind: io.l5d.k8s.daemonset
          namespace: [namespace]
          port: incoming
          service: linkerd
      servers:
      - port: 4140
        ip: 0.0.0.0
      service:
        kind: io.l5d.global
        totalTimeoutMs: 10000

    - protocol: h2
      label: incoming
      experimental: true
      dtab: |
        /srv        => /#/io.l5d.k8s/[namespace]/grpc;
        /grpc       => /srv;
        /svc        => /$/io.buoyant.http.domainToPathPfx/grpc;
      identifier:
        kind: io.l5d.header.path
        segments: 1
      interpreter:
        kind: default
        transformers:
        - kind: io.l5d.k8s.localnode
      servers:
      - port: 4141
        ip: 0.0.0.0

Here is some more logging at trace level. It seems that Linkerd is having trouble communicating with Linkerd on the other node?

D 1012 19:35:52.852 UTC THREAD33 TraceId:47b5e7c4b940f34c: [L:/10.52.4.78:45298 R:/10.52.4.79:50051] write: DATA
D 1012 19:35:52.907 UTC THREAD33 TraceId:bfd9745dddff703a: [L:/10.52.4.78:45298 R:/10.52.4.79:50051] wrote: DATA: Return(())
D 1012 19:35:52.923 UTC THREAD33 TraceId:47b5e7c4b940f34c: [L:/10.52.4.78:45298 R:/10.52.4.79:50051] wrote: DATA: Return(())
D 1012 19:35:53.002 UTC THREAD27 TraceId:6b1d7c9f81934a59: [C L:/10.52.4.78:45298 R:/10.52.4.79:50051 S:841] initialized stream
D 1012 19:35:53.003 UTC THREAD33 TraceId:852cec267c5b73d6: [L:/10.52.4.78:45298 R:/10.52.4.79:50051] wrote: RST_STREAM: Return(())
D 1012 19:35:53.009 UTC THREAD27 TraceId:6b1d7c9f81934a59: [L:/10.52.4.78:45298 R:/10.52.4.79:50051] write: HEADERS
D 1012 19:35:53.024 UTC THREAD27: [S L:/10.52.4.78:4141 R:/10.52.6.67:57656 S:419] admitting DATA in Open(com.twitter.finagle.buoyant.h2.netty4.Netty4StreamTransport$RemoteStreaming@2daf2257)
D 1012 19:35:53.024 UTC THREAD27: [S L:/10.52.4.78:4141 R:/10.52.6.67:57656 S:413] admitting RST_STREAM in com.twitter.finagle.buoyant.h2.netty4.Netty4StreamTransport$RemoteClosed@2e2a4a10
D 1012 19:35:53.023 UTC THREAD33 TraceId:a8ee72f9d6f2155c: [L:/10.52.4.78:45298 R:/10.52.4.79:50051] wrote: RST_STREAM: Return(())
D 1012 19:35:53.029 UTC THREAD27: [C L:/10.52.4.78:45298 R:/10.52.4.79:50051 S:831] remote message interrupted: Reset.InternalError
D 1012 19:35:53.037 UTC THREAD27: [C L:/10.52.4.78:45298 R:/10.52.4.79:50051 S:831] resetting Reset.InternalError in Open(com.twitter.finagle.buoyant.h2.netty4.Netty4StreamTransport$RemotePending@a3a07c2)
D 1012 19:35:53.044 UTC THREAD27: [S L:/10.52.4.78:4141 R:/10.52.6.67:57656 S:413] stream reset from remote: Reset.InternalError
D 1012 19:35:53.057 UTC THREAD27 TraceId:a35643cb01d677e0: [C L:/10.52.4.78:45298 R:/10.52.4.79:50051 S:831] stream reset from local; resetting remote: Reset.InternalError
D 1012 19:35:53.067 UTC THREAD33 TraceId:a35643cb01d677e0: [L:/10.52.4.78:45298 R:/10.52.4.79:50051] wrote: DATA: Return(())
D 1012 19:35:53.074 UTC THREAD27 TraceId:a35643cb01d677e0: [L:/10.52.4.78:45298 R:/10.52.4.79:50051] write: RST_STREAM
W 1012 19:35:53.103 UTC THREAD27 TraceId:a35643cb01d677e0: Exception propagated to the default monitor (upstream address: /10.52.6.67:57656, downstream address: /10.52.4.79:50051, label: %/io.l5d.k8s.localnode/10.52.4.78/#/io.l5d.k8s/[namespace]/grpc/asset-api).
Reset.InternalError

D 1012 19:35:53.147 UTC THREAD33 TraceId:6b1d7c9f81934a59: [L:/10.52.4.78:45298 R:/10.52.4.79:50051] wrote: HEADERS: Return(())
D 1012 19:35:53.154 UTC THREAD33 TraceId:6b1d7c9f81934a59: [L:/10.52.4.78:45298 R:/10.52.4.79:50051] write: DATA
D 1012 19:35:53.178 UTC THREAD27: [S L:/10.52.4.78:4141 R:/10.52.6.67:57656 S:415] admitting RST_STREAM in com.twitter.finagle.buoyant.h2.netty4.Netty4StreamTransport$RemoteClosed@6056a829
D 1012 19:35:53.172 UTC THREAD33 TraceId:6b1d7c9f81934a59: [L:/10.52.4.78:45298 R:/10.52.4.79:50051] wrote: DATA: Return(())
D 1012 19:35:53.189 UTC THREAD27: [C L:/10.52.4.78:45298 R:/10.52.4.79:50051 S:837] remote message interrupted: Reset.InternalError
D 1012 19:35:53.192 UTC THREAD27: [C L:/10.52.4.78:45298 R:/10.52.4.79:50051 S:837] resetting Reset.InternalError in LocalClosed(com.twitter.finagle.buoyant.h2.netty4.Netty4StreamTransport$RemotePending@343f93ee)
D 1012 19:35:53.192 UTC THREAD27: [S L:/10.52.4.78:4141 R:/10.52.6.67:57656 S:415] stream reset from remote: Reset.InternalError
D 1012 19:35:53.195 UTC THREAD33 TraceId:a35643cb01d677e0: [L:/10.52.4.78:45298 R:/10.52.4.79:50051] wrote: RST_STREAM: Return(())
D 1012 19:35:53.193 UTC THREAD27 TraceId:bfd9745dddff703a: [C L:/10.52.4.78:45298 R:/10.52.4.79:50051 S:837] stream reset from local; resetting remote: Reset.InternalError
D 1012 19:35:53.197 UTC THREAD27 TraceId:bfd9745dddff703a: [L:/10.52.4.78:45298 R:/10.52.4.79:50051] write: RST_STREAM
D 1012 19:35:53.217 UTC THREAD33: [C L:/10.52.4.78:45298 R:/10.52.4.79:50051 S:839] admitting HEADERS in LocalClosed(com.twitter.finagle.buoyant.h2.netty4.Netty4StreamTransport$RemotePending@483925b3)
D 1012 19:35:53.298 UTC THREAD33: [C L:/10.52.4.78:45298 R:/10.52.4.79:50051 S:839] admitting DATA in LocalClosed(com.twitter.finagle.buoyant.h2.netty4.Netty4StreamTransport$RemoteStreaming@12e73f61)
D 1012 19:35:53.314 UTC THREAD33: [C L:/10.52.4.78:45298 R:/10.52.4.79:50051 S:839] admitting HEADERS in LocalClosed(com.twitter.finagle.buoyant.h2.netty4.Netty4StreamTransport$RemoteStreaming@12e73f61)
I 1012 19:35:53.319 UTC THREAD27 TraceId:bfd9745dddff703a: FailureAccrualFactory marking connection to "%/io.l5d.k8s.localnode/10.52.4.78/#/io.l5d.k8s/[namespace]/grpc/asset-api" as dead. Remote Address: Inet(/10.52.4.79:50051,Map(nodeName -> gke-luna-pool-1-dfd37df9-bk48))
D 1012 19:35:53.326 UTC THREAD33 TraceId:47b5e7c4b940f34c: [C L:/10.52.4.78:45298 R:/10.52.4.79:50051 S:839] stream closed
D 1012 19:35:53.377 UTC THREAD14 TraceId:47b5e7c4b940f34c: [L:/10.52.4.78:4141 R:/10.52.6.67:57656] write: WINDOW_UPDATE
D 1012 19:35:53.384 UTC THREAD14: [L:/10.52.4.78:4141 R:/10.52.6.67:57656] write: HEADERS
W 1012 19:35:53.394 UTC THREAD27 TraceId:bfd9745dddff703a: Exception propagated to the default monitor (upstream address: /10.52.6.67:57656, downstream address: /10.52.4.79:50051, label: %/io.l5d.k8s.localnode/10.52.4.78/#/io.l5d.k8s/[namespace]/grpc/asset-api).
Reset.InternalError

D 1012 19:35:53.849 UTC THREAD33: [C L:/10.52.4.78:45298 R:/10.52.4.79:50051 S:841] admitting HEADERS in LocalClosed(com.twitter.finagle.buoyant.h2.netty4.Netty4StreamTransport$RemotePending@47a5daf4)
D 1012 19:35:53.850 UTC THREAD27: [S L:/10.52.4.78:4141 R:/10.52.6.67:57656 S:417] admitting RST_STREAM in com.twitter.finagle.buoyant.h2.netty4.Netty4StreamTransport$RemoteClosed@fcd810d
D 1012 19:35:53.872 UTC THREAD27: [S L:/10.52.4.78:4141 R:/10.52.6.67:57656 S:417] stream reset from remote: Reset.InternalError
D 1012 19:35:53.880 UTC THREAD27: [S L:/10.52.4.78:4141 R:/10.52.6.67:57656 S:419] admitting RST_STREAM in com.twitter.finagle.buoyant.h2.netty4.Netty4StreamTransport$RemoteClosed@e5829fa
D 1012 19:35:53.888 UTC THREAD27: [S L:/10.52.4.78:4141 R:/10.52.6.67:57656 S:419] stream reset from remote: Reset.InternalError
D 1012 19:35:53.891 UTC THREAD27 TraceId:47b5e7c4b940f34c: [L:/10.52.4.78:4141 R:/10.52.6.67:57656] wrote: WINDOW_UPDATE: Return(())
D 1012 19:35:53.915 UTC THREAD27: [L:/10.52.4.78:4141 R:/10.52.6.67:57656] wrote: HEADERS: Throw(com.twitter.finagle.ChannelWriteException: com.twitter.finagle.UnknownChannelException: Stream no longer exists: 417 at remote address: /10.52.6.67:57656. Remote Info: Not Available. Remote Info: Not Available)
E 1012 19:35:53.921 UTC THREAD27: [S L:/10.52.4.78:4141 R:/10.52.6.67:57656 S:417] unexpected error
com.twitter.finagle.ChannelWriteException: com.twitter.finagle.UnknownChannelException: Stream no longer exists: 417 at remote address: /10.52.6.67:57656. Remote Info: Not Available. Remote Info: Not Available
Caused by: com.twitter.finagle.UnknownChannelException: Stream no longer exists: 417 at remote address: /10.52.6.67:57656. Remote Info: Not Available
        at com.twitter.finagle.ChannelException$.apply(Exceptions.scala:261)
        at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$2.operationComplete(ChannelTransport.scala:106)
        at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$2.operationComplete(ChannelTransport.scala:102)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
        at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders(DefaultHttp2ConnectionEncoder.java:170)
        at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders(DefaultHttp2ConnectionEncoder.java:147)
        at io.netty.handler.codec.http2.H2FrameCodec.write(H2FrameCodec.scala:141)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
        at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1089)
        at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1136)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1078)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:23)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Stream no longer exists: 417
        ... 17 more
Caused by: io.netty.handler.codec.http2.Http2Exception: Request stream 417 is not correct for server connection
        at io.netty.handler.codec.http2.Http2Exception.connectionError(Http2Exception.java:85)
        at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultEndpoint.checkNewStreamAllowed(DefaultHttp2Connection.java:867)
        at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultEndpoint.createStream(DefaultHttp2Connection.java:733)
        at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultEndpoint.createStream(DefaultHttp2Connection.java:653)
        at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders(DefaultHttp2ConnectionEncoder.java:167)
        ... 16 more

E 1012 19:35:53.922 UTC THREAD27: [S L:/10.52.4.78:4141 R:/10.52.6.67:57656 S:417] ignoring exception
com.twitter.finagle.ChannelWriteException: com.twitter.finagle.UnknownChannelException: Stream no longer exists: 417 at remote address: /10.52.6.67:57656. Remote Info: Not Available. Remote Info: Not Available
Caused by: com.twitter.finagle.UnknownChannelException: Stream no longer exists: 417 at remote address: /10.52.6.67:57656. Remote Info: Not Available
        at com.twitter.finagle.ChannelException$.apply(Exceptions.scala:261)
        at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$2.operationComplete(ChannelTransport.scala:106)
        at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$2.operationComplete(ChannelTransport.scala:102)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
        at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders(DefaultHttp2ConnectionEncoder.java:170)
        at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders(DefaultHttp2ConnectionEncoder.java:147)
        at io.netty.handler.codec.http2.H2FrameCodec.write(H2FrameCodec.scala:141)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
        at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1089)
        at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1136)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1078)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:23)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Stream no longer exists: 417
        ... 17 more
Caused by: io.netty.handler.codec.http2.Http2Exception: Request stream 417 is not correct for server connection
        at io.netty.handler.codec.http2.Http2Exception.connectionError(Http2Exception.java:85)
        at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultEndpoint.checkNewStreamAllowed(DefaultHttp2Connection.java:867)
        at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultEndpoint.createStream(DefaultHttp2Connection.java:733)
        at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultEndpoint.createStream(DefaultHttp2Connection.java:653)
        at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders(DefaultHttp2ConnectionEncoder.java:167)
        ... 16 more

D 1012 19:35:53.918 UTC THREAD33 TraceId:6b1d7c9f81934a59: [L:/10.52.4.78:4141 R:/10.52.6.67:57656] write: WINDOW_UPDATE
D 1012 19:35:53.932 UTC THREAD27 TraceId:6b1d7c9f81934a59: [L:/10.52.4.78:4141 R:/10.52.6.67:57656] wrote: WINDOW_UPDATE: Return(())
D 1012 19:35:53.944 UTC THREAD33: [S L:/10.52.4.78:4141 R:/10.52.6.67:57656 S:419] remote write failed: Reset.InternalError
D 1012 19:35:53.951 UTC THREAD33: [C L:/10.52.4.78:45298 R:/10.52.4.79:50051 S:841] admitting DATA in LocalClosed(com.twitter.finagle.buoyant.h2.netty4.Netty4StreamTransport$RemoteStreaming@3da3929a)
D 1012 19:35:54.670 UTC THREAD33: [C L:/10.52.4.78:45298 R:/10.52.4.79:50051 S:841] admitting HEADERS in LocalClosed(com.twitter.finagle.buoyant.h2.netty4.Netty4StreamTransport$RemoteStreaming@3da3929a)
D 1012 19:35:54.673 UTC THREAD33 TraceId:6b1d7c9f81934a59: [C L:/10.52.4.78:45298 R:/10.52.4.79:50051 S:841] stream closed
D 1012 19:35:54.689 UTC THREAD33 TraceId:bfd9745dddff703a: [L:/10.52.4.78:45298 R:/10.52.4.79:50051] wrote: RST_STREAM: Return(())

Weirdly enough, these errors haven’t occurred anymore since two days. Previously these happened about every 1-5 min the whole day long.

Now Linkerd’s been acting up about something else, but that seems to be a known issue already with the k8s namer?

D 1015 11:46:06.687 UTC THREAD27 TraceId:893732fbead98a25: k8s returned 'too old resource version' error with incorrect HTTP status code, restarting watch
D 1015 11:46:06.687 UTC THREAD27 TraceId:893732fbead98a25: k8s restarting watch on /api/v1/watch/namespaces/envisense-v2/endpoints/customer-api, resource version Some(47375837) was too old
D 1015 11:58:48.632 UTC THREAD31 TraceId:5589ac04a7ec2945: k8s returned 'too old resource version' error with incorrect HTTP status code, restarting watch
D 1015 11:58:48.632 UTC THREAD31 TraceId:5589ac04a7ec2945: k8s restarting watch on /api/v1/watch/namespaces/envisense-v2/endpoints/asset-type-api, resource version Some(47375817) was too old
D 1015 12:03:24.696 UTC THREAD28 TraceId:bcdf8d915024856b: k8s ns envisense-v2 service customer-api modified endpoints
D 1015 12:03:24.723 UTC THREAD28 TraceId:bcdf8d915024856b: k8s ns envisense-v2 service customer-api removed Endpoint(/10.52.6.73,Some(gke-luna-pool-1-dfd37df9-dpb4))
D 1015 12:03:24.703 UTC THREAD28 TraceId:bcdf8d915024856b: k8s ns envisense-v2 service customer-api removed Endpoint(/10.52.4.81,Some(gke-luna-pool-1-dfd37df9-bk48))
D 1015 12:03:24.724 UTC THREAD28 TraceId:bcdf8d915024856b: k8s ns envisense-v2 service customer-api removed port mapping from grpc to 50051
D 1015 12:03:24.846 UTC THREAD28 TraceId:bcdf8d915024856b: k8s ns envisense-v2 service customer-api added Endpoint(/10.52.4.81,Some(gke-luna-pool-1-dfd37df9-bk48))
D 1015 12:03:24.846 UTC THREAD28 TraceId:bcdf8d915024856b: k8s ns envisense-v2 service customer-api added Endpoint(/10.52.6.73,Some(gke-luna-pool-1-dfd37df9-dpb4))
D 1015 12:03:24.848 UTC THREAD28 TraceId:bcdf8d915024856b: k8s ns envisense-v2 service customer-api mapped port grpc to 50051
D 1015 12:03:24.846 UTC THREAD28 TraceId:bcdf8d915024856b: k8s ns envisense-v2 service customer-api added endpoints
D 1015 12:03:59.063 UTC THREAD27 TraceId:a4f2e948c5d971ca: k8s ns envisense-v2 service customer-api modified endpoints
D 1015 12:03:59.063 UTC THREAD27 TraceId:a4f2e948c5d971ca: k8s ns envisense-v2 service customer-api removed Endpoint(/10.52.4.81,Some(gke-luna-pool-1-dfd37df9-bk48))
D 1015 12:03:59.063 UTC THREAD27 TraceId:a4f2e948c5d971ca: k8s ns envisense-v2 service customer-api removed Endpoint(/10.52.6.73,Some(gke-luna-pool-1-dfd37df9-dpb4))
D 1015 12:03:59.063 UTC THREAD27 TraceId:a4f2e948c5d971ca: k8s ns envisense-v2 service customer-api removed port mapping from grpc to 50051
D 1015 12:03:59.172 UTC THREAD27 TraceId:a4f2e948c5d971ca: k8s ns envisense-v2 service customer-api added endpoints
D 1015 12:03:59.172 UTC THREAD27 TraceId:a4f2e948c5d971ca: k8s ns envisense-v2 service customer-api added Endpoint(/10.52.6.73,Some(gke-luna-pool-1-dfd37df9-dpb4))
D 1015 12:03:59.172 UTC THREAD27 TraceId:a4f2e948c5d971ca: k8s ns envisense-v2 service customer-api mapped port grpc to 50051
D 1015 12:03:59.172 UTC THREAD27 TraceId:a4f2e948c5d971ca: k8s ns envisense-v2 service customer-api added Endpoint(/10.52.4.81,Some(gke-luna-pool-1-dfd37df9-bk48))

The logs are full of this: removing and then adding the same endpoints.

Also occasionally getting com.twitter.io.Reader$ReaderDiscarded: This writer's reader has been discarded.

Thanks
Dean

Hey @Dean, great, thanks for the additional info. The logs help. I think unfortunately you’re encountering two separate outstanding linkerd issues:

And:

We should have fixes for those soon hopefully. I can also poke around with grpc k8s 1.7 setup to see if I can repro the connection close issue in the original post.

Hi @klingerf

Thanks for pointing that out! Hopefully they can be fixed soon.

Today we’ve been experiencing the initial issue again, and a different node is involved now.

See logs (unfortunately without trace level):

I 1019 12:37:38.763 UTC THREAD10 TraceId:aacd05e825b6b96f: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api" as dead. Remote Address: Inet(/10.52.5.76:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-c39h))I 1019 12:37:38.765 UTC THREAD10: [S L:/10.52.6.67:4140 R:/10.52.6.81:53556 S:28417] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

W 1019 12:37:38.764 UTC THREAD10 TraceId:aacd05e825b6b96f: Exception propagated to the default monitor (upstream address: /10.52.6.81:53556, downstream address: /10.52.5.76:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api).
Reset.InternalError

I 1019 12:44:22.754 UTC THREAD10 TraceId:a89ff0aff8444529: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api" as dead. Remote Address: Inet(/10.52.5.76:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-c39h))W 1019 12:44:22.755 UTC THREAD10 TraceId:a89ff0aff8444529: Exception propagated to the default monitor (upstream address: /10.52.6.81:53556, downstream address: /10.52.5.76:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api).
Reset.InternalError

I 1019 12:44:22.756 UTC THREAD10: [S L:/10.52.6.67:4140 R:/10.52.6.81:53556 S:28665] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

I 1019 12:49:57.453 UTC THREAD10 TraceId:eef2217a07ac76c1: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api" as dead. Remote Address: Inet(/10.52.5.76:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-c39h))W 1019 12:49:57.454 UTC THREAD10 TraceId:eef2217a07ac76c1: Exception propagated to the default monitor (upstream address: /10.52.6.81:53556, downstream address: /10.52.5.76:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api).
Reset.InternalError

I 1019 12:49:57.455 UTC THREAD10: [S L:/10.52.6.67:4140 R:/10.52.6.81:53556 S:29035] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

I 1019 12:55:31.724 UTC THREAD10 TraceId:a64df6ee1daf1452: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api" as dead. Remote Address: Inet(/10.52.5.76:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-c39h))I 1019 12:55:31.726 UTC THREAD10: [S L:/10.52.6.67:4140 R:/10.52.6.81:53556 S:29169] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

W 1019 12:55:31.725 UTC THREAD10 TraceId:a64df6ee1daf1452: Exception propagated to the default monitor (upstream address: /10.52.6.81:53556, downstream address: /10.52.5.76:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api).
Reset.InternalError

I 1019 13:02:54.824 UTC THREAD10 TraceId:38c23f2049a885ee: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api" as dead. Remote Address: Inet(/10.52.5.76:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-c39h))W 1019 13:02:54.825 UTC THREAD10 TraceId:38c23f2049a885ee: Exception propagated to the default monitor (upstream address: /10.52.6.81:53556, downstream address: /10.52.5.76:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api).
Reset.InternalError

I 1019 13:02:54.826 UTC THREAD10: [S L:/10.52.6.67:4140 R:/10.52.6.81:53556 S:29321] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

I 1019 13:04:31.911 UTC THREAD10: Reaping /svc/device-api
I 1019 13:04:32.197 UTC THREAD10: Reaping %/io.l5d.k8s.localnode/10.52.6.67/#/io.l5d.k8s/[namespace]/grpc/device-api
I 1019 13:08:30.294 UTC THREAD10 TraceId:dfb09cf4751cdc39: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api" as dead. Remote Address: Inet(/10.52.5.76:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-c39h))I 1019 13:08:30.296 UTC THREAD10: [S L:/10.52.6.67:4140 R:/10.52.6.81:53556 S:29533] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

W 1019 13:08:30.296 UTC THREAD10 TraceId:dfb09cf4751cdc39: Exception propagated to the default monitor (upstream address: /10.52.6.81:53556, downstream address: /10.52.5.76:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api).
Reset.InternalError

I 1019 13:14:22.674 UTC THREAD10 TraceId:dcec564da7079832: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api" as dead. Remote Address: Inet(/10.52.5.76:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-c39h))I 1019 13:14:22.676 UTC THREAD10: [S L:/10.52.6.67:4140 R:/10.52.6.81:53556 S:29577] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

W 1019 13:14:22.675 UTC THREAD10 TraceId:dcec564da7079832: Exception propagated to the default monitor (upstream address: /10.52.6.81:53556, downstream address: /10.52.5.76:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api).
Reset.InternalError

I 1019 13:20:23.893 UTC THREAD10 TraceId:35f1a066d55bb9ea: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api" as dead. Remote Address: Inet(/10.52.5.76:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-c39h))I 1019 13:20:23.895 UTC THREAD10: [S L:/10.52.6.67:4140 R:/10.52.6.81:53556 S:29781] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

W 1019 13:20:23.894 UTC THREAD10 TraceId:35f1a066d55bb9ea: Exception propagated to the default monitor (upstream address: /10.52.6.81:53556, downstream address: /10.52.5.76:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api).
Reset.InternalError

I 1019 13:27:19.764 UTC THREAD10 TraceId:02012b328150704c: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api" as dead. Remote Address: Inet(/10.52.5.76:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-c39h))I 1019 13:27:19.766 UTC THREAD10: [S L:/10.52.6.67:4140 R:/10.52.6.81:53556 S:29915] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

W 1019 13:27:19.765 UTC THREAD10 TraceId:02012b328150704c: Exception propagated to the default monitor (upstream address: /10.52.6.81:53556, downstream address: /10.52.5.76:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api).
Reset.InternalError

I 1019 13:32:30.294 UTC THREAD10 TraceId:ef1df7a30ded6c7d: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api" as dead. Remote Address: Inet(/10.52.5.76:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-c39h))W 1019 13:32:30.295 UTC THREAD10 TraceId:ef1df7a30ded6c7d: Exception propagated to the default monitor (upstream address: /10.52.6.81:53556, downstream address: /10.52.5.76:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api).
Reset.InternalError

I 1019 13:32:30.295 UTC THREAD10: [S L:/10.52.6.67:4140 R:/10.52.6.81:53556 S:30035] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

I 1019 13:34:31.941 UTC THREAD10: Reaping /svc/device-api
I 1019 13:34:32.236 UTC THREAD10: Reaping %/io.l5d.k8s.localnode/10.52.6.67/#/io.l5d.k8s/[namespace]/grpc/device-api
I 1019 13:38:02.574 UTC THREAD10 TraceId:db043bf457e96036: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api" as dead. Remote Address: Inet(/10.52.5.76:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-c39h))I 1019 13:38:02.576 UTC THREAD10: [S L:/10.52.6.67:4140 R:/10.52.6.81:53556 S:30211] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

W 1019 13:38:02.575 UTC THREAD10 TraceId:db043bf457e96036: Exception propagated to the default monitor (upstream address: /10.52.6.81:53556, downstream address: /10.52.5.76:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api).
Reset.InternalError

I 1019 13:43:15.323 UTC THREAD10 TraceId:496fbecfa9a7e9f9: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api" as dead. Remote Address: Inet(/10.52.5.76:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-c39h))W 1019 13:43:15.324 UTC THREAD10 TraceId:496fbecfa9a7e9f9: Exception propagated to the default monitor (upstream address: /10.52.6.81:53556, downstream address: /10.52.5.76:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api).
Reset.InternalError

I 1019 13:43:15.325 UTC THREAD10: [S L:/10.52.6.67:4140 R:/10.52.6.81:53556 S:30255] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

I 1019 13:49:28.454 UTC THREAD10 TraceId:62a7351a8e8e3d9d: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api" as dead. Remote Address: Inet(/10.52.5.76:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-c39h))W 1019 13:49:28.455 UTC THREAD10 TraceId:62a7351a8e8e3d9d: Exception propagated to the default monitor (upstream address: /10.52.6.81:53556, downstream address: /10.52.5.76:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api).
Reset.InternalError

I 1019 13:49:28.456 UTC THREAD10: [S L:/10.52.6.67:4140 R:/10.52.6.81:53556 S:30451] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

I 1019 13:54:55.634 UTC THREAD10 TraceId:3609161e7395ddad: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api" as dead. Remote Address: Inet(/10.52.5.76:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-c39h))W 1019 13:54:55.635 UTC THREAD10 TraceId:3609161e7395ddad: Exception propagated to the default monitor (upstream address: /10.52.6.81:53556, downstream address: /10.52.5.76:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api).
Reset.InternalError

I 1019 13:54:55.635 UTC THREAD10: [S L:/10.52.6.67:4140 R:/10.52.6.81:53556 S:30547] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

I 1019 14:00:14.044 UTC THREAD10 TraceId:7404720f766e6a54: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api" as dead. Remote Address: Inet(/10.52.5.76:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-c39h))I 1019 14:00:14.047 UTC THREAD10: [S L:/10.52.6.67:4140 R:/10.52.6.81:53556 S:30687] unexpected error; resetting remote: INTERNAL_ERROR
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 10.seconds to 0.0.0.0/4140 while waiting for a response for the request, including retries (if applicable). Remote Info: Not Available

W 1019 14:00:14.045 UTC THREAD10 TraceId:7404720f766e6a54: Exception propagated to the default monitor (upstream address: /10.52.6.81:53556, downstream address: /10.52.5.76:4141, label: %/io.l5d.k8s.daemonset/[namespace]/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/rule-api).
Reset.InternalError

What does this FailureAccrualFactory error mean exactly? That Linkerd isn’t able to succesfully communicate with the service in question? But then the fact that an internal error occurs right after seems to point out that something within Linkerd is acting up? Because I can’t find any trace of the service being down at that moment.

Thanks
Dean

Hey @Dean – thanks for the update.

We’ve published a release candidate today with the fix for linkerd#1669. If you’re using docker, you can switch to using buoyantio/linkerd:1.3.1-rc1 and buoyantio/namerd:1.3.1-rc1, and I think that should resolve at least some of the issues you’re seeing.

That log you pasted is interesting. Are you using namerd in your setup? We sometimes see those global timeouts when linkerd can’t reach namerd. FailureAccrual is how linkerd accomplishes circuit breaking. If it has encountered enough failed requests, linkerd will proactively remove the instance from the load balancer pool. And those Reset.InternalError errors are printed when an h2 stream is reset. They’re noisy, but not necessarily indicative of a problem, if for instance, your client or server are terminating the stream without properly closing it.

I’d say give 1.3.1-rc1 a shot, and let us know if you’re still seeing the same issues.

Thanks for the explanation, @klingerf. Very helpful!

We’re not using namerd.

We’ve updated to 1.3.1-RC1 and the watch issues indeed seem to be solved. At first sight, no more FailureAccrual errors.
Still lots of com.twitter.io.Reader$ReaderDiscarded: This writer's reader has been discarded but that’s a different issue. I’ll keep an eye on that.

Thanks for the support!

Today we’ve been experiencing the problem again (with 1.3.1-RC1). Both the global timeouts and the FailureAccrual errors of which more detailed logs can be found above.

...
I 1024 12:49:28.456 UTC THREAD10 TraceId:bbd6f66709beee92: FailureAccrualFactory marking connection to "%/io.l5d.k8s.daemonset/envisense-v2/incoming/linkerd/#/io.l5d.k8s/[namespace]/grpc/asset-api" as dead. Remote Address: Inet(/10.52.7.93:4141,Map(nodeName -> gke-luna-pool-1-dfd37df9-rzgk))
...

Restarting Linkerd solved the inavailability of the microservices through Linkerd.
While Linkerd was initializing, it seemed to retry all the k8s requests for every microservice multiple times. I don’t know if that’s due to the problem we’re having.

... lots of these
E 1024 13:17:43.375 UTC THREAD14: retrying k8s request to /api/v1/namespaces/[namespace]/endpoints/customer-api on error com.twitter.finagle.FailedFastException: Endpoint client is marked down. For more details see: https://twitter.github.io/finagle/guide/FAQ.html#why-do-clients-see-com-twitter-finagle-failedfastexception-s. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:8001, Downstream label: client, Trace Id: f1b23cf14a8c72f0.f1b23cf14a8c72f0<:f1b23cf14a8c72f0
... randomly in between the others
I 1024 13:17:43.605 UTC THREAD14 TraceId:8fb4b0e136a40935: FailureAccrualFactory marking connection to "client" as dead. Remote Address: Inet(localhost/127.0.0.1:8001,Map())
... again lots of these
E 1024 13:17:43.606 UTC THREAD14: retrying k8s request to /api/v1/namespaces/[namespace]/endpoints/asset-api on error com.twitter.finagle.FailedFastException: Endpoint client is marked down. For more details see: https://twitter.github.io/finagle/guide/FAQ.html#why-do-clients-see-com-twitter-finagle-failedfastexception-s. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:8001, Downstream label: client, Trace Id: 8fb4b0e136a40935.8fb4b0e136a40935<:8fb4b0e136a40935
...

These errors while initializing seem to be happening on every instance.

Thanks
Dean

Hey @Dean – those logs you pasted seem to suggest that linkerd is having trouble talking to the kubernetes api, which should be running as a separate kubectl proxy container in the same pod. Can you verify that your linkerd pod spec includes a kubectl container? And can you see if there’s anything of note in the kubectl container’s logs?

Hello @klingerf

Yeah, every Linkerd instance has its own kubectl proxy using the buoyantio/kubectl:v1.4.0 image. It’s started using the arguments “proxy -p 8001”.

kubectl logs linkerd-pod-instance kubectl doesn’t seem to output anything. Does a container with image buoyantio/kubectl:v1.4.0 output any logs at all?

I see just now that there’s also a new version available for this image: v1.6.2. Do you know if there’s anything notably different in that version (from v1.4.0)? - Edit: Oh this probably just refers to the kubectl version. I will give updating this a try.

Thanks
Dean

Hmm, yeah, I’d try switching to v1.6.2. Which version of Kubernetes are you on? I don’t think there have been breaking changes that would affect this since 1.4, but it’s always possible.