Endpoints watch error

Hello again,

I was playing yesterday with retries by adding this piece of configuration to the incoming router:

  service:
    retries:
      budget:
        percentCanRetry: 1000
      backoff:
        kind: constant
    responseClassifier:
      kind: io.l5d.http.retryableRead5XX

My services were replying HTTP 500 with 50% probability because of the URL I used to call them.

And I’ve got quite a weird state since then:

In the playground I can route to some services but not to another on outgoing router, and in the incoming router there’re less services I can route to.

In the logs I see:

D 0914 08:38:28.146 UTC THREAD29 TraceId:192de2b4e560c525: k8s lookup: /test/http/a200 /test/http/a200
D 0914 08:38:30.738 UTC THREAD29 TraceId:7bbede3e82849f68: k8s lookup: /test/http/b200 /test/http/b200
D 0914 08:38:32.947 UTC THREAD29 TraceId:c8c7a94abd6db1a2: k8s lookup: /test/http/b500 /test/http/b500
D 0914 08:38:37.508 UTC THREAD36: Unhandled exception in connection with /10.128.0.1:58618, shutting down connection
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:433)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1100)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:372)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:24)
	at java.lang.Thread.run(Thread.java:748)
D 0914 08:38:37.508 UTC THREAD34: Unhandled exception in connection with /10.128.0.1:58612, shutting down connection
...
D 0914 08:39:53.502 UTC THREAD33 TraceId:b3591394496b9a9d: k8s watch cancelled

During the night linkerd was trying to do something:

W 0913 21:54:54.163 UTC THREAD37 TraceId:998f2c6a3911c9c7: k8s ns b200 service test endpoints watch error Status(Some(Status),Some(v1),Some(ObjectMeta(None,None,None,None,None,None,None,None,None,None,None)),Some(Failure),Some(too old resource version: 1132677 (1139149)),Some(Gone),None,Some(410))
W 0914 00:57:02.398 UTC THREAD37 TraceId:998f2c6a3911c9c7: k8s ns b200 service test endpoints watch error Status(Some(Status),Some(v1),Some(ObjectMeta(None,None,None,None,None,None,None,None,None,None,None)),Some(Failure),Some(too old resource version: 1132677 (1139149)),Some(Gone),None,Some(410))
W 0914 03:28:17.552 UTC THREAD37 TraceId:998f2c6a3911c9c7: k8s ns b200 service test endpoints watch error Status(Some(Status),Some(v1),Some(ObjectMeta(None,None,None,None,None,None,None,None,None,None,None)),Some(Failure),Some(too old resource version: 1132677 (1139149)),Some(Gone),None,Some(410))
W 0914 05:49:37.453 UTC THREAD37 TraceId:998f2c6a3911c9c7: k8s ns b200 service test endpoints watch error Status(Some(Status),Some(v1),Some(ObjectMeta(None,None,None,None,None,None,None,None,None,None,None)),Some(Failure),Some(too old resource version: 1132677 (1139149)),Some(Gone),None,Some(410))

Looks like linkerd has some endpoints marked completely dead.

So the questions are how do I stop the retries if needed, clean the dead nodes/endpoints cache?

Thanks.

Hi @smartptr! Is this with Linkerd 1.2.0? 1.2.0 has a known bug where if a service gets scaled down or deleted, Linkerd will get stuck on a stale address set. https://github.com/linkerd/linkerd/issues/1626

We have a fix for this issue and will get a release out ASAP.

Hi Alex,
Yes, 1.2.0.
#1626 might be related, but I didn’t scale the pods here, so didn’t attach the bug. Let’s see with the new release.

Hi @smartptr – just wanted to give you a heads up that linkerd 1.3.0 was released a few days ago, and it includes the fix that @Alex mentioned. Mind giving it a shot to see if it fixes your issue as well?