Retry on ChannelClosedException


#1

This post was flagged by the community and is temporarily hidden.


#2

And the log contents:

May 28, 2018 12:44:38 PM com.twitter.finagle.http.codec.HttpClientDispatcher $anonfun$dispatch$2
DEBUG: Failed mid-stream. Terminating stream, closing connection
com.twitter.finagle.ChannelClosedException: ChannelException at remote address: www.xxx.fi/193.210.45.50:443. Remote Info: Not Available
at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$2.channelInactive(ChannelTransport.scala:196)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.handler.codec.MessageAggregator.channelInactive(MessageAggregator.java:417)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelInactive(CombinedChannelDuplexHandler.java:420)
at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:377)
at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:342)
at io.netty.handler.codec.http.HttpClientCodec$Decoder.channelInactive(HttpClientCodec.java:281)
at io.netty.channel.CombinedChannelDuplexHandler.channelInactive(CombinedChannelDuplexHandler.java:223)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at com.twitter.finagle.netty4.channel.ChannelRequestStatsHandler.channelInactive(ChannelRequestStatsHandler.scala:41)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at com.twitter.finagle.netty4.channel.ChannelStatsHandler.channelInactive(ChannelStatsHandler.scala:148)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:377)
at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:342)
at io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:978)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1354)

May 28, 2018 12:44:38 PM com.twitter.finagle.util.DefaultMonitor logWithRemoteInfoERROR: service failure: com.twitter.finagle.ChannelClosedException: ChannelException at remote address: www.xxx.fi/193.210.45.50:443 from service: #/io.l5d.fs/www.xxx.fi:443. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: www.xxx.fi/193.210.45.50:443, Downstream label: #/io.l5d.fs/www.xxx.fi:443, Trace Id: 884a937e557af1b2.bb4b9590a60d8dee<:884a937e557af1b2
DEBUG: Exception propagated to the default monitor (upstream address: /172.16.1.236:38166, downstream address: www.xxx.fi/193.210.45.50:443, label: #/io.l5d.fs/www.xxx.fi:443).
com.twitter.finagle.ChannelClosedException: ChannelException at remote address: www.xxx.fi/193.210.45.50:443. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: www.xxx.fi/193.210.45.50:443, Downstream label: #/io.l5d.fs/www.xxx.fi:443, Trace Id: 884a937e557af1b2.bb4b9590a60d8dee<:884a937e557af1b2
at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$2.channelInactive(ChannelTransport.scala:196)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.handler.codec.MessageAggregator.channelInactive(MessageAggregator.java:417)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelInactive(CombinedChannelDuplexHandler.java:420)
at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:377)
at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:342)
at io.netty.handler.codec.http.HttpClientCodec$Decoder.channelInactive(HttpClientCodec.java:281)
at io.netty.channel.CombinedChannelDuplexHandler.channelInactive(CombinedChannelDuplexHandler.java:223)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at com.twitter.finagle.netty4.channel.ChannelRequestStatsHandler.channelInactive(ChannelRequestStatsHandler.scala:41)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at com.twitter.finagle.netty4.channel.ChannelStatsHandler.channelInactive(ChannelStatsHandler.scala:148)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)

ERROR: service failure: com.twitter.finagle.ChannelClosedException: ChannelException at remote address: www.xxx.fi/193.210.45.50:443 from service: #/io.l5d.fs/www.xxx.fi:443. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: www.xxx.fi/193.210.45.50:443, Downstream label: #/io.l5d.fs/www.xxx.fi:443, Trace Id: 884a937e557af1b2.bb4b9590a60d8dee<:884a937e557af1b2


#3

#4

This type of exception means that the connection was closed during a request. In order for Linkerd to retry that request, you must configure Linkerd to tell it that those kinds of requests are safe to retry. This is done with response classifiers. By marking read (GET, OPTIONS, HEAD, and TRACE) requests as retryable, for example, Linkerd will know it is safe to retry them.

Request authorizers come after retries which means that the request authorizer will be run on the original request AND on every retry.


#5

Thanks Alex, I actually managed to setup a test and saw that the RequestAuthorizer is invoked on the retries too.
The way I configured it is with a requeue budget based on the documentation stating:
“Requeues are for connection-level failures that are guaranteed to be idempotent. If a connection-level failure is encountered and there is requeue budget available, then the request will be retried. Requeue budgets are configured by the requeueBudget parameter on the client object. Requeues happen immediately with no backoff. Each client has its own requeue budget that is not shared with other clients or services.”

I thought that this is an example of a connection level failure.

The examples I have seen using response classifiers are e.g. about configuring that a read type operation with a 5XX response is retryable. This is perfectly clear if you really get a 5XX response, but when the connection is closed during a request, this is not the case, or is it?

So, I actually don’t understand how to configure retries using a response classifier, if there was an exception classifier making it possible to state that retries is ok on ConnectionClosedException it would be clear though. Maybe I’m missing something, please enlighten me if this is the case…

However, using requeue budget works. But I don’t know how to make the retry count be e.g. 1. Using percentCanRetry as the only parameter gives 25 retries.

I would greatly appreciate to be able to understand this topic in detail!

BR Niklas


#6

Great questions! I believe that the 5XX classifiers will still trigger if the connection is closed before the response is sent as long as the request was fully sent and is of the correct type. So in that sense, the name of the classifier isn’t 100% accurate.

Linkerd uses retry/requeue budgets instead of fixed numbers of retries. This is to avoid overloading struggling services with a retry storm. You can constrain the volume of retries by setting the retry budget or by setting a totalTimeout (once the timeout is reached, no more retries will be attempted).


#7

Ok, thanks for the reply. Let me shift the perspective slightly and ask if there is anything that would be in favour of using the response classifier and not the requeue configuration ? After all it works and the description seems to fit the situation well. I don’t need the backoff feature, as far as I know.

This might be a stupid question, but how do I set the retry/requeue budget? (I see the number 100 in the metrics.json page, this is way too high for my situation of it’s in absolute terms.)

BR Niklas


#8

If the configuration is working as desired, I don’t think there’s any reason to change it.

Retry/requeue budgets are set as a percentage of requests: https://linkerd.io/config/head/linkerd/index.html#retry-budget-parameters


#9

Ok, good. Then there wasn’t anything that I had missed, which I was afraid of.

Again, thanks and bye for now.

BR Niklas


#10

Unfortunately the requeue mechanism doesn’t kick in for the production scenario. But I have reconfirmed that I do indeed get requeues registered in the metrics when I provoke it on my development machine.

There is a difference though: I the lab scenario I used the shutter.go (which appears in other posts) that occasionally does:
bufrw.WriteString(“HTTP/1.1 400 Ugly Request\r\n”)
bufrw.WriteString(“Connection: close\r\n”)
bufrw.WriteString(“Transfer-encoding: chunked\r\n”)
bufrw.WriteString("\r\n")
bufrw.Flush()

And then I run 10 simultaneous client threads through Linkerd eventually getting ChannelClosedException for pretty high throughput.

But, I also discovered another difference: The ChannelClosedException causes a ChannelWriteException:

I 0607 15:11:27.778 CEST THREAD20 TraceId:334cd747602d7348: Throwable details com.twitter.finagle.ChannelWriteException: com.twitter.finagle.ChannelClosedException: null at remote address: localhost/127.0.0.1:8080. Remote Info: Not Available. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:8080, Downstream label: #/io.l5d.fs/shutter:4141, Trace Id: 334cd747602d7348.71afcaa851f47224<:334cd747602d7348, message com.twitter.finagle.ChannelClosedException: null at remote address: localhost/127.0.0.1:8080. Remote Info: Not Available. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:8080, Downstream label: #/io.l5d.fs/shutter:4141, Trace Id: 334cd747602d7348.71afcaa851f47224<:334cd747602d7348, cause com.twitter.finagle.ChannelClosedException: null at remote address: localhost/127.0.0.1:8080. Remote Info: Not Available
I 060

I the production scenario the traffic is very sparse so its not the same situation. I cannot determine how things differ from the Linkerd perspective though, i.e. what is the relevant difference.

No ChannelWriteException:

Jun 05, 2018 3:05:07 PM com.twitter.finagle.http.codec.HttpClientDispatcher $anonfun$dispatch$2
DEBUG: Failed mid-stream. Terminating stream, closing connection
com.twitter.finagle.ChannelClosedException: ChannelException at remote address: www.asiakastieto.fi/193.210.45.50:443. Remote Info: Not Available
at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$2.channelInactive(ChannelTransport.scala:196)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerConte

The thing is that these are POST requests so the standard response classifiers no not even make it possible to say these should be retried.

However, as you Alex pointed out it the docs/naming might be a bit misleading.

Anyone who can say if I should expect a requeue upon a ChannelClosedException?


#11

Hi @niklasuhrberg! Is it necessary for your server to send Connection: close on responses? You should get much better performance by having Linkerd reuse connections instead of tearing them down on each request. This is potentially related to the channel exceptions that you’re seeing.

I think it would make total sense to add a response classifier that treats all requests as retryable, regardless of http method. Are you interested in working on this?


#12

Hi Alex, the connection close is only used in the lab scenario to reproduce a ChannelClosedException. (I borrowed this concept from a discussion in 2016). But anyway, I think it’s the case that one of our external integrations causes ChannelClosedException much more frequently than the others do. I would take great pleasure in working on a response classifier, but it’s a tricky situation with regards to time! Let me see if I can try to get something done, it should really be a small effort given what is already available.


#13

I’ve looked at some code (why didn’t I do this earlier?) and guess what I found in ResponseClassifiersTest:

test(s"$classifier: retries $method channel closed") {
    testClassifier(
      classifier,
      method,
      Throw(new ChannelClosedException),
      Some(ResponseClass.RetryableFailure)
    )
  }

So indeed, if we have (like you suggested) a retryable classfication for POST as well there seems to be a straightforward to achieve my goal.


#14

Hi @Alex

I have given it a shot to try a new responseclassifier. I wanted to start out with a separate plugin, and not an addition to the Linkerd source code itself.
However, this fails because Linkerd seems to only accept the standard responseclassifiers as type ids.
See the stacktrace below.

I have actually assembled a Linkerd exec where I have added my “com.resurs.linkerd.responseclassifier.CustomResponseClassifierInitializer” as one more type in the services file. But then I get ClassNotFoundException. (I did put my jar file in the L5D_HOME/plugins folder)

Please let me know if there is a way to add a responseclassifier as a separate plugin as opposed to adding to the source code. I want to start this way, because we want to put this code into production.
(I don’t mind writing a pull request later.)

BR Niklas

om.fasterxml.jackson.databind.exc.InvalidTypeIdException: Could not resolve type id ‘com.resurs.linkerd.responseclassifier.CustomResponseClassifierInitializer’ into a subtype of [simple type, class io.buoyant.linkerd.ResponseClassifierConfig]: known type ids = [ResponseClassifierConfig, io.l5d.http.allSuccessful, io.l5d.http.nonRetryable5XX, io.l5d.http.retryableIdempotent5XX, io.l5d.http.retryableRead5XX]
at [Source: java.io.StringReader@44c13103; line: 32, column: 15] (through reference chain: io.buoyant.linkerd.Linker$LinkerConfig[“routers”]->com.fasterxml.jackson.module.scala.deser.BuilderWrapper[0]->io.buoyant.linkerd.protocol.HttpConfig[“service”]->io.buoyant.linkerd.protocol.HttpStaticSvc[“configs”]->com.fasterxml.jackson.module.scala.deser.BuilderWrapper[0]->io.buoyant.linkerd.protocol.HttpSvcPrefixConfig[“responseClassifier”])
at com.fasterxml.jackson.databind.exc.InvalidTypeIdException.from(InvalidTypeIdException.java:42)
at com.fasterxml.jackson.databind.DeserializationContext.unknownTypeIdException(DeserializationContext.java:1477)
at com.fasterxml.jackson.databind.DeserializationContext.handleUnknownTypeId(DeserializationContext.java:1170)
at com.fasterxml.jackson.databind.jsontype.impl.TypeDeserializerBase._handleUnknownTypeId(TypeDeserializerBase.java:282)
at com.fasterxml.jackson.databind.jsontype.impl.TypeDeserializerBase._findDeserializer(TypeDeserializerBase.java:156)
at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer._deserializeTypedForId(AsPropertyTypeDeserializer.java:112)
at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer.deserializeTypedFromObject(AsPropertyTypeDeserializer.java:97)
at com.fasterxml.jackson.databind.deser.AbstractDeserializer.deserializeWithType(AbstractDeserializer.java:142)
at com.fasterxml.jackson.module.scala.deser.OptionDeserializer.deserialize(OptionDeserializerModule.scala:61)
at com.fa


#15

At the same time, I am aware of the linkerd-examples and its ResponseClassifier plugin (this one is in java) so it seems clear that itshould be possible to achieve this feature using a plugin.


#16

OK, @Alex. Please don’t pay attention to the two last messages. I sorted this out, it was a stupid mistake - I forgot to put the services folder in META-INF. (Luckily I didn’t waste that much time on it…)

It’s working now.


#17

Phew! Glad that you got it working! :smile:


#18

Thanks, and soon we will confirm or (disconfirm) if we get the desired retry for the ChannelClosedExceptions we see. In my test it is actually little tricky to verify this, too complicated to explain exactly why in a short time. (But as I wrote above, I do see requeues on ChannelClosedExceptions, what remains to really establish is that the client does not see a 5XX) If you know a way to reproduce a ChannelClosedException in a simple and deterministic way, just let me know…

I’ll return with the results.

I will try to find some time too see what would be a suitable extension to the existing response classifiers. I have given it some thought while working on this.


#19

Hi, it is now confirmed that the custom retry works in production. Thanks for the help!
And the image is the graphical confirmation that a ChannelClosedException does not render a server 5XX reply.


#20

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.