Routing between nodes does not work on Tectonic


#1

Hi,

I am having problems getting linkerd to work on a Tectonic cluster.
Everything works when the services that are talking to each other are on the same node, but when they are on different nodes the the connection fails. linkerd is able to correctly figure out which node to connect to, but the actual connection times out.

Here is the log

I 0119 07:04:31.170 UTC THREAD28 TraceId:3c4e9ed9bf8fe034: FailureAccrualFactory marking connection to “%/io.l5d.k8s.daemonset/default/incoming/l5d/#/io.l5d.k8s/default/grpc/lb-test-server” as dead. Remote Address: Inet(/xx.xx.xx.xxx:4141,Map(nodeName -> ip-xx.xx.xx.xxx.ap-northeast-1.compute.internal))
I 0119 07:04:31.176 UTC THREAD28: [S L:/yy.yy.yy.yyy:4140 R:/zz.z.z.zz:56162 S:1] rejected; resetting remote: REFUSED
Failure(connection timed out: /xx.xx.xx.xxx:4141 at remote address: /xx.xx.xx.xxx:4141. Remote Info: Not Available, flags=0x09) with RemoteInfo -> Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /xx.xx.xx.xxx:4141, Downstream label: %/io.l5d.k8s.daemonset/default/incoming/l5d/#/io.l5d.k8s/default/grpc/lb-test-server, Trace Id: 3c4e9ed9bf8fe034.3c4e9ed9bf8fe034<:3c4e9ed9bf8fe034 with Service -> 0.0.0.0/4140 Caused by: com.twitter.finagle.ConnectionFailedException: connection timed out: /xx.xx.xx.xxx:4141 at remote address: /xx.xx.xx.xxx:4141. Remote Info: Not Available

This is an instance of the failure, occuring when one gRPC service calls the other. xx.xx.xx.xxx is the IP address of the node on which the receiving service runs, yy.yy.yy.yyy is the IP address of the node on which the sending service runs, and zz.z.zz.z is the IP address of Pod of the calling service. Although the log is from when I was running a hand written service, I have confirmed that the Hello World example fails in a similar manner.

The kubernetes cluster was created with a recent version of Tectonic. Because of this, Flannel is used on the cluster. I have read and applied the changes mentioned in Flavors of Kubernetes post (the “hostNetwork” part)

I have also tried ping’ing a node IP from within a running Pod, but that timed out as well.

Any pointers on how to get this working would be appreciated.


#2

Hi @tomoyat1, are you able to post the Linkerd config you’re using? That might help shine some light on what’s going on.


#3

Here are the configs I’m using.
https://gist.github.com/tomoyat1/814a4e2131cc8da9ece84c586c678094


#4

Hi @tomoyat1,
This seems like a networking issue. If pinging the node directly from within a running Pod does not work, then the nodes are not connected. Linkerd can then also not communicate between the two nodes.

We’d suggest looking into the config of the Techtonic cluster you’re using. Good luck and let us know how it goes or if you have any Linkerd follow-up issues once the nodes are communicating.

Franzi


#5

Hi @franzi

My VPC security group settings on AWS were so that incoming connections to ports 4140 and 4141 weren’t allowed.
I managed to get linkerd to work by allowing those ports to be connected to.

Thanks for confirming that the linkerd configs were correct.


#6

Glad you got it working!