I am having problems getting linkerd to work on a Tectonic cluster.
Everything works when the services that are talking to each other are on the same node, but when they are on different nodes the the connection fails. linkerd is able to correctly figure out which node to connect to, but the actual connection times out.
Here is the log
I 0119 07:04:31.170 UTC THREAD28 TraceId:3c4e9ed9bf8fe034: FailureAccrualFactory marking connection to “%/io.l5d.k8s.daemonset/default/incoming/l5d/#/io.l5d.k8s/default/grpc/lb-test-server” as dead. Remote Address: Inet(/xx.xx.xx.xxx:4141,Map(nodeName -> ip-xx.xx.xx.xxx.ap-northeast-1.compute.internal))
I 0119 07:04:31.176 UTC THREAD28: [S L:/yy.yy.yy.yyy:4140 R:/zz.z.z.zz:56162 S:1] rejected; resetting remote: REFUSED
Failure(connection timed out: /xx.xx.xx.xxx:4141 at remote address: /xx.xx.xx.xxx:4141. Remote Info: Not Available, flags=0x09) with RemoteInfo -> Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /xx.xx.xx.xxx:4141, Downstream label: %/io.l5d.k8s.daemonset/default/incoming/l5d/#/io.l5d.k8s/default/grpc/lb-test-server, Trace Id: 3c4e9ed9bf8fe034.3c4e9ed9bf8fe034<:3c4e9ed9bf8fe034 with Service -> 0.0.0.0/4140 Caused by: com.twitter.finagle.ConnectionFailedException: connection timed out: /xx.xx.xx.xxx:4141 at remote address: /xx.xx.xx.xxx:4141. Remote Info: Not Available
This is an instance of the failure, occuring when one gRPC service calls the other. xx.xx.xx.xxx is the IP address of the node on which the receiving service runs, yy.yy.yy.yyy is the IP address of the node on which the sending service runs, and zz.z.zz.z is the IP address of Pod of the calling service. Although the log is from when I was running a hand written service, I have confirmed that the Hello World example fails in a similar manner.
The kubernetes cluster was created with a recent version of Tectonic. Because of this, Flannel is used on the cluster. I have read and applied the changes mentioned in Flavors of Kubernetes post (the “hostNetwork” part)
I have also tried ping’ing a node IP from within a running Pod, but that timed out as well.
Any pointers on how to get this working would be appreciated.