Linkerd Proxy Error: "Failed to proxy request: request timed out"

Hi all,

I deployed the linkerd service mesh on a GKE cluster several weeks ago. Setup was straightforward and the mesh has been running without any issues up until today.

Today, several proxies were throwing the following errors (blanked addresses out):

WARN inbound:accept{peer.addr=***}:source{target.addr=***}: linkerd2_app_core::errors: Failed to proxy request: request timed out

WARN inbound:accept{peer.addr=***}:source{target.addr=***}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast

So if service A tried to send a request to service B (both having the linkerd proxies), I would get a 503 error.

Restarting the pods seems to have resolved the issue. But why did the issue occur in the first place? Is there an issue with the proxies in resolving the destination service?

Moreover, restarting the pods caused a different problem - the tap functionality no longer works for the restarted pod. The tap server logs show the following when attempting to tap the restarted deployment:

time="2020-05-19T21:50:35Z" level=info msg="Tapping 1 pods for target: {Namespace:*** Type:deployment Name:*** XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}"
time="2020-05-19T21:50:35Z" level=info msg="Establishing tap on ***"
time="2020-05-19T21:50:35Z" level=error msg="[***] encountered an error: rpc error: code = Unauthenticated desc = not_provided_by_remote"

I tried restarting all the control plane components to no avail. If anyone can help explain what the root cause of this issue is that would be fantastic!

Some more notes about my environment:

  • deployed linkerd 2.7.1
  • kubernetes version 1.14
  • Deployed linkerd with the CNI plugin daemonset so that the init container is not required.

@a-linkerd-newbie sorry to see that you’re running into this.

What is the log output from the linkerd-identity and linkerd-tap components?

It will help to run linkerd check to make sure that all the control plane components are in a good state.

Also, can you share the output of kubectl api-resources? There have been some reports of the apiserver becoming unavailable.

Hi cpretzer,

The linkerd check was showing that everything looked good when I ran it.

The linkerd-tap logs show this when I try to start a tap:

  time="2020-05-19T21:50:35Z" level=info msg="Tapping 1 pods for target: {Namespace:*** Type:deployment Name:*** XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}"
time="2020-05-19T21:50:35Z" level=info msg="Establishing tap on ***"
time="2020-05-19T21:50:35Z" level=error msg="[***] encountered an error: rpc error: code = Unauthenticated desc = not_provided_by_remote"

As for linkerd-identity - I tried to restart every control plane component yesterday to see whether that would fix the tap issue. But unfortunately that caused everything to break. Linkerd proxies are no longer able to get a cert and so are stuck on startup. This is how proxy startup logs look like (this is proxy for the destination server):

time="2020-05-19T23:26:17Z" level=info msg="running version stable-2.7.1"
[     0.37525183s]  INFO linkerd2_proxy: Admin interface on
[     0.37639885s]  INFO linkerd2_proxy: Inbound interface on
[     0.37651701s]  INFO linkerd2_proxy: Outbound interface on
[     0.37656605s]  INFO linkerd2_proxy: Tap interface on
[     0.37661205s]  INFO linkerd2_proxy: Local identity is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[     0.37667897s]  INFO linkerd2_proxy: Identity verified via linkerd-identity.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.37673316s]  INFO linkerd2_proxy: Destinations resolved via localhost:8086
[     0.38333631s]  INFO inbound: linkerd2_app_inbound: Serving listen.addr=

And these are the identity server logs after I restarted it:

time="2020-05-20T07:09:35Z" level=info msg="running version stable-2.7.1"
time="2020-05-20T07:09:35Z" level=info msg="starting admin server on :9990"
time="2020-05-20T07:09:35Z" level=info msg="starting gRPC server on :8080"
time="2020-05-20T07:09:44Z" level=info msg="certifying linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local until 2020-05-21 07:10:04 +0000 UTC"

Is there a certain order control plane components must start up in?

Also, here is the output for kubectl api-resources

NAME                              SHORTNAMES   APIGROUP                       NAMESPACED   KIND
bindings                                                                      true         Binding
componentstatuses                 cs                                          false        ComponentStatus
configmaps                        cm                                          true         ConfigMap
endpoints                         ep                                          true         Endpoints
events                            ev                                          true         Event
limitranges                       limits                                      true         LimitRange
namespaces                        ns                                          false        Namespace
nodes                             no                                          false        Node
persistentvolumeclaims            pvc                                         true         PersistentVolumeClaim
persistentvolumes                 pv                                          false        PersistentVolume
pods                              po                                          true         Pod
podtemplates                                                                  true         PodTemplate
replicationcontrollers            rc                                          true         ReplicationController
resourcequotas                    quota                                       true         ResourceQuota
secrets                                                                       true         Secret
serviceaccounts                   sa                                          true         ServiceAccount
services                          svc                                         true         Service
mutatingwebhookconfigurations           false        MutatingWebhookConfiguration
validatingwebhookconfigurations         false        ValidatingWebhookConfiguration
customresourcedefinitions         crd,crds           false        CustomResourceDefinition
apiservices                                   false        APIService
applications                      apps                     true         Application
controllerrevisions                            apps                           true         ControllerRevision
daemonsets                        ds           apps                           true         DaemonSet
deployments                       deploy       apps                           true         Deployment
replicasets                       rs           apps                           true         ReplicaSet
statefulsets                      sts          apps                           true         StatefulSet
meshpolicies                                 false        MeshPolicy
policies                                     true         Policy
tokenreviews                                   false        TokenReview
localsubjectaccessreviews                       true         LocalSubjectAccessReview
selfsubjectaccessreviews                        false        SelfSubjectAccessReview
selfsubjectrulesreviews                         false        SelfSubjectRulesReview
subjectaccessreviews                            false        SubjectAccessReview
horizontalpodautoscalers          hpa          autoscaling                    true         HorizontalPodAutoscaler
cronjobs                          cj           batch                          true         CronJob
jobs                                           batch                          true         Job
certificatesigningrequests        csr            false        CertificateSigningRequest
backendconfigs                                      true         BackendConfig
adapters                                             true         adapter
apikeys                                              true         apikey
attributemanifests                                   true         attributemanifest
authorizations                                       true         authorization
bypasses                                             true         bypass
checknothings                                        true         checknothing
circonuses                                           true         circonus
deniers                                              true         denier
edges                                                true         edge
fluentds                                             true         fluentd
handlers                                             true         handler
httpapispecbindings                                  true         HTTPAPISpecBinding
httpapispecs                                         true         HTTPAPISpec
instances                                            true         instance
kubernetesenvs                                       true         kubernetesenv
kuberneteses                                         true         kubernetes
listcheckers                                         true         listchecker
listentries                                          true         listentry
logentries                                           true         logentry
memquotas                                            true         memquota
metrics                                              true         metric
noops                                                true         noop
opas                                                 true         opa
prometheuses                                         true         prometheus
quotas                                               true         quota
quotaspecbindings                                    true         QuotaSpecBinding
quotaspecs                                           true         QuotaSpec
rbacs                                                true         rbac
redisquotas                                          true         redisquota
reportnothings                                       true         reportnothing
rules                                                true         rule
servicecontrolreports                                true         servicecontrolreport
servicecontrols                                      true         servicecontrol
signalfxs                                            true         signalfx
solarwindses                                         true         solarwinds
stackdrivers                                         true         stackdriver
statsds                                              true         statsd
stdios                                               true         stdio
templates                                            true         template
tracespans                                           true         tracespan
leases                                           true         Lease
bgpconfigurations                              false        BGPConfiguration
clusterinformations                            false        ClusterInformation
felixconfigurations                            false        FelixConfiguration
globalbgpconfigs                               false        GlobalBGPConfig
globalfelixconfigs                             false        GlobalFelixConfig
globalnetworkpolicies                          false        GlobalNetworkPolicy
globalnetworksets                              false        GlobalNetworkSet
hostendpoints                                  false        HostEndpoint
ippools                                        false        IPPool
networkpolicies                                true         NetworkPolicy
daemonsets                        ds           extensions                     true         DaemonSet
deployments                       deploy       extensions                     true         Deployment
ingresses                         ing          extensions                     true         Ingress
networkpolicies                   netpol       extensions                     true         NetworkPolicy
podsecuritypolicies               psp          extensions                     false        PodSecurityPolicy
replicasets                       rs           extensions                     true         ReplicaSet
serviceprofiles                   sp                      true         ServiceProfile
nodes                                                 false        NodeMetrics
pods                                                  true         PodMetrics
alertmanagers                                  true         Alertmanager
podmonitors                                    true         PodMonitor
prometheuses                                   true         Prometheus
prometheusrules                                true         PrometheusRule
servicemonitors                                true         ServiceMonitor
managedcertificates               mcrt              true         ManagedCertificate
destinationrules                                 true         DestinationRule
envoyfilters                                     true         EnvoyFilter
gateways                                         true         Gateway
serviceentries                                   true         ServiceEntry
virtualservices                                  true         VirtualService
ingresses                         ing              true         Ingress
networkpolicies                   netpol              true         NetworkPolicy
runtimeclasses                                           false        RuntimeClass
updateinfos                       updinf          true         UpdateInfo
poddisruptionbudgets              pdb          policy                         true         PodDisruptionBudget
podsecuritypolicies               psp          policy                         false        PodSecurityPolicy
clusterrolebindings                        false        ClusterRoleBinding
clusterroles                               false        ClusterRole
rolebindings                               true         RoleBinding
roles                                      true         Role
rbacconfigs                                            true         RbacConfig
servicerolebindings                                    true         ServiceRoleBinding
serviceroles                                           true         ServiceRole
scalingpolicies                                true         ScalingPolicy
priorityclasses                   pc               false        PriorityClass
trafficsplits                     ts               true         TrafficSplit
csidrivers                                            false        CSIDriver
csinodes                                              false        CSINode
storageclasses                    sc                  false        StorageClass
volumeattachments                                     false        VolumeAttachment
error: unable to retrieve the complete list of server APIs: the server is currently unable to handle the request

Last line says that the tap API is unable to handle request for the tap api. Could that be the issue?

Yeah, it looks like the apiserver is unreachable. Are there any other pods in a bad state or that have many restarts? In the kube-system namespace, perhaps?

If you can exec into a container running your service logic and send a curl request to the tap and identity services, that might help to understand what is happening. You can also use the debug container to test the connectivity to the apiserver.

There aren’t any pods that appear to in a failing state in the kube-system namespace.

These are the pods running in kube-system

 NAME                                                  READY   STATUS    RESTARTS   AGE
calico-node-4z66t                                     2/2     Running   0          5d9h
calico-node-d9k7f                                     2/2     Running   0          5d9h
calico-node-jqwvt                                     2/2     Running   0          5d9h
calico-node-vertical-autoscaler-b889c775f-f9gkz       1/1     Running   5          50d
calico-typha-79799497d7-jn7p2                         1/1     Running   0          5d9h
calico-typha-horizontal-autoscaler-d777c75b4-chptm    1/1     Running   0          50d
calico-typha-vertical-autoscaler-d9b7979f8-mthvb      1/1     Running   5          50d
event-exporter-v0.2.4-5f7d5d7dd4-2ksbz                2/2     Running   12         50d
fluentd-gcp-scaler-6965bb45c9-x78zr                   1/1     Running   0          50d
fluentd-gcp-v3.2.0-6b8sj                              2/2     Running   0          57d
fluentd-gcp-v3.2.0-g89vq                              2/2     Running   0          57d
fluentd-gcp-v3.2.0-nphx5                              2/2     Running   0          57d
heapster-gke-5977f6d8cb-4qq57                         3/3     Running   4          5d9h
ip-masq-agent-5brs5                                   1/1     Running   0          76d
ip-masq-agent-c72vz                                   1/1     Running   0          76d
ip-masq-agent-lffds                                   1/1     Running   0          76d
kube-dns-5995c95f64-kkthn                             4/4     Running   0          5d9h
kube-dns-5995c95f64-pgwdt                             4/4     Running   0          5d9h
kube-dns-autoscaler-8687c64fc-r4rdj                   1/1     Running   0          50d
kube-proxy-gke-development-pcf-pool-2-26b61618-jn8b   1/1     Running   0          50d
kube-proxy-gke-development-pcf-pool-2-26b61618-vsph   1/1     Running   0          50d
kube-proxy-gke-development-pcf-pool-2-26b61618-vxkm   1/1     Running   0          50d
l7-default-backend-7ff48cffd7-2qztn                   1/1     Running   4          50d
metrics-server-v0.3.1-5c6fbf777-7vqkj                 2/2     Running   0          50d
prometheus-to-sd-cqx2b                                2/2     Running   0          5d9h
prometheus-to-sd-cwvmd                                2/2     Running   0          5d9h
prometheus-to-sd-n5bxg                                2/2     Running   0          5d9h
tiller-deploy-79c4c54bc4-hgjb8                        1/1     Running   2          50d

I ran a curl command in a pod to the identity service.

Command was: curl -vvv linkerd-identity.linkerd.svc.cluster.local:8080

Result of the command was (masked IP address with ***):
* Rebuilt URL to: linkerd-identity.linkerd.svc.cluster.local:8080/
*   Trying ***...
* Connected to linkerd-identity.linkerd.svc.cluster.local (***) port 8080 (#0)
> GET / HTTP/1.1
> Host: linkerd-identity.linkerd.svc.cluster.local:8080
> User-Agent: curl/7.52.1
> Accept: */*
* Recv failure: Connection reset by peer
* Curl_http_done: called premature == 1
* stopped the pause stream!
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

I can confirm the IP address is correct so looks like dns resolution was proper. I checked the logs of the identity pod (both proxy container and identity container) to see if the incoming request was logged but didn’t see anything.

I can’t seem to startup a pod with a debug container because the proxy injector component is also failing to start up properly (its proxy fails to get cert I believe).

@a-linkerd-newbie ah, this is beginning to look familiar. Although the behavior is slightly different. Do you know if your nodes were restarted at any point? Do you have any custom autoscaling strategies configured?

There was an issue with the cluster vertical pod autoscaler that was addressed in the latest release.

Can you check the version of the image being used in the autoscaler?

kubectl get po calico-node-vertical-autoscaler-b889c775f-f9gkz -n kube-system -oyaml | grep image should do the trick.

The autoscaler’s image is:

Also, nodes weren’t restarted anytime before the issues started happening. We have no custom autoscaling strategies either.

@a-linkerd-newbie thanks for checking the image. I see that there were some restarts on the cluster vertical pod autoscaler pods, did you happen to capture the logs from them to see if they have any output like those described in 3388?

Curious to know if you have seen this behavior again. If so, can you share the kubernetes event logs?

I scanned the autoscaler logs for any errors or warnings over the time period in which the issues started occurring but I can’t seem to find anything related to what was found in 3388.

An interesting thing happened yesterday though. I ran into the issue described here: 4029. Issue was that the calico and linkerd CNI configurations got into a race condition. So pods that get restarted can no longer start up properly.

Seems like 4029 and 3388 are related in that they are both caused by the conflict between the linkerd cni and calico cni configurations. Do you think the solution provided in 3388 - setting the priorityClass of the linkerd-cni daemonset - could resolve both issues that I’ve ran into?

@a-linkerd-newbie, yeah, I think the priorityClass will work.

As I recall, one of the folks who commented in 3388 mentioned that they were not using the CNI plugin and couldn’t apply the priorityClass configuration.

I’ve decided to re-deploy the linkerd control plane without the cni plugin. It seems like using the linkerd cni and calico cni simultaneously is a bit finicky. So far things are looking good!

Thanks for the update. I’m glad to hear that it’s working.

If you have time, I’d appreciate a GitHub issue with the details so that I can try to reproduce it and dig into the details.