Linkerd Proxy Error: "Failed to proxy request: request timed out"

Hi all,

I deployed the linkerd service mesh on a GKE cluster several weeks ago. Setup was straightforward and the mesh has been running without any issues up until today.

Today, several proxies were throwing the following errors (blanked addresses out):

WARN inbound:accept{peer.addr=***}:source{target.addr=***}: linkerd2_app_core::errors: Failed to proxy request: request timed out

WARN inbound:accept{peer.addr=***}:source{target.addr=***}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast

So if service A tried to send a request to service B (both having the linkerd proxies), I would get a 503 error.

Restarting the pods seems to have resolved the issue. But why did the issue occur in the first place? Is there an issue with the proxies in resolving the destination service?

Moreover, restarting the pods caused a different problem - the tap functionality no longer works for the restarted pod. The tap server logs show the following when attempting to tap the restarted deployment:

time="2020-05-19T21:50:35Z" level=info msg="Tapping 1 pods for target: {Namespace:*** Type:deployment Name:*** XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}"
time="2020-05-19T21:50:35Z" level=info msg="Establishing tap on ***"
time="2020-05-19T21:50:35Z" level=error msg="[***] encountered an error: rpc error: code = Unauthenticated desc = not_provided_by_remote"

I tried restarting all the control plane components to no avail. If anyone can help explain what the root cause of this issue is that would be fantastic!

Some more notes about my environment:

  • deployed linkerd 2.7.1
  • kubernetes version 1.14
  • Deployed linkerd with the CNI plugin daemonset so that the init container is not required.

@a-linkerd-newbie sorry to see that you’re running into this.

What is the log output from the linkerd-identity and linkerd-tap components?

It will help to run linkerd check to make sure that all the control plane components are in a good state.

Also, can you share the output of kubectl api-resources? There have been some reports of the apiserver becoming unavailable.

Hi cpretzer,

The linkerd check was showing that everything looked good when I ran it.

The linkerd-tap logs show this when I try to start a tap:

  time="2020-05-19T21:50:35Z" level=info msg="Tapping 1 pods for target: {Namespace:*** Type:deployment Name:*** XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}"
time="2020-05-19T21:50:35Z" level=info msg="Establishing tap on ***"
time="2020-05-19T21:50:35Z" level=error msg="[***] encountered an error: rpc error: code = Unauthenticated desc = not_provided_by_remote"

As for linkerd-identity - I tried to restart every control plane component yesterday to see whether that would fix the tap issue. But unfortunately that caused everything to break. Linkerd proxies are no longer able to get a cert and so are stuck on startup. This is how proxy startup logs look like (this is proxy for the destination server):

time="2020-05-19T23:26:17Z" level=info msg="running version stable-2.7.1"
[     0.37525183s]  INFO linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.37639885s]  INFO linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.37651701s]  INFO linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.37656605s]  INFO linkerd2_proxy: Tap interface on 0.0.0.0:4190
[     0.37661205s]  INFO linkerd2_proxy: Local identity is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[     0.37667897s]  INFO linkerd2_proxy: Identity verified via linkerd-identity.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.37673316s]  INFO linkerd2_proxy: Destinations resolved via localhost:8086
[     0.38333631s]  INFO inbound: linkerd2_app_inbound: Serving listen.addr=0.0.0.0:4143

And these are the identity server logs after I restarted it:

time="2020-05-20T07:09:35Z" level=info msg="running version stable-2.7.1"
time="2020-05-20T07:09:35Z" level=info msg="starting admin server on :9990"
time="2020-05-20T07:09:35Z" level=info msg="starting gRPC server on :8080"
time="2020-05-20T07:09:44Z" level=info msg="certifying linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local until 2020-05-21 07:10:04 +0000 UTC"

Is there a certain order control plane components must start up in?

Also, here is the output for kubectl api-resources

NAME                              SHORTNAMES   APIGROUP                       NAMESPACED   KIND
bindings                                                                      true         Binding
componentstatuses                 cs                                          false        ComponentStatus
configmaps                        cm                                          true         ConfigMap
endpoints                         ep                                          true         Endpoints
events                            ev                                          true         Event
limitranges                       limits                                      true         LimitRange
namespaces                        ns                                          false        Namespace
nodes                             no                                          false        Node
persistentvolumeclaims            pvc                                         true         PersistentVolumeClaim
persistentvolumes                 pv                                          false        PersistentVolume
pods                              po                                          true         Pod
podtemplates                                                                  true         PodTemplate
replicationcontrollers            rc                                          true         ReplicationController
resourcequotas                    quota                                       true         ResourceQuota
secrets                                                                       true         Secret
serviceaccounts                   sa                                          true         ServiceAccount
services                          svc                                         true         Service
mutatingwebhookconfigurations                  admissionregistration.k8s.io   false        MutatingWebhookConfiguration
validatingwebhookconfigurations                admissionregistration.k8s.io   false        ValidatingWebhookConfiguration
customresourcedefinitions         crd,crds     apiextensions.k8s.io           false        CustomResourceDefinition
apiservices                                    apiregistration.k8s.io         false        APIService
applications                      apps         app.k8s.io                     true         Application
controllerrevisions                            apps                           true         ControllerRevision
daemonsets                        ds           apps                           true         DaemonSet
deployments                       deploy       apps                           true         Deployment
replicasets                       rs           apps                           true         ReplicaSet
statefulsets                      sts          apps                           true         StatefulSet
meshpolicies                                   authentication.istio.io        false        MeshPolicy
policies                                       authentication.istio.io        true         Policy
tokenreviews                                   authentication.k8s.io          false        TokenReview
localsubjectaccessreviews                      authorization.k8s.io           true         LocalSubjectAccessReview
selfsubjectaccessreviews                       authorization.k8s.io           false        SelfSubjectAccessReview
selfsubjectrulesreviews                        authorization.k8s.io           false        SelfSubjectRulesReview
subjectaccessreviews                           authorization.k8s.io           false        SubjectAccessReview
horizontalpodautoscalers          hpa          autoscaling                    true         HorizontalPodAutoscaler
cronjobs                          cj           batch                          true         CronJob
jobs                                           batch                          true         Job
certificatesigningrequests        csr          certificates.k8s.io            false        CertificateSigningRequest
backendconfigs                                 cloud.google.com               true         BackendConfig
adapters                                       config.istio.io                true         adapter
apikeys                                        config.istio.io                true         apikey
attributemanifests                             config.istio.io                true         attributemanifest
authorizations                                 config.istio.io                true         authorization
bypasses                                       config.istio.io                true         bypass
checknothings                                  config.istio.io                true         checknothing
circonuses                                     config.istio.io                true         circonus
deniers                                        config.istio.io                true         denier
edges                                          config.istio.io                true         edge
fluentds                                       config.istio.io                true         fluentd
handlers                                       config.istio.io                true         handler
httpapispecbindings                            config.istio.io                true         HTTPAPISpecBinding
httpapispecs                                   config.istio.io                true         HTTPAPISpec
instances                                      config.istio.io                true         instance
kubernetesenvs                                 config.istio.io                true         kubernetesenv
kuberneteses                                   config.istio.io                true         kubernetes
listcheckers                                   config.istio.io                true         listchecker
listentries                                    config.istio.io                true         listentry
logentries                                     config.istio.io                true         logentry
memquotas                                      config.istio.io                true         memquota
metrics                                        config.istio.io                true         metric
noops                                          config.istio.io                true         noop
opas                                           config.istio.io                true         opa
prometheuses                                   config.istio.io                true         prometheus
quotas                                         config.istio.io                true         quota
quotaspecbindings                              config.istio.io                true         QuotaSpecBinding
quotaspecs                                     config.istio.io                true         QuotaSpec
rbacs                                          config.istio.io                true         rbac
redisquotas                                    config.istio.io                true         redisquota
reportnothings                                 config.istio.io                true         reportnothing
rules                                          config.istio.io                true         rule
servicecontrolreports                          config.istio.io                true         servicecontrolreport
servicecontrols                                config.istio.io                true         servicecontrol
signalfxs                                      config.istio.io                true         signalfx
solarwindses                                   config.istio.io                true         solarwinds
stackdrivers                                   config.istio.io                true         stackdriver
statsds                                        config.istio.io                true         statsd
stdios                                         config.istio.io                true         stdio
templates                                      config.istio.io                true         template
tracespans                                     config.istio.io                true         tracespan
leases                                         coordination.k8s.io            true         Lease
bgpconfigurations                              crd.projectcalico.org          false        BGPConfiguration
clusterinformations                            crd.projectcalico.org          false        ClusterInformation
felixconfigurations                            crd.projectcalico.org          false        FelixConfiguration
globalbgpconfigs                               crd.projectcalico.org          false        GlobalBGPConfig
globalfelixconfigs                             crd.projectcalico.org          false        GlobalFelixConfig
globalnetworkpolicies                          crd.projectcalico.org          false        GlobalNetworkPolicy
globalnetworksets                              crd.projectcalico.org          false        GlobalNetworkSet
hostendpoints                                  crd.projectcalico.org          false        HostEndpoint
ippools                                        crd.projectcalico.org          false        IPPool
networkpolicies                                crd.projectcalico.org          true         NetworkPolicy
daemonsets                        ds           extensions                     true         DaemonSet
deployments                       deploy       extensions                     true         Deployment
ingresses                         ing          extensions                     true         Ingress
networkpolicies                   netpol       extensions                     true         NetworkPolicy
podsecuritypolicies               psp          extensions                     false        PodSecurityPolicy
replicasets                       rs           extensions                     true         ReplicaSet
serviceprofiles                   sp           linkerd.io                     true         ServiceProfile
nodes                                          metrics.k8s.io                 false        NodeMetrics
pods                                           metrics.k8s.io                 true         PodMetrics
alertmanagers                                  monitoring.coreos.com          true         Alertmanager
podmonitors                                    monitoring.coreos.com          true         PodMonitor
prometheuses                                   monitoring.coreos.com          true         Prometheus
prometheusrules                                monitoring.coreos.com          true         PrometheusRule
servicemonitors                                monitoring.coreos.com          true         ServiceMonitor
managedcertificates               mcrt         networking.gke.io              true         ManagedCertificate
destinationrules                               networking.istio.io            true         DestinationRule
envoyfilters                                   networking.istio.io            true         EnvoyFilter
gateways                                       networking.istio.io            true         Gateway
serviceentries                                 networking.istio.io            true         ServiceEntry
virtualservices                                networking.istio.io            true         VirtualService
ingresses                         ing          networking.k8s.io              true         Ingress
networkpolicies                   netpol       networking.k8s.io              true         NetworkPolicy
runtimeclasses                                 node.k8s.io                    false        RuntimeClass
updateinfos                       updinf       nodemanagement.gke.io          true         UpdateInfo
poddisruptionbudgets              pdb          policy                         true         PodDisruptionBudget
podsecuritypolicies               psp          policy                         false        PodSecurityPolicy
clusterrolebindings                            rbac.authorization.k8s.io      false        ClusterRoleBinding
clusterroles                                   rbac.authorization.k8s.io      false        ClusterRole
rolebindings                                   rbac.authorization.k8s.io      true         RoleBinding
roles                                          rbac.authorization.k8s.io      true         Role
rbacconfigs                                    rbac.istio.io                  true         RbacConfig
servicerolebindings                            rbac.istio.io                  true         ServiceRoleBinding
serviceroles                                   rbac.istio.io                  true         ServiceRole
scalingpolicies                                scalingpolicy.kope.io          true         ScalingPolicy
priorityclasses                   pc           scheduling.k8s.io              false        PriorityClass
trafficsplits                     ts           split.smi-spec.io              true         TrafficSplit
csidrivers                                     storage.k8s.io                 false        CSIDriver
csinodes                                       storage.k8s.io                 false        CSINode
storageclasses                    sc           storage.k8s.io                 false        StorageClass
volumeattachments                              storage.k8s.io                 false        VolumeAttachment
error: unable to retrieve the complete list of server APIs: tap.linkerd.io/v1alpha1: the server is currently unable to handle the request

Last line says that the tap API is unable to handle request for the tap api. Could that be the issue?

Yeah, it looks like the apiserver is unreachable. Are there any other pods in a bad state or that have many restarts? In the kube-system namespace, perhaps?

If you can exec into a container running your service logic and send a curl request to the tap and identity services, that might help to understand what is happening. You can also use the debug container to test the connectivity to the apiserver.

There aren’t any pods that appear to in a failing state in the kube-system namespace.

These are the pods running in kube-system

 NAME                                                  READY   STATUS    RESTARTS   AGE
calico-node-4z66t                                     2/2     Running   0          5d9h
calico-node-d9k7f                                     2/2     Running   0          5d9h
calico-node-jqwvt                                     2/2     Running   0          5d9h
calico-node-vertical-autoscaler-b889c775f-f9gkz       1/1     Running   5          50d
calico-typha-79799497d7-jn7p2                         1/1     Running   0          5d9h
calico-typha-horizontal-autoscaler-d777c75b4-chptm    1/1     Running   0          50d
calico-typha-vertical-autoscaler-d9b7979f8-mthvb      1/1     Running   5          50d
event-exporter-v0.2.4-5f7d5d7dd4-2ksbz                2/2     Running   12         50d
fluentd-gcp-scaler-6965bb45c9-x78zr                   1/1     Running   0          50d
fluentd-gcp-v3.2.0-6b8sj                              2/2     Running   0          57d
fluentd-gcp-v3.2.0-g89vq                              2/2     Running   0          57d
fluentd-gcp-v3.2.0-nphx5                              2/2     Running   0          57d
heapster-gke-5977f6d8cb-4qq57                         3/3     Running   4          5d9h
ip-masq-agent-5brs5                                   1/1     Running   0          76d
ip-masq-agent-c72vz                                   1/1     Running   0          76d
ip-masq-agent-lffds                                   1/1     Running   0          76d
kube-dns-5995c95f64-kkthn                             4/4     Running   0          5d9h
kube-dns-5995c95f64-pgwdt                             4/4     Running   0          5d9h
kube-dns-autoscaler-8687c64fc-r4rdj                   1/1     Running   0          50d
kube-proxy-gke-development-pcf-pool-2-26b61618-jn8b   1/1     Running   0          50d
kube-proxy-gke-development-pcf-pool-2-26b61618-vsph   1/1     Running   0          50d
kube-proxy-gke-development-pcf-pool-2-26b61618-vxkm   1/1     Running   0          50d
l7-default-backend-7ff48cffd7-2qztn                   1/1     Running   4          50d
metrics-server-v0.3.1-5c6fbf777-7vqkj                 2/2     Running   0          50d
prometheus-to-sd-cqx2b                                2/2     Running   0          5d9h
prometheus-to-sd-cwvmd                                2/2     Running   0          5d9h
prometheus-to-sd-n5bxg                                2/2     Running   0          5d9h
tiller-deploy-79c4c54bc4-hgjb8                        1/1     Running   2          50d

I ran a curl command in a pod to the identity service.

Command was: curl -vvv linkerd-identity.linkerd.svc.cluster.local:8080

Result of the command was (masked IP address with ***):
* Rebuilt URL to: linkerd-identity.linkerd.svc.cluster.local:8080/
*   Trying ***...
* TCP_NODELAY set
* Connected to linkerd-identity.linkerd.svc.cluster.local (***) port 8080 (#0)
> GET / HTTP/1.1
> Host: linkerd-identity.linkerd.svc.cluster.local:8080
> User-Agent: curl/7.52.1
> Accept: */*
> 
* Recv failure: Connection reset by peer
* Curl_http_done: called premature == 1
* stopped the pause stream!
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

I can confirm the IP address is correct so looks like dns resolution was proper. I checked the logs of the identity pod (both proxy container and identity container) to see if the incoming request was logged but didn’t see anything.

I can’t seem to startup a pod with a debug container because the proxy injector component is also failing to start up properly (its proxy fails to get cert I believe).

@a-linkerd-newbie ah, this is beginning to look familiar. Although the behavior is slightly different. Do you know if your nodes were restarted at any point? Do you have any custom autoscaling strategies configured?

There was an issue with the cluster vertical pod autoscaler that was addressed in the latest release.

Can you check the version of the image being used in the autoscaler?

kubectl get po calico-node-vertical-autoscaler-b889c775f-f9gkz -n kube-system -oyaml | grep image should do the trick.

The autoscaler’s image is: gke.gcr.io/cpvpa-amd64:v0.7.1-gke.0

Also, nodes weren’t restarted anytime before the issues started happening. We have no custom autoscaling strategies either.

@a-linkerd-newbie thanks for checking the image. I see that there were some restarts on the cluster vertical pod autoscaler pods, did you happen to capture the logs from them to see if they have any output like those described in 3388?

Curious to know if you have seen this behavior again. If so, can you share the kubernetes event logs?

I scanned the autoscaler logs for any errors or warnings over the time period in which the issues started occurring but I can’t seem to find anything related to what was found in 3388.

An interesting thing happened yesterday though. I ran into the issue described here: 4029. Issue was that the calico and linkerd CNI configurations got into a race condition. So pods that get restarted can no longer start up properly.

Seems like 4029 and 3388 are related in that they are both caused by the conflict between the linkerd cni and calico cni configurations. Do you think the solution provided in 3388 - setting the priorityClass of the linkerd-cni daemonset - could resolve both issues that I’ve ran into?

@a-linkerd-newbie, yeah, I think the priorityClass will work.

As I recall, one of the folks who commented in 3388 mentioned that they were not using the CNI plugin and couldn’t apply the priorityClass configuration.

I’ve decided to re-deploy the linkerd control plane without the cni plugin. It seems like using the linkerd cni and calico cni simultaneously is a bit finicky. So far things are looking good!

Thanks for the update. I’m glad to hear that it’s working.

If you have time, I’d appreciate a GitHub issue with the details so that I can try to reproduce it and dig into the details. https://github.com/linkerd/linkerd2/issues