ERROR linkerd2_app::env: Not a valid identity name

My LinkerD implementation repeatedly begins experiencing this issue after running fine for some time.

I first notice that new pods stop getting a linkerd proxy injected, so I go to re-deploy the proxy-injector pod which fails to start with the following errors:

time="2020-04-23T17:32:20Z" level=info msg="running version stable-2.7.1"
time="2020-04-23T17:32:20Z" level=info msg="Using with pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2020-04-23T17:32:20Z" level=info msg="Using with pre-existing CSR: /var/run/linkerd/identity/end-entity/key.p8"
[     0.7317620s] ERROR linkerd2_app::env: Not a valid identity name: linkerd-identity.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
[     0.7362210s] ERROR linkerd2_app::env: LINKERD2_PROXY_IDENTITY_SVC_NAME="linkerd-identity.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)" is not valid: NameError
[     0.7456501s] ERROR linkerd2_app::env: Not a valid identity name: $(_pod_sa).$(_pod_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
[     0.7474491s] ERROR linkerd2_app::env: LINKERD2_PROXY_IDENTITY_LOCAL_NAME="$(_pod_sa).$(_pod_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)" is not valid: NameError
[     0.7498852s] ERROR linkerd2_app::env: Not a valid identity name: linkerd-destination.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
[     0.7512102s] ERROR linkerd2_app::env: LINKERD2_PROXY_DESTINATION_SVC_NAME="linkerd-destination.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)" is not valid: NameError
[     0.7556233s] ERROR linkerd2_app::env: Not a valid identity name: linkerd-tap.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
[     0.7571993s] ERROR linkerd2_app::env: LINKERD2_PROXY_TAP_SVC_NAME="linkerd-tap.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)" is not valid: NameError
Invalid configuration: invalid environment variable

Then, attempts to re-deploy any of the linkerd pods then produce the same results. In the past I’ve ended up destroying and redeploying all of linkerd, but I’d like to instead find the cause and fix this since it’s happened multiple times. Any assistance would be appreciated.

@TechnoChimp, I can’t recall seeing that error before.

Where is this cluster running and how have you deployed it?

Did you use a custom cluster domain, by any chance?

It’s running in an EKS cluster in namespace: linkerd. We have not customized the cluster domain.

Thanks, that’s helpful. What is the kubernetes version?

If you can run kubectl exec commands against the pod, can you send the output from kubectl exec <pod-name> -c linkerd-proxy -- env?

The first thing we want to do is verify the _l5d_ns and _l5d_trustdomain environment variables.

The fact that the behavior starts after linkerd has been running for a while suggests that there is some sort of reconciliation controller that is updating the namespace or cluster domain, or otherwise making the components in the linkerd namespace inaccessible.

If you can get the kubelet output that might have some helpful information in it.

Server Version:
version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.10-eks-bac369", GitCommit:"bac3690554985327ae4d13e42169e8b1c2f37226", GitTreeState:"clean", BuildDate:"2020-02-26T01:12:54Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

I’m not sure how I can get run env on the pod. It fails pretty much immediately and doesn’t give me time to run anything before it exits. I don’t suppose you know of a way I can add something to the config which will run a command in the container before it crashes?

Anyway, I did at least get the variables you were asking about from a running pod which was started up before everything broke.

_l5d_ns=linkerd
_l5d_trustdomain=cluster.local

For some reason my previous reply got flagged as spam… So I’ll summarize with less verbose information in the hopes this message won’t get blocked…

K8s version is 1.15.

I can’t run any commands on the pod, it fails before I can exec anything. (Know of a way I can get the pod to output this before it crashes?)

A pod that is still running from before everything broke has the expected values for those variables.

You might be able to use the kubectl get logs <pod-name> --previous to see the logs.

The other thing I’d look at is the output from kubectl get ev -n linkerd command.

That being said, something fishy is happening at a system level making those environment variables unavailable.

It sounds like pods that are injected with the proxy after linkerd is installed, then after some amount of time, new pods with the proxy fail with this error. The thing we need to try to understand is what is changing in the environment between the time that the initial pods start running and the time that the failing pods start.

What else is installed in the cluster? Flux? Helm? Anything that would modify the environment?

So I can get the logs from the pod, that’s not a problem. (See my first post with attached logs from the crashed pod)

We do leverage Helm charts for deployment of some apps, but LinkerD is not deployed via Helm. We use the CLI for LinkerD.

The rest of the stuff in the cluster is pretty standard: Prometheus, Grafana, Fluentd and Jaeger. The app pods we’re deploying are pretty simple, just a deployment and a service. We are also using Nginx Ingress for Load Balancing.

I’m still digging in to see if I can find anything, but I’m pretty much going into it blind.

One question I had was relating to the service account that the proxy is looking for. I’m wondering how LinkerD uses the value linkerd-tap.(_l5d_ns).serviceaccount.identity.(_l5d_ns).$(_l5d_trustdomain), and if I can test that same logic in a different pod. Is this some sort of DNS lookup that’s happening?

@TechnoChimp, here’s a snippet of YAML from the proxy that is injected into a pod where we can see how those environment variables are used:

    - name: _l5d_ns
      value: linkerd
    - name: _l5d_trustdomain
      value: cluster.local
    - name: LINKERD2_PROXY_IDENTITY_LOCAL_NAME
      value: $(_pod_sa).$(_pod_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
    - name: LINKERD2_PROXY_IDENTITY_SVC_NAME
      value: linkerd-identity.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
    - name: LINKERD2_PROXY_DESTINATION_SVC_NAME
      value: linkerd-destination.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
    - name: LINKERD2_PROXY_TAP_SVC_NAME
      value: linkerd-tap.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)

Those variables are evaluated here.

The error message indicates that the environment variables aren’t being interpolated, which leads me to think that this is an EKS thing.

About how long after Linkerd is installed does this happen? Is there anything special about the deployments?

This installation was deployed 44 days ago. I only just noticed the problem about 4 days ago, but it’s very possible it started before then. I had only just noticed it when I was checking in on my app deployment and realized that the proxy was missing.

Also, I did some extra testing my manually filling in the variable values in place of the variable keys. I no longer got the errors in the logs relating to the ones I updated. I’m going to follow up with our K8s support team to see if they can help troubleshoot the variable interpolation issue further.

Thanks for your assistance!

Okay, please let me know what you learn, I’d be interested to find out if this is an EKS specific issue

I think it’s related to this: https://github.com/kubernetes/kubernetes/issues/48069

I moved this from the bottom of the env list to the top and it started working:

    - name: _l5d_ns
      value: linkerd
    - name: _l5d_trustdomain
      value: cluster.local
    - name: _pod_ns
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: _pod_sa
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.serviceAccountName

Also, since that bug was closed I opened my own here: https://github.com/kubernetes/kubernetes/issues/90466

While that’s being looked at, is there some way to change this order of variables globally for LinkerD so that new proxies can successfully deploy?

Also, a little more information if it helps…

When I run linkerd upgrade, the variables are generated in the correct order. However, when I then do kubectl apply the variables are rearranged. It seems they’re being re-arranged alphabetically with the _ variables added to the bottom.

I’m guessing this installation had broken at some point after an upgrade and I just didn’t immediately notice it.

And root cause found. We’re managing our cluster with Rancher. The re-deploy option in rancher re-orders the variables. Applying directly from the linkerd upgrade yaml file works fine.

@TechnoChimp, thanks for the ongoing updates on this.

First, I’m glad that you found the root cause of the issue. That is a very obscure bug, so you did some really good sleuthing.

To answer your question about changing the order of the environment variables, the best way to go about that would be to install using helm, then modify the charts to fit your needs. That being said, this is something that I would expect to be fixed in Kubernetes. The kubernetes bug you pointed to is said to be fixed in 1.7, so I’m really surprised that it’s occurring in 1.15, which seems like there is something about EKS 1.15 that is not up to date.

The response that my bug had gotten makes sense:

since envvars can reference each other, and cycles are possible, this matches behavior of defining envvars in a shell, etc. this behavior should be documented.

However, with that issue resolved, I’m still experiencing the first part of my issue. As stated before, at first I noticed that new application pods stopped getting a proxy injected. This is still occurring even after figuring out the proxy-injector pod issue.

I’m going to check some of the same items as mentioned in this issue: https://github.com/linkerd/linkerd2/issues/2847

I’ll report back with what I find.

@TechnoChimp thanks for the update.

If you can share your config files, I’ll take a look at them to try to figure out why the Linkerd proxy is not injected.

You can set the --log-level=debug on the linkerd-proxy-injector and it should provide some insight as to how the annotations are being processed and whether the injection is actively being skipped.

The kubectl get events command also has some detail about proxy injection.

We haven’t made any custom configuration. It’s my understanding from the user docs that it should be using the default linkerd.yaml file in the /config folder. Is there an easy way I can export this so I can see how it’s configured by default?

I have set the log level to debug on the proxy-injector, however I’m not seeing any new logs after redeploying one of my application pods.

I’ve tried kubectl get events on both the linkerd and application namespaces, however I’m not seeing anything relating to the proxy.

I did check the api-server logs and found these messages appear when I re-deploy my application pod:

E0427 18:04:17.926585 1 available_controller.go:409] v1alpha1.tap.linkerd.io failed with: failing or missing response from https://10.32.0.3:8089/apis/tap.linkerd.io/v1alpha1: Get https://10.32.0.3:8089/apis/tap.linkerd.io/v1alpha1: Address is not allowed

I0427 18:04:18.926293 1 controller.go:107] OpenAPI AggregationController: Processing item v1alpha1.tap.linkerd.io

W0427 18:04:18.926359 1 handler_proxy.go:91] no RequestInfo found in the context

E0427 18:04:18.926409 1 controller.go:114] loading OpenAPI spec for "v1alpha1.tap.linkerd.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]

I0427 18:04:18.926415 1 controller.go:127] OpenAPI AggregationController: action for item v1alpha1.tap.linkerd.io: Rate Limited Requeue.

W0427 18:04:23.340311 1 dispatcher.go:136] Failed calling webhook, failing open linkerd-proxy-injector.linkerd.io: failed calling webhook "linkerd-proxy-injector.linkerd.io": Post https://linkerd-proxy-injector.linkerd.svc:443/?timeout=30s: context deadline exceeded

E0427 18:04:23.340355 1 dispatcher.go:137] failed calling webhook "linkerd-proxy-injector.linkerd.io": Post https://linkerd-proxy-injector.linkerd.svc:443/?timeout=30s: context deadline exceeded

I0427 18:04:23.350202 1 trace.go:81] Trace[445250923]: "Create /api/v1/namespaces/***application_ns***/pods" (started: 2020-04-27 18:03:53.339721771 +0000 UTC m=+2140548.817302162) (total time: 30.010464723s): Trace[445250923]: [30.001521448s] [30.001460532s] About to store object in database