K8s: Linkerd fails to resolve service

I have 4 instances of Linkerd and 1 of them suddenly stopped resolving a service. I saw this issue couple of times when used version 0.8.5 but at them moment v1.1.2 (with mesh) is used.

I collected all logs and metrics from that instance but don’t know how to attache them to the topic.

If someone is going to take a look I’ll send them directly.

l5d.zip (45.9 KB)

Hiya!

You can upload files to a message by clicking 32 AM in the toolbar above the text input box and choosing a file.

Hope that helps!

Upload button only allows image upload. But I want to attache zip archive.
Of course I can create screenshots of all files content but don’t think it will be useful :slight_smile:

Ah my bad, that setting wasn’t turned on. Can you upload files now?

Attached zip archive. It is protected with password. Password will send you in a direct message.
Thanks for your help.

Thanks for the logs and metrics. It seems that the 1 misbehaving linkerd is not able to resolve the client names from namerd for some reason.

How many namerd instances are you running? Could you also include the logs and metrics from your namerd instance?

There are 3 Namerd instances installed on masters.

Do you need logs with enabled tracing?

I don’t think tracing should be necessary. Mostly I’m just looking for error messages in the log that could point to what is wrong. If 1 of the 3 namerds is having issues, that would explain why not all linkerds were affected. Only the linkerd that happened to connect to the affected namerd would have the issue.

namerd.zip (158.0 KB)
Attached logs, metrics and configs. Password is the same.

Thanks! Looking through the logs and metrics, nothing looks obviously amiss. The next steps to debug would be to do the same lookup through the dtab delegator on each of the namerd instances to see if they are able to find the service. If you find a namerd instance that fails to lookup the service, try exec’ing into that container and running

kubectl describe endpoints/<service name>

to see if the endpoints can be retrieved from the kubernetes API.

There was no kubectl inside namerd container so I queried api directly (I think Namerd does the same) and all 3 instances resolved it correctly:

# curl -i http://localhost:8080/api/v1/namespaces/apigw-portal/endpoints/apigw-report-designer-v1-1-142a
HTTP/1.1 200 OK
Content-Type: application/json
Date: Fri, 04 Aug 2017 20:10:13 GMT
Content-Length: 952

{
  "kind": "Endpoints",
  "apiVersion": "v1",
  "metadata": {
    "name": "apigw-report-designer-v1-1-142a",
    "namespace": "apigw-portal",
    "selfLink": "/api/v1/namespaces/apigw-portal/endpoints/apigw-report-designer-v1-1-142a",
    "uid": "d3f253c2-779d-11e7-905f-0eabfb32a902",
    "resourceVersion": "59695303",
    "creationTimestamp": "2017-08-02T16:15:40Z"
  },
  "subsets": [
    {
      "addresses": [
        {
          "ip": "100.96.xx.xx",
          "nodeName": "ip-x-x-x-x.ec2.internal",
          "targetRef": {
            "kind": "Pod",
            "namespace": "apigw-portal",
            "name": "apigw-report-designer-service-v1-1-142a-1887560080-pwof5",
            "uid": "d43445ba-779d-11e7-905f-0eabfb32a902",
            "resourceVersion": "59695236"
          }
        }
      ],
      "ports": [
        {
          "name": "http",
          "port": 9001,
          "protocol": "TCP"
        }
      ]
    }
  ]
}

Also want to say that those 3 Namerds are also used by about 30 other Linkerd instances and they don’t have any issue at the moment.

Very strange. Did the issue recover on its own or is it still ongoing?

The issue is still there and /delegator.json still shows negative resolution for ip addresses. We left that instance for investigation but excluded from traffic so if you need some additional information just tell what you need.
I’m sure that restart will help but don’t want to go this way.

Is there any endpoint in Namerd that does the same what Linkerd’s /delegator.json does?

@Alex
I finally found Namerd instance that doesn’t resolve ip addresses for dtab namespace that is used by that Linkerd. Interesting thing is that it resolves addresses for other namespaces.

I used this endpoint:
curl -i -X GET "http://localhost:4182/api/1/delegate/namespace?path=/external/apigw-report-designer-v1&watch=false"

Very interesting. Sounds like you’ve got a pretty good handle on debugging this. Does it seem like the error is named reading the dtab from the k8s 3rd party resource API? Or a problem with namerd reading a particular endpoints lookup?

There is a known issue where linkerd/namerd will stop getting updates from a k8s namespace if the namespace is deleted and recreated. Could you be hitting that, perhaps?

I don’t think that it is related to reading the dtab because Namerd api shows the latest dtab state and when I do /delegate it shows valid service names but without bind addresses.

So it may be related either to endpoints lookup or some cache issue.

I’m pretty sure that namespace was not recreated. But there is a lot of activity related to creation/deletion services/deployments in all namespaces. Don’t know if it may affect.

The interesting thing there is that Namerd doesn’t resolve services deployed after date X in particular namespace. All services that was deployed before date X are still available.

Is there any way to enable tracing in Namerd using some http endpoint?

I believe you can turn on tracing in namerd with the -com.twitter.finagle.tracing.debugTrace=true flag but I’m not sure how useful that will be. Another approach is to use wireshark or something similar to capture the HTTP API calls that namerd makes to the k8s API and make sure the responses that come back from k8s contain the correct set of endpoints.

I still didn’t find the reason. There is an assumption that it may be related to kube api (at the moment kubernetes 1.4 is used). Maybe the issue will disappear after migration to kubernetes 1.7.

At the moment I enabled tracing on Namerd instances and added tricky validation before dtab update - try to resolve new endpoints that will be added to dtab on all Namerd instances. If all instances resolve endpoints into ip addresses then allow update, if not - prevent update.

@Alex thanks for your time.