Namerd timeouts when fetching DNS query to Consul cluster

We are running Namerd & Consul cluster in our enviornment ,getting below error frquently and after rebooting Namerd services to matigate this issue.

Namerd-01 namerd: E 0114 06:30:29.851 UTC THREAD31: Retrying Consul request ‘GET /v1/health/service/xxxx?dc=x&passing=true’ on NonFatal error: com.twitter.finagle.ChannelWriteException: com.twitter.finagle.ChannelClosedException: null at remote address: consul.service.x.x.internal.prod/x.x.x.x:8500. Remote Info: Not Available from service: client. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: consul.x.x.x.internal/x.x.x.x:8500, Downstream label: client, Trace Id: 6d298c434b2d50b7.6d298c434b2d50b7<:6d298c434b2d50b7

Please help me. if any one facing this issue ,using below configuration.
/opt/namerd/namerd.yaml
admin:
port: 9991
ip: 0.0.0.0
namers:

  • kind: io.l5d.consul
    host: consul.service.x.x.internal.prod
    port: 8500
    useHealthCheck: true

storage:
kind: io.l5d.consul
host: consul.service.x.x.internal.prod
port: 8500
pathPrefix: /namerd/dtabs
datacenter: ent

interfaces:

  • kind: io.l5d.thriftNameInterpreter
    ip: 0.0.0.0
    port: 5100
    cache:
    bindingCacheActive: 2000
    bindingCacheInactive: 200
    addrCacheActive: 2000
    addrCacheInactive: 200

  • kind: io.l5d.httpController
    ip: 0.0.0.0
    port: 5180

telemetry:

  • kind: io.l5d.prometheus

=============================================

Did something change in the environment recently? How long have you been running Linkerd?

Did you check the list of endpoints known to namerd using the admin interface?

Hi,

We are running Linkerd & Namerd since 2016 & nothing changed environment wise. below is named.conf file

admin:
port: 9991
ip: 0.0.0.0
namers:

  • kind: io.l5d.consul
    host: consul.service.x.x.internal.x
    port: 8500
    useHealthCheck: true

storage:
kind: io.l5d.consul
host: consul.service.x.x.internal.x
port: 8500
pathPrefix: /namerd/dtabs
datacenter: ent

interfaces:

  • kind: io.l5d.thriftNameInterpreter
    ip: 0.0.0.0
    port: 5100
    cache:
    bindingCacheActive: 2000
    bindingCacheInactive: 200
    addrCacheActive: 2000
    addrCacheInactive: 200

  • kind: io.l5d.httpController
    ip: 0.0.0.0
    port: 5180

telemetry:

  • kind: io.l5d.prometheus

@Rajredison I can only guess that namerd is getting some stale endpoints. You can use the namerd and consul APIs to make sure that the endpoints match between the two, if it happens again. Have a look at this issue to see if it describes similar behavior to what you’re seeing.