Failures when there is a rule with a weighted union in namerd dtab

The error message I get is:

No hosts are available for /svc/qa5/tags, Dtab.base=[], Dtab.local=[]. Remote Info: Not Available

The error log from linkerd is:

0614 21:44:05.072 4243c69c8667b009.4243c69c8667b009<:4243c69c8667b009] Message(namer.success)
0614 21:44:05.073 4243c69c8667b009.4b47ee18e32febf2<:4243c69c8667b009] ClientAddr(/10.97.26.244:59612)
0614 21:44:05.074 4243c69c8667b009.40891eea44fc43e0<:4b47ee18e32febf2] Message(namer.success)
0614 21:44:05.076 4243c69c8667b009.40891eea44fc43e0<:4b47ee18e32febf2] BinaryAnnotation(io.buoyant.router.Failure,ClientAcquisition)
E 0614 21:44:05.076 UTC THREAD29 TraceId:4243c69c8667b009: service failure
com.twitter.finagle.NoBrokersAvailableException: No hosts are available for /svc/qa5/tags, Dtab.base=[], Dtab.local=[]. Remote Info: Not Available

The dtab in namerd when this happens is:

/svc=>/#/io.l5d.marathon;
/svc/qa5/qa01=>/svc/qa5/qa01.green;
/svc/qa5/tags=>7.00*/svc/qa5/tags.blue & 3.00*/svc/qa5/tags.green

The namerd config is:

admin:
  port: 9991
  ip: 0.0.0.0
storage:
  kind: io.l5d.zk
  experimental: true
  zkAddrs:
  - host: master.mesos
    port: 2181
  pathPrefix: /dtabs
  sessionTimeoutMs: 10000
namers:
- kind:         io.l5d.marathon
  experimental: true
  prefix:       /io.l5d.marathon
  host:         marathon.mesos
  port:         8080
interfaces:
- kind: io.l5d.thriftNameInterpreter
  ip:   0.0.0.0
  port: 4100
- kind: io.l5d.httpController
  ip:   0.0.0.0
  port: 4180

Linkerd config is:

admin:
  port: 9990

telemetry:
- kind: io.l5d.prometheus

routers:
- protocol: http
  label: outgoing
  #streamingEnabled: false
  #dtab: |
  #  /svc=>/#/io.l5d.marathon;
  identifier:
    kind: io.l5d.path
    consume: false
    segments: 2
  interpreter:
    kind: io.l5d.namerd
    experimental: true
    dst: /$/inet/namerd/4100
    namespace: sessionm
    transformers:
    - kind: io.l5d.port
      port: 4141
  servers:
  - port: 4140
    ip: 0.0.0.0
- protocol: http
  label: incoming
  #streamingEnabled: false
  #dtab: |
  #  /svc=>/#/io.l5d.marathon;
  identifier:
    kind: io.l5d.path
    consume: true
    segments: 2
  interpreter:
    kind: io.l5d.namerd
    experimental: true
    dst: /$/inet/namerd/4100
    namespace: sessionm
    transformers:
    - kind: io.l5d.localhost
  servers:
  - port: 4141
    ip: 0.0.0.0
  client:

Thanks for all the detail! Would you mind adding the namerd log also?

Sure. But there are no failures in there.

Yeah, I think the namerd log could still help.

A couple other things to try:

  1. Can you test the /svc/qa5/tags route using linkerd and namerd admin dtab pages, and then post a screenshot?

  2. Does it work if you use a dtab like this?

    /svc=>/#/io.l5d.marathon;
    /svc/qa5/qa01=>/svc/qa5/qa01.green;
    /svc/qa5/tags=>/svc/qa5/tags.blue

Yeah, I think the namerd log could still help.

Is there a way to attach the entire log file?

  1. Can you test the /svc/qa5/tags route using linkerd and namerd admin dtab pages, and then post a screenshot?

Namerd dtab resolution

Linkerd dtab resolution
How do I get this?

  1. Does it work if you use a dtab like this?
/svc=>/#/io.l5d.marathon;
/svc/qa5/qa01=>/svc/qa5/qa01.green;
/svc/qa5/tags=>/svc/qa5/tags.blue

Yes. And this works as well.

/svc=>/#/io.l5d.marathon;
/svc/qa5/qa01=>/svc/qa5/qa01.green;
/svc/qa5/tags=>/svc/qa5/tags.green

That dtab resolution looks good from namerd’s perspective. Can you do the same thing from the linkerd admin dtab page?

You can include the entire namerd log inline, or post it as a gist on github.

linkerd’s admin page, similar to namerd’s admin, but runs on port 9990 by default. Navigate to the “dtab” section.

linkerd dtab resolution

namerd logs

linkerd stats

What port are you sending to when you get the “No hosts are available” available? Based on the metrics it looks like you may be sending to 4141, the incoming router. These requests will fail unless there happens to be an instance of the target service running on the same node as the linkerd you’re sending to. You should instead send your request to port 4140, the outgoing router. This will forward the request to a node where the target service is running.

Yes - there is a service instance running on the same node as linkerd.

We send requests only to 4140. The traffic into 4141 that you see should just be the requests being forwarded by linkerd to itself (meant for the local service instance).

Note that if I set the dtab to direct 100% of the traffic to the local service instance (or the remote service instance), things work just fine. The problem occurs ONLY when the dtab rule is a weighted union.

Looking at the dtab resolution screenshot from the outgoing router, it looks like linkerd should be sending traffic to a weighted union of 70% to 10.97.26.244:4141 and 30% to 10.97.25.51:4141 (I assume. the last line of the delegation is cut off in the screenshot).

The next step in debugging would be to go to the dtab playground of the linkerd on each of those nodes and look at the delegation for /svc/qa5/tags on the “incoming” router. (The router can be selected from a dropdown in the top right).

Hopefully that can shed some light as to what is going wrong.

Oh, looks like the linkerd incoming dtabs don’t resolve well. I suppose the rules need to be specified differently. A different prefix on the incoming router, that doesn’t have the weighted union?

linkerd 01 incoming

linkerd 02 incoming

I think I understand what’s going on with this. But my root problem is still not solved. I’ll make another post about it.

(edit: just remembered that you’re not using k8s. I edited the dtab in the config below to strip the port transformer prefix instead of the k8s daemonset transformer prefix)

What I believe is happening is that at the outgoing router, linkerd is evaluating the dtab, encountering a weighted union, and picking one of the branches to route to. When the request reaches the incoming router on the destination node, it’s evaluating the dtab again and can potentially pick a different branch of the union. If it does, there’s not guarantee that an instance of that service will be running on that node.

To get around this problem you can have the incoming router configured to use the same client name that the outgoing router used (instead of evaluating the dtab all over again). You can do this by using the io.l5d.header identifier to read the l5d-dst-client header and changing the dtab to strip off the transformer prefix. You’ll also want to do path consumption to the outgoing router since the path identifier is not longer used on the incoming router. All together, your config would look something like this:

admin:
  port: 9990

telemetry:
- kind: io.l5d.prometheus

routers:
- protocol: http
  label: outgoing
  #streamingEnabled: false
  #dtab: |
  #  /svc=>/#/io.l5d.marathon;
  identifier:
    kind: io.l5d.path
    consume: true  # <-- notice consumption has been moved here
    segments: 2
  interpreter:
    kind: io.l5d.namerd
    experimental: true
    dst: /$/inet/namerd/4100
    namespace: sessionm
    transformers:
    - kind: io.l5d.port
      port: 4141
  servers:
  - port: 4140
    ip: 0.0.0.0
- protocol: http
  label: incoming
  #streamingEnabled: false
  dtab: | # <-- this dtab strips off the transformer prefix
    /svc/%/io.l5d.port/4141 => /;
  identifier:
    kind: io.l5d.header
    header: l5d-dst-client # <-- use the client name picked by the outgoing router
  interpreter:
    kind: default # <-- We're just using the client name from the outgoing router so no need to talk to namerd
    transformers:
    - kind: io.l5d.localhost
  servers:
  - port: 4141
    ip: 0.0.0.0
  client:

Let me know if this helps!

Yes, I figured out what was going on over the weekend too. Thanks.

I’ll try out your solution. I have a few questions on it though.

  1. Where are some of these things documented? E.g. the part about how linkerd puts the client in the header, and what that header looks like. Finagle?

  2. What exactly goes into the client header? What does it look like?

  3. Why does the path need to be consumed on the outgoing router? The incoming router constructs the identifier from a header (not path), right?

  4. One side effect of this solution is that you can’t hit the incoming router directly with a request. Right?

  5. Is the recipe of blue green deploy not used much?

Answer inline:

  1. https://linkerd.io/config/1.1.0/linkerd/index.html#informational-request-headers
  2. The concrete client id. In your case it looks something like: /%/io.l5d.port/4141/#/io.l5d.marathon/qa5/tags.blue
  3. It depends on if the target app expects the path to be stripped or not. The destination linkerd doesn’t need the path to be stripped but the destination app might.
  4. That’s correct.
  5. I can’t really say how much blue-green deploy is used. It’s definitely one of the more advanced of linkerd’s capabilities so it makes sense that it’s not as widely used.