What happens with failed linkerd requests?

Hi,

Setup:
We are using linkerd wtih consul as a service discovery all together with some gRPC services (.net core)

Story:
I accidentally run some load test with bad service name. Linkerd address was correct, but service name was not.

When I look into docker logs I see this message repeated all over again:

linkerd_1  | I 0216 14:36:44.672 UTC THREAD29: no available endpoints
linkerd_1  | com.twitter.finagle.NoBrokersAvailableException: No hosts are available for /svc/xxx-grpcservice, Dtab.base=[/svc=>/#/io.l5d.consul/dc1], Dtab.local=[]. Remote Info: Not Available

which is fine, but the

Issue
is when I run docker stats for the container I see that memory consumtpion goes up.
I thought that the logs are piling up, so I disabled the docker logging, but same thing continued to happen.

My question is are these requests stored somewhere, if they are is there a way to purge them? This is an edge case scenario, but I’d like to understand what’s happening.

Note: This I cannot confirm yet, but after I stop the test the memory is not getting freed, but it could also be that it requires more time I waited for around 10 minutes and it haven’t went down a bit.

linkerd metrics json can be viewed here

linkerd configuration can be found below:

admin:
  ip: 0.0.0.0
  port: 7998

namers:
- kind: io.l5d.consul
  includeTag: false
  useHealthCheck: true
  healthStatuses: 
    - passing
  host: 10.0.75.1 
  port: 8500

routers:
- protocol: h2 
  label: /grpc-consul
  experimental: true
  service:
    responseClassifier:
      kind: io.l5d.h2.grpc.neverRetryable  
    retries:
      budget:
        minRetriesPerSec: 0
        percentCanRetry: 0
  identifier:
   kind: io.l5d.header.path
   header: uri
   segments: 1
  dtab: |
    /svc => /#/io.l5d.consul/dc1;
  servers:
  - port: 7999
    ip: 0.0.0.0

Hey @MirzaMerdovic – thanks for reporting. I wouldn’t expect for linkerd’s memory consumption to increase when it fails to find a service in consul. It seems like you somehow managed to trigger a memory leak, possibly due to infinite retries looking up the service, but that’s just a guess. Possibly this is related to:

Which is still under investigation. From looking at the stats you pasted, I don’t see anything in particular that would explain this.

If you’re able to reproduce the memory leak reliably, it would be really helpful if you could open an issue on Github, with all of the steps to reproduce. That would make tracking down the leak a lot easier. Thanks!

Hi @klingerf,
I will be able to reproduce the issue, and I will open one with details how to reproduce this during the next week.

Note: When I run the load test with correct service name memory consumption is fine.

Thanks!

Hi @klingerf,
The issue has been created. Please let me know if you need any more details or if we can help in any way.
Thanks!

Thanks @MirzaMerdovic! We’ll take a look.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.