Benches for http webservers in different languages

Hi,

I am testing out mesh networks, in terms of features and perf cost.
I started out with the easiest to deploy out there: linkerd v2 :).

I did open a repo with multiple scenarios and benches: https://github.com/vincentserpoul/meshes

My problem is that from the benches I ran (i ran them multiple times), it seems the perf cost is far from negligible:


(see go, rust and springboot results).

The web server is reduced to a minimum, one handler, waiting 26ms and writing a string as a response.

The bench are launched from within the cluster, using rakyll/hey.
You can run the bench yourself (cluster set up locally with k3d or digital ocean) with the cluster.sh.

The bench scripts are located in the scenarios folder.
the web servers are located in the application folder.

Am I missing anything here? Config? the server response is too fast?

hey @vise thanks for this info. It’s really interesting.

I’d love to see it within the context of the other service meshes that you plan to test. There was a larger scale test run by Kinvolk last year that has some interesting detail when comparing Linkerd with another service mesh.

Can you share more info about your testing methodology? As you can expect, there is always going to be latency added with any service mesh. I suspect that the additional time that is required to encrypt/decrypt the traffic between the services. That’s a simple tradeoff of security for performance.

That being said, can you share the details of the performance metrics from the grafana and prometheus dashboards?

That would be really helpful in understanding the data.

for the testing methodology:
I run 1M reqs with 500 concurrent workers from one “bench pod” to one of the webservers tested (webserver running with 6 replicas, 10 nodes k8s cluster).
I respawn the cluster every time, starts by preheating the service, launch bench without linkerd, then install linkerd on the cluster, inject linkerd, preheat the service again, and relaunch the bench.

For the expectations, i definitely agree that security and functionalities have a cost.

Which metrics would you be interested in exactly?

This sounds like a case where distributed tracing will help to understand the timings. You will have to instrument the web server logic with OpenCensus telemetry in order to get the span info from the web server.

The docs will be helpful in setting up the environment.

Another interesting thing to try would be to run a similar test using one of the sample applications: emojivoto or booksapp.

After implementing tracing (using opencensus and jaeger, same as emojivoto), here is what I get:

Thanks @vise that’s interesting output.

I’m looking at your repository now to see what I’d have to do to replicate this. Are there any additional instructions that I need in order to run this test?

A few questions that I have about the test:

  • If you reduce the number of requests, does that make a difference in the overall output?
  • Are the response time consistently at 60+ms? Or does that happen at the end of the test? To ask another way, what are the response times at the beginning of the test?
  • What is the output from linkerd top --namespace linkerd deploy/linkerd-prometheus? It will be helpful to get the output at the start and end of the test.
  • Similarly, what is the output from kubectl top po -n linkerd --containers=true at the beginning and end of the test?

Hi @cpretzer,

I just updated my repo to make it more “user friendly”.

  1. Get a cluster (any cluster will do, you can use k3d locally and run ./cluster.sh)
  2. For the distributed tracing bench, simply run: ./scenarios/distributed-tracing/bench.sh

it’s quite straightforward, so I think it should be easily reproducible.

Thanks for the update @vise. I’ll work on setting it up.

@vise I’m still working on getting this set up. Will have an update for you this week.

1 Like

I refactored a bit and started adding istio (not an apple to apple comparison for now)

Thanks @vise! Still working to get this set up and will update you when I do.

Charles

@vise I spent about an hour trying to get these benchmarks running, and there are errors during the istio tests:

Error: unknown command "manifest" for "istioctl"
Run 'istioctl --help' for usage.
error: no objects passed to apply

I’m using Istio 1.3.6. Which version are you using?

Ran in to a few other requirements issues that aren’t mentioned in the docs, like the helm install for jaegertracing/jaeger-operator.

Is there a way to run only the Linkerd tests?

Once you’ve got the scripts in a state where I can run them, I’ll give it another go.

@vise

I commented out the Istio runs just to get the Linkerd metrics, and found similar results for 1000 requests at concurrency 100.

The interesting thing is that when I scaled the number of requests to 100000 and concurrency to 1000, the histograms started to look similar:

Despite that the average times and RPS are faster when there is no service mesh, the overall distributions take the same shape. We can expect some additional latency by encrypting and decrypting the traffic in the proxies, because we’re talking about the difference in handling plain HTTP traffic vs HTTPS traffic. I ran a third test where Linkerd is injected, but mTLS is disabled.

I’m not an expert in performance testing expert, so I’d be interested in your thoughts. One final thing to consider is that these tests were all run using k3d on my laptop with other applications running, so the results are speculative, at best.

Linkerd Injected and mTLS enabled


Summary:
  Total:	39.8219 secs
  Slowest:	2.0120 secs
  Fastest:	0.0287 secs
  Average:	0.3894 secs
  Requests/sec:	2511.1782
  
  Total data:	1798938 bytes
  Size/request:	17 bytes

Response time histogram:
  0.029 [1]	|
  0.227 [14294]	|■■■■■■■■■■
  0.425 [55956]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.624 [21287]	|■■■■■■■■■■■■■■■
  0.822 [4595]	|■■■
  1.020 [1851]	|■
  1.219 [862]	|■
  1.417 [510]	|
  1.615 [402]	|
  1.814 [76]	|
  2.012 [166]	|


Latency distribution:
  10% in 0.2082 secs
  25% in 0.2648 secs
  50% in 0.3504 secs
  75% in 0.4481 secs
  90% in 0.5936 secs
  95% in 0.7646 secs
  99% in 1.2746 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0026 secs, 0.0287 secs, 2.0120 secs
  DNS-lookup:	0.0017 secs, 0.0000 secs, 0.4657 secs
  req write:	0.0002 secs, 0.0000 secs, 0.2382 secs
  resp wait:	0.3364 secs, 0.0283 secs, 1.9005 secs
  resp read:	0.0276 secs, 0.0000 secs, 0.6408 secs

Status code distribution:
  [200]	99941 responses
  [503]	59 responses

Test without Linkerd

/ $ hey -n 100000 -c 1000 http://go-trace-webserver

Summary:
  Total:	21.8663 secs
  Slowest:	1.0972 secs
  Fastest:	0.0263 secs
  Average:	0.2113 secs
  Requests/sec:	4573.2498
  
  Total data:	1800000 bytes
  Size/request:	18 bytes

Response time histogram:
  0.026 [1]	|
  0.133 [22410]	|■■■■■■■■■■■■■■■■■■■■
  0.240 [45038]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.348 [25736]	|■■■■■■■■■■■■■■■■■■■■■■■
  0.455 [4207]	|■■■■
  0.562 [1108]	|■
  0.669 [702]	|■
  0.776 [522]	|
  0.883 [238]	|
  0.990 [17]	|
  1.097 [21]	|


Latency distribution:
  10% in 0.1114 secs
  25% in 0.1378 secs
  50% in 0.2109 secs
  75% in 0.2561 secs
  90% in 0.3205 secs
  95% in 0.3788 secs
  99% in 0.6203 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0009 secs, 0.0263 secs, 1.0972 secs
  DNS-lookup:	0.0009 secs, 0.0000 secs, 0.1380 secs
  req write:	0.0008 secs, 0.0000 secs, 0.3434 secs
  resp wait:	0.0324 secs, 0.0262 secs, 0.2115 secs
  resp read:	0.0980 secs, 0.0000 secs, 0.8039 secs

Status code distribution:
  [200]	100000 responses

Linkerd Injected and mTLS disabled

/ $ hey -n 100000 -c 1000 http://go-trace-webserver.linkerd-go-trace-webserver 

Summary:
  Total:	31.0574 secs
  Slowest:	1.0621 secs
  Fastest:	0.0286 secs
  Average:	0.2982 secs
  Requests/sec:	3219.8460
  
  Total data:	1800000 bytes
  Size/request:	18 bytes

Response time histogram:
  0.029 [1]	|
  0.132 [3253]	|■■■■
  0.235 [33961]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.339 [32098]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.442 [17837]	|■■■■■■■■■■■■■■■■■■■■■
  0.545 [8152]	|■■■■■■■■■■
  0.649 [2888]	|■■■
  0.752 [1032]	|■
  0.855 [605]	|■
  0.959 [100]	|
  1.062 [73]	|


Latency distribution:
  10% in 0.1686 secs
  25% in 0.2082 secs
  50% in 0.2691 secs
  75% in 0.3646 secs
  90% in 0.4664 secs
  95% in 0.5369 secs
  99% in 0.7305 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0008 secs, 0.0286 secs, 1.0621 secs
  DNS-lookup:	0.0017 secs, 0.0000 secs, 0.2843 secs
  req write:	0.0005 secs, 0.0000 secs, 0.1633 secs
  resp wait:	0.2412 secs, 0.0280 secs, 0.9021 secs
  resp read:	0.0296 secs, 0.0000 secs, 0.4338 secs

Status code distribution:
  [200]	100000 responses

Hi Charles,

First of all, thanks for taking the time to run this, I know it’s not easy to dive into someone else code, especially because you were the first one to do so.
I took your remarks into account and improve the doc and fixed the jaeger install.

I agree with you, there is an expected latency when mTLS is added, and the benefit/cost ratio is pretty clear, there is no free lunch.
I did most my tests locally as I was iterating the devs, but I’m going to run again the bench under digital ocean and I will add the linkerd without mTLS as well.

a quick question @cpretzer , what is the best way to deactivate the mTLS?

@vise, there are a couple of ways:

linkerd install --disable-identity --disable-tap when you install linkerd

or by adding annotations to the namespace, which is how I tested:
kubectl annotate ns <your namespace> config.linkerd.io/disable-tap=true config.linkerd.io/disable-identity=true

1 Like