Debugging network performance


#1

Because linkerd is a network proxy, linkerd performance issues are often
network performance issues hiding under the surface. To help you debug these
issues, we’ve come up with a series of notes covering common performance
problems we’ve seen.

Backend latencies

linkerd will shift traffic to faster backends but that’s not possible if the
pool size is too small or all the backend hosts are similarly latent. Note:
often during testing, it’s common to set the backend pool size to 1 or 2 hosts
and linkerd will not be able to perform optimally with such a small pool of
backends to load balance over.

Look at metrics.json loadbalancer stats to see how large
the pool is and if balancing is happening appropriately.

Other metrics to look at in metrics.json are connects,
connections, and request_latency to your backend hosts.

/connects are the number of connect attempts linkerd has made to a
pool.

/connections are the number of open connections to a pool.

request_latency contains a histogram breakdown of latency to your
pool. This combined with loadbalancer/size and connections should
help you diagnose whether your backend has enough concurrency or
responsiveness for a given amount of traffic.

Dropped packets

Dropped packets are indicative of a congested network. Network congestion can
cause p999 latency to be periodically higher than expected. From linkerd’s
perspective, latency due to network congestion is indistinguishable from
latency introduced by a slow backend.

ethtool -S eth0 (note that this doesn’t work on VMs in GCE)

/sbin/ifconfig eth0 also reports dropped packets.

A symptom of this will be p999 latency that’s a multiple of 200ms as
200ms is typically the operating system timeout before it retransmits
a lost packet.

Dropped SYN packets

Linux has two queues for TCP: one for packets with data ready to be
read by applications and one for accepting SYNs, the packet signaling
the beginning of a TCP connection.

How to spot dropped SYNs:

while true; do netstat -s |grep TCPBacklogDrop; sleep 5; done

If the number is growing, you’re dropping syn packets due to overflowing the syn queue.

Socket types

We recommend looking at socket types for hosts connecting to linkerd.

watch -n 1 ss -tan 'sport = :4140'

Look for lots of sockets in the TIME_WAIT state. TIME_WAIT sockets combined
with dropped SYN packets tells us that connection pooling is not being used or
systems connecting to linkerd are reusing connections for only a very small
number of requests.

You can use linkerd’s metrics.json to verify that connections are not being
reused or only reused for a small number of requests.

Connection pooling often has the biggest effect for the smallest cost and we
always encourage customers to use it with systems that communicate via linkerd.

Interrupts

It’s important to ensure that interrupts are being distributed across cores.

cat /proc/interrupts and look to see that interrupts are being spread around
the CPUs on the system.

If all IRQs are being processed by a single CPU then it’s possible that your
system is misconfigured. Talk to your local system administrator about whether
that behavior is intentional.

SoftIRQ kernel threads

SoftIRQ kernel threads ([ksoftirqd/0], etc) handle turning network packets into
byte arrays that an application can read. If they are starved for resources,
packets get dropped.

watch -n 1 cat /proc/net/softnet_stat

If there’s a lot of movement in the 2nd column then the backlog max isn’t sufficient.

sysctl -w net.core.netdev_max_backlog=2048 (typical default is 1000)

If there’s a lot of movement in the the 3rd column then your kernel isn’t
giving the SoftIRQ processing kernel threads enough time to run. You can
increase the budget using:

sysctl -w net.core.netdev_budget=600 (typical default is 300)

Apache Benchmark (AB)

Users have reported seeing lots of abandoned sockets and high connect counts when
benchmarking with ab. This may be due to a connection not being closed properly by ab. If you encounter this, try another benchmarking tool like slow cooker or wrk.

Additional tools

If you’re using RHEL or CentOS, try running RedHat’s fantastic
xsos tool to gain some insight into your
running system. We can help you interpret the output.


closed #2

pinned #3

Converting Github Wiki to Discourse Wiki