Hitting Network Limits

I recently wrote a high-level blog post for NGINX outlining the new SO_REUSEPORT feature in NGINX. One of the problems I was having whilst doing the benchmarks for this was that I was hitting some kind of bottleneck before hitting the limit of what NGINX could process. This meant that it was difficult to get useful benchmarks.

The main reason was down to hitting a limit on the maximum number of interrupts per second that could be processed. Despite the evidence staring me in the face it didn't click with me that all the interrupts were being processed on just one CPU.

3     root      20   0       0      0      0 R  95.4  0.0   0:10.28 ksoftirqd/0
21076 nobody    20   0   57208  31140   2272 S  26.5  1.0   9:41.90 nginx
21075 nobody    20   0   57164  31200   2376 S  26.2  1.0   9:47.15 nginx
21078 nobody    20   0   57236  31272   2376 S  25.8  1.0   9:40.71 nginx
21077 nobody    20   0   57184  31112   2268 R  25.2  1.0   9:40.94 nginx

This is an example on my home test rig. The server is based on a Core2Quad CPU, nothing special but it helps pin bottlenecks a lot of the time. NGINX in this example has been configured with SO_REUSEPORT enabled and we are returning just a very small packet response. The thing to notice here is that each NGINX worker is only using around 25-26% CPU (theoretically this machine has 400% CPU possible), and 95% CPU is being used for ksoftirqd. This last part is the key, it basically means that CPU 0 is being used up nearly 100% dealing with network interrupts, hitting a limit of what the machine can process. Whilst I very quickly figured that network interrupts were the bottleneck the above didn't click with me at the time to be the issue.

It turns out in Linux there is a way to spread these interrupts across all CPUs, I stumbled upon it by accident earlier today (and am kicking myself that I wasn't aware of it before). It is called Receive Packet Steering (RPS). RedHat's Performance Tuning Guide has a good description of what it does and how to use it. To enable it on a quad-core machine simply do:

$ echo f > /sys/class/net/eth0/queues/rx-0/rps_cpus

Obviously replacing eth0 with the name of your network card. Once enabled I was getting much better results:

21076 nobody    20   0   56140  29988   2272 R  93.8  1.0   9:12.14 nginx
21075 nobody    20   0   56152  30008   2376 R  92.9  1.0   9:17.54 nginx
21077 nobody    20   0   55996  29844   2268 R  92.4  1.0   9:11.67 nginx
21078 nobody    20   0   56160  30196   2376 R  91.7  1.0   9:11.85 nginx

The client machine (which was also configured with this parameter) was reading 2.5x the traffic and nearly half the latency. In addition there is Receive Flow Steering (RFS) which can reduce latency when RPS is used, in my testing the average latency was the same but the standard deviation was reduced.

Network cards that can spread the hardware interrupts and stacks using DPDK or similar probably won't have this issue. But for cloud environments I suspect this option will help a great deal. As always "Your Mileage May Vary".

If you have any tips for increasing performance please feel free to leave them in the comments.

Comments !