The 82599ES cards do permit finer-grained “per flow” balancing of incoming traffic across the TxRx queues and that may be looked at in a later study. The lower performance when 4+ server processes are in use requires further investigation, as does the apparent limit of only 3 queues being used. As a result, just occasionally a highly optimal (or highly suboptimal) combination of queues and cores arises. As each 30-second run of dnsperf starts the counters start incrementing, but at most only 3 individual queues were observed getting involved in handling the network traffic, and therefore only 3 CPU cores.ĭuring each dnsperf run the queue assignments vary, as do the CPU cores allocated to the DNS echo server itself. The number of IRQs that have been handled for each combination of TxRx queue and CPU core can be examined in /proc/interrupts. Under Linux the association of each TxRx queue with CPU cores is automatically handled by the irqbalance service, and by default it assigns each TxRx queue in sequence to one CPU core. The 82599ES NICs automatically configure themselves with eight separate TxRx queues, each with its own interrupt (or “IRQ”) number. To explain the increased variable requires looking further into the network card architecture. It should also be noted that the increase in the maximum throughput further reinforces the theory that the untuned dnsperf parameters used to generate the first graph were themselves limiting the maximum throughput of the system as a whole. The variability, however, is even further increased for some tests than before. The reason for the latter remains unclear.įor two or three concurrent server processes both the average and maximum throughput are substantially increased (the latter by 45% to 348 kqps from a previous high of 239 kqps). It was also determined that using taskset to restrict dnsperf to using only one of the two client CPUs was more optimal. In order to eliminate client-side effects from the testing a variety of dnsperf settings were tried, eventually settling empirically on apparently optimal settings of -c 8 to make dnsperf act as multiple clients, and -q 500 to allow for 500 maximum outstanding requests. It should also be noted that the variability in the results is quite significant - the results are by no means deterministic. The otherwise flat throughput figure is suggestive that the client itself is limiting the performance. In general the throughput is relatively constant on all three measures, although an explanation for the sudden fall-off when running 7 or 8 server processes in parallel is not immediately obvious. A bare minimum tuning was done on the UDP read/write buffers on both client and server by having the sysctl variables and both set to a maximum and default value of 16 MB each. Each data point is the result of ten 30-second runs of dnsperf using its default command line parameters. The following graph shows the mean, minimum, and maximum throughput obtained for different numbers of server processes running with the blocking I/O model. Both machines are running Fedora Server 21 with kernel 3.19.7-200 and the query generator is Nominum’s dnsperf 2.0.0 as packaged by Fedora and using the included sample query file. The machines are connected via a 10 Gbps switch using Intel 82599ES-based Network Interface Cards. We have a test rig comprising a server machine with dual Intel Xeon X5355 quad-core processors running at 2.66 GHz and a client machine with dual Intel Xeon X5365 quad-core processors running at 3.0 GHz. For multi-core systems it also supports locking processes or threads to a specific CPU core. It also supports a number of I/O models, including standard “blocking” I/O and non-blocking I/O using explicit poll(2) or select(2) calls, or via libevent. The DNS echo server allows the user to specify how many times it should fork into separate processes, and how many threads (if any) to start within each process. To this end we have developed a tiny DNS server that does nothing except echo the received packet back to the client, albeit with the “QR” bit flipped to indicate that the packet is a response and not a query. As part of an ongoing study into DNS server performance, we wanted to establish a baseline figure for the absolute maximum throughput that can be achieved using standard APIs.
0 Comments
Leave a Reply. |