Achieving 10Gbps Write-to-disk Performance (Part 4)
It’s been a busy six months at nPulse Technologies. We’ve achieved a huge milestone this week: the release of our HammerHead 3.0, Ultrafast Packet Capture Solution. Below is an htop snapshot of a HammerHead streaming 10Gbps-to-disk, sustained for 30 minutes without packet loss while tracking, indexing, and exporting flows. Notice the distribution of threads across the processor cores with a CPU load of only 0.27.
With this, I’d like to continue with the next installment of my blog series on high-performance write-to-disk.
Basic Design Rules
In my previous posts, I shared two of the four design rules that we incorporate with our Hammerhead products.
#1: Capture – packet acquisition with a Napatech NT20E
#2: Controller – packet storage with a RAID adapter
Today I’d like to highlight our third design rule:
#3: Cores – packet processing with multi-threaded software
Think In Terms of Queues
Designing a stream-to-disk solution using commodity hardware is essentially designing a system of queues with three major queuing centers. The first stage in the pipeline is packet acquisition. This is handled by a purpose-built adapter like the Napatech NT20E2. The last stage in the pipeline is the storage array, which is managed by an intelligent RAID controller. The middle, and most critical, stage in the system of queues is the packet processing stage, which is serviced by multi-threaded software across multi-cores.
The purpose of the packet processing stage is to achieve high Input/Output Operations Per Second (IOPS) without introducing backpressure to the packet acquisition queue or overwhelming the packet storage queue. For 10Gbps write-to-disks, this means processing up to 14.88 million packets per second. This translates to an average service time of approximately 7 nanoseconds per packet.
A lot of factors come into play to achieve packet processing rates this high. However, based on our experience, the key determinants are 1) exploiting processor cores and 2) leveraging memory cache. Both of these can be achieved through a well-designed and properly implemented multi-threaded software.
Compare The Difference
To underscore the performance difference between an application limited to a single core and one that uses multiple cores, I will compare the performance of two open source packet capture applications, Daemon Logger and Gulp. Both were tested on a HammerHead 320 platform. Using dstat for measurement, I focused on two metrics to compare performance: write throughput and CPU load. Write throughput is an indicator of software effectiveness while CPU load is a good indicator of software efficiency.
The first 30-minute test was with Daemon Logger. Below is an htop snapshot of daemon logger while under a load of 10Gbps of traffic. As you can tell, it is ostensibly a CPU-bound single thread application.
The second 30-minute test was with Gulp. Below is a snapshot of how gulp performed with 10Gbps of traffic. As you can tell, Gulp is a double-threaded application using a simple producer-consumer model.
What is very interesting is that there is a significant performance difference between the two applications.
Neither was able to capture 10Gbps at a sustained rate without dropping packets. Daemon Logger, because of its single-thread limitation, achieved an average throughput of 154 MB/s. On the other hand, Gulp, utilizing two cores, achieved an average throughput of 854 MB/s. That’s almost 6 times the write throughput using a simple multi-threaded model.
In terms of the average CPU load, Daemon Logger’s 2.10 had almost twice the load of Gulp’s 1.2.
What Is The Lesson Here?
The days of a “free lunch” enjoyed as a result of Moore’s Law are a thing of the past. Hence, single-threaded applications are legacy applications. Network monitoring applications capable of handling 10Gbps or more are intensely CPU-bound; therefore, performance gains will come from employing multi-cores with multi-threaded software.
Designing multi-threaded software capable of processing tens of millions of packets per second on commodity hardware is not easy. It is as much an art as it is a science. A lot is predicated on hardware choice, threading model, and software implementation. If you are looking for some inspiration with your design, a good place to start is by understanding Queueing Theory and Analysis; particularly, Little’s Law and Denning’s Operational Analysis.
In terms of reference examples, Gulp is a step in the right direction. Most currently, take a look at Suricata, which is designed from the beginning to be multi-threaded; and in the future, Multi-threaded Bro.