Improving Price/Performance with Intelligent Server Adapters (Part 2)
This is the second article in a three-part series that takes an in-depth look at the promise and pitfalls of using general CPUs in host-based networking applications, and it explores the different server networking hardware technologies being used to deliver better price/performance for host-based networking. The first article focuses on the potential issues and the importance of hardware-accelerated host-based networking. This article explores the three fundamental technologies being used in these intelligent server adapters and provides guidance on solutions that deliver best price/performance for host-based networking.
Data center operators deploying or evaluating host-based networking applications face the challenge of cost effectively scaling networks to 10, 25, 40 and 50GbE using COTS-based server platforms. As outlined in our previous article, host-based networking functions such as vSwitches that are implemented purely in software are expensive and inefficient at 10GbE and higher speeds, owing to the sheer number of costly and power-hungry x86 CPU cores consumed by data-plane processing. This article compares host-based networking approaches in the context of price/performance.
The underlying problem related to pure software implementations is the fundamental mismatch of the capabilities of the x86 architecture when it comes to packet, tunnel and flow processing—the basic ingredients of host-based networking. This is because x86, and other general-purpose processing architectures that target compute servers, are by design optimized for running complex and relatively long-running applications in an operating-system environment. This approach is great for server applications, but it's not necessarily the best strategy when it comes to implementing host-based networking. To understand these priorities for x86 servers, consider a chip-level view of the Sandy Bridge device shown below.
Figure 1: Intel x86: Optimized for servers, not for host-based networking.
The need to optimize for general compute applications running in a traditional OS environment drives a tremendous amount of complexity into the CPU architecture. The x86 cores are loaded with features such as very long and complex superscalar processing pipelines with speculative execution and branch prediction, large caches, and MMUs to support virtual memory, all resulting in a large die area per core. Massive L3 cache is needed to support very large programs and data sets in external memory, but host-based networking data-path programs and data are relatively small and would not need L3 in a standalone configuration. Likewise, graphics processing and floating-point units in x86 are unneeded for data-path processing. Since these server-class features dramatically drive up the size of the CPU chip while not increasing work output for host-based networking functions, they strongly reduce the price/performance ratio of the overall solution.
Indeed, applying x86 to host-based networking is like trying to haul firewood with a Ferrari: it's not the right tool for the job, and far too expensive to boot. This issue has now been widely recognized, and the natural reaction has been to pursue software optimization. Technologies like the Data Plane Development Kit (DPDK) are intended to improve x86 CPU performance in networking applications by improving cache utilization and eliminating interrupt-processing overhead. While offering modest improvement by reducing the percentage of processor idles, this approach cannot overcome the fundamental architectural limitations described above, so further improvement will be effectively quite limited.
Alternatives to x86 Software-Based Implementations
The failure of host-based networking software on x86 CPUs to deliver desirable price/performance has motivated the pursuit of alternative solutions capable of scaling cost-effectively to 10, 25, 40, 50 and 100GbE line rates. One such alternative is the intelligent server adapter equipped with a MIPS- or ARM-based multicore system-on-a-chip (SoC). In this model, the SoC implements the host-based networking data path, entirely in software running on the SoC processing cores. Although doing so offloads the server, it does not change the fundamental processing paradigm: the SoC approach suffers from the same fundamental architectural limitations as x86. This is because the architecture of these SoC devices is first optimized for the server market and then repurposed to server adapters, so the same issues listed above for x86 inefficiency apply here as well. Effectively, this approach is just a redistribution of processing resources that does very little to improve the efficiency and p rice/performance of the overall solution. The shortcoming is evident in the fact that SoC-based server adapters struggle to achieve line rate at all but the largest packet sizes in current implementations at 20Gbps.
Another often-discussed alternative is also a familiar one: the use of field-programmable gate arrays (FPGAs). Recent papers have described the benefits of FPGAs for accelerating specific algorithms related to web search in the data center.1 There have even been suggestions to apply the same approach to data-plane processing for host-based networking. But efficient and cost-effective use of FPGAs for host-based networking is yet to be proven and seems unlikely to receive widespread adoption for various reasons noted below.
FPGAs are suited to well-defined tasks that are repetitive and fine-grained in nature, such as image and signal processing, compression/decompression, and cryptography, to name a few. They can often perform these tasks more efficiently than software running on general-purpose processors. FPGAs, however, are poorly suited to the complex, variable and irregular processing tasks that are the hallmark of packet processing. Required functions such as branching, bit manipulation, encapsulation and filtering are just a few features that cause great difficulty for FPGAs when trying to implement network data paths.
In addition, FPGAs incur a huge area-efficiency penalty compared with standard ASIC technology that is difficult to overcome outside of very specific use cases. The programmable interconnect infrastructure in FPGAs burns a large amount of die area, leading to about 20–30x less effective logic gates per unit area and 12x more dynamic power per equivalent function compared with ASIC-based designs.2 Given that the upper limit of die area for server adapters is driven by the common denominators of cost and power consumption, the FPGA is at a significant disadvantage in performance efficiency.
In addition, one of the main purported benefits of FPGAs, the ability to adapt the function via reprogramming, is often quite limited in practice. Significant changes to the data path may not fit or route in the target device, or may fail to hit the same target operating frequency. Furthermore, FPGAs are typically programmed using esoteric hardware description languages such as Verilog or VHDL and require hand coding for good performance. Efforts to improve this process by support for FPGA programming with C via OpenCL and other approaches does promise to simplify development, but only at the expense of greater efficiency and even further eroding price/performance.
One need only to look at history to observe that FPGAs have always been relegated to niche applications for networking data paths. Often, they serve as a stopgap until more-efficient purpose-built solutions become available. In fact, FPGA use is often a leading indicator that a product gap exists somewhere, and if the gap is in an area with sufficient market size, purpose-built solutions will inevitably be developed.
For all of the above reasons, intelligent server adapters equipped with multicore SoCs or FPGAs clearly lack the scalability and extensibility required to accommodate the host-based networking applications of today and tomorrow. This is, of course, a familiar theme. Time and again, while the industry has tried to reuse existing technologies for new applications, it has proven necessary to accommodate new network-acceleration and efficiency-of-scale requirements with new and purpose-built technologies.
Purpose-Built Solutions Evolving to Become Mainstream
When available solutions are unable to meet the demands of emerging and compelling use cases, purpose-built solutions inevitably evolve to supplement and sometimes supplant them. IP routers were implemented all in software on general-purpose CPUs in the 1990s, but the explosion in traffic growth driven by the Internet led to the need for higher performance and scale, and the network processor was born. ATM evolved as a purpose-built and targeted technology intended, in part, to accommodate the desire to converge multiple traffic types. MPLS evolved next as an extension to Ethernet that incorporated the best of ATM as a superior solution for scaling Layer 2– and Layer 3–based VPNs. Initial implementations of these technologies often occurred in FPGAs, but very quickly ASSPs were developed that could perform these functions with better price/performance, leading to their mainstream adoption.
A similar evolution occurred with InfiniBand and RoCE. RoCE adapters were purpose-built for low latency and large-scale data transfers with low CPU utilization. Because the solution delivered superior price/performance and scalability, it was able to overcome what had been perceived as a significant hurdle: use of the InfiniBand transport layer and IBTA-defined verbs versus the far more familiar TCP/IP and traditional sockets interface. Its advantages prevailed and adoption gradually grew, and RoCE has now been enhanced in version 2 to add support for routing and deployment across Layer 3 networks. Although RoCE was initially implemented primarily in software on servers, the processing burden was very high, which drove specific solutions in the form of server-adapter ASSPs that supported RoCE offload directly in hardware, now mainstream.
The evolution of purpose-built technologies from specialized to mainstream deployments as a more cost-effective means to accommodate changing needs is depicted in Figure 2. In addition to IP/ATM/MPLS and RoCE, the figure also shows the evolution of purpose-built 3D-graphics technologies into mainstream GPU-based products that are now pervasive in PCs, providing another great example of a function that was initially implemented in software on servers, then moved to a purpose-built accelerators, and finally to mainstream adoption in the form of the GPU adapter. The same evolutionary process is also beginning to occur for host-based networking use cases with the advent a new purpose-built technology: the network flow processor (NFP).
Figure 2: Purpose-built technologies find their way to mainstream deployments.
Network Flow Processors: Purpose-Built for Host-Based Networking
Overcoming the performance and scalability limitations of multicore SoCs and FPGAs requires addressing the root cause of these limitations. Intelligent server adapters based on NFPs can efficiently scale from 10Gbps to 100Gbps of throughput, delivering more than an order-of-magnitude performance improvement over existing software-based solutions. Figure 3 shows a throughput comparison for a common host-based networking data-plane application, Open vSwitch (OVS). As shown, the NFP-based intelligent server adapter delivers more than a 20x improvement in packet throughput for the same amount of x86 CPU resources (single x86 core), dramatically improving the price/performance equation.
Figure 3: Host-based networking performance using NFP-based intelligent server adapter.
While we expect other intelligent server adapters based on MSOC or FPGA approaches to deliver at least some improvement in price/performance, not all such adapters are created the equal. The purpose of the third and final article in this series, therefore, is to outline several important characteristics that are helpful when evaluating an intelligent server adapter for use in host-based networking applications.
Leading article image courtesy of Wikieditor243 under a Creative Commons license
1"A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services," Microsoft
2"Measuring the Gap between FPGAs and ASICs," Ian Kuon and Jonathan Rose, Department of Electrical and Computer Engineering, University of Toronto
About the Author
Nick Tausanovitch has over 25 years of experience in the electronics and networking industries in positions ranging from FPGA and silicon design to system architecture to product marketing. Nick is currently senior director of solutions architecture at Netronome, where he is responsible for data center applications of the company's intelligent server adapter products. Previously, he was responsible for the high-end network-processor product line at Broadcom. Before that, Nick was Director of Electronic Design at IDT, where he developed TCAMs and algorithmic search engines, and a system architect at Nortel, where he developed switches, routers and network processors. Nick holds a Bachelor of Science degree in electrical engineering from the University of Rochester and a Master of Science in electrical engineering from New York Polytechnic University.



Comments
Post a Comment