Intelligent Server Adapters: Key Ingredients for Success (Part 3)

This article is the third in a three-part series that takes an in-depth look at the promise and pitfalls of using general CPUs in host-based networking applications, and it explores the different server-networking hardware technologies being used to deliver better price/performance for host-based networking. The first article discussed host-based networking and the potential issues when using general-purpose CPUs to implement the data path. It also discussed the importance of intelligent server adapters for hardware-acceleration and server offload. The second article explored the three fundamental technologies being proposed for these intelligent server adapters, and it provides guidance on the solutions that deliver the best price/performance for host-based networking applications. This final article examines the attributes and main ingredients for success that will enable the intelligent server adapter to become a mainstream technology in data centers for years to come.

Many instances of data-plane processing have evolved in recent years, both in the open-source community and in commercial deployments on the premises of data center operators. For example, the Open Virtual Switch (OVS[1]) user-space and kernel components have evolved to introduce more sophisticated and scalable tunneling and flow-processing capabilities. The OVS kernel modules handle tunneling and switching. The kernel also implements a fast caching mechanism for non-overlapping or exact-match flows. More recently, support for tunnels such as VXLAN[2] and wildcarding match-action rules have been added to the data path. In commercial deployments such as Microsoft Azure, the virtual switch supports network virtualization for tenants as well as sophisticated match-action processing for functions such as load balancing and security. In the Google Andromeda network, packet-processing nodes provide match-action flow and data-plane processing for firewall, security, rate limiting, routin g and other functions.

Data-plane processing for host-based networking has evolved in the host or server domain, implemented in software on general-purpose CPUs such as the x86 instruction architecture. Unlike traditional processing tasks executed on such general-purpose CPUs, however, data-plane processing and, more specifically, tunnel encapsulation and decapsulation as well as match-action-based flow processing are unique. This uniqueness results in significant inefficiencies when they are executed on general-purpose CPUs. Alternative solutions have been proposed, such as running the data plane in a multicore SoC or FPGA, but each of these approaches provides only marginal gains as the previous article in this series discussed.

Implementing data-plane processing with efficiency and scale requires processing cores and product architectural elements that are purpose built with special ingredients. Next, we discuss seven such important ingredients for success.

Key Ingredient # 1: Processor Multithreading

Flow processing requires access to memory such as DDR3- or DDR4-based RAM. To assist processing in CPU cores, hardware-based accelerators handle repetitive or specialized functions such as cryptography and hashing. With single-threaded processing, as with general-purpose CPUs (like standard x86, MIPS and ARM cores), memory and accelerator access latencies waste CPU cycles. For example, accessing DDR3 memory could take hundreds of CPU cycles, and access to hardware accelerators can take even longer, leaving the CPU core idle and effectively useless for this extended period. This issue will often reduce the effective CPU utilization to 10–20% for typical host-based data-plane processing tasks. Software custom-coding techniques can fill latency gaps, but these changes are time consuming and cumbersome, and they reduce the portability of the software.

An ideal solution to the problem is to implement processing cores that are highly multithreaded. When processing cores are multithreaded (for example, eight threads per core), the processor pipeline can always execute useful instructions instead stalls or idles. As a result, the multithreaded processing gain can be up to 800% compared with single-threaded machines when memory or hardware accelerator access requirements are significant, as in the case of typical data-plane processing in host-based networking and emerging NFV applications.

Key Ingredient # 2: Many Processor Cores Better Than a Few Faster Cores

General-purpose CPUs are typically optimized for highest processor clock speeds at the expense of power and area. For example, big and complex pipelines with more than 15 stages, out-of-order execution and branch-prediction capabilities are common in such CPUs. Large caches are also needed owing to the lack of multithreading in order to reduce the effect of memory latency, as explained earlier. When such general-purpose CPU cores are packed into a single silicon die, as in MIPS- or ARM-based multicore SoCs, the effective performance gain is lower than packing a larger number of smaller processing cores in the same silicon die. In other words, using more optimized multithreaded processing cores in silicon is better for data-plane processing than fewer higher-performance general-purpose CPU cores with fewer or no threads and large caches. The use of large processor cores carries significant overhead when price and power constraints are placed on server adapter designs, as general-pu rpose servers used in compute nodes show.

Key Ingredient # 3: Memory and Accelerator Multithreading

Efficient access to memory and hardware accelerators is critical in data-intensive flow processing, and the challenge is only exacerbated with larger numbers of flows and complex processing (such as the number of tuples used in matches and complexity of actions). Such requirements are bound to become pervasive in data centers with the growing need to support more users, tenants and applications, as well as with the requirements for stringent policies related to security and service levels. Although faster access to memory is important, multithreaded access to memory is even more important. A multithreaded memory subsystem with hardware accelerators can ensure that the processing cores avoid stalling. An example of such an efficient design is one that uses multiple SRAM banks with multiple high-bandwidth crossbar inputs. Access to such SRAM banks are further accelerated using dedicated high-performance tightly coupled hardware engines that perform critical functions such as atomics , statistics, lookups and load balancing.

Key Ingredient # 4: High-Performance Distributed Mesh Fabric

The multithreaded processing cores, hardware accelerators and multibank memory units described above must be well synchronized and provide high performance while avoiding processing stalls. Traditional shared-bus architectures suffer from bandwidth saturation and contention issues under load when accessing shared resources. This problem can be avoided by using an efficient high-performance distributed mesh fabric with multiterabit bisectional bandwidth between processing elements. Such a distributed mesh fabric will avoid contention and bus-saturation issues that are common with shared-bus architectures in general-purpose CPU-based SoCs.

Key Ingredient # 5: Optimized Programming Tools for Host-Based Networking

Although at first glance general-purpose CPU cores seem easy to program—for example, by using standard C-based programming tools—the difficulty and complexity increases substantially when attempting to parallelize applications and scale performance. So in this sense, they lack good support for development of optimized data-plane-processing applications. When programming multithreaded processing cores, it is critical to employ robust, easy-to-use, C-based programming tools that support parallel programming environments and that provide thread-level visibility during programming. They should also allow creation of data-plane-processing programs that are optimized for multithreaded operation.

In addition to C-based programming tools, it is now becoming possible to support high-level programming languages such as P4[3] that make the description and coding of data-path functions much simpler and less time consuming. Using the open P4 language, designers can write concise programs that can flexibly define match and action processing to quickly implement new protocols such as emerging network overlays. P4 is also hardware agnostic, so it can be retargeted to different technologies and implementations, provided that they support the P4 environment.

Key Ingredient # 6: Hitting Compute-Node Economics

Intelligent server adapters are evolving in a natural way, starting with low-volume niche applications, with the promise to grow into high-volume mainstream deployments. Initial implementations using multicore SoC silicon have found their way into appliances and purpose-built servers that are sometimes called service nodes or network nodes. Some instances use of NPUs and FPGAs in such applications. Because the deployment volume for service nodes has not been very large, data center operators have been willing to pay a premium for programmable server adapters capable of data-plane processing.

With host-based software-defined networking (SDN) and network functions virtualization (NFV) technologies moving toward mainstream adoption, however, the need for intelligent server adapters in the much higher-volume compute nodes is expected to rise significantly. This situation will require intelligent server adapters that will exhibit much better price/performance characteristics compared with the early service-node implementations. Specifically, such adapters will have to operate at wire speed within the PCI Express power envelope of 25 watts in most servers deployed today. And most importantly, they will have to be reasonably priced to support the volume economics of compute-node servers. Thus, silicon technology and data-plane-processing architectures in programmable server adapters must allow for performance, scale and economics. Key ingredients 1 through 5 highlighted above are required to meet data-plane-processing requirements at 25-, 40- and 50GbE bandwidths while hitti ng the compute-node economics expected by data center operators.

Key Ingredient # 7: Ready Software Ecosystem for Mainstream Adoption

In addition to meeting performance, feature, price and power requirements, mainstream adoption of programmable server adapters will require a well-supported software ecosystem. Specifically, the server operating-system kernel, user space and virtual-switch networking software stacks must support installation and operation of such server adapters that can offload data-plane processing such as virtual network tunneling and match-action-related flow processing.

Ease of integration with existing server applications, open-source software and maximal feature velocity are of paramount importance for a successful solution. As an example, offload architectures that work with existing open-source solutions such as Open vSwitch, as opposed to replacing them with proprietary or forked solutions, will undoubtedly prevail. A critical point is that any new features, when implemented in the open-source community, must be up-streamed into the appropriate open-source mainline trees (such as in www.kernel.org or www.openvswitch.org). Commercial operating systems and hypervisor distributions from well-known operating-system vendors should also include data-path offload capabilities such as the above. Operating-system vendors must support qualification of programmable-server-adapter device drivers and associated software with their distributions to enable seamless operation of such adapters and their mainstream adoption by data center operators.

Mainstream Market-Ready Intelligent Server Adapters

Host-based networking deployments are expected to drive mainstream adoption of intelligent server adapters. Such adapters have to be purpose-built and support the main ingredients for success. We discussed seven such ingredients that span architecture and technology in server adapters as well as volume economics and software-ecosystem requirements.

[1] Open Virtual Switch is an open-source community-based development. See www.openvswitch.org.

[2] VXLAN is a network-virtualization-related specification being developed in the IETF; it stands for Virtual Extensible Local Area Networking.

[3] P4 is an open-source higher-level programming language that is hardware agnostic. See www.P4.org.

Image courtesy of Sam Greenhalgh under a Creative Commons license

About the Author

Nick Tausanovitch has over 25 years of experience in the electronics and networking industries in positions ranging from FPGA and silicon design to system architecture to product marketing. Nick is currently senior director of solutions architecture at Netronome, where he is responsible for data center applications of the company's intelligent server adapter products. Previously, he was responsible for the high-end network-processor product line at Broadcom. Before that, Nick was Director of Electronic Design at IDT, where he developed TCAMs and algorithmic search engines, and a system architect at Nortel, where he developed switches, routers and network processors. Nick holds a Bachelor of Science degree in electrical engineering from the University of Rochester and a Master of Science in electrical engineering from New York Polytechnic University.

Search This Blog

Computer SRX