Programmable network controllers, SmartNICs, are an old concept, yet today they are seeing renewed interest and growing adoption in data centers and HPC systems. This blog post discusses the trends in modern computer networks that drive the dramatic increase of computing demands of I/O processing, highlights the challenges and downsides of using CPUs and GPUs to meet these demands, surveys recent developments in SmartNIC architectures and their applications, and speculates about future directions of this technology as well as emerging research opportunities.
Incommensurate scaling: network bandwidth is far ahead of the compute capacity
Server network bandwidth is growing rapidly. 200Gbps Ethernet NICs are already available, 400Gbps will become a reality soon, and the 1Tbps is looming as the next technology frontier. On the other hand, the compute capacity headroom for processing network I/O is shrinking. For example, for a typical Key-Value Store using 32 byte keys/values, in order to keep up with 400Gb/s line rate with 100 X86 CPU cores and perfect scaling, a core has about 500 cycles to process each key/value pair. Even under these optimistic assumptions, this minuscule compute budget is barely enough to perform a few LLC or memory accesses in the network stack, leaving precious little for the application logic. In other words, future applications with line rate I/O processing requirements are destined to be CPU- and memory-bound.
Over the years various low-level network layer functions have been pushed into NIC hardware. Today, these hardware offloads form the backbone of the network I/O processing, and range from simple scatter-gather I/O, checksum computation, and packet segmentation to full transport layer accelerators like TCP offload and RDMA.
The driving force for line rate processing on commodity systems
These days fixed-function offloads are no longer sufficient, however. There is an unprecedented demand for flexible low-latency processing of network I/O at line rate on commodity servers. Hardware cannot keep up with rapid changes in data center networking workloads. For example, domain-specific compression techniques to access network-attached storage, new standards and tunneling formats — all evolve quickly, outpacing the life cycle of ASIC development and deployment. Moreover, the compute demand is fueled by the paradigm shift in data center networks from proprietary middle-boxes, e.g., firewalls and intrusion detection systems, to Virtual Network Functions (VNF) that implement the same middle-box functionality in virtual machines.
VNFs put the network processing burden on the CPU. Notably, computing demands in such systems are even higher: the implementation of VNFs poses the challenge of matching the throughput and latency of purpose-built systems fine-tuned for line rate processing with the performance of VNF applications running inside a virtual machine on a commodity server.
CPUs and GPUs are not enough
Network I/O performance has long been the subject of research in the systems community, with the focus on eliminating inefficiencies in the OS network stack and optimization of NIC-CPU interaction. A common approach for implementing VNFs today is to bypass the network stack altogether, accessing raw packets directly from user-level libraries, e.g., DPDK. Further, CPU and NIC hardware now provide several mechanisms to dramatically improve the efficiency of I/O processing, e.g., bring data directly into the CPU LLC (Data Direct I/O (DDIO)), and improve scalability via reduced cache contention among the CPU cores (e.g., Receiver Side Scaling) and lower interrupt frequency (e.g, Interrupt Moderation). But even with these enhancements, multiple CPU cores are needed to execute common Network Functions even at 10Gbps. Furthermore, existing systems experience increased latency and high variability in packet processing performance due to CPU resource contention, even though a solution to this problem has been recently proposed.
GPUs have also been considered for accelerating network packet processing applications (PacketShader, SSLShader, SNAP, and GASPP are some notable examples). Unfortunately, GPUs introduce high latency overheads primarily due to GPU control and PCIe data transfers. Furthermore, the TCO gains and power efficiency of GPUs in mostly I/O intensive workloads such as routing have been questioned and partially debunked by showing that clever latency hiding techniques on CPUs alone enable comparable performance with much lower latency. The debate on using GPUs for accelerating network processing has recently seen another round, now with integrated GPUs.
SmartNICs: old idea, new times
The idea of adding programmable hardware to the NIC was actively researched in the early 2000s with the introduction of Network Processors, such as Intel’s IXP series. These processors, however, were mainly used in specialized network appliances rather than commodity servers where SmartNICs are found today. While the research activity in this area dropped to almost zero after 2007, a lot of the works published then are becoming highly relevant again.
Modern SmartNICs target server networking workloads. There are two dominant design choices for the computing unit on the NIC: fully-programmable network processors (e.g., Mellanox BlueField, Cavium LiquidIO, Netronome Agilio-CX) and FPGAs connected directly to the NIC ASIC over high-speed interconnect (e.g., Mellanox InnovaFlex and the Microsoft Catapult board broadly deployed in Microsoft Azure). Conceptually, SoC-based SmartNICs are direct descendants of the early network processors. They rely on a custom highly-threaded CPU equipped with a wealth of fixed-function units and hardware-accelerated processing primitives.
On the other hand, FPGA-based SmartNICs are similar to a regular FPGA board, with one important difference. Just as in network-attached FPGAs, they feature low-latency, high-bandwidth data and control paths between the NIC and the FPGA that do not involve the CPU. In addition, they offer fast data path from the FPGA to the host memory and other host resources. The most common design is called “bump-in-the-wire”: all the incoming traffic first arrives at the FPGA, then is passed to the NIC ASIC which transfers the data to the host (the order is reversed for egress).
This design is attractive for several reasons. First it requires no deep changes in the original NIC ASIC. Moreover, it allows reusing the optimized DMA hardware on the NIC and therefore makes the SmartNIC backward-compatible with the standard network stack on the host. The FPGA may also have its own DMA to the host memory (as in Catapult), and can be used as a regular look-aside FPGA.
The recent paper on SmartNICs in Microsoft Azure provides an insightful comparison between the FPGA- and SoC-based SmartNICs.
FPGA-based SmartNICs offer more flexibility and expected to scale better to higher network speeds than network processors, while providing high efficiency for classical networking workloads. For example, the ClickNP framework allows implementing Network Functions on the Catapult SmartNICs while abstracting away FPGA programming challenges for this application domain. This and other works highlighted one of the notable benefits of FPGA-based SmartNICs: predictable low latency. Incidentally, another project, NP-Click, pursued similar conceptual goals some 15 years ago, for the Intel IXP 1200 processor .
However, FPGA-based NICs provide acceleration opportunities beyond the traditional packet processing, in particular due to the SmartNIC’s ability to closely interact with the host. For example, Catapult SmartNICs were used to build a Key-Value Store accelerator that stores data in the host memory, and also enables offloading simple user-defined arithmetic functions on the remote data. Similarly, they helped enhance RDMA security and accelerate remote atomic operations as part of the Fast Remote Memory (FaRM) project. In addition, many recent works in the FPGA community have been devoted to accelerating network applications on stand-alone FPGAs, and most of their results are naturally applicable to these SmartNICs as well.
Challenges and opportunities
As the boundary between network communications and computations is becoming increasingly blurred, the role of SmartNICs in computer systems is shifting from narrow packet processing offloads toward more versatile acceleration of network-centric applications. FPGA-based SmartNICs are likely to be particularly advantageous here, following the growing adoption of FPGAs in data centers in general.
Emerging disaggregated data center architectures are changing the traditional CPU-centric view of the system, and will put significant pressure on the interconnects. SmartNICs may serve as an ideal platform for providing unmediated access to remote peripherals, like GPUs and storage devices, for building in-network computing services and near-data computations that will help reduce the I/O to remote CPUs and memory.
However, notwithstanding the benefits of the inline network I/O acceleration offered by the bump-in-the-wire SmartNICs, from the software perspective this architecture breaks the traditional network stack layers, exposing raw network data to the application logic on the FPGA. Thus, to implement any meaningful application processing on the NIC much of the functionality already implemented in the NIC ASIC needs to be re-implemented again in the FPGA. Eliminating this inefficiency will most certainly require tighter integration of the NIC ASIC and the FPGA, allowing to better leverage NIC’s native support for self-virtualization (SRIOV), Quality of Service management, and reliable low-latency transport protocols, e.g., RDMA, which together will contribute to broader adoption of SmartNICs in virtualized cloud environments. The recently announced Mellanox Innova-2 NIC architecture is a step in this direction.
These trends call for innovation across multiple levels of the software stack. Today SmartNICs still lack the software ecosystem that would enable productivity and convenience. The development complexity is a long-standing problem with FPGAs, but it is exacerbated in SmartNICs because of the requirement to achieve high performance, since the user-provided logic has to be integrated into an existing SmartNIC environment. A common solution used by hardware vendors is to vertically integrate SmartNICs into existing systems, providing complete solutions, like IPSec offloads, while hiding the new hardware substrate from the application developer.
However, broadening the appeal of SmartNICs beyond packet processing will require true hardware-software co-design, such as hardware support for QoS and protection to allow SmartNIC sharing in virtualized systems, NIC-resident runtimes with direct, secure access to other peripherals, new high-level programming models with effective state sharing between the NIC and the host CPU, new NIC-resident and host-side Operating System abstractions and interfaces that embrace the merger of computation and communication, and new domain-specific programming frameworks such as ClickNP and AccelNet to hide the hardware complexity and allow application portability.
About the Author: Mark Silberstein is an Assistant Professor in the Electrical Engineering Department at the Technion – Israel Institute of Technology, where he heads the Accelerated Computer Systems Lab.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.