In this blogpost I share what I learned at the Seventh Workshop on Multi-core and Rack-scale Systems (MaRS) co-located with the EuroSYS Conference on Computer Systems held in Belgrade in April. It is a small but vigorous interdisciplinary workshop that focuses on the systems aspects of emerging hardware, and brings together researchers and practitioners from a range of computer engineering areas, from architectures and operating systems to programming languages and parallel algorithms.
This year the workshop centered around the challenges of deploying and using programmable accelerators in data centers. We invited three prominent keynote speakers from Microsoft, IBM and Mellanox, each providing a unique perspective based on the first-hand experience with accelerated systems in data centers. There was an unequivocal consensus among the speakers that accelerators were becoming essential for data center performance and power efficiency. Perhaps more surprisingly, however, all the three presentations revealed common vision for the role of FPGAs in data centers, and in particular when an FPGA is coupled with high-speed network adapters.
Aleksandar Dragojevic from Microsoft Research Cambridge described their ongoing work in the Catapult project. He started his talk by presenting the main conditions for making accelerators viable in the data center: economy of scale for amortizing the cost of development and deployment of a new accelerator architecture, and hardware homogeneity across nodes for taming the complexity of resource management. These two conditions dictate the need for a versatile accelerator platform which supports three different modes of operation: (1) local acceleration in the same compute node, (2) pooling of multiple (possibly remote) accelerators to cope with unpredictable demand, and (3) infrastructure (primarily network) acceleration for offloading network protocols, SDN and packet processing.
Catapult V2.0 is the building block of such a “Configurable cloud” platform, which will be (or already is) installed in every compute node in Azure. Catapult V2.0 is the latest and the greatest generation of the FPGA-based accelerator which now adds support for standard high speed networking hardware. Alongside with numerous insightful details about the platform and its performance in local, remote and infrastructure offload settings, Aleksandar mentioned his ongoing research to employ Catapult in the FaRM system. FaRM – Fast Remote Memory – is a sophisticated high-performance data processing platform which provides strongly consistent transactions and fault tolerance while using new hardware such as RDMA networking and non-volatile memory. Aleksandar showed promising results for using Catapult to implement an extra security layer for RDMA, and for accelerating the performance of remote atomic reads.
Christoph Hagleitner from IBM Research Zurich discussed recent developments in IBM’s accelerator-rich systems as part of OpenPower and OpenCAPI Foundations. He argued that today’s diversity of data center workloads calls for two different types of computing systems: disaggregated nodes interconnected with high-speed networks for scale-out workloads like MapReduce and Spark, and scale-up “Fat” nodes that suitable for demanding high-performance applications.
Naturally, Fat nodes are accelerator-rich servers with a range of accelerators from FGPA and GPUs to near-memory, smart storage and smart network controllers. The key to the programmability and performance of Fat nodes is a Coherent Accelerator Processor Interface (CAPI) which provides memory coherence between discrete accelerators and the host CPU, as well as high bandwidth between the CPU and the accelerators (up to 300GB/s in Power9). By streamlining accelerator’s accesses to the system memory and by avoiding involvement of the device driver and the OS in the control path, CAPI enables dramatic reduction in the accelerator invocation overheads, and enables efficient offloading of small granularity tasks. CAPI has been adopted by many accelerator, storage and network hardware vendors, and has been prototyped with less traditional near-memory accelerators, with a respectable lineup of upcoming products to gain CAPI support. Christoph also showed the envisioned platform for disaggregated systems, with extreme density of compute nodes (>1000 per rack). The compute node prototype is a small discrete card with an FPGA, 10Gb NIC and 16GB of memory.
Dror Goldenberg, VP Software Architecture at Mellanox, outlined the company’s view on the future intelligent networks, I/O accelerators and co-processors. The primary observation that motivates the need for accelerating network processing is the growing gap between the network bandwidth and CPU processing rates. As an illustration, Dror pointed out that sustaining 200Gb/s line rate with 64B packets requires the CPU processing rate to be as high as 3.3ns per packet (assuming a single CPU core). For comparison, L3 access on Intel Haswell i7-4770 CPU is about 10ns, and a system call costs at least 50ns. With 200Gb/s networks at the doorstep, and 400Gb/s around the corner, it is clear that the CPU is becoming the primary bottleneck for packet processing alone, leaving no legroom for application logic.
Mellanox claims that offloading network processing logic to accelerators is essential, and to that end offers three complementary approaches: augmenting NICs with new fixed-function accelerators for host and network virtualization-related offloads; Smart NICs with bump-on-the-wire FPGAs or Network Processing Units for offloading complex protocols such as TLS, encryption and various network functions such as Deep Packet Inspection and firewalls; and finally in-network computations such as data reduction, performed in network switches.
Main challenge: programmability
In the panel that followed, the speakers addressed the main challenge of accelerators: the high complexity of software development. The more fundamental challenges discussed were the lack of appropriate tools for system profiling and debugging, lack of performance models to evaluate the potential benefits of accelerators, and lastly the millions of lines of legacy code that should be manually ported to use accelerators. Interestingly, the speakers did not quite mention the software development as a challenge in their keynotes.
It seems that the companies are not quite looking for making accelerator programming easier. Instead they are offering a simple workaround: vertical integration. Specifically, Mellanox advocates for transparent acceleration, in which the original CPU-only implementations are replaced with fully compatible, but accelerated vendor-provided libraries optimized for the specific hardware accelerator. For the Catapult platform the programmability is a concern, but since the Catapult hardware is not (yet) exposed for public use, the internal users seem to rely on the in-house expertise. Even if Catapult becomes publicly available on Azure in the future, the most likely usage scenarios will include Catapult-accelerated libraries and domain-specific languages crafted for common applications, like machine learning and graph processing.
This vertically integrated system approach is clearly the most effective one for the immediate term, as it does not require maintaining a software development ecosystem. Vendors may keep the accelerator software development in-house, similarly to the way device drivers are developed today. Its downside, however, is the “glass ceiling” of backward compatibility with existing interfaces and applications, which prevents the use of accelerators in other scenarios.
All the speakers expressed the hope that the growing awareness of accelerator technologies, most notably FPGAs, and their broadening adoption in data centers, will finally create the critical mass of developers who will drive the innovation to accelerate the accelerator software development.
Several proposals presented by academic researchers have been devoted to bridging the accelerator programmability gap. These included the frameworks for using smart NICs for accelerating network applications, a system for multi-GPU graph processing, new OS abstractions for heterogeneous systems, a runtime for building scalable network servers on GPUs, and new highly concurrent languages.
The papers and the keynote slides are available online at https://sites.google.com/site/mars2017eurosys/Program
About the author
Mark Silberstein is an Assistant Professor in the Electrical Engineering Department at the Technion – Israel Institute of Technology, where he heads the Accelerated Computer Systems Lab.