This blog post gives a glimpse of the computer systems research papers presented at the USENIX Annual Technical Conference (ATC) 2019, with an emphasis on systems that use new hardware architectures.
USENIX ATC is a top-tier venue with a broad range of systems research papers from both industry and academia. As is the case for many high-quality computer systems conferences, the papers presented here involve a significant amount of engineering and experimentation on real hardware to convincingly evaluate innovative concepts end-to-end in a realistic setting. As a consequence, the vast majority of the papers in the past has usually focused on conventional X86 or GPU-accelerated architectures.
ATC ’19 was refreshingly different. Alongside more traditional sessions such as Real-World Deployed Systems and Big Data Programming Frameworks, there were many papers focusing on emerging hardware architectures, including embedded multi-accelerator SoCs, in-network and in-storage computing, FPGAs, GPUs, and low-power devices.
A few papers demonstrated new techniques for performing dynamic binary translation (DBT) as well as new applications of this powerful tool, which enables execution of unmodified binaries across different ISAs. In particular, the best paper award went to the paper describing a high-performance system-level DBT which facilitates the tedious process of re-targeting the translation to new ISAs. Another work demonstrated a fairly unconventional technique to automatically learn translation rules between the ISAs automatically by analyzing semantically identical programs (a follow up from the ASPLOS ’18 paper by the same authors). An interesting application of DBT was shown in Transkernel. Here, some of the OS kernel functions are dynamically offloaded to low-power micro-controller-like cores, thereby allowing hibernation of the main processor.
Accelerators and OS
A fairly large batch of papers focused on the study of novel OS architectures for systems with accelerators. GAIA proposed to expand the OS page cache into accelerator memory. The OS page cache serves as a cache for file accesses and plays an important role in core OS services, such as memory-mapped files. GAIA’s extension of the OS page into GPU memory enabled access to memory-mapped files from GPU kernels. The key design decision was the implementation of lazy release consistency distributed shared memory between CPUs and GPUs using GPU hardware page faults.
Two other papers discussed a new approach to OS architecture for multi-accelerator Systems on a Chip (SoC) with hundreds of accelerators. These papers argued that the scalability of such multi-accelerator systems is severely affected by the centralized nature of the modern OS architecture running on a CPU. The proposed extensions to the novel M3 Operating System (presented originally at ASPLOS16) tackled the scalability challenge from two different angles: the first paper showed how one can eliminate the OS from the performance-critical data and control paths by introducing a novel hardware module called Data Transfer Unit (DTU). DTU can be seen as an advanced DMA controller attached to each accelerator in the SoC, used as the trusted gateway for all communications from and to the accelerator. The second work presented a novel scalable distributed capability mechanism for security and protection in such systems.
A different line of work considered accelerating existing OS services with accelerators. Intel Quick Assist Technology (QAT) was the focus of the QZFS paper which used this new hardware device to speed up file system compression. This work showed the ways to reduce the overhead of using the accelerator on the critical path of the file I/O.
Programmable I/O Devices
The conference featured a whole session with four papers on smart peripherals that can be used to implement near-network and near-storage processing. Two of the papers, NICA and INSIDER, considered inline processing on FPGA-based Smart NICs and Smart SSDs respectively. The inline processing paradigm offers many advantages in the context of I/O acceleration compared to the traditional look-aside acceleration as in GPUs. Both these works focused on the ways to facilitate inline processing by integrating acceleration primitives with the existing OS abstractions such as network sockets and files. Another important aspect discussed in NICA has been SmartNIC virtualization, and in particular, performance isolation between different virtual machines that use both the compute and I/O capabilities of these devices.
Another paper, called E3, made an observation that ARM-based SmartNICs are capable enough to run cloud micro-services. This work provided a comprehensive energy and cost analysis of modern systems and modern micro-service-based applications, and showed that SmartNICs can run micro-services with higher power efficiency and lower cost compared to traditional CPUs. Finally, the Cognitive SSD paper demonstrated an interesting application of in-storage computing to accelerate content-based data retrieval from storage using advanced deep neural network hashing techniques.
The shift toward increased system heterogeneity and accelerator-rich architectures comes as no surprise to the readership of this blog. Yet, the wealth of systems papers which actually use such architectures at USENIX ATC is a remarkable sign of the growing adoption of new hardware in complex full-system scenarios by a more conservative (from the hardware perspective) community. Therefore, studying these systems-building efforts may reveal many hidden blind spots in existing hardware designs, and potentially open new research opportunities for the architecture-focused crowd.
About the Author: Mark Silberstein is an Associate Professor in the Electrical Engineering Department at the Technion – Israel Institute of Technology, where he heads the Accelerated Computer Systems Lab.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.