Computer Architecture Today

Informing the broad computing community about current activities, advances and future directions in computer architecture.

One of the most pressing challenges facing today’s digital society is how to curb the relentless increase in the energy consumption of computing. Without major action, such an increase is even likely to accelerate, as ubiquitous AI models are embedded in applications ranging from personalized content creation, extended reality, and automation and control. A recent New York Times article that may prove prescient points out how US electric utilities are being overwhelmed by the demand from data centers. The message for computer architecture and systems researchers is clear: here is an area where your ideas can have a great impact.

The ACE Center for Evolvable Computing

Based on several studies that crystallized into two reports, the Semiconductor Research Corporation (SRC) established the ACE Center, focused on developing new architectures and paradigms for distributed computing with a radically new computing trajectory, to attain order-of-magnitude improvements in energy efficiency. The center is rooted in the academic community, with 21 faculty members [1] with diverse domain expertise and over 100 graduate students. 

On the surface, the center’s roadmap will look familiar to any researcher in our field: leverage hardware accelerators and integration, minimize data movement, co-design hardware and software innovations, and integrate security and correctness from the ground up so they do not have to be retrofitted later. The challenge is to go beyond many disjoint improvements and provide coordinated, multidisciplinary innovations with substantial combined impact. 

An idea that underpins ACE is evolvability: accelerator hardware, specialized communication stacks, or customized security mechanisms should be designed for extensibility and composability. They should have compatible interfaces, accommodate upgrades of their external environments, and be easily replaceable by (and co-exist with) next-generation designs of the same module. These principles have served us well with general-purpose processors; we should retain them as we move to an accelerator-centric era.

High-Performance Energy-Efficient Computing

To attain high energy efficiency, it is our vision that data centers will contain a large number of hardware accelerators with different functionalities, organized into distributed ensembles. Smart compilers will generate executable code from different sections of an application for different types of accelerators. Then, the runtime will assign each of these binaries to the most appropriate accelerator from an ensemble. Such ensembles will be spatially and temporally shared by multiple tenants in a secure manner. Further, to attain the highest efficiency, new classes of general-purpose processors will be specialized for different workload domains.

Toward this vision, we are developing Composable Compute Acceleration (COCA), a framework where multiple heterogeneous chiplets are integrated into a Multichip Module (MCM). Chiplets include general-purpose cores, accelerator ASICs, and FPGA dies connected with a synthesizable chip-to-chip UCIe (Universal Chiplet Interconnect Express) interface. Reconfigurability is attained offline by combining different mixes of chiplets in different MCM instances, and online by reprogramming the chiplets based on the needs of popular workloads.

Accelerators in different nodes of the datacenter are harnessed together to accelerate an operation with large data or compute needs, such as datacenter-wide sparse tensor operations. In this case, if the operation is heavily communication-bound, developing efficient algorithms for data transfer—possibly leveraging the sparse pattern of the data—is crucial.

A new compiler infrastructure is key to this vision. We are developing an open-source unified ACE compiler stack to program diverse accelerators. The infrastructure includes specialized front-end compilers for large language models or graph neural networks that translate the code into a shared intermediate form. Then, back-end ML compilers generate code for various hardware accelerators. We believe that this infrastructure can catalyze the use of novel hardware accelerators.

We specialize CPUs for specific workload domains. For example, microservice environments execute short service requests that interact via remote procedure calls and are subject to tail latency constraints. In contrast, general-purpose processors are designed for traditional monolithic applications, as they support global hardware cache coherence, incorporate microarchitecture for long-running, predictable applications such as advanced prefetching, and are optimized for average rather than tail latency. To address this imbalance, we propose Manycore, which casts out some of these features and is optimized for tail latency.

Communication and Coordination

A striking feature of modern data centers is that hardware remains highly underutilized. This is a major source of energy waste. Moreover, the software infrastructure that enables data center operation contains major inefficiencies resulting from the desire to remain general-purpose. These are some of the aspects that ACE is addressing. Our vision includes reconfigurable network topologies to enable efficient use of the resources, a nimble runtime that bundles computation in small buckets and ships them where the data is, flexible networking stacks specialized to the accelerators available in the data center, and computing in network switches and SmartNICs to efficiently offload processor tasks.

A contributor to hardware underutilization is the inflexibility of the data center network. Different workloads exhibit different communication patterns and, in some cases, the patterns are clear and periodic—such as in AI training. Yet the inter-node links and their bandwidth are fixed, which is suboptimal. In ACE, we dynamically reconfigure optical interconnects to adjust network topology and link bandwidth based on the workload. Moreover, we have developed LIBRA to perform design space exploration of networks for distributed AI training. LIBRA recommends the topology and bandwidth allocation for each level of the network.

Current networking stacks are general-purpose, even though important workloads or hardware devices may not need many of the features. Moreover, they use the kernel for secure operation, which further adds to the execution overhead. Bypassing the kernel, e.g., with RDMA, results in fast but insecure communication. To address these issues, we propose using eBPF (Extended Berkeley Packet Filter) and customizing the network stack to particular uses. The result is fast and secure operation.

Even with the most advanced hardware and leanest communication stacks, performance will lag if accelerators are often idle because the scheduler fails to assign work to them. Similarly, energy savings will not be realized if accelerators often operate on remote data, as energy for data movement will remain dominant. To tackle these challenges, we propose a runtime that bundles computation in small buckets called Proclets that are easy to schedule and migrate. Proclets enable load rebalancing and consolidation across compute units. Further, by migrating computation, they minimize the need to move data.

Computing in SmartNICs and network switches will improve the performance and energy efficiency of distributed workloads. We are examining efficient host-NIC interfaces, including a cache-coherent one (CC-NIC), and use-cases of compute offload to NICs. Computing in switches can be highly efficient in some applications, such as straggler detection and handling in AI training. To generate efficient code to run on a switch for uses such as anomaly detection or traffic classification, we propose an automated system. The user specifies high-level directives and the system automatically generates efficient ML models to run on the switches.

Security and Correctness

This part of the center focuses on conceiving new security paradigms that are more effective and easier to use than current ones. It also develops techniques for security and correctness verification from the earliest stages of hardware design. A challenge involves finding a common framework where computer designers and formal verification experts can effectively cooperate. 

As accelerators will be routinely operating on sensitive information, it is necessary to rethink how tools and mechanisms for CPU security and correctness can be redesigned for an accelerator-rich environment. Some of the current efforts include using information flow control in multi-tenant accelerators, applying automatic RTL-level instrumentation and analysis to detect security vulnerabilities in accelerators, and developing domain-specific trusted execution environments (TEEs) for accelerators. In particular, we envision an automated framework to generate customized TEEs for accelerators in a programmer-friendly manner.

We have developed several security and correctness techniques that we hope will be useful to the community. They include TEESec for pre-silicon security verification of TEEs, Untangle for high-performance and safe dynamic partitioning of hardware structures, SpecVerilog for design verification using a security-typed hardware description language, and G-QED for quick and thorough pre-silicon verification of hardware designs. Much work still needs to be done in this area.

Concluding Remarks

This blog has shared a research agenda that we hope will be embraced, expanded, and driven by a large section of our community. If, as a community, we manage to address the energy challenge, the rewards will be high. Incremental progress as usual is not an option.

[1] ACE Center Researchers:

Josep Torrellas (U. of Illinois), Minlan Yu (Harvard), Tarek Abdelzaher (U. of Illinois), Mohammad Alian (U. of Kansas), Adam Belay (MIT), Manya Ghobadi (MIT), Rajesh Gupta (UCSD), Christos Kozyrakis (Stanford), Tushar Krishna (GA Tech), Arvind Krishnamurthy (U. of Washington), Jose Martinez (Cornell), Charith Mendis (U. of Illinois), Subhasish Mitra (Stanford), Muhammad Shahbaz (Purdue), Edward Suh (Cornell), Steven Swanson (UCSD), Michael Taylor (U. of Washington), Radu Teodorescu (Ohio State U.), Mohit Tiwari (U. of Texas), Mengjia Yan (MIT), Zhengya Zhang (U. of Michigan), Zhiru Zhang (Cornell).

About the Author

Josep Torrellas is the Saburo Muroga Professor of Computer Science at the University of Illinois at Urbana-Champaign and the Director of the SRC/DARPA ACE Center for Evolvable Computing. He has made contributions to shared-memory multiprocessor architectures and thread-level speculation.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.