Over the last decade, flagship processors from Intel and AMD have been eking out only marginal cross-generational single threaded (ST) performance gains. Instead, the focus has been on boosting aggregate performance by increasing core count. While throughput performance is important, especially in data-centers, single threaded performance remains a pivotal metric for a large swath of application domains (HPC, Finance etc.). In view of this it is perhaps worth pondering why ST performance growth has stalled and some options to revive the growth.
The phenomenon of plateauing ST performance is a consequence of design constraints – primarily power. The Thermal Design Power (TDP) of a SOC in a current Ultrabook design is constrained to 15W. In an i5 class machine this translates into a per-core power budget of 2-5 Watts depending on the number of cores. The per-core power budget is similarly constrained for servers. Server processors have a power budget around 170W, but they also have more cores per die in order to compete on throughput. A good rule of thumb is 60% of the 170W, or ~100W, is budgeted for cores. With 16 or 32 cores per die being a common high performance configuration, the per-core power is 3-6 Watts which is similar to the client budget. This power budget has remained relatively unchanged in the last decade. In parallel, the slowing of Moore’s Law means that moving to a new process node does not translate into significant boost in frequency and/or device density. The coupling of these two factors have led to the aforementioned stagnation of ST performance. The rest of the article presents some views on whether additional single thread performance is feasible, and if so, some methods for extracting that performance in light of potential technology changes.
Extracting More ILP
Over the last two decades, processor designers have extracted significant performance by leveraging architectural solutions and process innovations. So does this mean the current generation of out-of-order architecture is approaching fundamental limits in ST performance? The two primary sources of ST performance are frequency and instruction level parallelism (ILP). Most CPUs use various micro-architectural techniques (branch prediction, register renaming, speculative execution, load/store dependency prediction, etc.) to extract ILP during execution. Studies have demonstrated that current generation OOO architectures extract significantly less ILP than is available in the programs. However, the micro-architectural complexity of harnessing additional levels of ILP increase significantly as the design moves up the ILP ladder. Aggressive performance features like value prediction, memory renaming, and control flow decoupling exhibit a power/performance ratio that significantly exceeds the traditional 1:1 metric which states that for every unit of performance gained, there is at most one unit of power dissipated. In other words, aggressive performance features will require an exponential power cost for the performance gain using current CMOS technology. However, if we relax the 1:1 power/performance requirement or procure technologies that lower the power overhead, a 2-3X ST performance speed-up over the current generation of CPUs is quite feasible.
When CMOS first entered the mainstream in the early 1980s, it was touted as a solution to the power problem of bipolar technology. The low power dissipation and higher density of CMOS enabled the architectural innovations of the last few decades which led to the single thread improvements seen in the 1990s and early 2000s. However CMOS, at best, delayed the power problem by about a decade, and other solutions must be found. The question now is how to address the technology barriers to improving performance. What will replace CMOS to tame power while boosting performance? And will it fundamentally change how we architect and implement processors?
There are already niche applications that utilize other process technologies. Quantum Computing, for instance, relies on Josephson junctions to not only implement the quantum bits or qubits, but also to implement the control systems for managing the qubits. Qubits are sensitive to temperature, so any technology used in the quantum realm must have extremely low power dissipation. Josephson junctions meet that requirement with orders of magnitude less power per computation than traditional CMOS. However, there are architectural limitations to this technology, such as significantly lower transistor density, smaller memory size and fanout limits.
There is ongoing research that examines other process technology innovations. The paper Can Beyond-CMOS Devices Illuminate Dark Silicon evaluates a set of device configurations and characterizes using three main metrics: power, performance, and area. The authors examine these devices for both high and low TDP scenarios and show that although these technologies provide some relief to the current limitations, they do not sustain Moore’s Law performance-scaling trends. Other research examines modifying existing technology to extract performance. The articles Can a transient effect rescue silicon power scaling? and Transistor Options Beyond 3nm examine ways to deliver performance by re-engineering traditional CMOS transistors. These articles note that these devices hold promise but there are still significant road blocks to achieving performance. However, companies such as Intel believe these technologies will prove viable in the future.
As architects, we need to determine how these devices will change the way we design processors and how they help or hinder the drive for single thread performance. In addition to implementation complexity, these new transistors come with varying characteristics in terms of area, power and performance which will impact design decisions. For instance, it may be more difficult to create gates with varying threshold voltages which are regularly used in current CMOS implementations. Or, because many of these new technologies focus on reducing power and power density rather than on performance, one may have to reduce clock frequency or super-pipeline the circuits to achieve higher frequencies. Interestingly, similar drive strength arguments were made to dissuade transitioning from bipolar to CMOS over three decades ago. As in the transition from bipolar to CMOS, these new technologies are changing the power/performance ratios used to make decisions in current CMOS technologies. For instance, for similar performance to CMOS, they show lower power and power density, potentially allowing for more complex architectural solutions for extracting ST performance. Finally, the paper An Expanded Benchmarking of Beyond-CMOS Devices Based on Boolean and Neuromorphic Representative Circuits makes the case that some of these new technologies may potentially outperform CMOS on alternative computing paradigms such as non-boolean circuits based on cellular neural networks. These cellular neural network circuits may be used to create efficient convolution neural networks (CNNs) used in deep-learning applications, and fundamentally change the way we design AI processors.
We covered a number of techniques to expand beyond the plateauing of current CMOS technology. Some of these features seem radical, such as superconducting logic, even though they are an integral part of quantum computer design. A parallel can be drawn to out-of-order processing. The first OOO machine was developed in the mid 1960s but OOO designs did not dominate architecture for another 30 years due to its inherent complexity. Similarly, some of the technologies being considered for niche domains now may not be commercially viable for a number of years, but someday they may be leveraged for traditional architectures or accelerators in order to move to the next level of computation.
About the Authors: Dr. Srilatha (Bobbie) Manne has worked in the computer industry for over two decades in both industrial labs and product teams. She is currently a Principal Hardware Engineer at Microsoft in the Quantum Architecture Group. Dr. Muntaquim Chowdhury is currently a part of the Quantum Architecture Group at Microsoft. Prior to this he worked at Intel for 24 years developing flagship CPUs from P6 to Haswell and 3 years at Qualcomm leading the Server Architecture Team.