Processor architecture has incorporated many research ideas from the academic community, from SMT to cache allocation policies. In this article, the question we are trying to answer is what makes an idea take hold. What is it about a paper that makes it interesting to product teams? What is the path from academic research to product reality? In this article, we cover some examples of research ideas that were distilled down to implementable solutions in industry.
Architectural research papers describe 1.) the architectural insight behind the work and 2.) the implementation to take advantage of the insight. Both may be useful for the research community, but the value to industry is the architectural insight the paper offers. The observation or unique characteristic that can be taken advantage of, and which can be incorporated into a production design is what matters. The proposed implementation is generally less useful to production teams. If the idea has merit, the design team will figure out a solution that works for the specific product. The final hardware design may not match what is proposed in the paper, and it will not achieve the full benefits described once real-world constraints are added. However, the right ideas will find their way into products.
Some examples of research ideas that have made it into production systems are Out-of-Order execution, SMT as popularized by Tullsen et al., and implemented in both AMD and Intel systems, DVFS mechanisms prevalent in both high performance and client parts, run-ahead execution in IBM Power6, memory disambiguation predictors, branch predictors (TAGE, perceptron), dependency schedulers used in high performance issue logic, QoS policies and hard partitioning of caches in multicore processors, and maybe even cache allocation policies. What differentiates these ideas from others is 1.) the general nature of the solution, i.e., it impacts a large variety of applications/customers, and 2.) the inherent validity and simplicity of the fundamental architectural concept.
Research Impacting Products
What makes a research paper valuable is the nugget of insight it provides. The original matrix scheduler paper, at first glance, looks like an implementation paper. It describes how to reduce the critical wakeup path for the scheduler loop. But the key insight in the paper, and one that is leveraged in OOO schedulers today, is a two-pass scheduling mechanism where the dependency construction is decoupled from the dependent instruction wakeup. The actual implementation is likely unique to each processor, but the insight is universal. The merit of this work is that it simplifies a very tight critical path that impacts all processor execution.
SMT has found its way into many processors today. The key insight of SMT is that execution resources can be shared across multiple threads to enable higher utilization of hardware resources. The ideas popularized in the Tullsen papers were simple and elegant, and made a strong case for dynamically shared resources. However, dynamically sharing resources is a verification nightmare in real systems. There are numerous corner cases that can cause the system to deadlock depending on how and which resources are shared. In addition, security is a deep concern with shared resources under SMT. For these reasons, SMT has broad impact but the actual implementation likely differs from that proposed in the original paper.
Sometimes an idea has merit, but its associated complexity might require a combination of fundamental developments and desperation to make it into products. One such idea is out-of-order execution. When first proposed in the 1960s, the idea was too complex to justify the cost. Thirty years later, however, process technology improvements combined with an ever increasing imbalance between processor and memory speeds tipped the scales enough to justify the complexity. There are numerous varieties of OoO processor implementations that do not reflect the original Tomasulo implementation. However, the fundamental idea that only true dependencies matter is the common thread throughout.
The value of OoO execution is well understood. That said, nothing about an out-of-order design is easy. Verifying the design, especially with complex ISAs like x86, is an exceptionally difficult problem. Add to that features such as multi-core design, virtualization and dynamic power management, and the result is systems with problems that surface months if not years down the road as evidenced by Intel and AMD errata. However, the performance benefits are well worth the cost even with the latest set of issues surrounding security holes and side-channel attacks.
Similarly, there are many ideas that have been on the research back-burner, so to speak, that may find their way into products because neither technology nor processors are scaling to produce the performance improvements seen in decades past, and baseline microarchitectures, workloads and underlying technologies have evolved since the research was originally proposed. For instance, ideas such as non-linear fetch, speculative threading, value prediction, and compiler hints may need to be re-examined as mechanisms to improve single thread performance.
Value of Research & Recommendations
Processor design moves at a rapid pace, and the research produced by academia and industry is invaluable. As noted, it has had significant impact on existing designs. However, as also noted, much of the value is in the concept and insight the authors provide. The best papers are ones that can be summarized in a few sentences and which, in retrospect, seem like obvious solutions.
Tier-1 conferences are being inundated with an increasing number of papers, resulting in large PCs being burdened with a heavy review load. ASPLOS 2020, for example, received 479 submissions, a 37% increase over submissions for 2019 and a 58% increase over 2018. The multi-stage review process helps reduce the overhead, but each PC member is still reviewing twenty or more papers on average. One way to reduce the burden on reviewers, in our opinion, is to reduce the length of the papers. As noted in this article, the architectural concept and insight are what matter, and these can be expressed in a shorter form for review. Longer versions of the paper can be placed on arXiv or published in journals. Given the rapid pace of innovation in our field, it is essential that we expose the best ideas to the community as soon as possible in a tenable manner. Based on our experience, we argue that shorter papers and a less burdensome review process for the reviewers and authors would increase the pace of innovation and impact.
About the Authors: Srilatha (Bobbie) Manne has worked in the computer industry for over two decades in both industrial labs and product teams. She is currently a member of the AI & Advanced Architectures team at Microsoft. Muntaquim Chowdhury is currently a part of the AI & Advanced Architectures team at Microsoft. Prior to this he worked at Intel for 24 years developing flagship CPUs from P6 to Haswell and 3 years at Qualcomm leading the Server Architecture Team. The authors would also like to thank Rob Chappell from Microsoft for his input and feedback on the article.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.