With Dennard scaling discontinued, application-specific hardware accelerators are ubiquitous in modern computers to offer more efficient task processing. Famous examples include Google’s Tensor Processing Units (TPUs) and Apple’s Neural Engines for artificial intelligence (AI) and machine learning (ML) workloads, as well as NVIDIA’s RT cores for ray tracing operations. Moreover, even general-purpose processors now contain Gaussian & Neural Accelerators (GNAs), as in Intel’s 12th generation ones.
Two decades ago, Graphics Processing Units (GPUs) were just accelerators for rendering graphics. However, GPUs have become the most intensively used general-purpose vector processors with appropriate architectural support and programming platforms. The revolution of general-purpose computing on graphics processing units (GPGPUs) helped to accelerate existing applications and results in the renaissance of neural networks (NNs). In the era of hardware accelerators, it is intriguing to ask: “Are there any types of modern accelerators with potential to lead the next revolution of general-purpose computing?”
Several research projects have investigated this potential on several commercialized hardware accelerators.
Tensor Processing Units (TPUs)
TPUs are Google’s proprietary accelerators that Google deployed in cloud servers in 2016, with more details released in their ISCA 2017 paper. Cloud TPUs are now accessible through Google’s cloud services. Besides the four generations of cloud TPUs, Google also offers TPUs in edge versions (i.e., edge TPUs) and embeds TPUs into mobile phone processors (i.e., Pixel Neural Core). Each version of TPU supports a set of matrix-based operations (e.g., 2D convolution, Fully connected, ReLU) that AI/ML workloads frequently use.
Existing research projects have demonstrated the use of cloud TPUs in accelerating relational database operations, Fast Fourier Transforms [Lu ISBI’21, Lu HPEC’20], and physics/quantum simulations [Morningstar, Pederson]. GPTPU project uses multiple edge TPUs with a customized programming platform to accelerate matrix-based compute kernels, including general matrix multiplications, page rank, gaussian elimination, etc.
Tensor Cores are NVIDIA’s response to Google’s TPUs for accelerating AI/ML workloads. Tensor Cores only support matrix multiply-accumulate (MMA) operations for now. The programmer can directly access Tensor Cores’ MMA functions through the low-level interface exposed through the CUDA programming platform or indirectly use Tensor Cores through highly optimized library functions (e.g., cuBLAS). Existing research projects have successfully demonstrated the performance gain of using Tensor Cores to accelerate relational database operations [Dakkak ICS’19, TCUDB], Fast Fourier Transforms [tcFFT, Durrani PACT’21], and cryptography.
Ray Tracing Hardware
As the demand for more realistic graph rendering grows, hardware accelerators for ray tracing algorithms start to gain ground in modern GPU architectures. For example, NVIDIA has integrated RT Cores since the introduction of the Volta architecture. In contrast to the embarrassingly parallel, highly regular pixel-based rendering algorithms, ray tracing models the light transport effects that are irregular and more complex to compute. RT Cores accelerate the bounding volume hierarchy (BVH) tree traversal process that tests if a ray intersects with a bounding volume/object during ray tracing. Research projects have demonstrated the use of RT Cores in accelerating nearest neighbor search problems, Monte Carlo simulations, and the 3D mapping problem in autonomous robotic navigational tasks.
Though research projects reveal the strong potential of performance gain or better energy efficiency with these accelerators mentioned above, “efficiently” using these accelerators in applications is challenging due to the following aspects.
Limited accelerator functions
Since these accelerators are designed to be application-specific, it is natural that these accelerators only provide just enough operations/features or optimize operations for their target applications. Therefore, programmers must spend significant efforts reengineering an algorithm to use these domain-specific functions. For example, GPTPU in accelerating matrix multiplications and research projects in accelerating FFT revamped the original algorithms to use the most efficient convolution operation that the TPU architecture provides.
The overhead of applying accelerators’ operations
Making compute kernels adaptive to application-specific operations is typically not side-effect free. First, as existing memory and storage abstractions use linear addressing, datasets generally are serialized. Restoring datasets into matrices or tensors requires additional memory operations and memory space during the conversion process. Even though datasets are already in tensor or matrix formats, the application may still need to rearrange the data layout to fit the data formats of application-specific operators. For example, one of the inputs in edge TPUs must be encoded as a TFLite model, requiring re-evaluating and re-quantizing the original matrix values.
As the philosophy of accelerators is to provide just-enough hardware support for the target application, these accelerators do not accommodate the precision/accuracy demand of other applications. For example, edge TPUs only support 8-bit operations. Both Tensor Cores and TPUs natively support up to only 16-bit inputs. The limited precision put constraints on the value ranges of input datasets. If the application wants the result in high accuracy, the software must pay additional overhead, and the performance may still be sub-optimal.
Intransparency of accelerator architectures
Finally, accelerators are proprietary to vendors with minimal detail or low-level programming interface. For example, cloud TPUs are only accessible through Google’s cloud service and programmable through a closed-source version of TensorFlow or PyTorch. The lack of low-level details creates hurdles in performance optimization and extending the spectrum of these accelerator’s applications. For example, the close-sourced edge TPU compiler forces the GPTPU project to reverse engineer the encoded TFLite model to allow arbitrary matrix transformations into TFLite models.
Future Perspective and Research Opportunities
With research projects showing the possibilities of using modern hardware accelerators beyond their original applications, there is clearly a research avenue toward generalized hardware accelerators.
More applications and algorithms
While existing research projects demonstrate quite a few examples of leveraging hardware accelerators in applications, there is still a lot to explore in this direction. We can definitely revisit the design of algorithms that are currently “in use” and optimize them as prior research projects. In addition, however, there could be opportunities in making algorithms that are currently rarely used due to the curse of dimensionality, feasible and efficient.
Extension of hardware features
Architects can leverage existing hardware accelerators’ foundation and design minor extensions to make more applications feasible and efficient. For example, dynamic programming problems share the same computation pattern as MMAs (i.e., semi-ring structures) if we replace the multiply operation with additions and the accumulations with minimum or maximum. Likewise, ray-tracing hardware can be used as a general-purpose tree or graph traversal engine if we can customize bounding box shapes. A precursor of this shift is EGEMM-TC and hardware manufacturers have started making tensor cores support higher precisions beyond the original demand from AI/ML workloads.
Programming framework and hardware/software interface
In the era of general-purpose computing on conventional processors and GPUs, processors abstract hardware at the instruction level. The relatively slowly evolving hardware/software interface allows the programming interface, mainly the language and the API, to stay unchanged but make programs automatically enjoy the benefit of new hardware features through recompilation or linking to the new library.
In contrast, most modern hardware accelerators expose features only through domain-specific languages without revealing the low-level details. As a result, it’s now the programmers’ responsibility to be “always” aware of the evolution of the underlying hardware and corresponding changes to the exposed programming interface. Programmers also need to invest in intelligently using these DSLs or APIs to compose desired algorithms. Consistent with insights from a prior post in Computer Architecture Today, some “new abstraction/contract” must be present to facilitate the program development and ensure the investment of programming can last long enough to pay off the efforts in adopting new hardware accelerators.
In addition to making computing models “heterogeneous,” emerging hardware accelerators also complicate the dimensionalities and precisions of datasets. For example, as most CPU instructions support computing on pairs of numbers in 64 bits, GPUs work the best on pairs of vectors in 32 bits, and AI/ML accelerators operate on pairs of matrices in 8 or 16 bits, matrices may need to present in various memory layouts and precisions within an application. The conventional memory abstraction and architecture can lead to significant computation overhead and waste of bandwidth in locating data and casting types when the dimensionality of the target hardware accelerator grows. We desperately need some innovative memory abstraction that can easily match the data layout of various computing units with minimum storage and transformation overhead – probably some extensions to research projects like GS-DRAM, RC-NVM, and NDS.
We are now in a “new golden age for computer architecture”, where there are many emerging application-specific hardware accelerators. It is important that we keep inventing the most efficient hardware for critical workloads. Simultaneously, we should also keep our minds open to other opportunities implied and enabled by our innovations in hardware accelerators.
About the Author: Hung-Wei Tseng is an Assistant Professor of the Department of Electrical and Computer Engineering at University of California, Riverside. Hung-Wei’s research group focuses on applications, system infrastructures, data storage, and the architecture of heterogeneous computers and hardware accelerators.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.