This post is a follow-up to the recent post on the SIGARCH blog by Profs. Reetuparna Das and Tushar Krishna on architectures for deep neural networks (DNNs) (e.g., convolutional neural networks (CNNs), long short-term memories (LSTMs), multi-level perceptrons (MLPs), and recurrent neural networks (RNNs)). While the blog post focuses on SIMD versus systolic architectures, it largely excludes the SIMT-based GPGPUs. Given that GPGPUs are the most prominent platforms today for DNN workloads, we focus on comparing the GPGPU and the TPU. We submit that GPGPUs have a fundamental disadvantage over the TPU for these workloads. Further, we argue that the systolic array (first proposed by IBM for DNNs) is a compute unit organization specifically for capturing far more reuse than the GPGPU.
We re-affirm the previous blog post’s disclaimer that this post is an academic analysis and not a product endorsement.
(1) GPGPU’s multithreading overhead: It is not our intention to lessen GPGPUs’ huge contribution to ML’s recent success (e.g., AlexNet burst on to the ML scene by showing huge improvements in accuracy thanks in no small part to GPGPUs used for training and inference). Overhead or not, there were no alternatives, until other options came along, to run ML workloads as fast as GPGPUs.
GPUs were originally built for graphics workloads (before the GPGPU era). These workloads have immense, fine-grain, regular parallelism (hence single-instruction multiple thread (SIMT) execution) but also unpredictable cache misses which cannot be prefetched and hence expose memory latency (texture cache misses are key examples). Many other accesses are predictable and hence are prefetched (e.g., vertex data). To combat the unpredictable misses, GPUs employ massive multithreading with numerous contexts to tolerate long memory latencies (e.g., one Streaming Multiprocessor (SM) can support 50 32-thread contexts to tolerate 500-cycle or longer memory latency whereas a CPU core typically supports 2-4 single-thread contexts).
In graphics workloads, the contexts have a high degree of fine-grain data sharing (otherwise adding more contexts would increase cache misses countering any latency tolerance). In contrast, typical server workloads have less sharing and demand at least around a MB of cache capacity per thread. Of course, the downside of the multithreading is the massive register file needed to hold the register state of all the contexts (e.g., 256 KB per SM, 8 MB for 32 SMs, 400 M transistors per die). While the latency of the register accesses can be tolerated as long as there is enough register bandwidth (multithreading tolerates the latency caused by itself!), there are opportunity, area and energy costs.
In contrast to graphics texture accesses, DNN models have simple, sequential accesses which are readily amenable to prefetching. Instead of multithreading, simple prefetching, where the compute units work on the current data as the next data is being prefetched, suffices to hide all the memory latency. The prefetch buffer can be much smaller than the multithreaded register file (e.g., 1-2 KB versus 256 KB) and hence is more efficient in terms of energy and area (which is better used for more multiply-accumulate (MAC) units in the TPU to meet DNNs’ voracious compute demand). Note that the TPU uses a large on-chip SRAM not to hide the latency but to avoid going to memory at all (prefetching or multithreading cannot avoid memory transfers but can only hide/tolerate the latency). Further, GPGPUs’ heavy multithreading also destroys DRAM row locality which simple prefetch preserves, improving memory bandwidth and energy.
(2) Data reuse in systolic array: DNNs have enormous data reuse within each layer – each input feature map cell is reused with all the filters and each filter cell is reused for each output feature map cell of the corresponding channel. DNNs also have immense compute demand. Capturing the reuse means bringing the feature maps and filters once from memory (or an on-chip SRAM cache) and ensuring that all the data is reused fully. Note that reuse here is avoiding going not only to off-chip memory but also to a large on-chip cache by holding the data within the compute units., Further, such capturing would require commensurately numerous compute units which can concurrently consume the reused data (in a parallel or pipelined manner). The numerous and concurrent consumption are key to avoiding large buffers for capturing reuse spread over time. Assuming typical DNN model parameters, these requirements point to 16K-64K MAC units. While the MAC unit count should not exceed the typical amount of computation (else there will be underutilization), the count should not undershoot the amount (otherwise, there will be many refetches or large near-MAC buffering of the reused data).
The concurrent consumption and fine-grained sharing suggest structured communication among the MAC units. Fortunately, DNN’s regular, fine-grain parallelism is amenable to lock-step and pipelined computation and communication giving rise to a systolic architecture, as originally proposed by IBM and followed by the TPU (the IBM design holds the partial sum in the MAC units and systolic-broadcasts the filters whereas the TPU does the reverse). However, such communication is not unique to systolic – a GPGPU’s SM can also broadcast a scalar value to all its 128 lanes to capture 128-way reuse. That the GPGPU uses a physical bus for broadcast whereas the TPU uses a systolic, hop-by-hop, pipelined broadcast does not change the achieved reuse. Nevertheless, even if we remove multithreading from the GPGPU, it will need more buffering than the TPU because the GPGPU does not pipeline the partial-sum accumulate over several MAC units like the TPU does. The TPU holds only one byte each of the filter and temporary result per MAC whereas the GPU holds many bytes (e.g., 128) of the filter due to the lack of pipelining and one byte of accumulated result per MAC.
The compact systolic organization holds only, say, 32 KB of data among 16K MAC units to capture most or all of the reuse. In contrast, each of GPGPUs’ SMs has only around 128 INT8 MAC units (+ 256 KB register file) which do not exploit enough reuse (each SM refetches the same data from the L2, whereas the systolic array fetches only once for many more MAC units). To capture cross-layer reuse (as opposed to the within-layer reuse above), the TPU has a large on-chip SRAM to hold the feature maps which are output from one DNN layer and input to the next layer, instead of going through memory. .
(3) Other issues: We agree with the previous blog post that the TPU has higher compute and data distribution density. The blog post argues that the TPU’s compute unit utilization is lower for matrix-vector operations prevalent in LSTMs and MLPs. Because the amount of compute in CNNs’ matrix-matrix operations is much higher than those in LSTMs and MLPs for a single inference, any architecture provisioned for CNNs may be underutilized for the others, and vice versa on a per-inference basis. However, the common practice of batching multiple inferences leads to matrix-matrix computation even for LSTMs and MLPs with the associated improved reuse and compute-bandwidth ratio compared to unbatched inference.
Further, GPGPUs are more programmable than the TPU but incurs some instruction overhead. GPGPUs can run many data parallel workloads which may be harder to program for the TPU (e.g., graphics and HPC workloads).
Conclusion: While GPGPUs have been instrumental in the recent advances in ML, they incur high overhead in performance, area, and energy due to heavy multithreading which is unnecessary for DNNs which have prefetchable, sequential memory accesses. The systolic organization in the IBM and TPU architectures capture DNNs’ data reuse while being simple by avoiding multithreading.
Acknowledgments: We thank Profs. Reetuparna Das and Tushar Krishna for their comments on a draft version.
About the authors: Mithuna Thottethodi and T. N. Vijaykumar are computer architects in the School of Electrical and Computer Engineering at Purdue University.