Computer Architecture Today

Informing the broad computing community about current activities, advances and future directions in computer architecture.

Computer designers have traditionally separated the role of storage and computation. Memories stored data. Processors computed them. Is this distinction necessary? A human brain doesn’t separate the two so distinctly, so why should a computer? Before I address this question, let me start with the well known memory wall problem.

What is the memory wall in today’s context? Memory wall originally referred to the problem of growing disparity in speed between fast processors and slow memories. Since 2005 or so, as processor speed flat-lined, memory latency has remained about the same. But as the number of processor cores per chip kept increasing, memory bandwidth and memory energy became more dominant issues. A significant fraction of energy is spent today in moving data back-and-forth between memory and compute units, a problem that is exacerbated in modern data-intensive systems.

How do we overcome the memory wall in today’s computing world that is increasingly dominated by data-intensive applications? For well over two decades, architects have tried a variety of strategies to overcome the memory wall. Most of them has centered around exploiting locality.

Here is an alternative. What if we could move compute _closer_ to memory? So much that the line that divides compute and memory starts to blur…

The First Wave…

Processing In Memory (PIM) was discussed in 1990s (initial suggestions date back to as early as 1970s) as an alternative solution to scale the memory wall.  The idea was to physically bring the compute and memory units closer together by placing computation units inside the main memory (DRAM). But this idea did not quite take off back then due to the high cost of integrating compute units within a DRAM die. Perhaps also due to the fact that cheaper optimizations were still possible, thanks to Moore’s law and Dennard scaling.

The advent of commercially feasible 3D chip stacking technology, such as Micron’s Hybrid Memory Cube (HMC), has renewed our interest in PIM.  HMC stacks layers of DRAM memory on top of a logic layer. Compute units in the logic layer can communicate with memory through high bandwidth through-silicon-vias (TSVs). Thanks to 3D integration technology, we can now take compute and DRAM dies implemented in different process technologies, stack them on top of each other.

The additional dimension in 3D PIM allows an order of magnitude more physical connections between the compute and memory units, and thereby provides massive memory bandwidth to the compute units. I would argue that the available memory bandwidth is so high that a general-purpose multi-core processor with tens of cores is a poor candidate to take advantage of 3D PIM. Bandwidth of cheaper conventional DRAM is mostly adequate for these processors.  Better candidates are customized compute units that can truly take advantage of the abundant memory bandwidth in 3D PIM — data-parallel accelerators such as a graphics processor (GPU), or  even better, customized accelerators such as Google’s Tensor Processing Units (TPUs).

While 3D PIM is a clear winner in terms of memory bandwidth compared to conventional DRAM, I find that its latency and energy advantages are perhaps exaggerated in literature. Remember, 3D PIM just brings compute _closer_ to DRAM memory. It has no effect on the energy spent in accessing data within DRAM layers, DRAM refresh and leakage, and on-die interconnect in logic layer, which together happen to be the dominant cost. To be clear, there is some memory latency and energy reduction as it eliminates communication over the off-chip memory channels by integrating compute in 3D PIM’s logic layer. But this benefit is not likely to be a big win.

The Second Wave…

While PIM brings compute and memory units closer together, the functionality and design of memory units remains unchanged. An even more exciting technology is one that dissolves the line that distinguishes memory from compute units.

Nearly three-fourth of silicon in processor and main memory dies is simply to store and access data. What if we could take this memory silicon and repurpose it to do computation? Let us refer to the resulting unit as Compute Memory.

A Compute Memory re-purposes the memory structures, the ones that are traditionally used to only store data, into active compute units for almost zero area cost. The biggest advantage of Compute Memory is that, its memory arrays morph into massive vector compute units (potentially, one or two orders of magnitude larger than a GPU’s vector units), as data stored across hundreds of memory arrays could be operated on concurrently. Of course, if you do not have to move data in and out of memory, you naturally save the energy spent in those activities, and memory bandwidth becomes a meaningless metric.

Micron’s Automata Processor (AP) is an example for Compute Memory. It transforms DRAM structures to a Non-Deterministic Finite Automata (NFA) computational unit. NFA processing occurs in two phases: state-match and state-transition. AP cleverly repurposes the DRAM array decode logic to enable state matches. Each of the several hundreds of memory arrays can now perform state-matches in parallel. The state-match logic is coupled with a custom interconnect to enable state-transition. We can process as many as 1053 regular expressions in Snort (classic network intrusion detection system) in one go using little more than DRAM hardware. AP can be an order of magnitude more efficient than GPUs and nearly two orders of magnitude more efficient than general-purpose multi-core CPUs! Imagine the possibilities if we can sequence a genome within minutes using cheap DRAM hardware!

AP re-repurposed just the decode logic in DRAMs. Could we do better? In our recent work on Compute Caches, we showed that it is possible to re-purpose SRAM array bit-lines and sense-amplifiers to perform in-place analog bit-line computation on the data stored in SRAM. A cache is typically organized as a set of sub-arrays; as many as hundreds of sub-arrays, depending on the cache level. These sub-arrays can all compute concurrently on KBs of data stored in them with little extensions to the existing cache structures and incurring an overall area overhead of 4%. Thus, caches can effectively function as large vector computational units, whose operand sizes are orders of magnitude larger than conventional SIMD units (KBs vs bytes). Of course, it also eliminates the energy spent in moving data in and out of caches. While our initial work supports few useful operations (logical, search, and copy), we believe that it is just a matter of time before we are able to support more complex operations (comparisons, addition, multiplication, sorting, etc.).

Supporting Compute Cache style in-place, analog bit-line computing in DRAMs is more challenging. The problem is that DRAM reads are destructive — one of the reasons why DRAMs need periodic refresh. While in-place DRAM computing may not be possible, an interesting solution is to copy the data to a temporary row in DRAM, and then do bit-line computing. This approach will incur extra copies, but retain the massive parallelism benefits.

Unlike DRAMs, bit-line computing may work well in a diverse set of non-volatile memory technologies (RRAMs, STT-MRAMs and flash).  Researchers have already found success in re-purposing  structures in emerging NVMs  to build efficient ternary content-addressable memory (TCAM) and neural networks.

Compute memories can be massively data-parallel. Potentially, an order of magnitude more performance and energy efficient than modern data-parallel accelerators such as GPUs. Such dramatic improvements could have a transformative effect on applications ranging from genome sequencing to deep neural networks. However, capabilities of compute memories may not be as general-purpose as GPUs are today, and may impose additional constraints in terms of where data is stored. Application developers may have to re-work their algorithms to fully take advantage of Compute Memory. Modern data-parallel domain-specific language frameworks such as CUDA, Tensorflow, etc., can be adapted to help these developers. It may also require runtime and system software support to meet compute memory  constraints such as data placement.

Parting Thoughts…

As efficiency of general-purpose core flat-lined over the last decade, both industry and academia have wholeheartedly embraced customization of computational units. It is high-time for us to think about customizing memory units as well.  While there are many ways that one could think of customizing memory, turning them into powerful accelerators is one of the exciting avenues to pursue.

Until recently, we have viewed compute and memory units as two separate entities. Even within a processor, caches and compute logic have operated as two separate entities that served different roles. The time has come to dissolve the line that separates them.

Thanks to Rajeev Balasubramonian, Jason Flinn, Onur Mutlu, Satish Narayanasamy and Sudhakar Yalamanchili for providing feedback on this blog post.

About the Author
Reetuparna Das is an assistant professor  is University of Michigan. Feel free to contact her at reetudas@umich.edu