Computer Architecture Today

Informing the broad computing community about current activities, advances and future directions in computer architecture.

A previous blog titled “Blurring the lines between memory and compute” by R. Das was a nice summary of the history and the recent trends on addressing the memory wall challenges with process-in-memory (PIM) ideas.  This blog would like to further highlight the importance and the recent industry developments of the intelligent memory architectures with new technologies such as die-stacking and emerging non-volatile memories.

As mentioned by R. Das,  the increasing gap between the computing of processor and the memory has created the “memory wall” problem in which the data movement between the compute and the memory is becoming the bottleneck of the computing systems, ranging from cloud servers to end-user devices. For example, reported recently by Nvidia and IBM, the data transfer between CPUs and off-chip memory consumes two orders of magnitude more energy than a floating point operation, and while technology scaling helps reduce the total energy, the data movement still dominates the total energy consumption. As we enter the era of big data, many emerging data-intensive workloads become pervasive,  and mandate very high bandwidth and heavy data movement between the computing units and the memory. Note that another blog on near data computing from a database system perspective also indicates the importance of closing the gap between computing and storage, with a focus on near-storage computing.

To close the gap between computing and memory, extensive work has been done to explore possible solutions, which can be classified into two categories: memory-rich processing units and compute-capable memory.

Memory-Rich Processing.  This approach increases the capacity of the on-chip memory that is close to the computing units, so that the data movement between computing units and memory can be reduced. Emerging high-capacity 3D stacked memories such as the High-Bandwidth Memory (HBM)  or Hybrid Memory Cube (HMC) have been proposed to integrate together with CPU/GPU using either through-silicon-via (TSV) based 3D integration or interposer-based 2.5D integration. Interposer-based 2.5D integration is already embraced by the industry for integrating 3D stacked memories with large-scale designs, due to its feasibility over true 3D integration in terms of cost and better thermal profile. For example, as early as 2015, AMD’s Fury X GPU integrates 4GB of 3D stacked High-Bandwidth Memory (HBM) ; the most recent Nvidia’s Tesla V100 GPU increases the capacity of HBM to 16GB and 32GB; Intel’s Knights’ Landing CPU also integrates 16GB high bandwidth stacked DRAM named MCDRAM (or Multi-Channel DRAM). Xilinx recently also announced the Virtex Ultrascale+ FPGA that integrates 8GB HBM.  I define such processing units (PU) (either CPU/GPU or FPGA) with GBs of 3D stacked memory as Memory-rich Processing units.

Compute-capable Memory.  This approach tries to move computation inside (or closer) to memory. This approach can also be classified into two categories:  a coarse-granularity coupling of compute and memory called near-data computing (NDC) with die-stacking memory,  and a finer-granularity of merging the compute and memory called  processing-in-memory (PIM) or compute-in-memory (CIM),  depending on the granularity of computing and the location of the computing.  For example, in HBM or HMC memory architecture, a logic layer is stacked on the bottom of multiple DRAM dies, and one can implement various accelerator designs on this logic layer, which is connected to the DRAM dies with a large number of  through-silicon vias (TSVs), in the same package with extremely high bandwidth and low latency.  Such design is referred to as near-data computing (NDC) with a clear distinguishing between a logic die and memory dies, and the bandwidth between the logic die and the memory stack can be up to 15x  of the traditional DDR3 memory with 70% less energy per bit, which can significantly accelerate data-intensive workloads with high energy-efficiency.  On the contrary, the PIM and CIM architectures envision using intra-array (bitline or per-cell) computations that leverage voltage and charge properties (for DRAM/SRAM)  or leverage the resistance-based analog computing in emerging non-volatile memories (such as MRAM/PCRAM/ReRAM)  to enable simple bulk operations, such as vector-matrix multiplications, multiply-accumulate, Boolean logic, and TCAM or associative processing. One may also place some logic near global sense-amplifiers, with similar concept proposed by The Berkeley Intelligent RAM (IRAM) project,   to enable general-purpose computations at the edge of the arrays — most likely in the form of vector operations.

Machine-learning applications as a driving force.  The state-of-the-art NN and DL algorithms, such as multilayer perceptron (MLP) and convolutional neural network (CNN), require a large memory capacity as the size of NN increases dramatically. High-performance acceleration of NN requires high memory bandwidth since the xPUs are hungry for fetching the synaptic weights. To address this challenge,  many AI accelerator designs actually adopt either memory-rich processing units designs or compute-capable memory architectures. For example, both AMD and Nvidia’s high-end GPU for AI applications have integrated 16-32GB HBM, as the general-purpose GPU architecture for machine learning.  Google’s second generation of TPU integrated 64GB HBM2 to address the memory capacity and bandwidth challenges for training.  Intel’s Nervana neural network processor also comes with integrated HBM memory.  Wave Computing, a start-up company, also disclosed that their Data-flow processing unit (DPU)  will integrate 128GB of HMC memory delivering peak capacity of 11.6 petaops.   While there are many academic efforts on designing AI accelerators with PIM or CIM concepts,  new start-up companies also emerge recently, together with IBM’s effort on similar directions.

Nevertheless, there are still many challenges for computer architects. For example, tools are needed for design space exploration of  massively parallel hierarchical, near-data processing and develop a modular architecture that allows design of systems to meet specific cost, energy, space, and performance targets.  System infrastructure and programming framework to support such novel PIM architectures should also be developed. With the thriving of such intelligent memory architectures, a memory-centric computing paradigm is coming with many challenges and opportunities.

About the Author:  Yuan Xie is a professor at University of California, Santa Barbara. Email: yuanxie@ece.ucsb.edu