A Brief Guide of xPU for AI Accelerators

In the workshop on Inter-Disciplinary Research Challenges in Computer Systems (Grand Challenges) co-located with ASPLOS 2018, Dr. Hillery Hunter from IBM and I co-organized a panel discussion on “Augmenting Human Abilities/AI”. During the discussion, inspired by a recent interesting article, I did a quick survey on various AI hardware accelerators developed in the last several years: Other than CPU/GPU that we are familiar with, we have seem many xPUs that are related to AI hardware accelerators: From A to Z, which letter is not used yet for your xPU design?

APU: Accelerated Processing Unit is the AMD’s Fusion architecture that integrates both CPU and GPU on the same die.

BPU: Brain Processing Unit is the design of the AI chips by Horizon Robotics. They unveiled their first two embedded AI chips fabricated with TSMC 40nm process in December 2017: “Journey 1.0 processor” targets at autonomous driving, while “Sunrise 1.0 processor” targets at image recognition enabled smart camera.

CPU: Central Processing Unit.

DPU:

Deep Learning Processing Unit (DPU) is the product from Deephi, a China’s start-up AI chip company who announced its DPUs optimized for CNN workloads and RNN workloads.
Dataflow Processing Unit (DPU) is the product of Wave Computing, a Silicon Valley company which is revolutionizing artificial intelligence and deep learning with its dataflow-based solutions. They introduced the architecture of coarse grain reconfigurable array (CGRA) for statically scheduled data flow computing in HOTCHIPS’17 and its software stack of compiler and linker in ICCAD’17.
DLA: Deep Learning Accelerator (DLA, or NVDIA) is an open and standardized architecture by Nvidia to address the computational demands of inference. With its modular architecture, DLA is scalable, highly configurable, and designed to simplify integration and portability.

EPU: Emotion Processing Unit is designed by Emoshape, as the MCU microchip design to enable a true emotional response in AI, robots and consumer electronic devices as a result of a virtually unlimited cognitive process.

FPU: Floating Processing Unit (FPU).

GPU: Graphics Processing Unit (GPU), which achieves high data parallelism with its SIMD architecture, has played a great role in the current AI market, from training to inference. For example, Nvidia’s Volta-based Quadro GV100 and DGX-2 was just announced at GTC’18. The new GV100 packs 7.4 TFLOPS FP-64, 14.8 TFLOPS FP-32 and 118.5 TFLOPS deep learning performance, and is equipped with 32GB of high-bandwidth memory (HBM) capacity. The new DGX-2, which achieves 2 petaFLOPS in the system, combines 16 fully interconnected GPUS with 10x the deep learning performance.

HPU: Holographic Processing Unit (HPU) is the specific hardware of Microsoft’s Hololens. HPU1 with TSMC 28nm process was announced in HOTCHIPS’17. It integrates 24 Tensilica DSPs and supports 5 cameras, 1 depth sensor and 1 motion sensor. Some details of HPU2 were unveiled in CVPR’17. HPU2 integrates a DNN co-processor, to handle the deep learning workloads on device.

IPU:

Intelligence Processing Unit (IPU) is specific for the graph related applications by GraphCore. Graphcore believes that the graph is a proper representation of knowledge model. They use the graph as the basic representation for many AI-related algorithms, including neural network, Bayesian network, Markov Field, and some other emerging methods. As they advertise, CPU is based on the scalar architecture; GPU is based on the vector architecture; while IPU is based on the graph architecture. The IPU design supports both training and inference, and is a memory-centric design. More details can be found in their recent talk in Scaled Machine Learning in March 2018.
Intelligence Processing Unit (IPU): Mythic, another start-up company also name their product as IPU. Their prototype introduces processing-in-memory and performs hybrid digital/analog calculation inside flash arrays.
Image Processing Unit (IPU) is the Pixel Visual Core designed by Google and integrated in Google Pixel 2 released in 2017. With 8 Google-designed custom IPU cores, each with 512 arithmetic logic units (ALUs), the Pixel Visual Core delivers raw performance of more than 3TOPS on a mobile power budget. Compared with Google Pixel 1, the HDR photography is accelerated by 5x and the power efficiency increased by 10x.

NPU: Neural Network Processing Unit (NPU) has become a general name of AI chip rather than a brand name of a company. To my best knowledge, even though the term NPU was first mentioned in a MICRO 2012 paper by Hadi Esmaeilzadeh et al., the first commercial product named after NPU in industry is from Vimicro in 2016. This NPU chip integrates several DSPs and makes the computation of CNNs and DNNs through the SIMD instructions. In ISSCC’18, there were many NPU designs. For example, a research group from Japan proposed QUEST, an NPU which is a log-quantized DNN inference engine. KAIST proposed another NPU, which supported 1bit-to-16bit fully variable weight precision. The paper achieves 50.6TOPS/W when it works in the 1-bit mode; and 3TOPS/W when working in the 6-bit mode. Another group from Stanford presented another NPU, which is a mixed-signal CNN processor that used analog computing to implement binary neural network (BNN).

SPU: Stream Processing Unit (SPU) is related to the specialized hardware to process the data streams of video. GPU can also be considered as a special SPU.

TPU: Tensor Processing Unit (TPU) is Google’s specialized hardware for neural network. In ISCA’17, Google published its first TPU paper (TPU1), which was highlighted with its systolic array structure. TPU1 focused on the inference tasks, and has been deployed in the datacenter since 2015. In Google I/O’17, Google announced its cloud TPU (also known as TPU2), which can handle both training and inference.

VPU: Vision Processing Unit (VPU) is the specialized chip for computer vision workloads. Movidius, which was acquired by Intel in 2016, develops its VPU-series named Myriad which makes hardware optimization for computer vision tasks. In 2014, Google’s Project Tango deployed Myriad1 to construct 3D indoor maps. In 2016, Dji, a China’s drone company, integrates Myriad2 in its products PHANTOM 4 and Mavic Pro. In 2017, Intel Movidius announces its latest generation VPU, named Myriad X. Compared with Myriad2, Myriad X integrates Neural Compute Engine, which contains 16 128-bit VLIW processors and supports FP-16 and INT-8 with TSMC 16nm process.

ZPU: is a small, portable CPU core by a Norwegian company Zylin AS to run supervisory code in electronic systems that include an FPGA.

There are many other deep learning processors or neural network accelerators that are not named xPU (such as Cambricon). Comprehensive lists are available here and here.

About the author: Yuan Xie is a professor in the ECE department at UCSB. More information can be found at http://www.ece.ucsb.edu/~yuanxie

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.

Computer Architecture Today

A Brief Guide of xPU for AI Accelerators

Contribute

Recent Blog Posts

Archives

Subscribe

Join Us

Computer Architecture Today

A Brief Guide of xPU for AI Accelerators

Share this:

Contribute

Recent Blog Posts

Archives

Tags

Subscribe

Join Us