This is the 1st June 2019 digest of SIGARCH Messages.

In This Issue

Call for Participation: TVM Tutorial
Submitted by Thierry Moreau

TVM Tutorial
in conjunction with ISCA 2019
Phoenix, Arizona, USA
June 22, 2019

Apache TVM is an open-source deep learning compiler stack for CPUs, GPUs, and specialized accelerators. It aims to close the gap between the productivity-focused deep learning frameworks, and the performance or efficiency-oriented hardware backends.

We will showcase the different ways in which TVM can be used to facilitate research on deep learning systems, compilers, and hardware architectures. We welcome TVM contributors, potential users, collaborators, researchers, and practitioners from the broader community.

The presentations will be structured to provide a high-level overview on the research. We will complement the presentations with hands-on tutorials that can be run on your own laptop during, or after the tutorial.

To register follow the link:

Call for Participation: Tutorial on Hardware Accelerators for Deep Neural Networks
Submitted by Vivienne Sze & Joel Emer

Tutorial on Hardware Accelerators for Deep Neural Networks
in conjunction with ISCA 2019
Phoenix, Arizona, USA
June 22, 2019

Deep neural networks (DNNs) are currently widely used for many AI applications including computer vision, speech recognition, robotics, etc. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, designing efficient hardware architectures for deep neural networks is an important step towards enabling the wide deployment of DNNs in AI systems.

This tutorial provides a brief recap on the basics of deep neural networks and is for those who are interested in understanding how those models are mapping to hardware architectures. We will provide frameworks for understanding the design space for deep neural network accelerators including managing data movement, handling sparsity, and importance of flexibility.

NOTE: This is an intermediate-level tutorial that will go beyond the material in the previous incarnations of this tutorial.

Call for Participation: Tutorial on Methods for Characterization and Analysis of Voltage Margins in Multicore Processors
Submitted by Dimitris Gizopoulos

Tutorial on Methods for Characterization and Analysis of Voltage Margins in Multicore Processors
in conjunction with ISCA 2019
Phoenix, AZ, USA
June 23, 2019

Conservative design margins in modern multicore CPU chips aim to guarantee correct execution of the software layers of computing system under various operating conditions, such as worst-case voltage noise (Ldi/dt), and accounting for the inherent variability among different cores of the same CPU chip, among different manufactured chips and among different workloads. However, guard-banding the main operational parameters of CPU chips (voltage, frequency), leads to limited energy efficiency.

In this tutorial, we will present recent methods and studies on design-time voltage-margins characterization and identification in modern multicore CPUs; such methods aim to improve energy efficiency while guaranteeing the correctness of software execution.

– We will discuss key power-delivery-network (PDN) challenges and present an on-chip dedicated circuitry for PDN voltage noise characterization. The presentation will include various analysis conducted with real hardware using this circuitry for characterizing voltage noise caused by Ldi/dt viruses, system-call intensive benchmarks and scan-debug activity.

– We will present the main challenges and how they can be addressed for the characterization and identification of different types of variability of modern multicore CPUs (across cores, across chips and across workloads) and to analyze the system behavior in scaled conditions (what types of malfunctions are observed). Both single-thread and multi-thread workloads will be discussed.

– We will present the main challenges and how they can be addressed for the characterization and identification of different types of variability of modern multicore CPUs (across cores, across chips and across workloads) and to analyze the system behavior in scaled conditions (what types of malfunctions are observed). Both single-thread and multi-thread workloads will be discussed.

– We will present a novel non-intrusive, zero-overhead, cross-platform approach for post-silicon dI/dt voltage noise monitoring based on sensing CPU electromagnetic emanations using an antenna and a spectrum analyzer. The approach is based on the observation that high amplitude electromagnetic emanations are correlated with high voltage noise. We leverage this observation to automatically generate voltage noise (dI/dt) stress tests and measure PDN resonance frequency.

– The tutorial analysis is based on real system measurements using client chips as well as on different multicore server CPU chips mainly based on ARMv8 architecture (such as ARM Cortex-A72 and Cortex-A53 CPUs, AppliedMicro’s – now Ampere Computing – multicore X-Gene 2 and X-Gene 3 CPUs). Discussion and comparison among the implementations and also with different architectures (mainly Intel and AMD x86 multicore CPU chips) will also take place.

The purpose of the tutorial is to summarize recent characterization and exploitation findings on multicore CPUs in server machines, emphasize on the potential of energy saving through identification and exploitation of design margins and to discuss our reports and findings to other machines similarly studied in the past.

Call for Participation: PyMTL Tutorial
Submitted by Christopher Batten

PyMTL Tutorial: A Next-Generation Python-Based Framework for Hardware Generation, Simulation, and Verification
in conjunction with ISCA 2019
Phoenix, Arizona, USA
June 22nd, 2019

The purpose of this tutorial is to introduce the computer architecture research community to the features and capabilities of the new version of PyMTL, a next-generation Python-based hardware generation, simulation, and verification framework.

* Why PyMTL?
Computer architecture researchers are increasingly exploring the hardware and software aspects of accelerator-centric architectures, and this has resulted in a trend towards implementing accelerators at the register-transfer level and even fabricating accelerator-centric test chips. However, the conventional wisdom is that designing, implementing, testing, and evaluating RTL accelerators is a complex, time-consuming, and frustrating process. These challenges in the computer architecture research community mirror the challenges faced by commercial, government, and hobbyist hardware designers. These challenges have motivated some design teams to augment or even replace traditional domain-specific hardware description languages (HDLs) with a mix of different high-level hardware generation, simulation, and verification frameworks. PyMTL is a next-generation Python-based framework that unifies hardware generation, simulation, and verification into a single environment. The Python language provides a flexible dynamic type system, object-oriented programming paradigms, powerful reflection and introspection, lightweight syntax, and rich standard libraries. PyMTL builds upon these productivity features to enable a designer to write more succinct descriptions, to avoid crossing any language boundaries for development, testing, and evaluation, and to use the complete expressive power of the host language for verification, debugging, instrumentation, and profiling. The hope is that PyMTL can reduce time-to-paper (or time-to-solution) by improving the productivity of design, implementation, verification, and evaluation.

* The PyMTL Workflow:
A typical workflow using PyMTL is shown above. The designer starts from developing a functional-level (FL) design-under-test (DUT) and test bench completely in Python. Then the DUT is iteratively refined to the cycle level (CL) and register-transfer level (RTL), along with verification and evaluation using Python-based simulation and the same test bench. The designer can then translate a PyMTL RTL model to Verilog and use the same test bench for co-simulation. Note that designers can also co-simulate existing SystemVerilog source code with a PyMTL test bench. The ability to simulate/co-simulate the design in the Python runtime environment drastically reduces the iterative development cycle, eliminates any semantic gap, and makes it feasible to adopt verification methodologies emerging in the open-source software community. Finally, the designer can push the translated DUT through an FPGA/ASIC toolflow and can even reuse the same PyMTL test bench during prototype bringup.

* The New Version of PyMTL:
This hands-on tutorial will introduce participants to the new version of PyMTL which is scheduled for a beta release in June. The new version of PyMTL maintains some of the best features of the current version including: support for highly paramterized chip generators; a unified framework for functional-, cycle-, and register-transfer level modeling; pure-Python-based simulation; elegant translation of PyMTL RTL to Verilog RTL; and first-class support for co-simulation of PyMTL and Verilog models through Python/Verilator integration. The new version of PyMTL will additionally include: a completely new execution model based on statically scheduled concurrent sequential update blocks; improved simulation performance; first-class support for method-based interfaces; PyMTL passes for analyzing, instrumenting, and transforming PyMTL models; and improved verification methodologies.

Our objective is to provide attendees with answers to the following questions:
– What kind of research problems can PyMTL help me solve?
– How do I build functional-level, cycle-level, and register-transfer-level models in PyMTL?
– How do I generate Verilog HDL from PyMTL RTL models and push them through an ASIC toolflow?
– How do I create flexible testing harnesses in PyMTL that work across abstraction levels?
– How do I incorporate PyMTL into my existing research flow?
– How do I use existing Verilog IP with PyMTL?

Christopher Batten, Cornell University
Shunning Jiang, Cornell University
Christopher Torng, Cornell University
Yanghui Ou, Cornell University
Peitian Pan, Cornell University

Call for Participation: ML Benchmarking Tutorial
Submitted by Vijay Janapa Reddi

ML Benchmarking Tutorial
in conjunction with ISCA 2019
Phoenix, Arizona, USA
June 22, 2019

The current landscape of Machine Learning (ML) and Deep Learning (DL) is rife with non-uniform models, frameworks, and system stacks. It lacks standard tools and methodologies to evaluate and profile models or systems. Due to the absence of standard tools, the state of the practice for evaluating and comparing the benefits of proposed AI innovations (be it hardware or software) on end-to-end AI pipelines is both arduous and error-prone — stifling the adoption of the innovations in a rapidly moving field.

The goal of the tutorial is to bring experts from the industry and academia together to shed light on the following topics to foster systematic development, reproducible evaluation, and performance analysis of deep learning artifacts. It seeks to address the following questions:
What are the benchmarks that can effectively capture the scope of the ML/DL domain?
Are the existing frameworks sufficient for this purpose?
What are some of the industry-standard evaluation platforms or harnesses?
What are the metrics for carrying out an effective comparative evaluation?

Tentative Schedule:
08:30 – 10:00 AM Introduction to MLPerf
10:00 – 10:30 AM Coffee break
10:30 – 12:00 PM Challenges and Pitfalls in Benchmarking ML
12:00 – 01:30 PM Lunch break
01:30 – 02:30 PM MLModelScope Deep Dive
02:30 – 03:00 PM MLModelScope for MLPerf
03:00 – 03:30 PM Coffee break
03:30 – 05:00 PM Tools and Methodologies
05:00 – 05:30 PM Open Issues / Challenges
05:30 – 06:00 PM Closing

The tutorial will cover a range of different topics, including but not limited to the following:

* Representative Benchmarks for the ML Domain
Benchmarks are instrumental to the development of the architecture and systems communities. We will cover the various ongoing efforts in ML benchmarking. More specifically, we will present MLPerf (, an ongoing industry-wide effort, involving over 30+ companies, to create a SPEC-like benchmark to standardize how we measure the “training” and “inference” performance of ML models, software frameworks, ML hardware accelerators, and ML cloud and edge platforms. We will discuss the influence of academic benchmarking efforts such as the Fathom benchmark suite from Harvard, the DAWNBench suite from Stanford University, and the TBD suite from the University of Toronto on the initial design of MLPerf. We will discuss the benchmarks in MLPerf and also elaborate on the subtle nuances of developing an ML benchmark for the industry, such as how models are chosen and prepared for benchmarking.

* Challenges Presented by Existing Frameworks
We will explain the common pitfalls in benchmarking ML systems. For instance, a common benchmarking trap is to assume that any two ML frameworks are naturally mathematically equivalent in their implementation of ML models. But no two frameworks (e.g., PyTorch, TensorFlow, MXNet, Caffe2) are truly alike. There are pros and cons to each frameworks’ implementation, and understanding these subtleties is critical to correctly benchmarking systems and understanding the performance of various ML/DL models. Apart from ML frameworks, other factors play an important role in benchmarking, such as the pre- and post-processing steps that are essential for running the benchmarks inside a framework, other supporting software libraries and their versions needed to compile the framework, and architecture configurations of the underlying computing hardware. We will show that there are some non-obvious intricacies and subtleties that, if not well understood, can lead to “mysterious” inconsistent comparisons. Hence, as system researchers, we attempt to teach/showcase the importance of avoiding common benchmarking pitfalls of ML models on different frameworks, computing hardware, the supporting software stacks, and the end-to-end benchmarking pipeline on different datasets.

* Tools and Methodologies for Evaluating Platforms
There is a need for evaluation platforms that can enable the evaluation of different hardware platforms, software frameworks, and models, all across cloud and edge systems. We will introduce a set of open-source evaluation platforms that are hardware/software agnostic, extensible and customizable platform for evaluating and profiling ML models across datasets/frameworks/hardware, and within different AI application pipelines. We will demonstrate the set of tools and methodologies built under the TBD Benchmark Suite from the University of Toronto and Microsoft Research with the key focus on analyzing performance, hardware utilization, memory consumption, and also different performance aspects (networking and I/O) related to distributed training. We will also cover an open-source evaluation platform, called MLModelScope, from the IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) which lowers the cost and effort for performing model evaluation and profiling, making it easier to reproduce, evaluate, and analyze accuracy, performance, and resilience claims of models, frameworks, and systems. All major frameworks, hundreds of models, data sets, and many major hardware types are available under it.

The industry sees a need to educate students and professionals in the art of ML benchmarking and analysis. Deep learning is a complex space that requires optimization of algorithms, software and hardware stacks. So, our goal in this workshop is that when an attendee leaves the tutorial they have a good sense of the ML landscape, understand the state of affairs in ML model benchmarking for conducting research, appreciate the research value of ML benchmarking, and also learn the existing tools to debug and analyze the performance of his/her ML/DL models. Ideally, the tutorial raises the bar for informed research and spark new ideas.

Vijay Janapa Reddi (Harvard University)
Jinjun Xiong (IBM)
Wen-mei Hwu (UIUC)
Gennady Pekhimenko (University of Toronto)
Abdul Dakkak (UIUC)
Cheng Li (UIUC)

Vijay Janapa Reddi (Harvard University)
Wen-mei Hwu (UIUC)
Jinjun Xiong (IBM)
Karthik V Swaminathan (IBM)
Cody Coleman (Stanford)
Carole-Jean Wu (ASU and Facebook)

Call for Participation: Tutorial on High Performance Distributed Deep Learning
Submitted by Ammar Ahmad Awan

High Performance Distributed Deep Learning: A Beginner’s Guide
in conjunction with ISCA 2019
Phoenix, Arizona, USA
June 22, 2019

The current wave of advances in Deep Learning (DL) has led to many exciting challenges and opportunities for Computer Science and Artificial Intelligence researchers alike. Modern DL frameworks like Caffe2, TensorFlow, Cognitive Toolkit (CNTK), PyTorch, and several others have emerged that offer ease of use and flexibility to describe, train, and deploy various types of Deep Neural Networks (DNNs). In this tutorial, we will provide an overview of interesting trends in DNN design and how cutting-edge hardware architectures are playing a key role in moving the field forward. We will also present an overview of different DNN architectures and DL frameworks. Most DL frameworks started with a single-node/single-GPU design. However, approaches to parallelize the process of DNN training are also being actively explored. The DL community has moved along different distributed training designs that exploit communication runtimes like gRPC, MPI, and NCCL. In this context, we will highlight new challenges and opportunities for communication runtimes to efficiently support distributed DNN training. We also highlight some of our co-design efforts to utilize CUDA-Aware MPI for large-scale DNN training on modern GPU clusters. Finally, we include hands-on exercises in this tutorial to enable the attendees gain a first-hand experience of running distributed DNN training experiments on a modern GPU cluster.

Call for Participation: SYSTOR 2019
Submitted by Eran Gilad

12th ACM International Systems and Storage Conference (SYSTOR)
Haifa, Israel
June 3-5, 2019

Registration is now open:

The organizing committee is delighted to invite you to the 12th ACM International Systems and Storage Conference, SYSTOR 2019. SYSTOR is a single-track conference that serves as an international platform dedicated to the broad area of systems and storage. The technical program features original peer-reviewed research papers, three keynotes delivered by distinguished speakers, highlight papers recently published in related top-tier conferences, and a poster session. SYSTOR 2019 is sponsored by SIGOPS and in cooperation with Usenix.

James Larus (EPFL) – Caches Are Not Your Friends: Programming Non-Volatile Memory
Bill Bolosky (Microsoft Research) – Biological Data Is Coming to Destroy Your Storage System
James Bottomley (IBM Research) – Is there Virtualization Beyond Containers? And Is It Useful to the Cloud?

Social event:
Guided Tour of Carmel Mountains Area

The full program is available at

Call for Participation: Tutorial on OpenPiton and Ariane: The RISC-V Hardware Research Platform
Submitted by Jonathan Balkind

Tutorial on OpenPiton and Ariane: The RISC-V Hardware Research Platform
in conjunction with ISCA 2019
Phoenix, Arizona, USA
June 23, 2019

OpenPiton+Ariane is a permissively-licensed open-source framework designed to enable scalable architecture research prototypes. With the recent addition of SMP Linux running on FPGA, OpenPiton+Ariane is the first Linux-booting, open-source, RISC-V system that scales from single-core to manycore. Building on the maturity of the OpenPiton platform and the Ariane 64-bit RISC-V processor, OpenPiton+Ariane is the ideal RISC-V hardware research platform.

On Sunday June 23rd at ISCA/FCRC 2019 we will be holding a half-day afternoon tutorial to get interested users acquainted with the platform. This is a hands-on session which will first introduce attendees to our validation infrastructure using open-source simulators. Attendees will also learn how to synthesise multiple cores to FPGA and get direct experience with booting multicore Linux on our provided FPGAs. We will also teach attendees how to configure and extend the OpenPiton architecture to enable architecture research.

Register for the tutorial and enjoy an early registration discount until May 24th ( More details on our tutorial are available at

Call for Participation: Tutorial on Demystifying Memory Models Across the Computing Stack
Submitted by Yatin Manerkar

Tutorial on Demystifying Memory Models Across the Computing Stack
in conjunction with ISCA 2019
Phoenix, Arizona, USA
June 22, 2019

Do you find memory consistency models inscrutable? Does your parallel program mysteriously crash or give counterintuitive results? Do you lie awake at night wondering if your processor is buggy? Then this tutorial is for you!

Memory consistency models (MCMs) specify ordering rules which constrain the values that can be read by loads in parallel programs. Correct MCM implementations are critical to parallel system correctness, and each layer of the hardware/software stack is responsible for enforcing some of the orderings required. However, MCMs are notoriously complicated to specify and verify, and often require examination of many counterintuitive corner cases.

This tutorial will teach hardware and software researchers how to navigate the world of MCMs through the Check suite, a collection of automated tools for formal MCM verification developed by our group at Princeton over the last five years. With tools for every layer of the stack from high-level languages to Verilog RTL, we have something for virtually all FCRC attendees. Topics we will cover include:

-automated all-program inductive proofs of hardware MCM correctness
-how to balance high-level language MCM requirements against hardware optimizations
-MCM verification of real Verilog RTL!

Along the way, we’ll point out some of the real-world bugs we’ve discovered in the course of our work, including IBM compiler flaws, issues with the RISC-V ISA, and even a problem with the C++ memory model.

So whether you want to verify that your processor correctly implements its MCM, or you’re wondering what fences to include in your shiny new ISA, or you just want to learn more about memory consistency, please do attend our tutorial on the morning of Saturday, June 22nd at FCRC 2019 in Phoenix, AZ. We hope to see you there!

FCRC Registration:

ISCA Tutorial List:

Yatin Manerkar, Caroline Trippel, Prof. Margaret Martonosi (Princeton University)

Call for Workshops/Tutorials: IISWC 2019
Submitted by Clay Hughes

2019 Annual IEEE International Symposium on Workload Characterization (IISWC)
Orlando, Florida, USA
November 3-5, 2019

Proposal submission deadline: July 5, 2019

The IEEE International Symposium on Workload Characterization is dedicated to the understanding and characterization of workloads that run on all types of computing systems. Whether they are embedded systems or massively parallel systems, the design of future computing machines can be significantly improved if we understand the characteristics of the workloads that are expected to run on them.

We are soliciting proposals for workshops and tutorials to be held on the Sunday before the main conference, on November 3, 2019. Please send your proposal to Clay Hughes at, by Friday, July 5, 2019.
. In your email, be sure to include the following information:
– Session type (workshop/tutorial)
– Name of session
– List of organizers
– Description of session
– Duration of session (1/0.5 day)

Please view the SIGARCH website for the latest postings, to submit new posts, and for general SIGARCH information. We also encourage you to visit the SIGARCH Blog.

- Boris Grot
SIGARCH Content Editor