Artifact Evaluation for Reproducible Quantitative Research

We all love good ol’ architecture research! From a germ of an idea, through a thorny path of its implementation and validation, to its publication. With its publication, hopefully comes its adoption. With its adoption, grows our reputation. With our reputation, come to us many good things including fantastic colleagues and lucrative grants! Therefore, it ought to bother us a great deal when good ideas get no adoption. And that’s why we care deeply about understanding and eliminating barriers to successful adoption. In this blog post, we discuss “Artifact Evaluation” to foster wider adoption of computer architecture ideas.

The software community can point to many examples of wider adoption facilitated by sharing research artifacts (code, data, scripts, etc.). Perhaps one of the best examples is the LLVM compiler infrastructure which originated from Chris Lattner’s PhD thesis and has since been adopted by all major companies. To stimulate artifact dissemination in the software community, over the past 6 years we have helped introduce and standardize the so-called Artifact Evaluation (AE) process at many top software ACM, IEEE and NeurIPS conferences such as PPoPP, CGO, PACT, Supercomputing and SysML. AE typically works as follows. Once a paper is accepted, the authors can submit the related artifact to a special committee usually formed from senior graduate students and postdoctoral researchers. The committee performs lightweight validation of submitted artifacts based on the standard ACM procedure we contributed to a few years ago: check if the artifact is permanently archived, follow its installation instructions and verify functional correctness, partially replicate results from the paper, assign reproducibility badges, etc.

The good news is that over the years the number of papers participating in AE has dramatically increased from ~30% to ~70% of all accepted papers. For example, PPoPP’19 has set a new record with 20 papers passed through AE out of 29 papers accepted. Supercomputing’19 has even made artifact appendices mandatory for all submitted papers. ACM has also played a great role in promoting reproducible articles. For example, the ACM Digital Library provides advanced article search based on reproducibility badges. Even more importantly, the ACM Taskforce on Reproducibility continues improving a common evaluation methodology by taking into account the authors’ feedback. Finally, the community has even started discussing how to share reusable artifacts and crowdsource experiments via public repositories to help new research easily build upon past research, thus truly enabling open science.

We believe that the popularity and success of AE owes a great deal to being completely voluntary and not influencing the final paper decision (at least, currently). At the same time, AE allows the authors to receive important feedback and fix problems in a friendly and cooperative way before making their artifacts more widely available.

If having AE is so clearly beneficial, one might wonder, why the top architecture conferences (ISCA, MICRO, HPCA, ASPLOS) are yet to adopt AE? After discussing this phenomenon with authors regularly publishing at the above conferences, we have identified several concerns and misconceptions:

Not all architecture artifacts can be made publicly available (e.g. if they come from a company’s R&D lab).
Architecture artifacts often use different simulators which makes them hard, if not impossible, to compare.
Architecture artifacts are way more complex than software ones, and therefore are simply too hard or time-consuming to package for AE.
Some researchers will boycott AE to avoid detailed evaluation of their artifacts and fair comparison against those of their competitors.

As for the first three concerns, we heard similar worries from conferences that have embraced AE by now. Where there’s a will, there’s a way! For example, AE does not mandate the authors to make their artifacts publicly available: we have found multiple ways to evaluate artifacts without revealing all the IP. Also, the AE committee focuses on validating the authors’ artifact, not artifacts from other papers; ensuring that techniques are comparable is more of a prerogative of the paper authors and the program committee. Furthermore, other conferences have already featured articles on co-designing very complex SW/HW stacks which have successfully passed AE.

The last worry concerns unique quantitative aspects of architecture research. We are all too familiar with “benchmarking wars” where companies accuse each other of using non-representative benchmarks or outdated competitors’ products to make comparison appear to be in their favour. We will have much more to say about fair benchmarking in our future posts, but we believe that introducing AE for architecture research would go a long way in addressing this issue. First, by making their artifacts available through AE the authors will reduce the effort for other researchers to reproduce their results and hence invite more objective comparisons. Second, by sharing components such as benchmarks will allow to grow a pool of components available to others to reuse.

To make even more impact, quantitative AE would benefit from sharing workflows and components in an open unified format. To test this out, we organized the 1st ACM ReQuEST tournament at ASPLOS’18 aiming to co-design efficient SW/HW stacks for deep learning and reproduce results using the established AE methodology. We invited the community to submit complete implementations (code, data, scripts, etc.) for the popular ImageNet object classification challenge. Our goal was not only to put every submitted paper through rigorous AE, but also create portable and highly reusable artifacts. We created a reproducible Collective Knowledge workflow for each artifact to unify evaluation of accuracy, latency (seconds per image), throughput (images per second), peak power consumption (Watts), price and other metrics. We published the unified reproducible workflows on GitHub and added snapshots to the ACM Digital Library.

Rather than declaring a single winner or naming-and-shaming, we plot all the results on a public interactive dashboard. Using the dashboard, anyone can apply their own criteria to explore the solution space and look for optimal solutions lying on Pareto frontiers (e.g. to find the most energy efficient solution achieving the required accuracy). It allows for flexibility in how one wishes to interpret the results through different filters or lenses.

We hope that our successful ReQuEST experience can persuade even the most hardcore AE sceptics that AE can be introduced at architecture conferences. Indeed, we managed to reproduce and compare very diverse and complex ML/SW/HW stacks consisting of:

Platforms: Xilinx Pynq-Z1 FPGA, Arm Cortex CPUs and Arm Mali GPGPUs (Linaro HiKey960 and T-Firefly RK3399), a farm of Raspberry Pi devices, NVIDIA Jetson TX1 and TX2, and Intel Xeon servers in the Amazon, Microsoft and Google clouds.
Models: MobileNets, ResNet-18, ResNet-50, Inception-v3, VGG16, AlexNet, SSD.

Data types: 8-bit integer, 16-bit floating-point (half), 32-bit floating-point (float).
ML frameworks and libraries: MXNet, TensorFlow, Caffe, Keras, Arm Compute Library, cuDNN, TVM, NNVM.

Most importantly, all the artifacts and workflows from this tournament including tools, libraries, models and data sets are now publicly available in a common format. This means that the the community can continue reproducing the artifacts and fixing issues even after the tournament, as described in the organizers’ report and the Amazon presentation at the O’Reilly AI Conference. We envision that eventually the community will crowdsource benchmarking, co-design and optimization of complete SW/HW stacks (e.g. see our Android crowd-benchmarking demo), which in turn will dramatically reduce effort and time-to-market required for real-world products.

In conclusion, we should remind ourselves that while AE does place a greater burden on the authors, it leads to wider adoption of their research ideas (with all the good things that come with it). Making artifacts available in a common reusable format allows other researchers to easily build upon them, which implies that the burden of AE can be gradually reduced over time even for complex, quantitatively-oriented artifacts typical for architecture research. Therefore, we very much hope that the architecture community will soon start introducing AE at architecture conferences!

About the Authors: Grigori Fursin (PhD, Edinburgh) is president of the non-profit cTuning foundation, and CTO and co-founder of dividiti. He is a long time champion for collaborative and reproducible research and a founding member of the ACM taskforce on reproducibility. Anton Lokhmotov (PhD, Cambridge) is CEO and co-founder of dividiti. In 2010-2015, he was a technical lead and manager for GPU Compute programming technologies for the Arm Mali GPU series, including production and research compilers.

Computer Architecture Today

Artifact Evaluation for Reproducible Quantitative Research

Contribute

Recent Blog Posts

Archives

Subscribe

Join Us

Computer Architecture Today

Artifact Evaluation for Reproducible Quantitative Research

Share this:

Contribute

Recent Blog Posts

Archives

Tags

Subscribe

Join Us