It’s a marvelous time in computer systems. For me, working in Deep Learning feels like living through a scientific revolution. Kuhn described this kind of change in his classic book, where a new paradigm takes hold, causing entire fields to change their standard practices. The revolution has already transformed image and speech recognition, translation, search, and how we play Go, with new results arriving continually. I’m even more excited about the possibility that deep learning will transform systems and computer architecture in their turn.
It’s also a marvelous time in computer architecture, as the huge computing demands of machine learning have led to a new burst of creativity in our field. GPUs sparked the revolution, and Google’s TPUs show the possibility of even greater benefits through specialization. Machine learning constitutes an increasing fraction of the papers and sessions of architecture conferences. There are now at least 45 hardware startups with $1.5 billion in investment targeting machine learning. After years of dominance by a handful of computer architectures, we are at the cusp of a Cambrian explosion in architectural diversity (a word of warning, though: very few of the body forms in the Burgess Shale look like animals alive today).
This transfer from research to commercial deployment has been remarkably fast: the breakthrough AlexNet paper, which markedly improved image classification using deep learning, was published just six years ago. Commercialization raises the stakes, creating a strong desire to understand and quantify the benefits of deep learning technologies. In an ideal world, each new user of machine learning would evaluate the costs and benefits of the technology for their particular application. In the real world, that’s too expensive, and the expertise needed to measure well is unevenly distributed. We need good summary measures: it’s time for the machine learning system-building community to build good benchmarks.
Good benchmarks serve a number of different purposes and groups. For newcomers, benchmarks provide a summary that helps them orient in a maze of new terms and data. For sophisticates, benchmarks provide an easily portable and quick-to-collect baseline, where specific measurements will give more relevant data, and disagreement between the benchmark and the specific measurement suggests that more investigation is needed. For users and solution providers, benchmarks provide the possibility of sharing the development costs of measurement infrastructure. Benchmarks encapsulate expert opinions about what’s important and valuable, multiplying the impact of the advice that went into benchmark selection and construction. While not immune to manipulation, benchmarks provide a counterweight to product marketing hype. When the benchmarks are “representative,” they allow engineering effort to be focused on a small but high-value and widely used set of targets. In the best cases, benchmarks initiate a virtuous circle, propelling a cycle of optimization and improved value for all members of a community: manufacturers and users, academics and businesses, consultants and analysts.
Readers of this blog know SPEC well; it and the TPC benchmarks are shining examples of good benchmark suites. Their launch and adoption correspond to Golden Ages of innovation in microprocessor performance and transaction processing, respectively. While it is difficult to distinguish correlation from causation, it seems likely that common, good metrics helped these two large fields to make rapid, measurable progress. Wouldn’t we want a similar Golden Age for machine learning systems?
What might a good ML benchmark look like? Allow me some rhetorical indulgence, in grouping the attributes that I think are important under five “R”s:
- Relevant metrics. When I meet with my neural network colleagues, we often have to clarify what we mean by performance. They mean precision and recall or top-5 accuracy. As a computer architect trained in the quantitative approach championed by Hennessy and Patterson, I mean time-to-solution. For some deep learning systems, the correct metric is likely a combination, such as the DAWNBench time-to-accuracy metric. But this is just a start; for example it seems tricky to come up with a single metric for some reinforcement learning systems. Many studies today report examples/second, a throughput measure that measures neither accuracy nor time-to-solution for either training or inference.
- Representative workloads. A good benchmark suite is both diverse and representative, where each workload in the suite has unique attributes and the suite collectively covers a large fraction of the application space. There are dangers in overweighting traditional, well-understood tasks (ImageNet) and underweighting emerging ones (GANs and ML for healthcare). Choosing the set of workloads is probably best done with expert guidance. The Fathom benchmark suite emphasized broad representation and analyzed the mix of operation types in their set to better understand the differences among workloads.
- Recent problems. Deep Learning is changing rapidly, so any fixed benchmark suite will quickly become obsolete. SPEC aimed to unveil a new version of its suite every three years; that pace seems far too slow for today’s ML ecosystem. Rapid (annual?) iteration makes for more work but also allows a benchmark suite to remain relevant and respond to new developments.
- Repeatable measurements. ResNet-50 is a de facto neural network benchmark, and it’s common to report ResNet results in examples/second. But “ResNet-50” is a name, not a specification or an executable program. A recent comparison from RiseML that spanned hardware and frameworks pointed out that one system “applies very compute-intensive image pre-processing steps and actually sacrifices raw throughput” to improve time-to-solution, observing, “The improved…convergence is likely due to the better pre-processing and data augmentation, but further experiments are needed to confirm this.” This difficulty in comparison reflects our current state of the art. Cloud services and systems such as Docker make repeatability much easier today. But new hardware takes time to reach the Cloud, so a good benchmark suite should have a way of supporting repeatability regardless of where an experiment is conducted.
- Reasonable cost. One virtue of SPEC was that academics could easily include versions of its results in their papers. At the opposite extreme, some TPC results require large, expensive systems to demonstrate the best cost/transaction/second. While some sophisticated users require scale, a benchmark suite should be affordable.
I couldn’t think of a way to capture “portable” in a word that starts with “R”, but of course portability is fundamental to any benchmark suite, not just ML benchmarks.
There are a number of efforts to build benchmark suites for deep learning. I’ve already mentioned Harvard’s Fathom and Stanford’s DAWNBench. Baidu’s DeepBench is built from microbenchmarks rather than whole applications. TBD, a collaboration between University of Toronto and Microsoft, was just announced in March. Each deep learning framework has its own page listing models and often reporting accuracy, performance, or both (Caffe, CNTK, MxNet, PyTorch, TensorFlow). NVIDIA has an ML-specific performance page, as does training startup GraphCore; I expect other startups to follow suit as their systems mature. So as a community we are already doing benchmarking work. Each group labors separately and produces results that are difficult to compare.
How can we improve this situation? The history of SPEC and TPC shows us the way: join forces, and collaborate to build a common benchmark suite for Machine Learning. Include multiple viewpoints, spanning academic and commercial, hardware and software, inference and training, researchers and developers. A group of universities and companies, including many of the teams listed above, are banding together to build such a common benchmark suite: MLPerf. We bring our needs and hunches, and our hopes and fears for the future to the table, and we work to build a shared consensus. None of us will get exactly what we think we want; that’s the nature of compromise. But I believe that the whole will be far better than anything that could be built by one group. That sounds like the start of a virtuous cycle to me.
About the Author: Cliff Young is a member of the Google Brain team and works on codesign for deep learning accelerators. He was one of the architects of Google’s TPU and contributes to the MLPerf benchmarking effort.