A Checklist Manifesto for Empirical Evaluation: A Preemptive Strike Against a Replication Crisis in Computer Science

The SIGPLAN Empirical Evaluation Checklist (click for PDF)

In 2009, Dr. Atul Gawande, a surgeon at Brigham and Women’s Hospital in Boston, published The Checklist Manifesto: How to Get Things Right, describing his experience using checklists to reduce the risk of errors. Gawande observed that a number of serious airplane accidents were due to avoidable human errors. That is, despite the fact that pilots are highly trained professionals, they still routinely made mistakes which led to crashes.

In response, aviation officials instituted the use of checklists to be used prior to take-off; while pilots were initially reluctant to use them, once they did, the effect was a drastic reduction in certain kinds of accidents. Gawande advocated the same approach for another domain with highly trained professionals—medicine—and their adoption had a similarly dramatic impact on reducing harm due to error.

While most research in computer science (thankfully) does not run the risk of killing people, there are other serious risks when scientists are not sufficiently careful. A number of sciences are now in the grips of the so-called replication crisis, where many articles have been forced to be retracted once their results turned out to be non-reproducible.

Artifact Evaluation

One recent initiative that can help prevent this kind of embarrassment is the Artifact Evaluation Process (AEP), which originated in the Programming Languages (SIGPLAN) and Software Engineering (SIGSOFT) communities and was recently discussed in a Computer Architecture Today post; the first major conference to conduct such a process was ESCE/FSE (2011).

The AEP process aims to raise the standards for reproducibility by encouraging authors to submit research artifacts (e.g., systems and data) with which reviewers can confirm results presented in the paper. Though AEP occurs after the acceptance decision, authors have volunteered to follow it in ever-growing numbers. Many major CS conferences have now adopted an AEP, and it has received backing by the ACM.

The benefits of an AEP are welcome but limited in scope. An AEP can confirm that results presented in the paper are reproducible; it cannot effectively judge whether those results are sufficient to support the paper’s research claims. For example, for a paper that presents a new program analyzer, an AEP can confirm that, as set out in the paper, the analyzer indeed finds bugs in a particular set of target programs.

But an AEP does not determine whether those results are sufficient to prove a claim of the analyzer’s general utility (e.g., by assessing whether the target programs are sufficiently representative). Leaving that question to the AEP would be too late, as the paper has already been accepted. Rather, the determination should occur during the paper review process.

Our observation back in mid-2017 was that, unfortunately, such judgments are not applied consistently when considering empirical evidence: Papers in top SIGPLAN conferences with significant methodological flaws were being published with regularity.

The Checklist

To study the problem, we formed the SIGPLAN ad hoc committee on Empirical Evaluations. As we learned more, we realized that what we were seeing was not so surprising. Good research requires managing many details, some of which are easily overlooked. Papers are often reviewed under significant time pressure, so reviewers can miss these problems. Moreover, the field is moving fast, and not everyone is aware of the best research methods.

In short, what we were seeing were errors of the same sort Gawande was seeing in failures of commercial air travel and emergency rooms. With his manifesto (and the success of AEP) as a loose guide, we developed an Empirical Evaluation Checklist (PDF) for authors and reviewers of research in programming languages. We hope to inspire other communities across computer science like SIGARCH to adopt our checklist (possibly with changes) or develop their own.

The checklist is short: it fits on one page and consists of just seven items, each with associated example violations for illustration. The items are meant to be comprehensive, applying to the breadth of possible empirical evaluations. The example violations for each item highlight concrete, common areas in which we observed that best practice was frequently not followed.

These are meant to be useful and illustrative, but they are neither comprehensive nor applicable to every evaluation. For less common empirical evaluations, other example violations may be relevant, even if not presented in the checklist explicitly.

The seven checklist items are given below. These are grouped with their example violations in the full checklist.

Clearly Stated Claims. Papers—explicitly or implicitly—include claims. If the claims of a paper are not clear, how can an evaluation evaluate them?
Suitable Comparison. Papers present novel contributions. Novel means different from—and ideally better than—prior work. How can one claim to improve upon prior work if one does not perform a suitable comparative evaluation?
Principled Benchmark Choice. Evaluations often use benchmarks, example tasks, or workloads on which to assess an idea. If such benchmarks are not chosen in a principled way, then the findings of an evaluation might be meaningless.
Adequate Data Analysis. The results of a quantitative evaluation usually are summarized for presentation in a paper. If this data analysis or summarization is inadequate, the presented results can be misleading.
Relevant Metrics. A quantitative evaluation determines the effects of an idea. To be useful, the evaluation needs to consider all the relevant metrics.
Appropriate and Clear Experimental Design. Experimental design is key to empirical evaluation. The soundness of the evaluation depends on the design being appropriate, while the reproducibility of the evaluation depends on the design being clear.
Appropriate Presentation of Results. After performing a data analysis of the empirically collected metric values, authors usually transform the results into a representation to include in the paper. No matter whether that presentation is textual, tabular, or in the form of graphs, it needs to not mislead the reader.

Frequently Asked Questions

While the feedback we have received has been almost entirely positive, several people have expressed concerns – interestingly, quite similar to those reported by Gawande. Here we discuss the top three concerns we have heard.

Doesn’t everyone already know this stuff? Yes, in principle, but as we describe above, we drove the checklist by identifying widespread cases where authors did not check all the boxes. While the pressure to publish a computer science article is certainly less than that involved in brain surgery or flying airplanes, the rush to meet a deadline does seem to result in things getting lost in the shuffle.
While some people see the checklist as too obvious, others worry in the opposite direction: will this raise the bar too high? That is, will the existence and use of a checklist somehow squeeze out empirical research? Perhaps some checklist could, but not this one: we deliberately chose checklist items that represent a consensus view of the minimum bar for a correct empirical evaluation. The checklist is simply there to make sure that no key aspect of evaluation is forgotten, either during writing or reviewing.
Finally, some worry that the checklist will make rejection easier. They envision a world where reviewers use the checklist mechanically and simply reject papers that fail to check all the boxes.
First, we stress that the checklist absolutely cannot replace proper reviewing. Second, we believe that a failure to check a box may well be an appropriate reason to reject the paper, but this requires nuance. Third, we believe that, on the contrary, it may make papers harder to reject. The checklist represents a uniform standard that all empirical evaluations are held to. If a paper checks all the boxes, it makes it much harder for a reviewer to reject based on its evaluation!

Since we introduced the checklist, the reception of the SIGPLAN community to the checklist has been overwhelmingly positive. The community provided thoughtful feedback on our initial drafts via an online form, which we incorporated into subsequent revisions of the checklist and an accompanying FAQ. They responded positively to both of our SIGPLAN Town Hall meeting presentations in 2018, where we heard anecdotal reports about the checklist being adopted by the community, particularly in the training of graduate students. We met with the PLDI steering committee, and after incorporating further feedback, the checklist was included in the advice for authors in the PLDI 2019 call for papers.

Evolving the Checklist

This checklist was shaped by problems we discovered when reviewing a large number of recent SIGPLAN papers (including ASPLOS). While most of the checklist items will equally apply to other communities, notably SIGARCH, some example violations may not be applicable in other settings, and it may be that other checklist items will be appropriate elsewhere.

We acknowledge these limitations and hope that other communities will adapt our checklist to better match their needs. Furthermore, computer science is a fast-moving discipline, and our empirical evaluations must continuously evolve to meet the our needs, so we see this checklist as a living document, expecting and hoping that it will change as the discipline does.

Call to Action

The embrace of the Empirical Evaluation Checklist by the SIGPLAN community is an encouraging sign of our collective will to improve standards of empirical research. We believe the checklist, with minor modifications, could be adopted by SIGARCH as well, and we look forward to feedback from the SIGARCH community.

About the Authors: Emery D. Berger is a professor in the College of Information and Computer Sciences at the University of Massachusetts Amherst and is currently a visiting researcher at Microsoft Research and the University of Washington. Stephen M. Blackburn is a professor in the Research School of Computer Science, Australia National University. Matthias Hauswirth is an associate professor in the Faculty of Informatics at the Università della Svizzera italiana (USI). Michael W. Hicks is a professor in the Department of Computer Science at the University of Maryland, College Park.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.

Computer Architecture Today

A Checklist Manifesto for Empirical Evaluation: A Preemptive Strike Against a Replication Crisis in Computer Science

Artifact Evaluation

The Checklist

Frequently Asked Questions

Evolving the Checklist

Call to Action

Contribute

Recent Blog Posts

Archives

Subscribe

Join Us

Computer Architecture Today

A Checklist Manifesto for Empirical Evaluation: A Preemptive Strike Against a Replication Crisis in Computer Science

Artifact Evaluation

The Checklist

Frequently Asked Questions

Evolving the Checklist

Call to Action

Share this:

Contribute

Recent Blog Posts

Archives

Tags

Subscribe

Join Us