A Common Standard to Fix Our Review Process (and oh, I was wrong about one thing)

Here is a summary of the state of our review process (see https://www.sigarch.org/overwhelming-statistical-evidence-that-our-review-process-is-broken/): (1) an average score of weak accept (or better) has historically meant that the paper will most likely get in, (2) some 25-35% of the PC routinely give weak-accept-or-better (WAB) scores for 50%+ of the papers in his/her pile which is statistically unlikely to be legitimate (but 65% of the PC who do not do this can fix this problem, as we will see later), and (3) if only 5% of the PC do this then their scores will be offset by the others’ scores but if 30% do this then for typical PC sizes there is a 10% chance that a lucky paper can go to three over-positive PC (OPPC) and have a high chance of getting in. This means 30-40% of the final program may be unduly influenced by whom the papers went to instead of only on what is in them (historical acceptance rate is 20%, so the 10% chance means half the program). Most importantly, these false-accepts may push down truly good but unlucky papers in the ranking and cause false-rejects.

If you think this is absurd (it is), I have a comprehensive and simple fix for this problem.

Unfortunately, review scores can’t be normalized easily. The per-PC WAB can legitimately span from 8% to 33%, typically. So normalizing the scores using a fixed acceptance rate would induce false-accepts at the 8% end of the spectrum and false-rejects at the 33% end. Alternatively, seeing how a reviewer diverges from the others on a paper does not help if a paper gets three OPPC. At the same time, our review process does not offer any common standards which has led to the current absurd situation, especially given our large and temperamentally-diverse PCs.

(1) A review rubric would provide a common standard for all reviewers. You may be surprised that even the most basic of metrics have become meaningless. For example, whether an accepted paper should be “new and better” has become debatable and novelty has been deemed to be subjective to the point of becoming useless (yes all subjectivity cannot be removed but that does not mean it is unusable in the vast majority of cases). You may think that such confusion would be uncommon, but quite a few of my co-reviewers in many recent conferences have raised these issues. I can’t believe that I have to argue that “new and better” is not controversial, has always been the standard, and is what ensures progress. Clearly, to make progress, the paper has to be better in some important aspect while being equal, or at least not “much” worse, in other important aspects (otherwise, it is one step forward and two steps backward).

So what would this rubric look like? It is simply not true that we cannot ever agree on one. Then, why do we waste thousands of reviewer hours on a selection process? And what does ISCA’s 20% acceptance rate even mean?

I have a sample here based on “new and better”. Of course, PC chairs can make reasonable changes.

What is the problem and why is it compelling? A well-known problem, or a new problem with reasonable evidence of importance. While little evidence is not good, demanding extraordinary evidence for an intuitive claim is not fair either.
What is the best previous solution or a reasonable/solid baseline if no previous work?
What is new in this solution? The paper should pinpoint the novelty over previous work. If a reference covers a claimed contribution then the contribution is not novel (the reviewer must give the reference). In that case, the remaining contributions must be substantial enough, one measure of which is how much the paper improves over previous work (point #3). The context or topic being new (e.g., new technology) does not by itself mean novelty unless the context substantially changes the problem so that the solution is novel. On the flip side, if a paper builds on previous work that alone does not make it “incremental” (if it did, ISCA would have one paper every ten years). If the paper makes substantial contributions beyond the previous work then it is novel. Old problem and new solution is novel too. No amount of implementation is a replacement for intellectual novelty.
How much better is the paper over the best previous work? What did the paper show to convince that the improvements come from the claimed contributions and not something else? The paper should show reasonable evidence that it is better than the best previous work in one important aspect and equal, or at least not much worse, in other important aspects (better in some and worse in others should be discussed to ensure overall progress). If there is no previous work, then comparing to a reasonable “baseline” should be acceptable. While not comparing to the best previous work is not good, demanding extraordinary evidence, well beyond community standards, is unreasonable. Further, the paper should show evidence that it is improving for the reasons it claims (e.g., a prefetch paper should show lower miss rates).
The “new and better” criterion should be applied irrespective of whether a reviewer loves or hates the topic of the paper (the thoroughness of evaluation can be modulated for unusual topics but the basic “new and better” must hold).

A few points: (a) You may think this sample is obvious but you will be surprised by the width of the reviewer leniency-strictness spectrum and by the lack of clarity on these basic metrics. (b) None of these points are exhaustive or 100% objective. But a common standard can ensure that the vast majority of papers (90%?) are judged fairly, and the remaining minority can be judged under the calibration set up by the common standard. (c) And no, the rubric would not make our papers look “regimented and alike”. The key ingredients of a paper – the problem, solution, and results (i.e., 99% a paper) — would be vastly different across papers.

But a rubric alone is not enough. We have to ensure that the rubric is followed, especially by both OPPC and over-negative PC (ONPC).

(2) Abolish online accepts. Typically, 30-50% of the program is decided by online accepts. Bad idea. These are decided solely by 3-5 PC members where OPPC can have undue influence. PC chairs do this to make room for discussing more papers. But how many papers are discussed is not important. Which papers are discussed and how fair is the process are more important. In my previous blog, I argued that PC-wide votes are unfair (see https://www.sigarch.org/how-pc-chairs-and-pc-members-can-help-improve-our-process/). I was wrong. All papers should be decided by PC-wide votes so that the non-OPPC (65%) can offset the OPPC (35%) scores. This is not controversial – this is how it was for nearly the first three decades of ISCA. One problem is that the discussion lead should summarize the paper well for the rest of the PC who have not read the paper (else the authors will pay which is the unfairness I was trying to avoid previously). To that end, PC chairs should give 4-5 specific questions (based on the rubric) to be answered in the summary, so all summaries answer the same questions. Banning online accepts will not lengthen the PC meetings which routinely discuss 90+ papers of which only 30 get accepted (+ 30 online accepts = 60 accepts total). Instead, discussing 90 papers without any online accepts and taking 60 is fine.

(3) Abolish “champions”. This lets OPPC go wild whereas the others remain restrained. If a paper is good, it will need no champions as long as everyone plays by the rules.

(4) Encourage reviewers to act as judges and not as advocates. PC chairs often tell reviewers to be positive which makes the OPPC go wild whereas the others remain restrained. Reviewers should be neither positive nor negative. Just the facts and all the facts (both positive and negative), please.

(5) Show the rubric to the authors during rebuttal so they can flag reviewers who don’t play by the rules. This flagging is crucial for reviewer accountability and for keeping ONPC under check.

(6) Show the per-PC WAB counts throughput the review process so everyone knows their co-reviewers’ dispositions. Give the binomial WAB count expectation before reviewing starts and compare the expectation to the actual distribution throughout the process.

(7) Define the scores: A weak accept means just above the bar (instead of “I’d prefer to be accept”), an accept means well above the bar (instead of “this paper should be accepted”), and a strong accept means way above the bar (instead of “I’d champion this paper”). These definitions move away from what may be interpreted as the reviewer’s prerogative to the merit of the paper.

(8) Adjust the paper discussion order at the PC meeting based on per-PC WAB counts so the OPPC don’t flood the top ranks (e.g., top 4-6 papers of every PC before the rest, irrespective of the numerical scores) .

These suggestions can be accommodated within our current process without an overhaul which has a high risk of even bigger unforeseen problems.

Conclusion: Defining a common standard for reviews, for per-PC WAB counts, and for PC discussion summaries will go a long way in moving us away from our current absurd situation.

About the Author: T. N. Vijaykumar is a Professor of Electrical and Computer Engineering at Purdue University and works on computer architecture.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.

Computer Architecture Today

A Common Standard to Fix Our Review Process (and oh, I was wrong about one thing)

Contribute

Recent Blog Posts

Archives

Subscribe

Join Us

Computer Architecture Today

A Common Standard to Fix Our Review Process (and oh, I was wrong about one thing)

Share this:

Contribute

Recent Blog Posts

Archives

Tags

Subscribe

Join Us