Overwhelming Statistical Evidence That Our Review Process Is Broken

I have been saying that over-positive PC (OPPC) members’ high scores mess up the paper rankings, the coverage of online discussions (lower-score papers are ignored), and the discussion order at the PC meeting (see https://www.sigarch.org/false-accepts-and-false-rejects). While false-accepts are unfair (papers that should be rejected are accepted), the more pernicious effect is false-rejects (papers that should be accepted are rejected) which hurts forward progress, not to mention the human cost on the authors. Even if a false-accept is eventually rejected, it pushes out good but unlucky papers (which do not even get discussed, let alone get accepted).

Previously I had analyzed only the pre-rebuttal score distributions but not the impact on the actual outcomes. Now, I have statistical evidence of the impact.

CAUSE OF THE PROBLEM: Data from an actual conference

Assumptions: (1) Final acceptance rate: 20% (i.e., 20% papers are “good”), (2) an average score of weak accept or better score (WABS) for a paper means a high chance of acceptance, and (3) good papers are randomly distributed among reviewers (this assumption is discussed later in “Limitations”).

As per binomial distribution, a PC member has a 0.7% (0.007) probability of getting 8 or more “good” papers out of 16 (i.e., personal acceptance rate >= 50%). In one conference, as many as 35.5% PC members each gave WABS for 8 or more papers BEFORE the rebuttal, whereas the binomial distribution says 0.7% PC member may have 8+ “good” papers. The null hypothesis (fraction of PC with 8+ good papers is 0.007) is rejected at a significance of 0.00005 (the statistical equivalent of finding the Higgs Boson!). This says that 35.5% of the PC cannot each claim that half or more of the papers assigned to them are “good”. There were 1, 2, and 2 OPPC who had personal acceptance rates of 62.5%, 68.75%, and 75% – the probabilities are 0.00022, 0.000029 and 0.0000031!

All the conferences that I have monitored in the last 5 years have had 25%+ OPPC each giving WABS for >= 50% of their papers (more than 8 conferences). Below, I show the OPPC’s impact on the actual outcomes.

IMPACT ON THE ACTUAL OUTCOMES

Even if we assume that the OPPC happened to get disproportionately many good papers, the binomial distribution predicts that the top 35.5% of the PC with the most accepted papers contribute 58.4% of the PC reviews for all the accepted papers (the contribution is more than 35.5% because of legitimately getting the top papers). In the conference, the top 35.5% of the PC contributed 68.7% of the PC reviews for all the accepted papers (does not say if the reviews were positive or negative). Now, this top 35.5% of the PC includes some of the OPPC and some non-OPPC who might have legitimately got some good papers. The null hypothesis (fraction of top-35.5 PC’s reviews for accepted papers <= 58.4%) is rejected at a significance of 0.003. This says that the accepted papers have disproportionately many PC reviews from the top 35.5% of the PC which includes some of the OPPC, and the papers’ selection is skewed significantly by the OPPC.

If we assume that the OPPC got a random set of papers (instead of disproportionately many good papers), then we can try a different test. For the accepted papers, 50% of the PC reviews came from the OPPC (does not say if the reviews were positive or negative). Because there are 35.5% OPPC, the expected fraction of OPPC reviews for these papers is 35.5%. The null hypothesis (fraction of OPPC reviews for accepted papers <= 35.5%) is rejected at a significance of 0.0002. This says that the accepted papers have disproportionately many PC reviews from the OPPC.

For the accepted papers, only 11.48% did not have any OPPC reviews (the other papers had one or more OPPC reviews). Out of 3 PC reviews per paper, prob. (all three reviews are from non-OPPC who are 64.5% of the PC) = 26.8% (assumes a random set of papers for the OPPC). The null hypothesis (fraction of accepted papers with only non-OPPC reviews <= 26.8%) is rejected at a significance of 0.005. This says that disproportionately few accepted papers have only non-OPPC reviews.

For the accepted papers, 50.8% had 2 or more OPPC reviews. Out of 3 PC reviews per paper, prob. (2 or more OPPC reviews) = 28.9% (assumes a random set of papers for the OPPC). The null hypothesis (fraction of accepted papers with 2+ OPPC reviews < = 28.9%) is rejected at a significance of 0.002. This says that disproportionately many accepted papers each have 2+ OPPC reviews.

Out of the reviews written by the OPPC, 26% were for the accepted papers (does not say if the reviews were positive or negative). Of the reviews written by non-OPPC, 14% were for the accepted papers. Thus, there is a 86% higher chance for a paper to be accepted if reviewed by the OPPC.

The papers on the PC meeting discussion list had similar trends.

Worst impact of OPPC

There is a 26.8% chance that a paper does not get any OPPC review. This means more than a quarter of the papers have little hope. But the reality is much worse than a single rejection. If a topic has mostly non-OPPC reviewers (not necessarily negative) then a good but unlucky paper on the topic will repeatedly (really, permanently) get rejected because every conference has more than 25% OPPC reviewers who will repeatedly cause many other false-accepts to be ranked well above the paper. Negative reviewers hurt good papers but OPPC magnify the negative reviewers’ impact.

MITIGATING THIS PROBLEM

NOT HARD! I have been saying this for 5 years. Does NOT need a big change or an overhaul.

I am NOT saying (1) reviewers should be perfect, (2) there can be no difference of opinions among reviewers, or (3) only perfect papers should be accepted (there are no such papers). A reviewer can deviate from what probability predicts but 35% of the PC cannot have 50% good papers (and the same people cannot claim 50% good papers in conference after conference – which is even rarer than finding the Higgs Boson!). Most PC members and chairs do not pay attention to this data readily available on HotCRP. HotCRP shows per-PC average scores which are not meaningful. We should look at the number of WABS per PC or ERC member.

Once reviewing starts this problem is impossible to fix (any post-facto normalization is nearly impossible when there are 8 indistinguishable weak accepts). Prevention is not only better than cure, cure is impossible. Therefore, BEFORE reviews start, the PC chairs should instruct the PC/ERC on TWO things (THAT’s ALL – NO OTHER CHANGE): (1) that an average score of a weak accept means the paper will likely get accepted – that is, the common standard is that weak accept means above the bar (crucial to avoid false-accepts), accept means well above the bar, and strong accept means way above the bar, and (2) based on binomial distribution for 20% acceptance rate and 16 papers, most PC (X% based on PC size) should have 3-4 WABS, some PC (y% based on PC size) should have 1-2 or 5-6 WABS, and very few PC (literally 1-2) should have 0 or >= 7 WABS. Similar guidelines are applicable to ERC also (many of the null hypothesis tests work even for the ERC). We claim to be a quantitative community – our review process cannot routinely claim events whose probability is < 0.00005!

Test the above null hypotheses before and after the rebuttal and a week before the final ranking for the PC meeting, and keep reporting the findings to the PC.

LIMITATIONS OF THIS ANALYSIS

This analysis is limited to OPPC. Over-negative PC (ONPC) exist but I do not have any data on the undiscussed (and rejected) papers. Therefore I cannot say much about the ONPC except that there were zero, only one, and only two PC members with 0, 1, and 2 WABS before the rebuttal (the binomial distribution predicts as many as 2, 7, and 13 PC members with such WABS). Nevertheless, the same PC members cannot claim across multiple conferences year after year that each time they got only 1 moderately-good paper out of 16. That would surely fail a null hypothesis test too.
The analysis assumes that good papers are randomly distributed among reviewers. Our process violates this assumption by assigning most of the papers on a topic to a subset of reviewers who are experts on the topic. However, some randomization occurs even within this subset and each reviewer receives papers on multiple topics to avoid only good or only poor papers. If a topic has strong researchers then the acceptance rate for that topic may be legitimately high. Still, (a) it is unlikely that the topic will have submissions only from such researchers and not others, and (b) less than a handful of topics are likely to fit this category. There is absolutely nothing in our process to justify 35.5% of the PC to have >= 50% personal acceptance rates before the rebuttal. Simply impossible. This is a root cause.

CONCLUSION

The OPPC problem is out of control. This must stop now. The fix amounts to the PC chair instructing the PC/ERC on two things before the review starts and ensuring that overall distributions remain sane. That’s all – no other change, no overhaul. I did not say much about the ONPC problem, which is even more pernicious, because I do not have data. I have discussed some ideas for the ONPC in my previous blog “False-Accepts and False-Rejects” (https://www.sigarch.org/false-accepts-and-false-rejects).

About the Author: T. N. Vijaykumar is a Professor of Electrical and Computer Engineering at Purdue University and works on computer architecture.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.

Computer Architecture Today

Overwhelming Statistical Evidence That Our Review Process Is Broken

Contribute

Recent Blog Posts

Archives

Subscribe

Join Us

Computer Architecture Today

Overwhelming Statistical Evidence That Our Review Process Is Broken

Share this:

Contribute

Recent Blog Posts

Archives

Tags

Subscribe

Join Us