Having papers In our top conferences are essential for graduate students to get jobs, assistant professors to get tenured, associate professors to get promoted, full professors to become chaired professors, and for academic and industry researchers to be recognized by IEEE/ACM Fellowships and other awards including the Turing award. More broadly, the papers establish provenance of ideas to ensure due credit and are fundamental to moving the field forward.
In my title, I single out graduate students and assistant professors because they are affected the most – in a career-starting or -ending manner– by the presence or absence of conference papers in their resume.
Given the central importance of conference papers, one would think that the paper review process is sane. Instead, the process resembles the following scenario: Imagine you are taking a college-level computer architecture course. The class has one instructor and ten TAs. The grade is determined by only one exam and the instructor randomly distributes the exams to the TAs for grading. The instructor does not provide any answer key or rubric to the TAs, the point distribution, or the expected grade distribution.
The TAs naturally have varied temperaments and expectations. For example, for a question where the correct answer is 4.3, one TA gives full points only if the student has shown all her work and has arrived at 4.3, and zero points if even some minute detail is missing though the intermediate steps are all correct and the final answer is correct. Six other TAs give most of the points for any answer close to 4.3 with correct intermediate steps. The remaining three TAs give most of the points for any answer that is positive! Further, the three TAs score the exams out of 200 points (don’t ask me why) while the rest use 100 points. And finally, the instructor simply sorts the exams by the raw scores without any normalization and assigns A’s to the top 20% and F’s to the rest.
At this point, if you think this madcap scenario would never occur in our conferences, please see my previous blog post on this topic, Overwhelming statistical evidence that our review process is broken.
I have raised this issue several times. Though many people agree with me that this situation is absurd, this problem continues even though at least some PC chairs are trying quite hard. For example, under the leadership of my good friend Babak Falsafi, ISCA ’18 used per-reviewer paper ranking in addition to scores despite HotCRP’s extremely tedious interface; and, Babak’s team sent several hundreds of personalized manual messages during online discussions. Unfortunately, the PC chairs do not address the core issue of a priori defining a common standard for all the reviewers and enforcing the standard at all stages of the process. For example, unlike the above example, post-facto normalization of scores won’t work because normalizing with respect to a single score ignores the fact that there can be a genuine spread across the reviewers in the number of good papers in each reviewer’s pile (8% to 33%). Also, normalization would not differentiate among, say, the 8 weak accepts given by a reviewer which is exactly the problem. We need a priori guidelines for a common standard and sane enforcement (see A common standard to fix our review process).
Babak’s blog post, ISCA18 review process reflections, argues that per-reviewer ranking alleviates this issue. For example, using ranking to guide the discussion order at the PC meeting is an excellent idea. However, ranking alone does not address the problem of a reviewer hard-rejecting his/her rank #2 (out of 15 papers) while another reviewer weak-accepts his/her rank #9. Any ranking would have to be combined with a common standard, such as most reviewers’ weak-accepts or better should fall within, say, ranks #1 through #5 (out of 15 papers). Babak’s blog post argues that because the reviewers of an accepted paper largely agree on the paper’s scores, the process is sane. This argument is a bit circular because the reviews are already skewed. Consider an extreme example where some 30% of the reviewers give accepts to all the papers in his/her pile. Now, if all of the reviewers for a paper are from this 30% (non-trivial probability given 30%) then the paper will be accepted. While all the reviewers would agree on the scores of such papers, the outcome is obviously absurd. In reality, our conferences routinely have some 30% of PC reviewers giving 50%+ weak-accepts or better which is statistically unlikely given that the final acceptance rate is under 20%.
If the above scenario is unacceptable for a college-level course, why is it acceptable for our career-affecting review process? Graduate students and assistant professors (and indeed all of us) should demand a common standard for all the reviewers.
About the author: T. N. Vijaykumar teaches and studies computer architecture in the School of Electrical and Computer Engineering at Purdue University.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.