antonisa/SAC.md

## SAC.md

      
    Raw
  

              SAC.md
            
          
    The Impossible Task of Conference SACs/PCs or How I lost 3 Nights of Sleep

I am writing this post in order to share my thoughts on the processes behind acceptance/rejection decisions in top-tier (NLP) conferences. I'll first discuss the process and then share some thoughts on its shortcomings.
Before we start, a bit about me. I am an assistant professor (aka, rather junior: I have been in this position for less than 4 years, following my PhD studies and a short postdoc) working on NLP, with a focus on multilingualism and low-resource settings. While I have submitted, published at, and reviewed for *ACL conferences and workshops for many years, it was at EMNLP'23 that I was a Senior Area Chair (SAC) for the first time.
The Conference Paper Pipeline

Let's first briefly outline the process that a paper undergoes, from submission to decision:


Paper is submitted. The authors outline the track.


After some rudimentary checks for potential desk rejection (does the paper conform to the page limit and follow the appropriate format? is it anonymous? etc) and for fit with the track, the paper is assigned an Area Chair (AC) and reviewers.


Reviewers then read the paper and provide their reviews, along with soundness/excitement/reproducibility scores.¹ In some conferences, there may be an author response or general author-reviewer discussion.²


Most people are familiar with this, exactly since these days almost anyone who is submitting papers is very likely to be a reviewer also.

The AC then reads the reviews (and the paper, ideally) and provides a meta-review.

The meta-review is meant to summarize the reviews (and discussion), and provide a recommendation for the paper.
Of course, the AC will also bring their own perspective to the mix: perhaps they will deem one review as being particularly harsh, and decide to weigh it less for their recommendation; or the opposite. Again, most people are familiar with this, as they end up seeing the meta-review.
The Role of the Senior Area Chairs

The Senior Area Chairs are assigned to a Track, and they are primarily responsible for what happens to the paper further down (which is why they will generally be folks more senior than the ACs and the reviewers). In particular, the SACs are tasked with ranking the papers in their track. There's again some differences between conferences (e.g. *ACL ones vs ICLR/NeurIPS but let's ignore them here).
This ranking (for *ACL conferences) is: Accept-Main, Borderline-Main-Findings, Accept-Findings, Borderline-Findings-Reject, Reject. For the purposes of this discussion only, I'll ignore the distinction between a publication at the Main conference and Findings, and treat it as a 3-level ranking between Accept, Borderline, and Reject.³
There are multiple criteria for this ranking: the reviews, as reflected by both the actual text and the soundness and excitement scores; the rebuttal and/or discussion with authors providing detailed additional experimentation or reviewers raising their scores; the recommendation from the AC; the SAC's own reading of the papers; thematic diversity considerations and the SAC's own views on what constitutes publication-worthy material.
Out of all these criteria, only a few (the "scored" soundness/excitement and the AC recommendation, which also comes with a 1-5 score) can be quantifiably used to produce a definitive ranking. Consider, though, that even these scores always need to be taken with a grain of salt for a few reasons: almost everyone in the community is over-worked and reviewing is a reward-less endeavor -- so many end up reviewing in haste; almost everyone has unconscious biases -- for example, more junior reviewers tend to be harsher than more senior ones.⁴ Or maybe the AC had a bad day or simply happened to be hangry when they were writing a meta-review.
Nevertheless, there are many cases that are easy for the SACs to make recommendations on:

when all (or most) reviewers and the AC agree that the paper is worthy of publication, it is straightforward to place the paper in an Accept category.
Conversely, when all (or most) reviewers and the AC agree that the paper is not sound or not ready for publication, the paper ends up in the Reject pile.

The problem is, the largest portion of papers will fall in neither of these categories. Even for "good" papers, it is rather easy to come up with issues on any paper that will push it into a borderline status: for example, one can almost always ask for more experiments and results or further analysis.
Additional Constraints: Acceptance Rate

In theory, any sound paper that is sufficiently different from others and thematically relevant to the conference should be accepted. But conferences operate under space/time constraints due to the venue, and there are additional historical reasons related to conference's (perceived) prestige,⁵ which lead to an additional constraint: keeping the acceptance rate to some fairly low number. This is in my opinion the root of many of the (perceived) injustices we observe.
In broad terms, an acceptance rate of 20% means that the conference will only accept 20 out of every 100 submitted papers (but note that the denominator also includes desk rejects and withdrawn papers). That's true even if e.g. 40% of the papers were deemed good enough to potentially appear at the conference by the reviewers/ACs.⁶ Typically, *ACL conferences have a (historical) 25-35% acceptance rates.⁷
Producing the Ranking

Now, how can the SACs go about producing the ranking of the papers?
One approach is to simply treat the ACs and reviewers as the ultimate arbiters of quality, i.e. do nothing to adjust their scores/suggestions. This is perhaps the laziest approach, and I hope you'll agree with me that this is not what the SACs should be doing.
Instead, I believe that the SACs should form an opinion for the papers they're ranking by looking at the metareviews, the reviews, the scores, and ultimately the papers themselves. However, doing so for all papers in a track, especially large ones (e.g. with >100 papers) is basically infeasible.
The solution we came up in my track was to divide up the papers (randomly) among the SACs, produce an initial classification based on each batch, and then merge them. We then had a couple of several-hours-long meetings where we went through each and every paper, discussing them as needed, and assigning a label (Accept, Borderline, Reject) along with a priority score (1-5) for the borderline papers which effectively produce the final ranking.
A very helpful tool in this process was a sheet with all papers sorted by average soundness, AC recommendation, average excitement, and by our recommendation/priority. This sheet allowed us to identify potential "mistakes" or "outliers". These were papers that appeared "out of order", e.g. papers with low scores being ranked higher than papers with high scores, sound papers (based on the soundness score) being rejected, etc.
Examples (all real) include:

papers with high soundness scores that fell under Reject: we double-checked the reviews, AC rec, and the paper, sometimes agreeing that the paper should be moved "up", sometimes deciding that the initial decision was correct.
papers with low scores that the AC had suggested to accept: again, we double-checked everything and decided accordingly.
papers with low scores that fell under Accept: again, we double-checked everything, often finding that bad reviews (which the ACs often decided to ignore, as instructed) were the reason for the low scores.

Why Did I Lose Sleep?

All in all, the process was nerve-wracking. We had to produce a ranking adhering to the acceptance rate quotas, and that meant that some sound papers (which, in my view, "deserve" to be published) would have to be ranked so low that in practice they would be rejected, even if we classified them as "Borderline". At the same time, we have had to take into account the opinions of the reviewers and the ACs, and balance them with our own opinions about what is sound, what is exciting, and what is generally "good" for the community (i.e. which works will benefit our community if they are published now, in this venue versus not). And all that, while also trying to be as objective and fair as possible, knowing that students' and researcher's careers can be on the line.
Had our track only received a handful of "good" papers, so that we could just accept those and still remain under the desired acceptance rate, all would have been good. But we received so many sound⁸ papers: by my estimation, even a 50% acceptance rate would have left decent papers out! I strongly suspect this is not track-specific, but rather a general occurrence.
In the end, two things allowed me to sleep at night and to be able to stand by all our decisions.
First, the fact that all the SACs got together and discussed all papers, which means that we all agreed on the final ranking (and didn't take any shortcuts in producing it). Second, the fact that we decided to make our recommendations almost ignoring the target acceptance rate -- and pointing out to the PCs (who in the end have the final say) that we had more publication-worthy submissions than the quota, and strongly recommending an increase of our track's quota.⁹
The whole process, from the SACs side, took more than 5 complete working days of intense work. Note that this is volunteer work: I did not get paid extra to do any of this, although I guess you could consider it part of the service to the community that is expected of faculty or even industry researchers. Same goes for the ACs/PCs.¹⁰
Should we Change the Process?

I only described the experience that I had as a SAC in one (large) track of a single conference.
In our track, I can genuinely say that everyone took the job seriously and I firmly believe we made as fair decisions as we could with the information we had in hand.  While I do not know if that's the norm across tracks and across conferences, I strongly suspect that it is. And that's why I say that the SACs (and the PCs) are often faced with an impossible task, especially in big-umbrella tracks that receive a large amount of submissions.
While I've seen some calls for an instituted appeals process, I do not support them (with the exception of possible clerical errors). There are some good arguments online, but for me the most important one is that the authors have, by necessity, incomplete information. Even if a paper has high scores or generally positive reviews or whatnot, there's no way of saying where it falls within the ranking over all papers in the track! Authors only see their own papers and their reviews/scores (and maybe any other papers they happen to review). SACs have a broader overview, being able to see all papers within a track. And PCs have the bird's eye view of the whole conference, allowing them to make balancing decisions that may involve changing the "quotas" (acceptance rates) across tracks, but of course cannot be held solely responsible for each individual decision, as they are operating at a much more "macro" level.¹¹
The only reasonable call is to entirely let acceptance rates constraints go. Our community only has to lose from arbitrary gatekeeping -- let's just accept more papers, get as many people as possible together to discuss science, and let downstream impact be our measure of success (if and when we need to measure such things). This was what lead to the creation of the Findings avenue for publication in 2020.¹² Unfortunately, I think further changing things would require deeper institutional change (not just at the ACL, but also at academic departments across the world) which is impossible to attain overnight.
The current situation is really no specific person's fault: it's not like the PCs of a conference get together and decide on a priori arbitrary quotas. I cannot speak for them, but I am confident that everyone who has been a PC at a top conference has approached this work somberly and done the best they could given the current system, venue limits, time constraints, etc.
My only suggestion is to acknowledge that as the NLP community keeps growing, we will have to find more venues to publish our work. A lot of workshops, for example, are already publishing top-level work following the same rigorous reviewing process as top-tier conferences.
[Addendum]: In the days following the decision notification, I was surprised to discover that PCs and SACs have actually been receiving emails from authors, urging them to reconsider their decision. I was surprised because the thought of doing that hadn't even crossed my mind! Perhaps this could motivate the creation of an official appeals process (instead of only dealing with unofficial appeals from people who have the "audacity" –if I may be blunt– to ask for it!).¹³
I want to stress again that the entire process relies on volunteer work from the PCs to (S)ACs and to the reviewers, and takes up a lot of time and effort. Compounding this with additional effort for appeals and such responses from the community would discourage people from taking up SAC/pC roles.
Final Thoughts

I tried to give an inside-view of the processes that lead to paper acceptances or rejections in conferences.¹⁴
I emphasize again that I am only describing my personal experience. It could very well be that processes for other tracks or oher conferences are different, or that other SACs have different views and experiences than mine. Also, I do not really know what goes into the PCs' job, I am only making minimally-educated guesses.
The main takeaway is that the process is by definition noisy and that's why we have multiple failsafes along the conference hierarchy: multiple reviewers, author responses/discussion, ACs, SACs, and PCs. But even if everyone involved was 100% fair and unbiased and adept, we would still end up rejecting some papers undeservedly. For authors of rejected papers, I offer the same advice I give my students: don't take it personally, embrace stochasticity, accept the noise, revise/rewrite, and resubmit.
Acknowledgements

Many thanks to everyone who provided feedback on my initial draft: Shruti Rijhwani, Sunayana Sitaram, Graham Neubig, and Juan Pino.
Footnotes

Footnotes


This two-score system is rather new (introduced in ACL'23). We used to only have a single recommendation score, but the 2-score system disentanging soundness from excitement was largely seen as a success so it will probably stick around. ↩


Different conferences follow different procedures, also for the rebuttal/discussion format: no rebuttal, rebuttal followed by internal AC/reviewer discussion, rebuttal followed by internal + direct reviewer/author discussions, with or without paper PDF updates allowed, etc. ↩


This is just to make the writing of this post cleaner. A lot of people do consider a Main versus Findings acceptance quite differently and in many practical respects they are indeed treated differently. For example, Findings publications do not get a presentation slot (so less visibility), nor are they counted as top-tier publications by various conference/department ranking "authorities". ↩


Not sure if there exists a citation for this, but this seems to be a common perception in the community. ↩


And, ahem, academic committees that pay too much attention to csrankings. ↩


Of course the PCs can adjust this number a bit (e.g. raise it a few percentage points) if the venue allows it but I doubt they could e.g. double it. ↩


I don't have any information or arguments for or against the actual causal relationship between prestige and acceptance rates, and how it all became to be institutionalized. Is it really that some entity decided that a certain threshold is required and then conferences followed suit in order to be considered top-tier? Or was it that the acceptance rate organically evolved due to the actual relative quality of the submissions?
I don't want to assume one way or the other,¹⁵ but nevertheless to me it sounds like the objective scientific way to go about this is to let the quality of the submissions determine the decision (which would then simply allow for the calculation of the acceptance rate), as opposed for the acceptance rate influencing the decision. All this ignores, of course, additional external factors like venue capacities and such. ↩ ↩²


Here I also include papers in which the author response made it clear that the authors could easily make changes to the paper so that the camera-ready is above bar. ↩


Again, see note above (⁷) about the causal relationship. We chose to go with (what I think is the more objective) way of "let's let the amount of sound papers determine the acceptance rate" and not the other way round. ↩


I have not included reviewers in that list, although one could count it also as volunteer work. And for some it may indeed be. But I have strong opinions on this: if you are a (somewhat) senior author of a submitted paper with (at least some) experience, then my view is that you should be contributing to reviewing for that conference (about 3 times the number of papers you are submitting). ↩


Well, PCs are responsible for recruiting the right people as SACs and setting up checks and balances in the process, but I hope you get my point. I highly recommend take a look at the PCs Report from ACL 2023 to understand what goes into the PCs' final decisions, or how different acceptance rates across tracks can be. Check Tables 1, 4, and 6 in that paper, for example, although I really recommend going through the whole paper -- it's a great read! ↩


Along with a bunch of other reasons, see here. ↩


Maybe take a minute to ask yourself what type of person is likely to ask for a re-evaluation or an appeal. Let me give you an example from the academic community in a different setting: in departments with travel budgets or other budget items that are meant to be uniformly distributed to each faculty member, the way to get to go to more stuff is, sometimes, simply, to ask the department chair! If you don't ask, you won't get it. Shockingly, it turns out women typically don’t think it is reasonable to ask since they’ve been told this is their quota, while most men just go and ask for more money when they need it. Footnote footnote: I don't have a citation for this in hand, but I believe it to be true. ↩


There are a lot of things that this post did not even cover. The SACs have additional responsibilities, like chasing (meta)-reviews, giving feedback to meta-reviewers, flagging/considering potential ethical issues, etc. In general the process is also more convoluted. For instance, there are different considerations for short vs long papers. There's also the additional complication of deciding on Main vs Findings, and in general the unequal perception of Findings in the community. ↩


I suspect acceptance rates evolved historically until various conference ranking authorities (e.g. csrankings) started using them as the ultimate criterion to distinguish publication venues and now we are stuck with them. ↩