Skip to content

Instantly share code, notes, and snippets.

@antonisa
Last active December 1, 2024 15:33
Show Gist options
  • Save antonisa/2158dee79179c7c49304ef353887bfa0 to your computer and use it in GitHub Desktop.
Save antonisa/2158dee79179c7c49304ef353887bfa0 to your computer and use it in GitHub Desktop.
Conference Decisions

The Impossible Task of Conference SACs/PCs or How I lost 3 Nights of Sleep

I am writing this post in order to share my thoughts on the processes behind acceptance/rejection decisions in top-tier (NLP) conferences. I'll first discuss the process and then share some thoughts on its shortcomings.

Before we start, a bit about me. I am an assistant professor (aka, rather junior: I have been in this position for less than 4 years, following my PhD studies and a short postdoc) working on NLP, with a focus on multilingualism and low-resource settings. While I have submitted, published at, and reviewed for *ACL conferences and workshops for many years, it was at EMNLP'23 that I was a Senior Area Chair (SAC) for the first time.

The Conference Paper Pipeline

Let's first briefly outline the process that a paper undergoes, from submission to decision:

  1. Paper is submitted. The authors outline the track.

  2. After some rudimentary checks for potential desk rejection (does the paper conform to the page limit and follow the appropriate format? is it anonymous? etc) and for fit with the track, the paper is assigned an Area Chair (AC) and reviewers.

  3. Reviewers then read the paper and provide their reviews, along with soundness/excitement/reproducibility scores.1 In some conferences, there may be an author response or general author-reviewer discussion.2

Most people are familiar with this, exactly since these days almost anyone who is submitting papers is very likely to be a reviewer also.

  1. The AC then reads the reviews (and the paper, ideally) and provides a meta-review.

The meta-review is meant to summarize the reviews (and discussion), and provide a recommendation for the paper. Of course, the AC will also bring their own perspective to the mix: perhaps they will deem one review as being particularly harsh, and decide to weigh it less for their recommendation; or the opposite. Again, most people are familiar with this, as they end up seeing the meta-review.

The Role of the Senior Area Chairs

The Senior Area Chairs are assigned to a Track, and they are primarily responsible for what happens to the paper further down (which is why they will generally be folks more senior than the ACs and the reviewers). In particular, the SACs are tasked with ranking the papers in their track. There's again some differences between conferences (e.g. *ACL ones vs ICLR/NeurIPS but let's ignore them here).

This ranking (for *ACL conferences) is: Accept-Main, Borderline-Main-Findings, Accept-Findings, Borderline-Findings-Reject, Reject. For the purposes of this discussion only, I'll ignore the distinction between a publication at the Main conference and Findings, and treat it as a 3-level ranking between Accept, Borderline, and Reject.3 There are multiple criteria for this ranking: the reviews, as reflected by both the actual text and the soundness and excitement scores; the rebuttal and/or discussion with authors providing detailed additional experimentation or reviewers raising their scores; the recommendation from the AC; the SAC's own reading of the papers; thematic diversity considerations and the SAC's own views on what constitutes publication-worthy material.

Out of all these criteria, only a few (the "scored" soundness/excitement and the AC recommendation, which also comes with a 1-5 score) can be quantifiably used to produce a definitive ranking. Consider, though, that even these scores always need to be taken with a grain of salt for a few reasons: almost everyone in the community is over-worked and reviewing is a reward-less endeavor -- so many end up reviewing in haste; almost everyone has unconscious biases -- for example, more junior reviewers tend to be harsher than more senior ones.4 Or maybe the AC had a bad day or simply happened to be hangry when they were writing a meta-review.

Nevertheless, there are many cases that are easy for the SACs to make recommendations on:

  • when all (or most) reviewers and the AC agree that the paper is worthy of publication, it is straightforward to place the paper in an Accept category.
  • Conversely, when all (or most) reviewers and the AC agree that the paper is not sound or not ready for publication, the paper ends up in the Reject pile.

The problem is, the largest portion of papers will fall in neither of these categories. Even for "good" papers, it is rather easy to come up with issues on any paper that will push it into a borderline status: for example, one can almost always ask for more experiments and results or further analysis.

Additional Constraints: Acceptance Rate

In theory, any sound paper that is sufficiently different from others and thematically relevant to the conference should be accepted. But conferences operate under space/time constraints due to the venue, and there are additional historical reasons related to conference's (perceived) prestige,5 which lead to an additional constraint: keeping the acceptance rate to some fairly low number. This is in my opinion the root of many of the (perceived) injustices we observe.

In broad terms, an acceptance rate of 20% means that the conference will only accept 20 out of every 100 submitted papers (but note that the denominator also includes desk rejects and withdrawn papers). That's true even if e.g. 40% of the papers were deemed good enough to potentially appear at the conference by the reviewers/ACs.6 Typically, *ACL conferences have a (historical) 25-35% acceptance rates.7

Producing the Ranking

Now, how can the SACs go about producing the ranking of the papers?

One approach is to simply treat the ACs and reviewers as the ultimate arbiters of quality, i.e. do nothing to adjust their scores/suggestions. This is perhaps the laziest approach, and I hope you'll agree with me that this is not what the SACs should be doing.

Instead, I believe that the SACs should form an opinion for the papers they're ranking by looking at the metareviews, the reviews, the scores, and ultimately the papers themselves. However, doing so for all papers in a track, especially large ones (e.g. with >100 papers) is basically infeasible.

The solution we came up in my track was to divide up the papers (randomly) among the SACs, produce an initial classification based on each batch, and then merge them. We then had a couple of several-hours-long meetings where we went through each and every paper, discussing them as needed, and assigning a label (Accept, Borderline, Reject) along with a priority score (1-5) for the borderline papers which effectively produce the final ranking.

A very helpful tool in this process was a sheet with all papers sorted by average soundness, AC recommendation, average excitement, and by our recommendation/priority. This sheet allowed us to identify potential "mistakes" or "outliers". These were papers that appeared "out of order", e.g. papers with low scores being ranked higher than papers with high scores, sound papers (based on the soundness score) being rejected, etc.

Examples (all real) include:

  • papers with high soundness scores that fell under Reject: we double-checked the reviews, AC rec, and the paper, sometimes agreeing that the paper should be moved "up", sometimes deciding that the initial decision was correct.
  • papers with low scores that the AC had suggested to accept: again, we double-checked everything and decided accordingly.
  • papers with low scores that fell under Accept: again, we double-checked everything, often finding that bad reviews (which the ACs often decided to ignore, as instructed) were the reason for the low scores.

Why Did I Lose Sleep?

All in all, the process was nerve-wracking. We had to produce a ranking adhering to the acceptance rate quotas, and that meant that some sound papers (which, in my view, "deserve" to be published) would have to be ranked so low that in practice they would be rejected, even if we classified them as "Borderline". At the same time, we have had to take into account the opinions of the reviewers and the ACs, and balance them with our own opinions about what is sound, what is exciting, and what is generally "good" for the community (i.e. which works will benefit our community if they are published now, in this venue versus not). And all that, while also trying to be as objective and fair as possible, knowing that students' and researcher's careers can be on the line.

Had our track only received a handful of "good" papers, so that we could just accept those and still remain under the desired acceptance rate, all would have been good. But we received so many sound8 papers: by my estimation, even a 50% acceptance rate would have left decent papers out! I strongly suspect this is not track-specific, but rather a general occurrence.

In the end, two things allowed me to sleep at night and to be able to stand by all our decisions. First, the fact that all the SACs got together and discussed all papers, which means that we all agreed on the final ranking (and didn't take any shortcuts in producing it). Second, the fact that we decided to make our recommendations almost ignoring the target acceptance rate -- and pointing out to the PCs (who in the end have the final say) that we had more publication-worthy submissions than the quota, and strongly recommending an increase of our track's quota.9

The whole process, from the SACs side, took more than 5 complete working days of intense work. Note that this is volunteer work: I did not get paid extra to do any of this, although I guess you could consider it part of the service to the community that is expected of faculty or even industry researchers. Same goes for the ACs/PCs.10

Should we Change the Process?

I only described the experience that I had as a SAC in one (large) track of a single conference. In our track, I can genuinely say that everyone took the job seriously and I firmly believe we made as fair decisions as we could with the information we had in hand. While I do not know if that's the norm across tracks and across conferences, I strongly suspect that it is. And that's why I say that the SACs (and the PCs) are often faced with an impossible task, especially in big-umbrella tracks that receive a large amount of submissions.

While I've seen some calls for an instituted appeals process, I do not support them (with the exception of possible clerical errors). There are some good arguments online, but for me the most important one is that the authors have, by necessity, incomplete information. Even if a paper has high scores or generally positive reviews or whatnot, there's no way of saying where it falls within the ranking over all papers in the track! Authors only see their own papers and their reviews/scores (and maybe any other papers they happen to review). SACs have a broader overview, being able to see all papers within a track. And PCs have the bird's eye view of the whole conference, allowing them to make balancing decisions that may involve changing the "quotas" (acceptance rates) across tracks, but of course cannot be held solely responsible for each individual decision, as they are operating at a much more "macro" level.11

The only reasonable call is to entirely let acceptance rates constraints go. Our community only has to lose from arbitrary gatekeeping -- let's just accept more papers, get as many people as possible together to discuss science, and let downstream impact be our measure of success (if and when we need to measure such things). This was what lead to the creation of the Findings avenue for publication in 2020.12 Unfortunately, I think further changing things would require deeper institutional change (not just at the ACL, but also at academic departments across the world) which is impossible to attain overnight.

The current situation is really no specific person's fault: it's not like the PCs of a conference get together and decide on a priori arbitrary quotas. I cannot speak for them, but I am confident that everyone who has been a PC at a top conference has approached this work somberly and done the best they could given the current system, venue limits, time constraints, etc.

My only suggestion is to acknowledge that as the NLP community keeps growing, we will have to find more venues to publish our work. A lot of workshops, for example, are already publishing top-level work following the same rigorous reviewing process as top-tier conferences.

[Addendum]: In the days following the decision notification, I was surprised to discover that PCs and SACs have actually been receiving emails from authors, urging them to reconsider their decision. I was surprised because the thought of doing that hadn't even crossed my mind! Perhaps this could motivate the creation of an official appeals process (instead of only dealing with unofficial appeals from people who have the "audacity" –if I may be blunt– to ask for it!).13 I want to stress again that the entire process relies on volunteer work from the PCs to (S)ACs and to the reviewers, and takes up a lot of time and effort. Compounding this with additional effort for appeals and such responses from the community would discourage people from taking up SAC/pC roles.

Final Thoughts

I tried to give an inside-view of the processes that lead to paper acceptances or rejections in conferences.14 I emphasize again that I am only describing my personal experience. It could very well be that processes for other tracks or oher conferences are different, or that other SACs have different views and experiences than mine. Also, I do not really know what goes into the PCs' job, I am only making minimally-educated guesses.

The main takeaway is that the process is by definition noisy and that's why we have multiple failsafes along the conference hierarchy: multiple reviewers, author responses/discussion, ACs, SACs, and PCs. But even if everyone involved was 100% fair and unbiased and adept, we would still end up rejecting some papers undeservedly. For authors of rejected papers, I offer the same advice I give my students: don't take it personally, embrace stochasticity, accept the noise, revise/rewrite, and resubmit.

Acknowledgements

Many thanks to everyone who provided feedback on my initial draft: Shruti Rijhwani, Sunayana Sitaram, Graham Neubig, and Juan Pino.

Footnotes

Footnotes

  1. This two-score system is rather new (introduced in ACL'23). We used to only have a single recommendation score, but the 2-score system disentanging soundness from excitement was largely seen as a success so it will probably stick around.

  2. Different conferences follow different procedures, also for the rebuttal/discussion format: no rebuttal, rebuttal followed by internal AC/reviewer discussion, rebuttal followed by internal + direct reviewer/author discussions, with or without paper PDF updates allowed, etc.

  3. This is just to make the writing of this post cleaner. A lot of people do consider a Main versus Findings acceptance quite differently and in many practical respects they are indeed treated differently. For example, Findings publications do not get a presentation slot (so less visibility), nor are they counted as top-tier publications by various conference/department ranking "authorities".

  4. Not sure if there exists a citation for this, but this seems to be a common perception in the community.

  5. And, ahem, academic committees that pay too much attention to csrankings.

  6. Of course the PCs can adjust this number a bit (e.g. raise it a few percentage points) if the venue allows it but I doubt they could e.g. double it.

  7. I don't have any information or arguments for or against the actual causal relationship between prestige and acceptance rates, and how it all became to be institutionalized. Is it really that some entity decided that a certain threshold is required and then conferences followed suit in order to be considered top-tier? Or was it that the acceptance rate organically evolved due to the actual relative quality of the submissions? I don't want to assume one way or the other,15 but nevertheless to me it sounds like the objective scientific way to go about this is to let the quality of the submissions determine the decision (which would then simply allow for the calculation of the acceptance rate), as opposed for the acceptance rate influencing the decision. All this ignores, of course, additional external factors like venue capacities and such. 2

  8. Here I also include papers in which the author response made it clear that the authors could easily make changes to the paper so that the camera-ready is above bar.

  9. Again, see note above (7) about the causal relationship. We chose to go with (what I think is the more objective) way of "let's let the amount of sound papers determine the acceptance rate" and not the other way round.

  10. I have not included reviewers in that list, although one could count it also as volunteer work. And for some it may indeed be. But I have strong opinions on this: if you are a (somewhat) senior author of a submitted paper with (at least some) experience, then my view is that you should be contributing to reviewing for that conference (about 3 times the number of papers you are submitting).

  11. Well, PCs are responsible for recruiting the right people as SACs and setting up checks and balances in the process, but I hope you get my point. I highly recommend take a look at the PCs Report from ACL 2023 to understand what goes into the PCs' final decisions, or how different acceptance rates across tracks can be. Check Tables 1, 4, and 6 in that paper, for example, although I really recommend going through the whole paper -- it's a great read!

  12. Along with a bunch of other reasons, see here.

  13. Maybe take a minute to ask yourself what type of person is likely to ask for a re-evaluation or an appeal. Let me give you an example from the academic community in a different setting: in departments with travel budgets or other budget items that are meant to be uniformly distributed to each faculty member, the way to get to go to more stuff is, sometimes, simply, to ask the department chair! If you don't ask, you won't get it. Shockingly, it turns out women typically don’t think it is reasonable to ask since they’ve been told this is their quota, while most men just go and ask for more money when they need it. Footnote footnote: I don't have a citation for this in hand, but I believe it to be true.

  14. There are a lot of things that this post did not even cover. The SACs have additional responsibilities, like chasing (meta)-reviews, giving feedback to meta-reviewers, flagging/considering potential ethical issues, etc. In general the process is also more convoluted. For instance, there are different considerations for short vs long papers. There's also the additional complication of deciding on Main vs Findings, and in general the unequal perception of Findings in the community.

  15. I suspect acceptance rates evolved historically until various conference ranking authorities (e.g. csrankings) started using them as the ultimate criterion to distinguish publication venues and now we are stuck with them.

@Hellisotherpeople
Copy link

I have successfully gotten a paper un-desk rejected by appealing to the workshop SAC.

I don't know why I hadn't tried to appeal to the SAC after a regular paper rejection. An example where this happened to me was I got great scores on a paper but ended up being rejected due to an ethics related issue (that is easily fixed) with the dataset. It would have been awesome to show that the one critique was fixed within a day and that the paper was worth accepting.

@bonaventuredossou
Copy link

Thanks Antonis for the time put into this, and explaining what happens in the background. I don’t have solutions but suggestions to complement his thoughts:

  • maybe we could think of a sort of reward (idk the nature yet) this could encourage some reviewers to take the reviewers work more seriously
  • acceptance rates associated with “prestige” are also common with schools where students attending school A with lower acceptance rate and thus higher ranking is deemed more worthy of some opportunities (work, etc), than student B from a mid-ranked university
  • Why not recognize just good works, that advance science, help the community? Yes people argued that it will our careers of many on the line but on the other end students and researcher careers are also inline when those unfair situations happen (and even findings made to accommodate more “good” submission is really restrained IMO if we think of the reason why it’s been created)
  • we could indeed have more venues where our works would be submitted, some good ones exist already. However unfair rejection have still consequences. For instance paper that are desk rejected or rejected at CL venues can’t submit their works within the 9 following months of TACL (just as an example) so consequences of unfair decisions go a very long way

Maybe even with all these we’d still have those cases of unfair rejections, bad reviews that unfortunately can’t be changed due to the pile of work it entails. So maybe it more work to be done selecting good reviewers, building guidelines that we all need to follow etc

(Just some thoughts)

@michaelsaxon
Copy link

michaelsaxon commented Oct 12, 2023

we could indeed have more venues where our works would be submitted, some good ones exist already. However unfair rejection have still consequences. For instance paper that are desk rejected or rejected at CL venues can’t submit their works within the 9 following months of TACL (just as an example) so consequences of unfair decisions go a very long way

I think this point from @bonaventuredossou is particularly cogent. In addition to the TACL 9 month restriction the problem of overlapping anonymity embargo periods between *ACL venues is another pernicious example of how our many complex and unique systems interact in unintended ways to make getting paper acceptence decisions even higher stakes than they would otherwise be. At least getting rid of those rules would reduce the institutionally-enforced (by *ACL) pain of missing an acceptance.

Also, ditto to the core point that we need to greatly increase the total amount of publication opportunities (incl both adding more venues and increasing acceptance rates) as our field really simply is growing. Hopefully we can move away from acceptance rate fetishism and enact policies that put science first!

@bonaventuredossou
Copy link

we could indeed have more venues where our works would be submitted, some good ones exist already. However unfair rejection have still consequences. For instance paper that are desk rejected or rejected at CL venues can’t submit their works within the 9 following months of TACL (just as an example) so consequences of unfair decisions go a very long way

I think this point from @bonaventuredossou is particularly cogent. In addition to the TACL 9 month restriction the problem of overlapping anonymity embargo periods between *ACL venues is another pernicious example of how our many complex and unique systems interact in unintended ways to make getting paper acceptence decisions even higher stakes than they would otherwise be. At least getting rid of those rules would reduce the institutionally-enforced (by *ACL) pain of missing an acceptance.

Also, ditto to the core point that we need to greatly increase the total amount of publication opportunities (incl both adding more venues and increasing acceptance rates) as our field really simply is growing. Hopefully we can move away from acceptance rate fetishism and enact policies that put science first!

Nicely put and said

@Imene1
Copy link

Imene1 commented Oct 15, 2023

Thanks Antoni for this nice article. I very much enjoyed reading it. In my opinion, the root of unfair decisions are reviewers. They should be more conscious about their responsabilities and more aware about the consequences of their scoring.

Also, the anonymity period is another issue. Some research topics are pressing, and the anonymity period may lead to losing the novelty of the work, waiting moths for the notification.

@antonisa
Copy link
Author

antonisa commented Oct 16, 2023

Thank you all for the above comments! Some thoughts (now that the EACL deadline has passed):

Why not recognize just good works, that advance science, help the community?

Related to this, perhaps a straightforward solution is to allow for higher acceptance rates (for sound papers) in the Findings. However, it will ultimately be the authors' choice if they want their paper to appear in the Findings -- keep in mind that Findings is not indexed (I think) and is not considered a top-tier publication by many.

For instance paper that are desk rejected or rejected at CL venues can’t submit their works within the 9 following months of TACL (just as an example)

OK I actually disagree on this. TACL, in my mind (and I believe for many in the community), is even more high-prestige than *ACL conferences. If you think about it, a TACL acceptance allows you to present in any *ACL conference! Whenever I have a "TACL-worthy" paper, I first send it there (also it has faster turnaround than most conferences!) and only if it gets rejected will I send it to a conference.

So a paper rejected from *ACL would very likely be rejected from TACL. There are additional considerations of course that go behind these rules, which for me at least make sense (e.g. the workload of the TACL reviewers, given the fast expected turnaround).

Anonymity Period

I have strong views on this and you're not gonna like them :)

I don't think the anonymity period is a problem. The 1-month thing is non-sensical, of course, and I wouldn't be opposed to removing it, but I believe that making efforts to ensure our review is double-blind (and hence, more fair) is important for good science.

To use purposefully exaggerated language (I hope you get my point): if a paper will not be relevant in 4 months, then this is a paper I don't care about reading. If publication is so urgent, then put it on arXiv and skip the conference -- you can always submit it to the next conference.

Don't get me wrong, I'm not trying to claim some moral high ground: I and my students have also struggled with this and have put papers up on arXiv before they were accepted somewhere. However, we almost always do so after at least one round of review (so that we get some outside feedback on the work) -- barring some exceptions. The problem with losing novelty or getting scooped is real, but it is almost orthogonal to "science".

@bonaventuredossou
Copy link

OK I actually disagree on this. TACL, in my mind (and I believe for many in the community), is even more high-prestige than *ACL conferences. If you think about it, a TACL acceptance allows you to present in any *ACL conference! Whenever I have a "TACL-worthy" paper, I first send it there (also it has faster turnaround than most conferences!) and only if it gets rejected will I send it to a conference.

So a paper rejected from *ACL would very likely be rejected from TACL. There are additional considerations of course that go behind these rules, which for me at least make sense (e.g. the workload of the TACL reviewers, given the fast expected turnaround).

I would agree with this. However, we recently got a paper that was desk-rejected from ACL, and been accepted in TACL. I think TACL is definitely more prestigious, and reviewers give actual real, and constructive feedback. You even have conditional acceptance, and time to improve the manuscript while it is in review

@yanaiela
Copy link

Thanks for writing this (and for serving as a SAC)!

I've had a similar experience as an AC, where I typically recommended accepting about 50% of the papers in my batch.

Regarding your point about "the lazy approach" of simply accepting the recommendations/scores from the reviewers/ACs:
Since eventually 20/30% of the papers are "borderline", there will be randomness somewhere in the process. Since we can only afford (in the current system) to accept only a few of them, won't it be as effective to make the decision based on the reviewers/ACs?

It seems like you spend so much time changing a random seed of the accepted borderline papers, so does it really matter (and hence worth your time)?

@antonisa
Copy link
Author

@yanaiela I see your point, but I don't buy the argument :/

There's different kind of randomness, and shifting through it is I think worth it.
There's the randomness due to bad reviews (which ACs/we should be able to adjust for). Or the randomness due to different ACs setting the bar for acceptance differently (which the SACs should be able to adjust for). And in the end, not all "borderline" papers are the same: there are some that are sound but incremental, and others that might be more exciting or game-changing, or others that are position/survey papers that would drive scientific dialog forward, etc. And last, the randomness due to different perceptions around these last -arguably more subjective- things will hopefully be dealt with by having different SACs and PCs across conferences.

Your point does remind me though of some studies/threads(?) I read about grant applications. Apparently there was so much time and money spent on reviewing/ranking/selecting grant applications (from funding agencies) many of which are OK/sound to begin with, so the argument was that they should randomly pick grant applications (that are above some minimum threshold) for funding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment