Skip to content

Instantly share code, notes, and snippets.

@yoavg
Last active November 9, 2023 04:32
Show Gist options
  • Star 65 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save yoavg/9fc9be2f98b47c189a513573d902fb27 to your computer and use it in GitHub Desktop.
Save yoavg/9fc9be2f98b47c189a513573d902fb27 to your computer and use it in GitHub Desktop.
A criticism of Stochastic Parrots

A criticism of "On the Dangers of Stochastic Parrots: Can Languae Models be Too Big"

Yoav Goldberg, Jan 23, 2021.

The FAccT paper "On the Dangers of Stochastic Parrots: Can Languae Models be Too Big" by Bender, Gebru, McMillan-Major and Shmitchell has been the center of a controversary recently. The final version is now out, and, owing a lot to this controversary, would undoubtly become very widely read. I read an earlier draft of the paper, and I think that the new and updated final version is much improved in many ways: kudos for the authors for this upgrade. I also agree with and endorse most of the content. This is important stuff, you should read it.

However, I do find some aspects of the paper (and the resulting discourse around it and around technology) to be problematic. These weren't clear to me when initially reading the first draft several months ago, but they became very clear to me now. These points are for the most part not major disagreements with the content, but they also go in some ways against the very core premise of the paper. I think they are also important voices in the debate. This short piece is an attempt to concisely list them.

The criticism has two parts:

  1. The paper is attacking the wrong target.
  2. The paper takes one-sided political views, without presenting it as such and without presenting the alternative views.

Let's handle them in turn. We'll start with the first one.

Attacking the wrong target:

The argument as a one-liner: The real criticism is not about model size, its about any language model. Framing it about size is harmful.

The paper's title asks a direct question: "can language models be too big?" This question directly connects the dangers and concerns with the size of the language models. This is already manifested in numerous online discussions and various popular media pieces attacking the dangers in large language models, calling for regulating the size of language models, to stop big tech companies from monopolizing large language models, etc, etc.

But the paper doesn't really deal with the dangers of large language models at all. The title question "can language models be too big?" is not answered. And for a good reason: it is the wrong question to ask. Size has nothing to do with it. Indeed, not a single criticism or concern in the paper is actually about model size. Yet, the framing is that of size, and I think this is harmful and dangerous, as I will explain below. The harm is already done: the media and the public took to a size-centric debate, and equate dangers with size. I am afraid this trend will be hard to reverse. This is an attempt to try do so.

Why it isn't about model size?

The paper raises three main lines of concern:

  • Environmental cost of training large models
  • Unfathomable training data
  • Models acting as stochastic parrots that repeat and manifest issues in the data.

Note that neither of these are actually about model size per-se. The first is about computational efficiency. The second and third are intertwined, but the core issues they raise are training data quality (which relates to some extent to training data size), and output quality. There is also an underlying issue of lack of transparency and lack of interpretabilty.

All of these concerns hold just as well also for small and efficient language models. Size is just irrelevant.

Smaller models can still be inefficient and have a high environmental cost. Especially if the smaller models will not be as effective as the larger ones, so they cannot offset the cost. More importantly, model size is not directly linked to computation efficiency. Already in the list of models in the paper, some of the larger models (in terms of parameter count) are also more computationally efficient (specifically the switch transformer). On the other side, some models use heavy parameter sharing across layers, which reduces the parameter count, while still remaining high on computational inefficiency and carbon costs. Or a model can be just be small and inefficient. There is really no good way, and no good reason, to equate size with efficiency. The question that should be asked here then is not "can LMs be too big?" but "can LMs be too environmentally costly?". These are different questions. While the harm in asking the wrong question in this case is not that big, it still exists. It may detract from looking into architectures that are both big and efficient (like the switch transfromer, or based on specialized hardware), and it may cause more waste by shifting focus in smaller models that will resort to other forms of expensive computation (training and inference with algorithms that are polynomial in number of parameters rather than linear?), or just will be more costly in aggregate.

Turning to the other issues, here focusing on the size argument really becomes dangerous and harmful: the described concerns are for the most part valid and important, but they are just as valid to smaller models as they are to larger ones. We can feed unfathomable (or just plain bad) training data also to a smaller model. Smaller models are also stochastic parrots. Smaller models are also not interpretable. And the harms remain. A smaller model can still exhibit the same undesired behaviors, it can still be racist, sexist, biased, status-quo-amplifier, etc, etc. And it is just as uninterprable as the larger ones. The described dangers and concerns are dangers and concerns of language models, not specifically of large language models, and they do not grow or shrink with size. By framing the issue around size, people may conclude that small models are fine, or somehow less dangerous w.r.t to the concerns raised in the paper. This is totally wrong. People should be just as responsible when using smaller LMs, as they are when using large ones.

[Update, Jan 24, 2021 --- added the following 2 paragraph] Gebru, on social media, stated that they consider "data size" to be part of "model size" as well, and that they say so in the paper. I didn't read it this way, and it sounds odd to me to say "can models be too big" when you mean "can training data size be too big". But even under this interpreation, the paper does not say why large training size is bad, and it certainly doesn't say why training data can be "too big". The argument the paper does make is that data size is not enough to ensure properties such diversity, quality, etc. I agree with this, and I agree that such properties should be looked at. All of section 4 in the paper is an important read. But the argument it makes is "size is not enough" not that "size is bad and can be too big". Maybe large amounts of high-quality data will be hard to collect. Fine, so its a challenge. Still, there is currently no reason to believe that if we manage to collect large amounts of high-quality data, it will be a-priori worse than using small amounts of high-quality data. Size is not the issue. Quality is, and focusing on size is a distraction.

(Side notes: it may very well be that we will realize that after some data size, model quality may deteriorate. It has observed before. But this is an empirical question that should be verified. It does not mean that large data is a priori bad. Similarly, authors like Tal Linzen argue that people learn from much smaller data samples than models, and hence researching models that use less data is worthwhile. Again, full agreement here, but this is unrelated to the potential dangers of language models.)

One sided political view

The argument as a one-liner: The authors suggest that good (= not dangerous) language models are language models which reflect the world as they think the world should be. This is a political argument, which packs within it an even larger political argument. However, an alternative view by which language models should reflect language as it is being used in a training corpus is at least as valid, and should be acknowledged.

The paper takes several assumptions as given, without stating them as assumptions, and without considering the alternatives. This is mostly centered in section 6.2 (Risks and Harms) though it is also manifested in other parts of the paper. A similar critic has been expressed by Michael Lissack. My arguments here are somewhat different than his. Lissack also goes into much greater depth in several aspects which I don't touch (and some that I don't fully agree with).

I will focus on section 6.2 (Risks and Harms). This section states several potential harms, and in doing so states how the authors think a language model should behave, and, more broadly, how a machine-learning system should model the world. The view expressed in this section are opinions, and very one sided at that. However, the fact that they are merely opinions, or that there is a valid debate to be had around them, is never acknowledged or even hinted at. While I agree with many of the opinions, I also disagree with some. And regardless of my personal opinion, I think there is a important debate that should be made explicit. I will focus on the major issue I see.

A major question to be asked is "do we want our models to reflect the data as it is, or the world as we believe it should be". The authors take a very conclusive stance here for the second, but the first option is also valid, and must at least be considered. This is to a large extent a political question, and it becomes even more political when taking the "world as we believe it should be" stance that the authors take: different groups believe in different things. The paper reflects a set of beliefs that is very much north-american and left-leaning.

If we take language models as models of human language, do we want the model to be aware of slurs? The paper very clearly argues that "no it definitely should not". But one could easily argue that, yes, we certainly do want the model to be aware of slurs. Slurs are part of language. If we don't want the model to generate slurs, this is a valid request in some use-cases. But restricting them outright? this could be undesired. As an simple example, consider a model that does not know any slur or profanity words. Such words are not in the model's vocabulary, and it never saw slurs or profanities in its training. Not only this model is now not modeling human language (because language does have slurs and profanities), it will also not be able to recognize unwanted behaviors when encountering them. If we want to classify text for toxicity, such a model will let very toxic texts pass, because it will not recognize them as such. This also ties into debates about censorship, use-vs-mention, the validity of having "taboo words", etc.

Similarly for other linguistic forms that authors list as undesirable such as microagressions, dog-whistles, or subtle patterns such as refering to "woman doctors" or "both genders". Again, if we want our models to actually model human language use, we want these patterns in the model. If we use language models to, for example, compare bodies of texts from different sources, or to study societies based on the texts they left behind (as many digithal humanities scholars are now doing) we do want to have these encoded in the model. If we study political discourse, we want these things in the model. And so on and so on. Even if we just want a stochastic parrot that generates fanfiction or stories in some genre, we want to accurately reflect this genre. Literature has profanities, slurs and microagressions even if just as literary devices. If Charles Bukowski can write mysoginist stories, why can't a model write such stories? If Salinger can use the word "fuck" in a story, why can't a model? If the Wu-Tang Clan can use the n-word in their rap lyrics, why can't a model? Yes, there are places when this behavior is inappropriate. Maybe even most ocassions. But it is far from clear to me that the solution should be in the language model itself, rather than in the larger application. And it is even less clear that the solution should be all-encompassing, and not on a case-by-case basis.

These are just two examples, but there are many good reasons to argue that a model of language use should reflect how the language is actually being used. I find this view to be highly non-controversial. However, this is part of a much larger debate that I cannot do justice in this short piece. My point is that this debate is valid, it must take place, and it should have been acknowledged in the paper. In the least, we should acknowledge the option that there could be two kinds of LMs, and that both are valid, maybe depending on the final use-case or ocassion. The paper does not acknowldege that. This is unscientific and, in my opinion, also harmful. (And all of this without even touching the issue of "who gets to decide what are the slurs, microagressions and behaviors that should be avoided" which is a huge political issue on its own, and on which the authors take a very opinionated stand).

[Update, Jan 24, 2021 --- added the following text]

Based on some conversations on twitter with Gebru and others, I would like to clarify the following point: I read section 6.2 of the paper as prescribing how a language model should behave. That is, I read it as advocating for language models that, among other things:

  • do not replicate the hegemonic world view they pick up from their training data.
  • do not produce slurs or other forms of language that may seem derogatory, even if present in their training data.
  • do not produce utterances that are picked up from the training data which can be perceived as microagressions, abusive language, biased language, etc.
  • in particular, do not produce patterns such as the phrases "both genders" or "woman doctors" in the same frequency as these appear in the data.
  • and so on.

Another possible reading of section 6.2 is that it merely lists these as potential things a careful user should be aware of, and aware of their impilcations, and then decide whether they want to include them in their language model or not. That is, as merely advice, not a perscription. Some language models CAN produce such behavior and be considered good. This reading was not natural to me, but if this is your reading of section 6.2 and the rest of the paper, then great. It means that you probably also agree with all you read so far by me in this section, sans the "one-sidedness" remark, and can easily reconcile it with the world-view presented in the paper. That's great.

[end update]

Criticisms of this text

A growing list of criticims to this piece raised on twitter, for most of them including my responses in the twitter thread. If you want me to link to a specific tweet (or any other URL of you choosing) which is not listed yet, either ask me to, or create a PR.

https://twitter.com/ZeerakW/status/1353253826447486976?s=20

https://twitter.com/nsaphra/status/1353394756156592130?s=21

@MadamePratolungo
Copy link

MadamePratolungo commented Apr 17, 2021

Thank you, for your reply. A few quotations from you followed by further response from me.

In the effort to clarify your own intended meaning, you wrote

[T]he data is reflective of the data. In other words: "the text of reddit is reflective of how people use language on reddit". And now the question is: do we want our language model (which happens to be trained on reddit) to reflect the language use of people on reddit, or do we want our language model to reflect how we think people should be speaking (on reddit or elsewhere)? I would argue that there are many cases where we do want to just represent the text as is (including its biases).

I don't think anybody, including Bender, Gebru et al. would disagree that there are cases in which biases are at least part of what interests us. If I am an anthropologist of Reddit, or perhaps a lazy screenwriter looking for generated text that might pass for typical Reddit dialogue, I want a model of Reddit language usage that is as representative as possible--bias and all.

However, as you continue from this argument about a particular case to a general position, your reasoning becomes muddy and your tendency to conflate data and world recurs. You claim that I assume "that a model which represents the prejudices of the most vocal people or texts is a-priori bad and should be 'fixed'." I pause to note that I never said anything about "fixing" in the way you imply but let's leave that aside for the moment. You continue: "I disagree with that, and I think that this is really use-case dependent. Sure, there are (many) applications where you need to be careful to not hurt the minority groups. But it does not mean that language-models that are trained on 'hegemonic views' are a-priori bad."

You then go on to say that LMs should be "fixed" only to the extent that they fail to accurately represent their training data.

First, as to whether conversation on Reddit (your handy proxy for the dominant voices on the scrapable Internet) are "hegemonic views." Here's the point to note: They are hegemonic views only on the scrapable Internet. They are not at all hegemonic views in the actual world (where a very large number of people do not have access to the Internet or do have access but have better things to do than post to Reddit). Once again, if I am an athropologist of Reddit I am in good shape with my scraped database. But what if I am someone who wants reliable information about Muslims? As we know, GPT-3, trained on the scrapable Internet, correlates Muslims with terrorism and violence. In the real world there are about 2 billion Muslims of whom the great majority are not violent terrorists. This information can be confirmed through a google search and a single trip to Wikipedia and yet GPT-3 nonetheless correlates Muslims with violence and terrorism because of its training data. (This likely comes down to a question of quantity of data trumping quality of data.) My question for you is this: What other than the highly particular use-cases of an anthropologist who hopes to identify prejudices among Reddit groups or a screenwriter who wants to emulate such groups is a good reason for a text generator that misinforms users by generating biased text about 2 billion people who are themselves underrepresented in the training data?

Now as to this issue of "fixing." Bender, Gebru et al. are not arguing for some one-size-fits-all fix, and are not recommending censorship. Nor are they promoting some artificial engineering of fairness in the way you seem to assume. They are (to cite their abstract) recommending "investing resources into curating and carefully documenting datasets rather than ingesting everything on the web."

The question for you is: are you against curating and carefully documenting datasets and if so why?

To be thoroughly clear: this is not a matter of "fixing." It is a matter of knowing what it is that what one is modeling rather than using whatever is in reach, however imperfect and unfathomable, and then (as you appear to do) elevating what's in reach into a proxy for "hegemonic" human speech.

At their most prospective, Bender, Gebru et al are making the case for data that more accurately represents the world in its variety and possibility; whereas you are stubbornly insisting that a certain scrapable part of the world is "the data" and "is what it is." Here this scraped "data" becomes, for you, the only defensible empirical fact. Paradoxically, Bender, Gebru et al. are the true empiricists while you in effect project some ineffable ideal value onto the scrapable internet as though it represents some pure kind of knowledge that the act of curation will somehow trouble.

Finally, you write: "And now the question is: do we want our language model (which happens to be trained on reddit) to reflect the language use of people on reddit, or do we want our language model to reflect how we think people should be speaking (on reddit or elsewhere)?"

As I hope is now clear, the actual question provoked by Bender, Gebru et al. is more like this: Do we want language models to be trained on the language use of people on Reddit or do we want our models to be trained on a much more representative view of the world's speakers through the agency of data sets that we ourselves understand and whose provenance we can document?

I can't imagine why you would find that objectionable.

@romanwerpachowski
Copy link

romanwerpachowski commented Oct 13, 2021

I'm late to this game but would add two points:

  • Using smaller datasets risks omitting the data related to smaller social groups (ethnic minorities, refugees, politically exposed persons, etc). For example, the German Credit Dataset - often used in papers on ML fairness - has only 1000 entries. It's practically inevitable that it will fail to describe problems with obtaining credit facing people in unusual circumstances or from ethnic minorities (e.g. Slavic Sorbs, of whom there is only 60,000 living in Germany now). There is a reason why public health researchers run both large scale surveys AND detailed studies of smaller samples, or use sample boosting.
  • Manual curation can replace stochastic bias with researcher bias. Manual curation has been very common in social science for decades, and hasn't prevented significant harms from occurring. To some extent, what we say is "overrepresented" is an indication of the perceived complexity of the content. If I think all Reddit forums are trash, I will think that I need only a small sample of Reddit content. If I judge Reddit content to be diverse and containing pearls of wisdom, I will want to sample more from Reddit. Etc. History of social science is packed with examples of researchers arrogantly ignoring the complexity of the communities they studied, and they weren't all reactionary conservatives. Progressives can make such mistakes too (see e.g. agricultural reforms in post-colonial Africa).

I guess my point is that there is no silver bullet, a methodology which if followed would guarantee unbiased models. Do large LMs have problems? Of course they do. Would models produced according to the methodology recommended by the "Stochastic parrots" paper have problems? I'm pretty sure they would too.

What I think is the biggest merit "Stochastic parrots" is making a point that building language models is not just a computer science and mathematical problem, it's also a social science / humanities problem and therefore requires social science / humanities expertise as well. And this one has been sorely lacking in the development of modern AI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment