ARTICLE On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
ARTICLE On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
ABSTRACT
The past three years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-two and GPT-three, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.
CCS CONCEPTS
CCS CONCEPTS
One INTRODUCTION
One of the biggest trends in natural language processing has been the increasing size of language models as measured by the number of parameters and size of training data. Since twenty eighteen alone, we have seen the emergence of BERT and its variants, GPT-two, T-NLG, GPT-three, and most recently Switch-C, with institutions seemingly competing to produce ever larger language models. While investigating properties of language models and how they change with size holds scientific interest, and large language models have shown improvements on various tasks, we ask whether enough thought has been put into the potential risks associated with developing them and strategies to mitigate these risks.
We first consider environmental risks. Echoing a line of recent work outlining the environmental and financial costs of deep learning systems, we encourage the research community to prioritize these impacts. One way this can be done is by reporting costs and evaluating works based on the amount of resources they consume. As we outline in section three, increasing the environmental and financial costs of these models doubly punishes marginalized communities that are least likely to benefit from the progress achieved by large language models and most likely to be harmed by negative environmental consequences of its resource consumption. At the scale we are discussing, the first consideration should be the environmental cost.
Just as environmental impact scales with model size, so does the difficulty of understanding what is in the training data. In section four, we discuss how large datasets based on texts from the Internet overrepresent hegemonic viewpoints and encode biases potentially damaging to marginalized populations. In collecting ever larger datasets we risk incurring documentation debt. We recommend mitigating these risks by budgeting for curation and documentation at the start of a project and only creating datasets as large as can be sufficiently documented.
As argued by Bender and Koller, it is important to understand the limitations of language models and put their success in context. This not only helps reduce hype which can mislead the public and researchers themselves regarding the capabilities of these language models, but might encourage new research directions that do not necessarily depend on having larger language models. As we discuss in section five, language models are not performing natural language understanding, and only have success in tasks that can be approached by manipulating linguistic form. Focusing on state-of-the-art results on leaderboards without encouraging deeper understanding of the mechanism by which they are achieved can cause misleading results as shown in and direct resources away from efforts that would facilitate long-term progress towards natural language understanding, without using unfathomable training data.
Furthermore, the tendency of human interlocutors to impute meaning where there is none can mislead both NLP researchers and the general public into taking synthetic text as meaningful. Combined with the ability of language models to pick up on both subtle biases and overtly abusive language patterns in training data, this leads to risks of harms, including encountering derogatory language and experiencing discrimination at the hands of others who reproduce racist, sexist, ableist, extremist or other harmful ideologies reinforced through interactions with synthetic language. We explore these potential harms in section six and potential paths forward in section seven.
We hope that a critical overview of the risks of relying on ever-increasing size of language models as the primary driver of increased performance of language technology can facilitate a reallocation of efforts towards approaches that avoid some of these risks while still reaping the benefits of improvements to language technology.