On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

100%

ARTICLE On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

ABSTRACT

The past three years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-two and GPT-three, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.

CCS CONCEPTS

One INTRODUCTION

One of the biggest trends in natural language processing has been the increasing size of language models as measured by the number of parameters and size of training data. Since twenty eighteen alone, we have seen the emergence of BERT and its variants, GPT-two, T-NLG, GPT-three, and most recently Switch-C, with institutions seemingly competing to produce ever larger language models. While investigating properties of language models and how they change with size holds scientific interest, and large language models have shown improvements on various tasks, we ask whether enough thought has been put into the potential risks associated with developing them and strategies to mitigate these risks.

We first consider environmental risks. Echoing a line of recent work outlining the environmental and financial costs of deep learning systems, we encourage the research community to prioritize these impacts. One way this can be done is by reporting costs and evaluating works based on the amount of resources they consume. As we outline in section three, increasing the environmental and financial costs of these models doubly punishes marginalized communities that are least likely to benefit from the progress achieved by large language models and most likely to be harmed by negative environmental consequences of its resource consumption. At the scale we are discussing, the first consideration should be the environmental cost.

Just as environmental impact scales with model size, so does the difficulty of understanding what is in the training data. In section four, we discuss how large datasets based on texts from the Internet overrepresent hegemonic viewpoints and encode biases potentially damaging to marginalized populations. In collecting ever larger datasets we risk incurring documentation debt. We recommend mitigating these risks by budgeting for curation and documentation at the start of a project and only creating datasets as large as can be sufficiently documented.

As argued by Bender and Koller, it is important to understand the limitations of language models and put their success in context. This not only helps reduce hype which can mislead the public and researchers themselves regarding the capabilities of these language models, but might encourage new research directions that do not necessarily depend on having larger language models. As we discuss in section five, language models are not performing natural language understanding, and only have success in tasks that can be approached by manipulating linguistic form. Focusing on state-of-the-art results on leaderboards without encouraging deeper understanding of the mechanism by which they are achieved can cause misleading results as shown in and direct resources away from efforts that would facilitate long-term progress towards natural language understanding, without using unfathomable training data.

Furthermore, the tendency of human interlocutors to impute meaning where there is none can mislead both NLP researchers and the general public into taking synthetic text as meaningful. Combined with the ability of language models to pick up on both subtle biases and overtly abusive language patterns in training data, this leads to risks of harms, including encountering derogatory language and experiencing discrimination at the hands of others who reproduce racist, sexist, ableist, extremist or other harmful ideologies reinforced through interactions with synthetic language. We explore these potential harms in section six and potential paths forward in section seven.

We hope that a critical overview of the risks of relying on ever-increasing size of language models as the primary driver of increased performance of language technology can facilitate a reallocation of efforts towards approaches that avoid some of these risks while still reaping the benefits of improvements to language technology.

Two BACKGROUND

Three ENVIRONMENTAL AND FINANCIAL COST

Four UNFATHOMABLE TRAINING DATA

Four point one Size Doesn't Guarantee Diversity

Four point two. Static Data/Changing Social Views

Four point three. Encoding Bias

Four point four Curation, Documentation and Accountability

Five DOWN THE GARDEN PATH

Six STOCHASTIC PARROTS

Six point one Coherence in the Eye of the Beholder

Six point two Risks and Harms

Six point three. Summary

Seven. Paths Forward

Eight CONCLUSION

Overview

The paper critically examines the trend toward larger language models in NLP, addressing environmental consequences, biases in training data, and the misleading implications of synthetic text. Recommendations for mitigating risks, such as careful dataset documentation and alternative research directions, are discussed.

Key Points

1Increasing language model size poses significant environmental risks
2Oversized datasets may reinforce biases that harm marginalized communities
3Effective documentation and curation of training datasets are crucial
4The focus should shift from merely increasing model size to addressing underlying challenges in NLP
5Misinterpretation of synthetic text can mislead both researchers and the public regarding language model capabilities

Details

Authors: Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
Category: Technology and Engineering

PDF
Group Decision Support Systems and Executive Support Systems
This document presents an overview of Group Decision Support Systems (GDSS) and Executive Support Systems (ESS), detailing their functions, benefits, limitations, and characteristics to aid in collective decision-making processes in business contexts.
PDF
Information Management and Decision Making
This document provides an overview of Decision Support Systems (DSS), discussing their components, types, and the importance of information management in decision-making processes for business executives.
PDF
Electronic Communication Systems
The document provides an overview of electronic communication systems, discussing various types such as electronic conferencing, meeting systems, and publishing, while also addressing benefits and risks associated with electronic publishing.
PDF
End-User Computing
This document provides a comprehensive overview of End-User Computing (EUC), discussing its definition, types, benefits, risks, and the tools used in EUC environments. It aims to explain how non-programmers can effectively participate in computing processes and develop their own applications.
PDF
FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness
This document presents FLASHATTENTION, an IO-aware exact attention algorithm that significantly improves the speed and memory efficiency of Transformers on long sequences. It introduces methods to reduce memory reads and writes, resulting in faster training and better model performance compared to existing methods.