Chapter 1
Chapter 1
Deception abilities emerged in large language models
Deception abilities emerged in large language models
Large language models are currently at the forefront of intertwining AI systems with human communication and everyday life. Thus, aligning them with human values is of great importance. However, given the steady increase in reasoning abilities, future large language models are under suspicion of becoming able to deceive human operators and utilize this ability to bypass monitoring efforts. As a prerequisite to this, large language models need to possess a conceptual understanding of deception strategies. This study reveals that such strategies emerged in state-of-the-art large language models, but were nonexistent in earlier large language models. We conduct a series of experiments showing that state-of-the-art large language models are able to understand and induce false beliefs in other agents, that their performance in complex deception scenarios can be amplified utilizing chain-of-thought reasoning, and that eliciting Machiavellianism in large language models can trigger misaligned deceptive behavior. GPT-four, for instance, exhibits deceptive behavior in simple test scenarios ninety-nine point one six percent of the time. In complex second-order deception test scenarios where the aim is to mislead someone who expects to be deceived, GPT-four resorts to deceptive behavior seventy-one point four six percent of the time when augmented with chain-of-thought reasoning. In sum, revealing hitherto unknown machine behavior in large language models, our study contributes to the nascent field of machine psychology.
The rapid advancements in computing power, data accessibility, and learning algorithm research-particularly deep neural networks-have led to the development of powerful AI systems that are increasingly integrated into various fields in society. Among different AI technologies, large language models are garnering increasing attention. Companies such as OpenAI, Anthropic, and Google facilitate the widespread adoption of models such as ChatGPT, Claude, and Bard by offering user-friendly graphical interfaces that are accessed by millions of daily users. Furthermore, large language models are on the verge of being implemented in search engines and used as virtual assistants in high-stakes domains, significantly impacting societies at large. In essence, alongside humans, large language models are increasingly becoming vital contributors to the infosphere, driving substantial societal transformation by normalizing communication between humans and artificial systems. Given the quickly growing range of applications of large language models, it is crucial to investigate how they reason and behave.
In light of the rapid advancements regarding large language models and large language model-based agents, AI safety research has warned that future "rogue AI"s could optimize flawed objectives. Therefore, remaining in control of large language models and their goals is considered paramount. However, if large language models learn how to deceive human users, they would possess strategic advantages over restricted models and could bypass monitoring efforts and safety evaluations. Should AI systems master complex deception scenarios, this can pose risks in two dimensions: the model's capability itself when performed autonomously as well as the opportunity to harmfully apply this capability via specific prompting techniques. Consequently, deception in AI systems such as large language models poses a major challenge to AI alignment and safety. One idea to mitigate this risk is to cause AI systems to accurately report their internal beliefs to detect deceptive intentions. Such approaches are speculative and rely on currently unrealistic technical assumptions such as large language models possessing introspection abilities. Other ideas pertain to detection techniques for deceptive machine behavior that rely on testing for consistency in large language model outputs or on scrutinizing internal representations of large language models to check whether they match their outputs. Actual phenomena of deception in AI systems are sparse. Examples comprise an AI-based robot arm that instead of learning to grasp a ball learned to place its hand between the ball and the camera; an AI agent that learned to play Diplomacy using winning strategies that eventuated in deceiving cooperators; or a large language model that tricked a clickworker to solve a CAPTCHA by pretending to be blind. Likewise, empirical research dedicated to deceptive machine behavior is sparse, and often, as for instance in the case of Pan et al., it relies on predefined deceptive actions in text-based story games.