Recent advances in natural language processing have given rise to a new kind of AI architecture: the language agent. By repeatedly calling an LLM to perform a variety of cognitive tasks, language agents are able to function autonomously to pursue goals specified in natural language and stored in a human-readable format. Because of their architecture, language agents exhibit behavior that is predictable according to the laws of folk psychology: they have desires and beliefs, and then make and update plans to pursue their desires given their beliefs. We argue that the rise of language agents significantly reduces the probability of an existential catastrophe due to loss of control over an AGI. This is because the probability of such an existential catastrophe is proportional to the difficulty of aligning AGI systems, and language agents significantly reduce that difficulty. In particular, language agents help to resolve three important issues related to aligning AIs: reward misspecification, goal misgeneralization, and uninterpretability.
Goldstein and Kirk-Giannini, “Language agents reduce the risk of existential catastrophe“
1. Introduction
This is Part 5 of my series Papers I learned from. The series highlights papers that have informed my own thinking and draws attention to what might follow from them.
Part 1 looked at Harry Lloyd’s defense of robust temporalism, a form of pure temporal discounting.
Part 2 looked at an argument by Richard Pettigrew that risk-averse versions of longtermism may recommend hastening human extinction. This was meant not as a recommendation, but rather as a way of putting pressure on standard arguments for longtermism. Part 3 looked at a reply to Pettigrew by Nikhil Venkatesh and Kacper Kowalczyk.
Part 4 looked at a paper by Maarten Boudry and Simon Friederich examining evolutionary arguments for AI risk.
Today’s post continues this theme by discussing a paper on existential risk in language models.
Simon Goldstein is Associate Professor of Philosophy at the University of Hong Kong. Cameron Domenico Kirk-Giannini is Assistant Professor of Philosophy at Rutgers University – Newark. Both Simon and Cameron were fellows at the Center for AI Safety, and it is good to see fruitful publications coming out of that fellowship program.
Their paper, “Language agents reduce the risk of existential catastrophe,” argues that language agents may reduce the risk of existential catastrophe. We are lucky enough to have a post authored by Simon and Cameron — all words that follow are theirs.
2. Preliminaries
One of the great challenges of AI development is the alignment problem. The alignment problem is the problem of ensuring that AIs do what we want, obeying our commands rather than going rogue.
The issue is that in machine learning, we give AIs goals indirectly rather than directly. Traditional systems, like those trained with reinforcement learning, are programmed using mathematical reward functions (or “objective functions”). These functions are supposed to steer the AI toward a desired outcome. But AIs don’t automatically pursue the goal that is programmed into a reward function; rather, training is an evolutionary environment that selects for AIs in the direction of reward.
This paper argues that a new kind of AI system, language agents, avoids several of the difficulties associated with the alignment problem. In this summary post, we’ll explain three difficulties for alignment, explain what language agents are, and then explain how language agents avoid or mitigate the difficulties.
3. The alignment problem
The indirect nature of alignment leads to three specific problems: reward misspecification, goal misgeneralization, and uninterpretability. Let’s consider each in turn:
3.1. Reward Misspecification
When training an AI, we may experiment with different objective functions. In reinforcement learning, the goal is to define a reward function that gives the agent a reward for performing actions that produce desired states.
The problem is that it is difficult to design a reward function that properly encodes a goal. For example, Popov et al. (2017) set out to teach a reinforcement learning agent to stack red Legos on top of blue Legos. They tried to capture this goal by rewarding the agent for the height of the bottom of the red Lego, since stacked red Legos are higher off the ground than unstacked red Legos. But the agent didn’t learn to stack Legos; instead, it learned to flip red Legos over, thus elevating their bottoms without stacking them. There is a long list of examples of reward misspecification involving many kinds of AI, many kinds of games, and many different types of reward.1
3.2. Goal Misgeneralization
Another challenge for alignment is goal misgeneralization (Langosco et al. 2022, Shah et al. 2022). Even when the objective function for a task has been appropriately specified, an AI system may learn a strategy which achieves high performance on that task in some circumstances but not others. ML models are trained on data, environments, and problems that can be different from the data, environments, and problems to which they are later exposed when they are deployed. When an AI is used in a new context that does not resemble the one in which it was trained, we say that this context is out of distribution. In cases of goal misgeneralization, the AI succeeds during its training by pursuing a different goal than what its designers intended (it learns the wrong rule). This is manifested by decreased performance in out-of-distribution contexts.
For example, Shah et al. (2022) trained an AI in a “Monster Gridworld.” The intended goal was for the AI to collect apples and avoid being attacked by monsters. The AI could also collect shields, which protected it from monster attacks. The AI learned to collect shields during training in a monster-rich environment, and then entered an out-of-distribution environment with no monsters. In this monster-free setting, the AI continued to collect shields. Instead of learning to collect apples and value shields instrumentally as a way of avoiding monster attacks, it instead learned to collect both apples and shields.
3.3. Uninterpretability
If we can’t understand how someone makes a decision, it can be hard to predict what they will do. An AI system is interpretable to the extent that we can understand how it generates its outputs. Unfortunately, contemporary AI systems based on neural networks are often uninterpretable. It can be difficult to understand in human terms the reasons why a neural network produces the outputs it produces.
Artificial neural networks are difficult to interpret because they contain vast numbers of parameters that are not individually correlated to features of the environment. Humans are also fairly uninterpretable at a neuronal level. But human behavior can be explained by appealing to reasons: we describe someone’s beliefs and desires in order to explain why they did what they did. The behavior of AI systems is often not explainable in this way. Consider, for example, Gato, a generalist agent built with a transformer architecture to learn a policy that can achieve high performance across text, vision, and games (Reed et al. 2022). Gato does not have anything like a folk psychology; it does not engage in anything like belief-desire practical reasoning. It is an uninterpretable deep neural network that has learned how to solve problems through optimizing a loss function. It can be hard to say exactly why systems like Gato perform particular actions.2
This type of behavior is worrying in two related ways. First, if AIs make decisions that are not easily explained using reasons, then it is very difficult to predict their behavior. Second, if AIs make decisions in a very different way than humans do, they may find strategies for defeating humans in conflict by exploiting unfamiliar policies.
4. Language agents
Our thesis is that language agents significantly reduce the probability of misalignment. But what, exactly, are language agents? At its core, every language agent has a large language model like GPT-4. You can think of this LLM as the language agent’s cerebral cortex: it performs most of the agent’s cognitive processing tasks. In addition to the LLM, however, a language agent has one or more files containing a list of its beliefs, desires, plans, and observations recorded in natural language. The programmed architecture of a language agent gives these beliefs, desires, plans, and observations their functional roles by specifying how they are processed by the LLM in determining how the agent acts. The agent observes its environment, summarizes its observations using the LLM, and records the summary in its beliefs. Then it calls on the LLM to form a plan of action based on its beliefs and desires. In this way, the cognitive architecture of language agents is familiar from folk psychology.
For concreteness, consider the language agents developed by Park et al. (2023).3 These agents live in a simulated world called ‘Smallville’, which they can observe and interact with via natural-language descriptions of what they see and how they choose to act. Each agent is given a text backstory that defines their occupation, relationships, and goals. As they navigate the world of Smallville, their experiences are added to a “memory stream.” The program that defines each agent feeds important memories from each day into the underlying language model, which generates a plan for the next day. Plans determine how an agent acts, but can be revised on the fly on the basis of events that occur during the day.
More carefully, the language agents in Smallville choose how to behave by observing, reflecting, and planning. As each agent navigates the world, all of its observations are recorded in its memory stream in the form of natural language statements about what is going on in its immediate environment. Because any given agent’s memory stream is long and unwieldy, agents use the LLM (in Park et al.’s study, this was gpt3.5-turbo) to assign importance scores to their memories and to determine which memories are relevant to their situation at any given time. In addition to observations, the memory stream includes the results of a process Park et al. call reflection, in which an agent queries the LLM to make important generalizations about its values, relationships, and other higher-level representations. Each day, agents use the LLM to form and then revise a detailed plan of action based on their memories of the previous day together with their other relevant and important beliefs and desires. In this way, the LLM engages in practical reasoning, developing plans that promote the agent’s goals given the agent’s beliefs. Plans are entered into the memory stream alongside observations and reflections and shape agents’ behavior throughout the day.
5. Language agents and alignment
Language agents significantly reduce or eliminate the challenges of reward misspecification, goal misgeneralization, and uninterpretability. Let’s consider each in turn.
Language agents bypass the problem of reward misspecification because their objectives are not encoded in a mathematical objective function, as in traditional reinforcement or supervised learning. Instead, language agents are given a goal in natural language. The goal could be something like: Organize a Valentine’s Day party. In this respect, language agents are fundamentally different from traditional AI systems in a way that makes them easier to align.
Similar considerations are relevant to goal misgeneralization. Language agents are given a natural language goal. This goal has a clear interpretation in a variety of different behavioral contexts, including out-of-distribution contexts. In particular, a language agent will make a plan for how to achieve their goal given their memories and observations of the current situation. Language models can use their common sense to successfully formulate a plan for achieving the goal, across a wide variety of different situations. By contrast, a traditional reinforcement learning agent will formulate a policy in a training environment, and this policy may or may not generalize to new situations in the way desired by its creators.
Language agents are interpretable. They have beliefs and desires that are encoded directly in natural language as sentences. The functional roles of these beliefs and desires are enforced by the architecture of the language agent. We can determine what goal a language agent has by looking at their beliefs and desires. In addition, we can know what plan a digital agent creates in order to achieve this goal.
6. Reasoning models
Our discussion so far summarized our paper, Language Agents Reduce the Risk of Existential Catastrophe. In closing, we’ll note an emerging threat to the safety paradigm we defended above. Since the emergence of GPT o1, AI labs have been pursuing a new route to AI agency, different from language agents. Instead, they have been developing reasoning models. Reasoning models take an LLM and apply reinforcement learning on its explicit reasoning processes, called ‘chains of thought’. The model is given a reward in ‘verifiable’ domains like math or computer programming when its reasoning arrives at a correct answer. This process results in very sophisticated agents: the newest reasoning models have achieved very high scores in mathematical reasoning benchmarks.
Crucially, this approach to LLM agency does not inherit the safety benefits of language model agents. Quite the opposite. Training in the direction of verifiable reward may move large language models away from the direction of human legibility. Previous paradigms of LLM training have focused on matching human predictions on text, or on pleasing human users. This creates pressure towards LLM reasoning that broadly matches human reasoning. By contrast, training for verifiable objective rewards may encourage LLMs to reason in ways that are illegible to human cognition. In this way, we suggest that the choice between language agents and reasoning models may be a pivotal one in the path towards AGI.
- The phenomenon we call reward misspecification is sometimes also called “reward hacking” (e.g. by Amodei et al. 2016), “specification gaming” (e.g. by Shah et al 2022), or, in the context of supervised learning, “outer misalignment.” ↩︎
- Similar remarks apply to the Decision Transformer architecture developed by Chen et al. (2021). ↩︎
- Besides the agents developed by Park et al., other examples of language agents include Voyager, Devin, and the agents in Altera, a large-scale social AI simulation. ↩︎

Leave a Reply