We argue that instrumental goal preservation—the claim that a rational agent will tend to preserve its goals because that makes it better at achieving its goals—is false on the basis of the timing problem: an agent which abandons or otherwise changes its goal does not thereby fail to take a required means for achieving a goal it has. Our argument draws on the distinction between means-rationality (adopting suitable means to achieve an end) and ends-rationality (choosing one’s ends based on reasons). Because proponents of the instrumental convergence thesis are concerned with means-rationality, we argue, they cannot avoid the timing problem.
Southan, Ward and Semler, “A timing problem for instrumental convergence“
1. Introduction
This is Part 6 of my series Papers I learned from. The series highlights papers that have informed my own thinking and draws attention to what might follow from them.
Part 1 looked at Harry Lloyd’s defense of robust temporalism, a form of pure temporal discounting.
Part 2 looked at an argument by Richard Pettigrew that risk-averse versions of longtermism may recommend hastening human extinction. This was meant not as a recommendation, but rather as a way of putting pressure on standard arguments for longtermism. Part 3 looked at a reply to Pettigrew by Nikhil Venkatesh and Kacper Kowalczyk.
Part 4 looked at a paper by Maarten Boudry and Simon Friederich examining evolutionary arguments for AI risk.
Part 5 looked at a paper by Simon Goldstein and Cameron Domenico Kirk-Giannini on existential risk from language models.
Today’s post looks at the instrumental convergence thesis, focusing on goal content preservation.
Rhys Southan is a PhD student in philosophy at the University of Oxford and a former affiliate of the Global Priorities Institute (rest in peace). Helena Ward is a PhD student in philosophy at the University of Oxford and a holder of the Ethics in AI Scholarship at the Oxford Institute for Ethics and AI. Jen Semler is a postdoctoral associate with the Digital Life Initiative at Cornell Tech and also received her PhD in philosophy from Oxford.
Their paper, “A timing problem for instrumental convergence,” is available open access at Philosophical Studies. In this post, Southan, Ward and Semler describe their paper. All of what follows was written by them.
2. Preliminaries
In AI safety and alignment discourse, a superintelligence is anything more cognitively adept than humans at just about everything that matters (Bostrom 2014, 22). Superintelligent AI is a scary prospect for many reasons. One is that we are ill-equipped, in the present, to know much about what a future superintelligence will do or be like. We can’t even know what its goals will be; according to the orthogonality thesis, any goal can be combined with any level of intelligence (Bostrom 2012). But there’s one thing we can know about a superintelligence: it will be rational.
From the fact that a superintelligence will be rational, some philosophers think we can know some further information about it. Bostrom (2012) puts forward the instrumental convergence thesis, which claims that superintelligent AIs are likely to adopt certain goals because doing so will be useful for achieving whatever goal the superintelligence has. Bostrom proposes five such instrumental goals: self-preservation, goal-preservation, cognitive enhancement, technological perfection, and resource acquisition.
Our paper takes aim at the item on Bostrom’s list that has received little scrutiny: goal preservation. We argue that rational agents (more on how we define ‘rationality’ later) should not be expected to preserve their goals on the basis of rational considerations alone. Specifically, we argue against the following claim:
Instrumental goal preservation: Because an agent is more likely to achieve its present final goals if it still has those goals in the future, rational agents will tend to preserve their final goals (where a final goal is an end that the agent seeks to achieve for its own sake).
3. Appeal for Instrumental Goal Preservation
Our argument goes against the grain of thinking about superintelligence. To see why, this section will explain the appeal of instrumental goal preservation. The motivation for instrumental goal preservation arises from two assumptions.
3.1. Assumption 1: A superintelligence will be means-rational but not ends-rational
Discussions of instrumental convergence, and superintelligence more broadly, tend to conceptualize rationality as the capacity to determine which subgoals and actions are conducive to achieving one’s final goals. We refer to agents that are rational in this sense as means-rational. This type of rationality can be contrasted with what we call ends-rationality, which involves the capacity to critically evaluate one’s goals and their justifications. Concerns about superintelligence often arise because agents can be rational concerning means and not ends, so a superintelligence might be rational about the means it takes to achieve its goals without subjecting the goals themselves to rational scrutiny.
From this assumption, we see some motivation for instrumental goal preservation. Given that a superintelligent agent cannot be assumed to adopt good goals or analyze the reasons it has for pursuing its goals, it is intuitively plausible that such an agent will have reason to preserve whatever goal it has. After all, such an agent would never decide its goal is somehow wrong or bad, and preserving one’s goal seems like an effective strategy for achieving it.
3.2. Assumption 2: Rational requirements are narrow-scope in nature
Means-rationality requires an agent to adopt the means it believes are necessary for achieving its goal. There are two ways to interpret this requirement. On a narrow-scope interpretation, if you have a goal, then you are rationally required to take the means to that goal. On a wide-scope interpretation, you are rationally required to act such that (if you have a final goal, then take the necessary means to that goal)—whereby the demands of rationality range over the whole conditional clause. An agent can meet wide-scope requirements by abandoning its goal, but can only meet narrow-scope requirements by taking the means to achieve its goal. So, under a wide-scope interpretation, instrumental goal preservation is straightforwardly false.
Proponents of the instrumental convergence thesis tend to adopt a narrow-scope interpretation of rational requirements. This assumption is noteworthy in part because most philosophers outside discussions of instrumental convergence prefer wide-scope interpretations of rationality. We’re not sure why proponents of the instrumental convergence thesis adopt a narrow-scope interpretation—it might have to do with how AI systems are currently being trained, or it might have to do with a commitment to focusing only on means-rationality. Regardless, we meet instrumental goal preservationists on their own terms.
From the narrow-scope assumption, we again see motivation for instrumental goal preservation. If rationality is interpreted in such a way that an agent cannot meet the requirements of rationality by abandoning its goal—and can only fulfil those requirements by taking the means to achieve its goal—it seems like goal preservation is rationally required.
4. The Timing Problem
Our argument against instrumental goal preservation centers around what we call the timing problem: an agent which abandons or otherwise changes its goal does not thereby fail to take a required means for achieving a goal it has.
To illustrate, we offer the following case:
Cake: Suppose that on Monday Ronya has the goal of eating cake when cake is presented to her. On Monday afternoon, Ronya attends a birthday party and eats cake whenever cake is presented to her. As Ronya gets ready for bed on Monday night, she deliberates about whether to change her goal. She has two options: (1) she can preserve her goal of eating cake when presented to her, or (2) she can abandon her goal. Ronya decides to abandon her goal of eating cake when presented to her. On Tuesday, a friend offers Ronya cake and Ronya declines.
In Cake, Ronya never fails to take the means to achieving a goal she has. On Monday, she has the goal of eating cake, and she takes the means to that goal by eating cake when it’s offered to her. On Tuesday, Ronya lacks the goal of eating cake, and so she turns down the offer of cake. Sure, on Tuesday Ronya fails to take the means to achieving the goal of eating cake, but this is a goal Ronya had in the past—and she’s not rationally required to take the means to achieve former goals. The timing problem applies to all means-rational agents, including superintelligent AIs. When a superintelligent AI changes its goal, it avoids any rational requirement it had to take the means to achieve that goal.
5. Objections
We imagine that, in part because of the strength of the intuition about instrumental goal preservation, it may feel as though something doesn’t sit right about the timing problem. We spend the next chunk of the paper responding to objections. Here’s a preview of the two most pressing objections.
5.1. The Delay Objection
Perhaps the most common objection we have faced is that the timing problem goes wrong in the following way: when an agent decides to change (or abandon) its goal, the agent violates means-rationality by setting itself up to fail to achieve its current goal.
The objection can be seen in the following case, with a delay between the choice to change one’s goal and the actual goal change.
Paint: Marian has blue paint, paintbrushes, and a goal to paint an entire wall blue. It will take Marian two days to finish painting the wall blue if she paints continuously. Marian has unlimited paint, but if she doesn’t press a ‘Goal Preserve’ button by noon, she will lose her blue painting goal, and adopt a new goal of painting the wall red. Marian knows that if she loses her goal, she will never finish painting the wall blue. Marian does not press Goal Preserve. She keeps painting the wall blue until midnight when she loses her goal. The next day, she begins painting the wall red—Marian never finishes painting the wall blue.
The objection is that between noon and midnight, Marian has failed to take a required means to achieving her current goal, as her decision not to press Goal Preserve guarantees that she won’t finish painting the wall blue. And because it’s reasonable to think that all goal changes will take some time to implement, there will be a time at which any goal-changing agent sets itself up to later fail to take a required means to achieve a goal it currently has. We agree that in general, setting oneself up to later lack the means to achieve one’s current goal will typically violate means-rationality. For instance, Marian would be means-irrational if she set herself up to fail by choosing not to acquire blue paint. But this is not true when the means in question is the goal itself. Refusing to press Goal Preserve does not set Marian up to fail by means-rationality, because it sets her up to fail at a goal she will no longer have when the change is implemented.
5.2. The Goal-First Objection
If something about our discussion of the timing problem feels fishy, it might be the following thought. We’ve characterized means-rationality as determining how a rational agent approaches its goal; we’ve adopted a means-rationality-first approach. But another way of viewing the relationship between means-rationality and goals is that an agent’s commitment to its goal explains its interest in means-rationality. On this goal-first view, the fact that an agent has a goal is grounds for maintaining that goal—having a goal provides its own justification for goal preservation.
If the goal-first view is right, then the timing problem is not a problem for goal preservation. But we’re skeptical about whether there is a convincing argument in favor of the goal-first view, especially in the context of a superintelligence.
When imagining a superintelligence, we’re supposed to strip away everything except means-rationality; we’re supposed to avoid envisioning goals as functioning like desires that agents could be justified in preserving (e.g., desires that feel good to achieve or are believed to be good). Discussions of instrumental convergence are assuming an entity which only has means-rationality, goals, and an ability to affect the world, nothing else. But we suspect that goal-first advocates are not fully stripping away these ends-rational considerations.
6. Importance
Why does it matter if instrumental convergence is false? If we’re right, there are several significant implications.
First, a superintelligence might abandon its goals for no reason. We can’t say it will, but we also can’t assume it won’t. This means the instrumental convergence thesis is conditional. All of its subgoals only apply if the agent has a goal it intends to keep. A goal-abandoning superintelligence may rationally ignore resource accumulation or self-preservation altogether.
Second, if we can’t assume goal preservation, we also can’t assume that a superintelligence will keep the goals we want it to keep. If instrumental goal preservation were true, there would be no safety-related reasons to intentionally design AI systems to preserve their goals, or not goal-preserve, since we could have no influence. But if we cannot take goal preservation for granted, then the likelihood of goal preservation may depend on whether we try to design AI with goal preservation in mind. The desirability of goal preservation will thus be an important component in discussions of alignment.
Finally, the falsity of goal preservation carries a double-edged sword. On the pessimistic side, we know even less than we thought about how a superintelligence might behave. On the optimistic side, if goal preservation isn’t inevitable, we may have more influence over a superintelligence’s behaviours than previously thought.
References
Bostrom, N. (2012). The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. Minds & Machines, 22, 71–85. doi:10.1007/s11023-012-9281-3
Bostrom, N. (2014). Superintelligence: Paths, dangers, strategies. Oxford University Press.

Leave a Reply