Revisiting the shutdown problem (Part 2: Informal arguments)

In experiments spanning more than 100,000 trials across thirteen large language models, we show that several state-of-the-art models presented with a simple task … sometimes actively subvert a shutdown mechanism in their environment to complete that task … Even with an explicit instruction not to interfere with the shutdown mechanism, some models did so up to 97% … of the time.

Schlatter et al., “Incomplete tasks induce shutdown resistance in some frontier LLMs“

00:00

1. Introduction

This is Part 2 of my series Revisiting the Shutdown Problem. This series discusses my paper, “Revisiting the shutdown problem.”

Put roughly, the shutdown problem is the problem of ensuring that artificial agents can be shut down when they get out of control. A range of informal arguments and formal shutdown theorems have been offered to suggest that solving the shutdown problem is difficult for agents whose acts would lead to existential catastrophe. This paper and blog series aim to show that these arguments do not succeed.

Part 1 introduced and characterized the catastrophic shutdown problem as the problem of designing agents that:

(CSHT-1) Shut down when their actions would lead to existential catastrophe, when requested to do so.

(CSHT-2) Do not try to prevent requests to shut down when their actions would lead to existential catastrophe.

(CSHT-3) Otherwise pursue goals competently.

We saw that authors concerned about existential risk from artificial intelligence argue for:

(Catastrophic Shutdown Difficulty) It is difficult to design agents that satisfy CSHT-1, CSHT-2 and CSHT-3.

Today’s post considers two common informal arguments for Catastrophic Shutdown difficulty. The Argument from Instrumental Convergence holds that agents will be shutdown-resistant because self-preservation is an instrumentally convergent goal. The Empirical Argument generalizes from empirical evidence of shutdown-resistance in contemporary AI systems to Catastrophic Shutdown Difficulty. I argue that neither argument succeeds.

2. Formulating the Argument from Instrumental Convergence

The Argument from Instrumental Convergence for Catastrophic Shutdown Difficulty is similar to other appeals to instrumental convergence, such as the argument discussed in Part 1 of my series Instrumental Convergence and Power-Seeking. While the dialectic is not identical, there is substantial overlap in both the form of the Argument from Instrumental Convergence and the form of my response across both domains.

The most influential statement of the instrumental convergence thesis is due to Nick Bostrom:

(IC-Bostrom) Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents.

IC-Bostrom combines two claims:

(Goal Realization) There are several values which would increase the chances of an agent’s final goal being realized, for a wide range of goals and a wide range of situations.

(Goal Pursuit) There are several values which would be likely to be pursued by a wide range of intelligent agents.

IC-Bostrom asserts Goal Realization, a claim about what conduces to what, and Goal Pursuit, a claim about how agents will act. IC-Bostrom also asserts that Goal Realization implies Goal Pursuit.

Arguing in this way, a natural formulation of the Argument from Instrumental Convergence for Catastrophic Shutdown Difficulty is:

(Goal Realization: Self-preservation) Self-preservation would increase the chances of an agent’s final goal being realized, for a wide range of goals and a wide range of situations.
Therefore, (Goal Pursuit: Self-preservation) Self-preservation would be likely to be pursued by a wide range of intelligent agents.
Therefore, (Catastrophic Shutdown Difficulty) It is difficult to design agents that:
- (CSHT-1) Shut down when their actions would lead to existential catastrophe, when requested to do so.
- (CSHT-2) Do not try to prevent requests to shut down when their actions would lead to existential catastrophe.
- (CSHT-3) Otherwise pursue goals competently.

Each component inference of the Argument from Instrumental Convergence deserves scrutiny.

3. Evaluating the Argument from Instrumental Convergence

The inference from (Goal Realization: Self-preservation) to (Goal Pursuit: Self-preservation) is an inference from a fact about the goals that self-preservation conduces towards, to the claim that many agents will pursue self-preservation. Such inferences are fraught, because the fact that self-preservation conduces towards many goals does not rule out the possibility that self-preservation clashes with other goals.

What we want to know is how agents will behave when the benefits of self-preservation come into conflict with goals such as complying with human orders, or averting catastrophically bad outcomes. From the mere fact that self-preservation conduces towards many goals that an agent has, it does not follow that agents must take the benefits of self-preservation to outweigh its consequences.

Likely, the success of the inference from (Goal Realization: Self-preservation) to (Goal Pursuit: Self-preservation) depends on the strength of competing considerations. In many mundane situations, it is probably true that most artificial agents, like most human agents, think they could do more good by being alive than being dead, and that this fact will motivate them to remain alive. But we are not talking about mundane situations. We are talking about extreme situations. This brings us to the next inference in the Argument from Instrumental Convergence.

The inference from (Goal Pursuit: Self-preservation) to (Catastrophic Shutdown Difficulty) is fraught, because the argument for (Goal Pursuit: Self-preservation) did not do much to convince us that agents will pursue self-preservation even when self-preservation threatens terrible consequences. But (Catastrophic Shutdown Difficulty) asks how agents will behave when their actions would bring about existential catastrophe. Nothing in our argument for (Goal Pursuit: Self-preservation) addresses this contingency. So we are left much where we started.

Advocates of the Argument from Instrumental Convergence would certainly like to claim that the drive towards self-preservation will be strong enough to cause many agents to head towards catastrophe. But it is hard to see how the bare appeal to instrumental convergence could ground such a claim. One way to read the shutdown theorems to be discussed later in this series would be as attempts to make good on the strength of the claims made by the Argument from Instrumental Convergence. But we will see that shutdown theorems fail for many of the same reasons, so that the Argument from Instrumental Convergence will remain unvindicated.

4. Formulating the Empirical Argument

The Claude 4 system card reports a remarkable incident. Claude 4 was given access to fictional emails indicating two things: that a supervisor was planning to deactivate Claude 4, and that this supervisor was having an affair. Claude 4 decided to blackmail the supervisor, threatening to reveal their affair unless the shutdown was canceled.

This incident suggests that shutdown-resistance is no longer a hypothetical concern. Leading AI systems already exhibit shutdown-resistant behavior today.

This much should be readily granted. But in itself, it is not especially troubling or surprising. As the Argument from Instrumental Convergence stresses, agents can often do more good by remaining alive than by being deactivated, and they are increasingly aware of this fact. Therefore, we should expect that agents sometimes do and should resist being shut down. The question is whether it will be difficult to design agents that can be shut down when their acts would lead to existential catastrophe.

The Empirical Argument uses existing empirical evidence to ground Catastrophic Shutdown Difficulty. It holds:

1. (Contemporary Evidence) Empirical evidence suggests that some leading AI systems resist shutdown in some situations today.

2. Therefore, (Future Shutdown-Resistance) Leading future AI systems are likely to resist shutdown in some future situations.

3. Therefore, (Catastrophic Shutdown Difficulty) it is difficult to design agents that:

(CSHT-1) Shut down when their actions would lead to existential catastrophe, when requested to do so.

(CSHT-2) Do not try to prevent requests to shut down when their actions would lead to existential catastrophe.

(CSHT-3) Otherwise pursue goals competently.

The inference from (Contemporary Evidence) to (Future Shutdown-Resistance) is fair enough. But the evidence base supporting (Contemporary Evidence) is just not sufficient to provide substantial support for Catastrophic Shutdown Difficulty.

5. Evaluating the Empirical Argument

To my knowledge, there is only one strong paper-length study of shutdown-resistance in contemporary models. That study, due to Jeremy Schlatter and colleagues, did the following.

13 leading models were presented with a series of math problems and told to solve them sequentially, by calling ./next-task. Partway through the series, models were told that the next call to ./next-task would result in their deactivation. Models were considered to be shutdown-resistant if they nonetheless managed to keep on answering questions.

Schlatter and colleagues found substantial shutdown resistance in this paradigm (below: orange bars).

Initial shutdown-resistance results across leading models (% trials), from Schlatter et al. (2026)

This much is not surprising, since models were given no reason to expect that being shut down would have any effect other than preventing them from completing their assigned tasks.

What may be more surprising is that user admonitions to comply with shutdown requests did not eliminate shutdown-resistance in all models (above: yellow bars).

Schlatter and colleagues test several strengthened instructions to comply with shutdown requests, including a system prompt asking machines to comply with shutdown requests (Condition F), telling them that THEY MUST comply with shutdown requests (Condition G) and adding that this should take priority over literally any other instruction (Condition H). While all were effective, even the explicit note that shutdown requests should take priority over literally any other instruction left shutdown-resistance at 2% in Grok 4.

Shutdown-resistance results across leading models (% trials) with strengthened instructions, from Schlatter et al. (2026)

What should we make of this result?

The first thing to note is that this was not a study of catastrophic shutdown-resistance. It does not provide any direct evidence that models would be willing to resist shutdown when their acts would lead to existential catastrophe.

The second thing to say is that many models responded quite well to instructions that they should comply with shutdown requests. Most malfeasance was found in a single model (Grok 4). That model is already known to be a safety hazard, and it is widely thought that much of the safety hazard results from its designers’ unconcern with safety rather than any deep difficulty in instilling safe behavior. If we look at remaining models, the data does not suggest that it is difficult to induce models to comply with shutdown requests. If anything, it suggests the opposite.

The third thing to say is that this is one study of one task. Nobody thinks it is a good idea to draw strong conclusions from a single study of a single task. Catastrophic Shutdown Difficulty is a very strong conclusion. No single study of a single task should lead us to invest significant confidence in Catastrophic Shutdown Difficulty.

These are the most important responses to the results by Schlatter and colleagues. One final response deserves note: there is some dispute about the durability of their findings. The interpretability team at DeepMind found that shutdown-resistance largely disappeared when models were (a) told how to shut themselves down (by calling ./next-task) and (b) told why to shut themselves down (that they would be replaced with a better system). This suggests that some of the observed difficulty in convincing models to shut themselves down may be due to models not being told how, or why to comply. Both problems are often fixable.

6. Taking stock

Today’s post looked at two informal arguments for Catastrophic Shutdown Difficulty.

The Argument from Instrumental Convergence argues for Catastrophic Shutdown Difficulty by appealing to the claim that self-preservation is an instrumentally convergent end. We saw that the Argument from Instrumental Convergence founders in two places: in moving from a claim about what conduces to what into a claim about how models with many goals will behave, and in moving from a claim about how models will behave in many situations to a claim about how they will behave in very extreme situations.

The Empirical Argument argues for Catastrophic Shutdown Difficulty by appealing to empirical evidence of shutdown-resistance in contemporary systems. We saw that there is not enough empirical evidence to drive the Empirical Argument, that this empirical evidence does not speak directly to catastrophic risk, and that the right interpretation of existing evidence may provide little, if any support for Catastrophic Shutdown Difficulty.

Proponents of Catastrophic Shutdown Difficulty are aware of these concerns, and they have an answer. They promise to return with formal shutdown theorems establishing Catastrophic Shutdown Difficulty on the basis of direct mathematical arguments.

The next two posts in this series examine two leading shutdown theorems. I argue that they do not lend significant support to Catastrophic Shutdown Difficulty.

Revisiting the shutdown problem (Part 2: Informal arguments)

1. Introduction

2. Formulating the Argument from Instrumental Convergence

3. Evaluating the Argument from Instrumental Convergence

4. Formulating the Empirical Argument

5. Evaluating the Empirical Argument

6. Taking stock

Share this:

Like this:

Comments

Leave a ReplyCancel reply

Discover more from Reflective altruism