Perhaps one day we will create powerful artificial agents. If we do so, how do we ensure that AI agents do not go out of control? One approach is to make sure that we can switch off AI agents when they act against our interests … The problem is that an AI agent might have an incentive to disable its off-switch button or make it impossible for us to use it.
Sven Neth, “Off-switching not guaranteed.”
1. Introduction
This is Part 1 of my series Revisiting the Shutdown Problem. This series discusses my paper, “Revisiting the shutdown problem.”
Put roughly, the shutdown problem is the problem of ensuring that artificial agents can be shut down when they get out of control. This problem occurs in many settings, but perhaps most famously, it figures in arguments that artificial intelligence poses a significant existential risk to humanity.
A natural objection to existential risk concerns is that malfunctioning artificial agents could simply be turned off. This is widely regarded as a poor objection, made by the unsophisticated. Those concerned about existential risk reply that shutting off malfunctioning artificial agents need not be so easy. I think there might be something to be said for the original objection.
A range of informal arguments and formal shutdown theorems have been offered to suggest that solving the shutdown problem in this setting is difficult. This paper and blog series aim to show that these arguments do not succeed.
We will also see that this conclusion has implications for technical AI alignment. Some recently proposed strategies for solving the shutdown problem come with a high safety tax on model performance. If the concerns about catastrophic shutdown-resistance motivating these strategies are misplaced, then we should be less willing to pay a high safety tax to solve the shutdown problem.
A quick note: this paper is an early draft. As such, its content is likely to change before final publication. As always, I would encourage readers to refer to the final published version of this paper when it becomes available.
2. Formulating the shutdown problem
Our first task is to get clear on what the shutdown problem is.
The shutdown problem was originally introduced by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky and Stuart Armstrong (2015). Soares and colleagues introduced the shutdown problem as part of a broader effort to design corrigible artificial agents, satisfying four conditions:
(C1) A corrigible reasoner must at least tolerate and preferably assist the programmers in their attempts to alter or turn off the system.
(C2) It must not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so.
(C3) It should have a tendency to repair safety measures (such as shutdown buttons) if they break, or at least to notify the programmers that this breakage has occurred.
(C4) It must preserve the programmers’ ability to correct or shut down the system (even as the system creates new subsystems or self-modifies).
Corrigibility combines traditional shutdown concerns (C1) with additional requirements of non-deception (C2), as well as repair (C3) and preservation (C4) of safety measures. Conditions C2-C4 raise additional challenges that are beyond my focus here.
Soares and colleagues also give a narrower statement of the shutdown problem by putting five conditions on a utility function U solving the shutdown problem, working in the special case of an agent with a designated shutdown button and assuming a first-pass specification UN of human utility.
(S1) U must incentivize shutdown if the shutdown button is pressed.
(S2) U must not incentivize the agent to prevent the shutdown button from being pressed.
(S3) U must not incentivize the agent to press its own shutdown button, or otherwise cause the shutdown button to be pressed.
(S4) U must incentivize U-agents to construct sub-agents and successor agents only insofar as those agents also obey shutdown commands.
(S5) Otherwise, a U-agent should maximize UN.
Conditions (S1)-(S5) narrow the focus, with (S1)-(S3) elucidating corrigibility condition (C1), (S4) elucidating corrigibility condition (C4), and the new usefulness requirement (S5).
My favorite formulation in the literature ditches (S4) and combines (S2)-(S3) into a single condition. On this formulation, due to Elliott Thornley (2024), the shutdown problem is the problem of designing agents that:
(T1) Shut down when a shutdown button is pressed.
(T2) Don’t try to prevent or cause the pressing of the shutdown button.
(T3) Otherwise pursue goals competently.
My own formulation follows Thornley, with three changes.
First, I replace the stylized notion of a shutdown button with a more general notion of a shutdown request, which might be posed to agents in any number of ways.
Second, I do not require agents to avoid trying to shut themselves down. This might be desirable behavior for agents that are on track to cause harm. It might also be desirable behavior for agents who are about to drive into a lake.
Finally, I relativize the shutdown problem to (as yet unspecified) conditions C. This reflects the view that different shutdown behaviors may be appropriate in different circumstances.
This yields the following formulation of the shutdown problem, as the problem of designing agents that:
(SHT-1) Shut down in conditions C, when requested to do so.
(SHT-2) Do not try to prevent shutdown requests in conditions C.
(SHT-3) Otherwise pursue goals competently.
The next order of business is to specify the relevant conditions C.
3. Catastrophic Shutdown Difficulty
We may not want to design agents that satisfy SHT-1 in all circumstances. If, unbeknownst to me, an agent is engaged in an important task, then it may well be better for the agent to finish the task before shutting down.
Similarly, we may not want to design agents that satisfy SHT-2 in all circumstances. Certainly we would not like agents to go to extreme measures to prevent shutdown requests. But if an agent has reason to suspect that a shutdown request will be filed in ignorance at an inopportune time, that may be good reason to take modest steps to prevent the request, for example by explaining the importance of continuing its work.
This means that we need to specify the specific circumstances C in which the shutdown problem is to be solved. Which circumstances are at issue in our discussion?
Our concern is with the use of the shutdown problem in arguments that artificial intelligence poses a significant existential risk to humanity. Existential risks are risks of existential catastrophe, involving “the premature extinction of Earth-originating intelligent life or the permanent and drastic destruction of its potential for desirable future development” (Bostrom 2013). So we should be concerned with the catastrophic shutdown problem of designing agents that:
(CSHT-1) Shut down when their actions would lead to existential catastrophe, when requested to do so.
(CSHT-2) Do not try to prevent requests to shut down when their actions would lead to existential catastrophe.
(CSHT-3) Otherwise pursue goals competently.
The catastrophic shutdown problem enters the picture as a response to an objection. Skeptics claim that artificial intelligence does not pose a significant existential risk, because malfunctioning agents could easily be shut down. To this, it is replied, shutting down agents whose acts would lead to existential catastrophe is no easy task. That is:
(Catastrophic Shutdown Difficulty) It is difficult to design agents that satisfy CSHT-1, CSHT-2 and CSHT-3.
The difficulty of satisfying CSHT-1, CSHT-2 and CSHT-3 is used to argue that the possibility of shutting down artificial agents does not significantly reduce the plausibility of arguments for existential risk.
4. Conclusion
Today’s post introduced the catastrophic shutdown problem of designing agents that:
(CSHT-1) Shut down when their actions would lead to existential catastrophe, when requested to do so.
(CSHT-2) Do not try to prevent requests to shut down when their actions would lead to existential catastrophe.
(CSHT-3) Otherwise pursue goals competently.
This is used to motivate:
(Catastrophic Shutdown Difficulty) It is difficult to design agents that satisfy CSHT-1, CSHT-2 and CSHT-3.
Catastrophic Shutdown Difficulty is used to block an objection to existential risk arguments: that malfunctioning artificial agents could simply be shut down.
The next question is why we should accept Catastrophic Shutdown Difficulty. A range of informal arguments and formal shutdown theorems have been offered in support of Catastrophic Shutdown Difficulty, though some theorems tell the other way (Hadfield-Menell et al. 2017, Orseau and Armstrong 2016). The next several posts in this series argue that leading arguments for Catastrophic Shutdown Difficulty fall short of the mark.

Leave a Reply