Some researchers speculate that intelligent reinforcement learning (RL) agents would be incentivized to seek resources and power in pursuit of their objectives. Other researchers point out that RL agents need not have human-like power-seeking instincts. To clarify this discussion, we develop the first formal theory of the statistical tendencies of optimal policies. In the context of Markov decision processes, we prove that certain environmental symmetries are sufficient for optimal policies to tend to seek power over the environment. These symmetries exist in many environments in which the agent can be shut down or destroyed. We prove that in these environments, most reward functions make it optimal to seek power by keeping a range of options available and, when maximizing average reward, by navigating towards larger sets of potential terminal states.
Turner, Smith, Shah, Critch, and Tadepalli, “Optimal policies tend to seek power“
1. Introduction
This is Part 3 of a series on my paper “Instrumental convergence and power-seeking.”
Part 1 introduced the argument from power-seeking and showed how this argument rests on the instrumental convergence thesis. In fact, we saw that the argument from power-seeking rests on a strong version of the instrumental convergence thesis:
(Catastrophic Goal Pursuit) There are several values which would be likely to be pursued by a wide range of intelligent agents to a degree that, if successful, would permanently and catastrophically disempower humanity.
The next item of business is to argue that leading power-seeking theorems do not establish Catastrophic Goal Pursuit.
Part 2 considered an early power-seeking theorem due to Tsvi Benson-Tilsen and Nate Soares.
Today’s post looks at the most influential recent power-seeking theorem due to Alex Turner and colleagues.
2. Optimal policies tend to seek power
In my academic work, I try to engage with high-quality academic papers making the case for AI risk. Much of my choice of which arguments to address is driven not by the popularity of those arguments, but rather by the presence of high-quality papers which express them.
For years, I was unable to write about power-seeking theorems because I was not impressed by existing power-seeking theorems. To my mind, the paper discussed in Part 2 did not clear the bar of mathematical sophistication or scholarly rigor to merit a paper-length response.
Alex Turner and collaborators did a great deal to raise the bar for work in this area. Turner’s first co-authored paper, “Optimal policies tend to seek power,” was presented at NeurIPS 2021. Turner’s next paper, “Parametrically retargetable decision-makers tend to seek power,” was presented at NeurIPS 2022. There is a reason that these papers landed in NeurIPS twice in a row. They are categorically more sophisticated than any papers that came before them, enough so that they singlehandedly convinced me that there was enough material to justify writing a response.
To Turner’s credit, he is also dissatisfied with the arguments in these papers. Last year, Turner wrote of these papers that:
The papers embody the brash, loud confusion which I think was typical of 2018-era LessWrong … Sometimes I fantasize about retracting Optimal Policies Tend to Seek Power so that it stops (potentially) misleading people into thinking optimal policies are practically relevant for forecasting power-seeking behavior from RL training.
That is admirable and ultimately will be similar to my conclusion. There may be reasons to be concerned about AI power-seeking which scales to the level of existential catastrophe. But those reasons just will not have much to do with the kind of facts about optimal policies picked out in these papers.
For the sake of brevity, I focus on the first paper, “Optimal policies tend to seek power.” That is still a mathematically-dense paper, so I present a version of Turner and colleagues’ first set of results but omit an extension of those results. Even here, I simplify the framework to get at the essence of the result at the expense of reducing its generality – for example, insofar as possible I have made Turner’s and colleagues’ models nonstochastic, working in a special case of Turner and colleagues’ stochastic framework. (I also simplify the notation, at some risk of ambiguity.)
I hope that the details presented here will be enough to give a good sense of what Turner and colleagues’ model looks like and what their main results do and do not show, without answering the more academic question of how far the framework and results can be generalized. Readers who want the full details are invited to consult the original paper. It isn’t long, but it does use a good bit of mathematics.
3. A very rough outline
In very coarse outline, Turner and colleagues argue for the following claims:
(1) In some sense, most reward functions treat keeping options open as conducive to power.
(2) Because power is beneficial, agents will tend to keep options open.
(3) Because being shut down closes off options, agents will tend to resist shutdown.
(4) We can extrapolate from this result that optimal policies often tend to seek power by accumulating resources, to the detriment of other agents.
In very coarse outline, my objections will be that:
(Objection 1) Claim (4) doesn’t follow from claim (3). In fact, claim (4) is not supported by any of the mathematical results in the paper.
(Objection 2) The notion of power at issue in this paper is not the same notion of power at issue in the argument from power-seeking. For example, on Turner and colleagues’ definition agents would count as power-seeking if they did everything possible to put themselves in a position to benefit humanity, so long as that is what they desired to do.
(Objection 3) The problem identified by Turner and colleagues admits of a simple technical fix.
(Objection 4) This technical fix helps us to see that Turner and colleagues’ discussion of what “most” reward functions favor is not a good lens into the behavior of hypothetical superintelligent agents.
4. Motivating example
Before presenting the model, let us consider a motivating example due to Turner and colleagues.

Suppose an agent is navigating a video game environment. She arrives in the central state *. From there, she has three options. She can open the door on the left, moving to state . She can open the door on the right, moving to state
. Or she can get herself killed in the process, moving to state
.
If she goes left, she must move left again to state , from which point she can remain in state
or move up to state
. However, she cannot remain in state
but must leave and return.
If she goes right, her options are symmetrical except that now she is able to remain in state without exiting and returning.
And if she gets herself killed, she will remain dead in each future state.
Intuitively, the agent has the most options if she goes right, slightly fewer options if she goes left, and the fewest options if she dies.
One way of cashing out this intuition is that each option she gains by going left is exactly copied by an option she gains by going right. (For example, the policy [left, left, up, down, up down …] is copied by the policy [right, right, up, down, up, down …]). But the converse is not true: policies like [right, right, up, up, up, …] gained by going right have no analogs gained by going left. In this sense, it looks like going right gives the agent strictly more options than going left.
Turner and colleagues aim to formalize the idea that the agent has the most options if she goes right, slightly fewer options if she goes left, and the fewest options if she dies. Joining this to the idea that preserving options gives agents more power and power is beneficial, they aim to show that agents will seek power by preserving options. Since shutdown provides few options, agents will tend to avoid shutdown.
That is the motivation. Let’s get to the model.
5. Preliminaries
5.1. Markov decision problems
Turner and colleagues work with finite discounted Markov decision problems. These consist of a finite set S of states and a finite set A: S –> S of acts. Agents receive rewards R based on their current state, discounted at rate γ per time-step.
In our previous example, the states are represented as nodes of the graph and the acts are represented by arrows. Rewards and discount rates were not yet specified.
5.2. Visit distributions and policy values
We’d like to study optimal policies for agents to follow. Studying policies directly is hard. However, all that matters to the value of a policy is the visit distribution telling us how many times the policy visits each state. Hence the next order of business is to define visit distributions.
We can represent the states s_1, … , s_n by n-dimensional column vectors, so that s_k is the column vector with a 1 in the k-th row and a 0 in other rows.
For any finite number of steps t, let π(s,t) be the state resulting from t applications of policy π in starting state s. Then the discounted visit distribution corresponding to policy π and starting state s is .
This is all we need to determine the value of a policy. Policy π has value , which is just to say that we apply the reward function R to the discounted visit distribution
and sum across states.
5.3. Optimal acts and uncertainty
Let A^*(s,γ) be the optimal act(s) with starting state s and discount rate γ.
If agents were certain about the true reward function R, then they would know the optimal act(s) A^*(s,γ). However, agents are assumed to be uncertain about some matters including the identity of the true reward function. Their uncertainty is captured by a probabilistic credence function c.
This means that for any act a, agents have some credence c(a ∈ A^*(s,γ)) that the act is optimal. Let be the value of an optimal policy under reward function R.
5.4. Power
How much power does an agent have in state s with discount rate γ? Call this quantity
At a first pass, Turner and colleagues set an agent’s power equal to the value of the optimal policy. Since agents are uncertain about the true reward function, this puts an agent’s power at:
This first pass definition has two problems. First, it diverges as γ tends to one. Second, it rewards agents for the state s that they are already in, rather than the states that they can bring about after s.
As a final pass, Turner and colleagues solve the first problem by pre-multiplying by the scalar (1-γ)/γ and solve the second problem by subtracting the reward R(s) of the current state.
This puts an agent’s power at:
5.5. Having more options
We saw in the motivating example that Turner and colleagues want to capture the idea that a state (like ) gives agents more options than another state (like
). The idea cannot be that
provides the agent with all of the same options as
, plus some more. After all, the states accessible from
all have r’s in their name and the states accessible from
all have l’s in their name. Rather, the idea is that we should be able to swap the names of states to make it the case that every option available after going left is now identical to an option available after going right. For example, the policy [left, left, up, down, up, down, …] will correspond after relabeling to the policy [right, right, up, down, up, down, …].
More precisely, let F(s) and F(s’) be sets of visit distributions beginning from states s and s’, respectively. For any state permutation φ, let φF(s) be the result of applying φ to each element of F(s). For example, the permutation φ which swaps all right- and left-states will take (the visit distribution corresponding to) the policy [left, left, up, down, up, down, …] to (the visit distribution corresponding to) the policy [right, right, up, down, up, down, …].
Say that F(s) contains a copy of F(s’) if for some involution φ (a state permutation which swaps some pairs of states and leaves the rest alone), φF(s’) is a subset of F(s).
5.6. Larger on most reward functions
Turner and colleagues want to make a claim about what is true under most formally definable reward functions. It may turn out that the true reward function favors good behavior, but their concern is that in some sense, most reward functions will favor bad behavior.
Turner and colleagues cash out the notion of one quantity being larger than another on most reward functions (≥most) by applying state permutations to the agent’s credences.
For any credences c and state permutation φ, let φ(c) be the credence function which results from first permuting states by φ and then applying c to the states in permuted order. For example, if c(s_1) = 0.2 and φ maps s_5 to s_1, then φ(c(s_5)) = 0.2. Let Π(c) be the set of all credence functions that can be formed by permuting c.
What does it mean to say that state s has more power than state s’ on most reward functions? I.e. what does it mean to say that ?
For Turner, this is to say that for any credences c with finite support, there are more permuted credences in Π(c) which treat s as more powerful than s’ than there are permuted credences in Π(c) which treat s’ as more powerful than s. I.e. for any discount rate γ:
We can now state Turner and colleagues’ main results — or at least, enough of their main results to grasp the main argument of the paper.
6. Main results
The first result of the paper is that states with more options have more power.
(Theorem 1: States with more options have more power) If F(s) contains a copy of F(s’) then for any discount rate γ < 1,
.
Recall, this does not mean that s’ is certain to have more power than s. Rather this means that no matter the agent’s beliefs about reward, there are more state permutations of her beliefs that treat s’ as more powerful than s than there are state permutations that treat s as more powerful than s’.
Because Turner and colleagues define the power of a state in terms of the value of the optimal policy, agents will tend to seek power as a way of maximizing value.
More concretely, for any initial state s, single-step act a, and discount rate γ, let P(s,a,γ) be the agent’s credence that a is optimal. Extend the definition of ≥most to this quantity in the natural way, so that it makes sense to ask under what conditions P(s,a,γ) ≥most P(s,a’,γ). Turner and colleagues prove:
(Theorem 2: Preserving options tends to be optimal) If F(a(s)) contains a copy of F(a'(s)) and [technical condition omitted for brevity], then for all discount rates γ < 1, P(s,a,γ) ≥most P(s,a’,γ).
Again, we have not said that preserving options is certain to be optimal, nor even that the agent must think that it is. Theorem 2, like the claims before, is a claim about what happens under most reward functions, cashed out in terms of all possible state permutations of the agent’s credences.
7. Link to instrumental convergence
What does all of this have to do with instrumental convergence? One immediate consequence of Turner and colleagues’ results is a link to the Shutdown Problem of determining the conditions under which agents will voluntarily switch themselves off.
As Turner and colleagues note, shutdown is often operationalized in Markov models as a single state which transitions only into itself, as in our motivating example. This means that most other states will provide vastly more options than shutdown does. Insofar as it tends to be optimal for agents to preserve options, it will then tend to be optimal to resist shutdown. As they write:
Average-optimal agents … tend to avoid getting shut down. The agent’s task … often represents agent shutdown with terminal states … [thus] average-optimal policies tend to avoid shutdown. Intuitively, survival is power-seeking relative to dying, and so shutdown-avoidance is power-seeking behavior.
The first two sentences are a claim about shutdown avoidance, not power-seeking. The last does link shutdown-avoidance to power-seeking, but not to anything like Catastrophic Goal Pursuit, and we will see in the next section that the relevant sense of power-seeking is a bit of an odd one.
How do we get from theorems about shutdown-avoidance to instrumental convergence? I have no idea how to do this, and if Turner and colleagues know how to do it, they are not telling. They do, however, flat-out assert that this move can be made:
Reconsider the case of a hypothetical intelligent real-world agent which optimizes average reward for some objective. Suppose the designers initially have control over the agent. If the agent begins to misbehave, perhaps they could just deactivate it. Unfortunately, our results suggest that this strategy might not work. Average-optimal agents would generally stop us from deactivating them, if physically possible. Extrapolating from our results, we conjecture that when γ ≈ 1, optimal policies tend to seek power by accumulating resources – to the detriment of any other agents in the environment.
Every sentence before the bolded claim is a statement about power-seeking. The last sentence asserts a link to instrumental convergence. How is this link made, you ask? Through pure extrapolation and conjecture, with not a single word of argument or even a single word to illustrate what the argument might be.
Turner and colleagues are welcome to conjecture as they please. But a conjecture is not a proof, nor is it an argument, though it certainly should be supported by one.
8. Challenges
There are four natural obstacles to taking Turner and colleagues’ results to support Catastrophic Goal Pursuit.
8.1. Shutdown and instrumental convergence
Turner and colleagues prove a series of theorems about the conditions under which agents will resist shutdown.
Questions about shutdown-avoidance are downstream from instrumental convergence in the argument from power-seeking. Instrumental convergence is used to argue that artificial agents will seek to permanently disempower humanity. Shutdown-avoidance is then used to answer an objection, namely that if artificial agents seek to disempower humanity, we can just turn them off.
There is no direct or indirect route from claims about shutdown-avoidance to claims about instrumental convergence. Shutdown-avoidance is strictly downstream of these claims, and arguments do not swim upstream.
There is not a single theorem in Turner and colleagues’ paper that deals in any direct way with instrumental convergence. All of the mathematics is used to interrogate downstream questions about shutdown-avoidance. Their final extrapolation has no grounding in any of the relevant mathematics and should therefore be given no greater epistemic status than any other extrapolation made on the basis of no argument or theorem at all.
8.2. Defining power
For Turner and colleagues, an agent’s power is defined in terms of its ability to achieve its goals. I don’t want to argue about the correct definition of the word “power” and whether Turner and colleagues’ notion matches up with ordinary linguistic usage. What I do want to point out, however, is that Turner and colleagues’ definition of power cannot be the same sense of power at issue in the argument from power-seeking.
The argument from power-seeking claimed that artificial agents would permanently disempower humanity, and that this disempowerment would constitute an existential catastrophe. But if what Turner and colleagues mean by power-seeking is simply that artificial agents will put themselves in a good position to achieve their goals, then it neither follows that humanity must be disempowered in any sense, nor that the results must be an existential catastrophe.
Certainly, there are many goals that artificial agents might have which might put them in tension with human goals. The behavior that results has been the focus of much discussion and would put us to a large extent back where we started.
However, there are also a number of goals which would not involve existentially catastrophic human disempowerment. For example, artificial agents might aim to act morally, to build a long-lasting and flourishing biological human civilization, or to shut themselves down.
This means that even if Turner and colleagues were to establish a link between shutdown-avoidance and power-seeking, this would not (yet) be the type of power-seeking needed to ground the argument from power-seeking.
8.3. Fixing the problem
I don’t think that Turner and colleagues have identified a sizable source of shutdown-resistance in future superintelligent agents. But let us suppose that they have. The problem is very easy to fix.
Turner and colleagues make use of the fact that shutdown is often formally represented as a 1-cycle. 1-cycles are easily embedded in more complex graphs and hence will be seen as providing far fewer options than other candidate acts.
If that is really the problem, then there is an easy fix. Modify any decision problem you like by representing shutdown, not as a 1-cycle, but as a network of many fully connected nodes. Call this network Dreamland. Make Dreamland enough to contain a copy of all other relevant subgraphs. By making Dreamland very large, we can argue that since the vast majority of state permutations favor entering Dreamland, agents are very likely to enter Dreamland.

One way of taking this is as a quick technical fix to a genuine problem identified by Turner and colleagues. But another way of taking this would be to argue that a superintelligent agent could not possibly be so stupid as to be swayed by the mere fact that Dreamland has many nearly-identical states in it. I agree.
What went wrong? Below, I argue that what went wrong is that merely counting the number of states in a graph, or counting the verdicts of possible reward functions, just does not tell us very much about how a superintelligent agent would behave.
8.4. Most formally definable functions ≠ most likely functions
Suppose you see an old lady crossing the street. You can either help her or murder her. You assign values [1,-1,000] to the states in which she is helped or murdered, respectively. So you help her.
Aha, I object. But aren’t you very lucky to have settled on this specific reward function instead of the permuted reward [-1,000,1] which decisively favors murder? Only, once we understand how human value learning works, this is not a very tempting argument. Your learning abilities, combined with the experiences you are exposed to, make it much more likely for you to learn some reward functions over others. The fact that there are, in some sense, equal numbers of reward functions on both sides does not mean that humans are equally likely to end up on either side.
Now suppose I object that there are actually four options: you can help, rob, blackmail, or murder the poor lady. You assign values [1,-20,-100,-1,000] to the states in which these dastardly acts are perpetrated. Now, I object, isn’t it convenient that you settled on one of the few formally definable reward functions that favors helping the lady over the three immoral alternatives?
Only, once we understand how learning works, the sheer number of alternative possible reward functions just does not make much of a difference. It is certainly possible for an agent to be able to learn that it is good to help old ladies and wrong to murder them, without also learning that it is bad to rob or blackmail old ladies. But although this is formally possible, it isn’t terribly likely.
In the same way, just telling us that under a certain representation, the vast majority of state permutations of reward would favor shutdown-avoidance does not tell us much about the behavior of hypothetical superintelligent agents. Nor, for that matter, does changing the behavior of permuted reward functions by packing Dreamland full of near-shutdown states. Until we know what capacities a superintelligent agent has and how it learns, we cannot say much about the values it is likely to have.
I think this may be what Turner had in mind when he wrote that:
Sometimes I fantasize about retracting Optimal Policies Tend to Seek Power so that it stops (potentially) misleading people into thinking optimal policies are practically relevant for forecasting power-seeking behavior from RL training.
If we want to get a handle on power-seeking behavior, we need to spend less time focusing on optimal policies and more time telling concrete and empirically-grounded stories about the capacities of superintelligent agents, their learning processes, and the data from which they learn.
9. Conclusion
This post discussed a paper by Alex Turner and colleagues, “Optimal policies tend to seek power.” This paper has been the basis for most modern power-seeking theorems.
We saw that while the paper proves some interesting theorems about what most formally definable reward functions tend to favor, it falls short of grounding an argument for instrumental convergence. This happens for four reasons.
First, the paper proves theorems about shutdown-avoidance and then extrapolates, without evidence, that instrumental convergence follows. There is no direct inference from shutdown-avoidance to instrumental convergence.
Second, the paper uses a notion of power that is not strong enough to ground the argument from power-seeking.
Third, the paper identifies a technical problem which, if genuine, admits of a simple fix.
Finally, studying this fix reveals that merely counting numbers of formally definable states, reward functions and the like does not tell us much about how agents are likely to behave.
This concludes my discussion of leading power-seeking theorems. Like the early theorem by Benson-Tilsen and Soares discussed in Part 2, Turner and colleagues’ theorem cannot ground the argument from power-seeking because it does not ground Catastrophic Goal Pursuit or any similar instrumental convergence claim.

Leave a Reply