Instrumental convergence and power-seeking (Part 1: Introduction)

Under plausible assumptions, some AI systems will seek power, successfully obtain it and go awry, potentially bringing about catastrophe … This is the Problem of Power-Seeking.

Bales, D’Alessandro, and Kirk-Giannini, “Artificial intelligence: Arguments for catastrophic risk

Author’s note: My work on this paper and blog series was supported by a grant from the Survival and Flourishing Fund. All views are my own.

Listen to this post

1. Introduction

If recent reviews are to be believed, there are two leading arguments for taking developments in artificial intelligence to pose an existential threat to humanity.

The first is the singularity hypothesis. On this view, self-improving artificial agents will experience a period of rapid growth in their general intelligence, after which they will be orders of magnitude more intelligent than the average human. Beyond this, we had better hope that the resulting agents wish us well, for in the words of Eliezer Yudkowsky, one does not bargain with ants.

I discuss the singularity hypothesis in my paper and blog series “Against the singularity hypothesis.”

The second argument is the argument from power-seeking. On this view, power is an instrumentally convergent end. That is to say, power is conducive to many things that artificial agents might value, so we should expect artificial agents to pursue power in order to achieve their ends. If they pursue enough power, they will permanently disempower humanity. That will be an existential catastrophe in its own right, even if it does not involve eliminating humanity to remove a threat to artificial agents’ power.

My paper and blog series, “Instrumental convergence and power-seeking,” addresses the argument from power-seeking. This is the first post in that series.

Let us begin with some clarifications.

2. Power-seeking theorems

There are two types of arguments offered in support of the argument from power-seeking. The first are informal arguments. These arguments appeal to factors such as the potential for reward functions to be mis-specified, goals to misgeneralize in novel environments, and increasingly sophisticated agents to deceive us about their true intentions.

These informal arguments have proven quite polarizing, with many of those most concerned with existential risk taking them to be among the strongest arguments on offer, but many opponents remaining substantially unconvinced.

A second class of formal arguments aims to break the standstill. They do this by offering power-seeking theorems which aim to show that a wide range of agents will have problematic power-seeking tendencies. Unless we can be confident that the systems we design will not be of this type, it is urged that we should take seriously the possibility that they will be problematically power-seeking.

This blog series will focus primarily on the second class of arguments. I will argue that leading power-seeking theorems do not substantially improve the credibility of the argument from power-seeking. This was the original project of the paper accompanying this blog series, entitled “What power-seeking theorems do not show.”

As often happens with papers responding to longtermist arguments, the topic of this paper has proven a bit too narrow for academic readers, many of whom did not take power-seeking theorems particularly seriously to begin with. As a result, the version of the paper that is likely to be published bears the more general title, “Instrumental convergence and power-seeking,” and also explores a range of informal avenues of support for the argument from power-seeking.

While I have kept the new title to stave off confusion, my purpose in writing this paper was to discuss power-seeking theorems, and so the blog series will focus entirely on power-seeking theorems. I will try to cleave closely to the original paper draft, except that I will focus for the most part on particular responses to leading theorems rather than generalities.

3. The argument from power-seeking

To get a handle on the argument from power-seeking, it will help to have a concrete formulation of the argument on board. Let’s work with a formulation of the argument due to Joe Carlsmith. The following is an exact quotation, with three exceptions: descriptive labels have been added to premises, premises have been stated as unconditional rather than conditional claims, and the relativization to Carlsmith’s chosen deadline (2070) has been removed.

(Possibility) It will become possible and financially feasible to build AI systems with the following properties:

  • Advanced capability: they outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/military/political strategy, engineering, and persuasion/manipulation).
  • Agentic planning: they make and execute plans, in pursuit of objectives, on the basis of models of the world.
  • Strategic awareness: the models they use in making plans represent with reasonable accuracy the causal upshot of gaining and maintaining power over humans and the real-world environment.

(Call these “APS” – Advanced, Planning, Strategically aware – systems.)

(Incentives) There will be strong incentives to build and deploy APS systems.

(Alignment Difficulty) It will be much harder to build APS systems that would not seek to gain and maintain power in unintended ways (because of problems with their objectives) on any of the inputs they’d encounter if deployed, than to build APS systems that would do this, but which are at least superficially attractive to deploy anyway.

(Power-Seeking) Some deployed APS systems will be exposed to inputs where they seek power in unintended and high-impact ways (say, collectively causing >$1 trillion dollars of damage), because of problems with their objectives.

(Disempowerment) Some of this power-seeking will scale (in aggregate) to the point of permanently disempowering ~all of humanity.

(Catastrophe) This disempowerment will constitute an existential catastrophe.

Instrumental convergence figures into the argument from power-seeking in at least one, and probably three places.

Most obviously, instrumental convergence is offered as an argument for Power-Seeking. The reason why we are meant to expect that deployed APS systems will seek power is that power will be conducive to the achievement of their goals.

Quite naturally, instrumental convergence may figure in Disempowerment. The reason why we should expect some APS systems not only to seek power, but in fact to seek enough power to permanently disempower most of humanity is that more power will continue to be useful for many goals even as a system accumulates large quantities of power.

Instrumental convergence may also underlie the appeal of Alignment Difficulty. On this reading, one of the chief reasons why it is hard to align APS systems is that the vast majority of goals that APS systems might have will tend to favor power-seeking.

All of this suggests that to get clear on the argument from power-seeking, we should get clear on instrumental convergence. Precisely what kind of an instrumental convergence claim must we make for the argument from power-seeking to get off the ground? It turns out that the needed claim is stronger than it may appear.

4. Instrumental convergence

What precisely does instrumental convergence claim? The best-known formulation of instrumental convergence is due to Nick Bostrom:

(IC-Bostrom) Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents.

This formulation of instrumental convergence combines two claims, one in each clause:

(Goal Realization) There are several values which would increase the chances of an agent’s final goal being realized, for a wide range of goals and a wide range of situations.

(Goal Pursuit) There are several values which would be likely to be pursued by a wide range of intelligent agents.

IC-Bostrom holds that Goal Realization and Goal Pursuit are true, and that Goal Realization implies Goal Pursuit.

Many other theorists follow Bostrom in stating versions of instrumental convergence that are closely tied to goal realization. For example:

[The] Instrumental Convergence Thesis [is] the claim that certain resource-acquiring, self-improving and shutdown-resisting subgoals are useful for achieving a wide variety of final goals. (Bales et al. 2024)

The instrumental convergence thesis states that there are certain goals which are instrumentally useful for a wide range of final goals and a wide range of situations. (Dung 2024)

By contrast, I think that we have good reason to separate Goal Realization from Goal Pursuit.

Goal Realization makes a claim about what would increase the chance of an agent realizing their goals. By contrast, Goal Pursuit makes a claim about how agents will behave. The inference from claims about what conduces to goal fulfillment to claims about behavior is complex, because agents have competing goals which they may not always be willing to relax in the pursuit of power.

For example, money is conducive to many of my goals. If I had more money, I could work less, eat better, and spend more time blogging. From this, it does not follow that I would rob a bank if I could get away with it. This is not because the money would not be useful — indeed, it would be life-changing. It is rather because I value other things, such as justice, human life, and the rule of law. I may be willing to bend these values around the edges, but I am not willing to set them completely aside in order to achieve my other goals.

Moreover, Goal Pursuit is too weak to drive the argument from power-seeking. Goal Pursuit says only that instrumentally convergent goals such as power are likely to be pursued, but it does not pronounce on the extent to which they will be pursued. It is highly plausible that most artificial agents, like most humans, will sometimes pursue power and resources. We do this through simple acts such as cashing a paycheck. But that is not what the argument from power-seeking needs.

The argument from power-seeking needs to claim that artificial agents will pursue so much power that humanity will be permanently disempowered.

(Catastrophic Goal Pursuit) There are several values which would be likely to be pursued by a wide range of intelligent agents to a degree that, if successful, would permanently and catastrophically disempower humanity.

Without Catastrophic Goal Pursuit, the argument from power-seeking would be unable to establish its final two premises: Disempowerment and Catastrophe.

However, Catastrophic Goal Pursuit is a very strong claim. Some of us would, regrettably, be willing to rob a bank if we could get away with it. But fewer would be willing to rob every bank in the world. This is not (just) because of the diminishing marginal value of money or the effects on the global monetary system, but rather because many of us find the prospect of draining the world’s coffers and leaving our neighbors to starve profoundly unappealing.

To establish a claim like Catastrophic Goal Pursuit, it is not enough to show that power conduces to many goals that artificial agents might have, nor even that the instrumental value which agents assign to power grows unboundedly or has a very large upper bound. Instead, we need to show that agents will be so morally confused as to view the permanent disempowerment of humanity as worth the price, and will be tempted to translate this view into action.

The rest of this series will argue that leading power-seeking theorems fail to establish anything like Catastrophic Goal Pursuit. As a result, they do not ground current versions of the argument from power-seeking.

5. Conclusion

This post introduced my paper and blog series “Instrumental convergence and power-seeking.”

We saw that the argument from power-seeking is one of the leading arguments for expecting artificial intelligence to pose an existential risk to humanity. We looked at a leading formulation of the argument from power-seeking due to Joe Carlsmith and saw how the argument from power-seeking depends on instrumental convergence.

We then asked what instrumental convergence would have to claim in order to drive the argument from power-seeking. We saw that the needed claim, Catastrophic Goal Pursuit, is quite strong.

The next post in this series will look at a first attempt to establish instrumental convergence due to Tsvi Benson-Tilsen and Nate Soares.

Comments

4 responses to “Instrumental convergence and power-seeking (Part 1: Introduction)”

  1. Bob Jacobs Avatar

    Why is Catastrophic Goal Pursuit assumed for time-constrained goals? I can understand it for indefinite goals like “make sure there’s always a paperclip on my desk”. Here, I could see a robot killing everyone just so nobody burns or bombs the desk. But for “make sure there’s a paperclip on my desk in the next five minutes” there’s simply no time to do any of that, even if the robot would want to.
    If we give the robot some leeway on *when* it has to be on the desk, e.g. “within the next five minutes” and not “at the five minute mark”, it can “relax” more. We can go further, giving it ranges in time, and/or in space, and/or in energy consumption, and/or almost anything. Anytime we give a range it has to satisfy, instead of a target it has to maximize, it can “relax” more ( https://bobjacobs.substack.com/p/aspiration-based-non-maximizing-ai ).

    1. David Thorstad Avatar

      Thanks Bob!

      You’re quite right that Catastrophic Goal Pursuit is more plausible for some final goals than others. For example, as you mention, a time- or resource-limited goal may make the pursuit of human disempowerment less feasible.

      I’m not aware of any consideration of time- or resource-limited goals in the recent literature on power-seeking theorems. It might be interesting to study these theorems under time- or resource-limits.

      1. titotal23 Avatar
        titotal23

        I actually wrote up a bit on constrained goals in one of my earliest blog posts here: https://titotal.substack.com/p/chaining-the-evil-genie-why-outer

        It’s a little outdated (modern day AI’s do not have set utility functions), but some of the ideas might transfer.

        1. David Thorstad Avatar

          Thanks titotal!

          Sorry for the very late reply. I’m just heading back from a conference in St Louis. Not a great time for the city.

          I’m looking forward to reading this soon.

Leave a Reply

Discover more from Reflective altruism

Subscribe now to keep reading and get access to the full archive.

Continue reading