Instrumental convergence and power-seeking (Part 2: Benson-Tilsen and Soares)

Omohundro has argued that sufficiently advanced AI systems of any design would, by default, have incentives to pursue a number of instrumentally useful subgoals, such as acquiring more computing power and amassing many resources. Omohundro refers to these as “basic AI drives,” and he, along with Bostrom and others, has argued that this means great care must be taken when designing powerful autonomous systems, because even if they have harmless goals, the side effects of pursuing those goals may be quite harmful. These arguments, while intuitively compelling, are primarily philosophical. In this paper, we provide formal models that demonstrate Omohundro’s thesis, thereby putting mathematical weight behind those intuitive claims.

Benson-Tilsen and Soares, “Formalizing convergent instrumental goals“

Listen to this post

1. Introduction

This is Part 2 of a series on my paper “Instrumental convergence and power-seeking.”

Part 1 introduced the argument from power-seeking and showed how this argument rests on the instrumental convergence thesis. In particular, we saw that the argument from power-seeking rests on a strong version of the instrumental convergence thesis:

(Catastrophic Goal Pursuit) There are several values which would be likely to be pursued by a wide range of intelligent agents to a degree that, if successful, would permanently and catastrophically disempower humanity.

The next item of business is to argue that leading power-seeking theorems do not establish Catastrophic Goal Pursuit.

Today’s post looks at one of the first and best-known power-seeking theorems due to Tsvi Benson-Tilsen and Nate Soares.

2. Overview

Nate Soares is president of the Machine Intelligence Research Institute and a former software engineer. In academic circles, Soares is best known for his defense of functional decision theory, “Cheating death in Damascus,” co-authored with Ben Levinstein.

Tsvi Benson-Tilsen is a former researcher at the Machine Intelligence Research Institute.

As an informal summary, Benson-Tilsen and Soares model superintelligent decisionmakers as taking actions across a finite number of world regions and a finite number of time-steps. Actions both consume and produce resources. Benson-Tilsen and Soares show that under many conditions, superintelligent decisionmakers will be willing to consume resources in one region in order to spend them in another. Benson-Tilsen and Soares argue that their results ground a problematic form of instrumental convergence.

Let’s take a more formal look at the Benson-Tilsen and Soares model.

3. The model

In Benson-Tilsen and Soares’ model, a superintelligent agent conceives of the universe as divided up into disjoint regions r₁, … , r_n.

At each of a finite number m of time-steps, a region r_i can be in one of a finite number of states S_i, and the agent can take one of a finite number of acts A_i in r_i.

At each time-step t, the agent has some resources R_t to allocate among regions. Resources evolve over time in the expected way: unspent resources carry over to the next time period, and each act may produce some resources that also carry over.

The agent must choose a strategy π specifying the acts she will take in each region at each time. The strategy must be feasible in the sense that the agent will always have enough resources to carry it out.

The agent assigns values to each region-state and seeks to maximize the sum of regional values over time. That is, she chooses a feasible strategy π to maximize

$U(\pi) = \sum\limits_{t=1}^m \sum \limits_{i = 1}^n U(\pi_t(r_i))$

where $\pi_t(r_i)$ returns the state of region r_i at time t under policy π.

4. An example: The Bit Universe

Here is an example due to Benson-Tilsen and Soares. Suppose that the universe is built of bits of information. Each bit can have one of three states: activated (1), disactivated (0), or disordered (X).

Within each region r_i, the agent prefers to maximize the number of activated bits. However, she values some regions more than others. At any given time, the agent values a region at the product of its weight w(r_i) and the total number of active bits A(r_i, t).

The value of a strategy is then just the weighted sum of active bits over time:

$U(\pi) = \sum\limits_{t=1}^m \sum \limits_{i = 1}^n w(r_i)A(r_i,t)$

In the Bit Universe, the agent begins with no resources. At each time, the agent can `burn’ any combination of activated bits, converting them to disordered bits and gaining one unit of energy per bit burned. The agent can also `charge’ any combination of disactivated bits, converting them to activated bits and losing one unit of energy per bit burned. Choosing a strategy amounts to choosing a policy for burning and charging bits across regions and times.

5. Free lunches

Sometimes, an agent can act within a given region without losing resources. More specifically, a cheap lunch in region r_i is a resource-preserving partial strategy in r_i – that is, a sequence of acts in region r_i, such that each act in the sequence returns all of the resources invested in it. A free lunch is a cheap lunch that is feasible with no starting resources.

Say that an agent is indifferent to a region if she does not care about it, so that U takes the same constant value over all states of that region. As might be expected, Benson-Tilsen and Soares prove that agents will happily eat cheap lunches in regions to which they are indifferent.

Theorem 1 (Slightly modified for ease of exposition): Suppose an agent is indifferent to some region r_i. Suppose that π* is an optimal strategy among the strategies which do not intervene on r_i, and π_Yum is a cheap lunch in r_i. Then the composite strategy π* U π_Yum of following π_Yum in r_i and π* elsewhere is at least as good as π*.

Put roughly, our agent does not mind robbing Peter to pay Paul if she does not care about Peter.

Moreover, agents will strictly prefer to eat free lunches if the resources gained allow them to produce more value elsewhere. That is:

Theorem 2 (Slightly modified for ease of exposition): Suppose an agent is indifferent to some region r_i. Suppose that π* is an optimal strategy among the strategies which do not intervene on r_i, and π_Yum is a cheap lunch in r_i. Suppose that π’ is an (infeasible) strategy which does not intervene on r_i, with U(π’) > U(π*), and that the composite strategy π’ U π_Yum is feasible. Then π* is not overall optimal (nor is any strategy which fails to intervene on r_i).

Put roughly, our agent strictly prefers to rob Peter to give any positive benefit to Paul if she cares about Paul but not Peter.

However, there is rarely such thing as a free lunch. Suppose, more realistically, that an agent does care about some region r_i, but that her concern is bounded above by some N. That is, suppose that the difference U(s_i⁺) – U(s_i^–) between the utilities assigned to the best and worst states of r_i is no greater than N. Then, the agent will happily continue to eat any free lunch in r_i that provides value at least N in other regions.

Theorem 3 (Slightly modified for ease of exposition): Suppose an agent’s concern for region r_i is bounded by N. Suppose that π* is an optimal strategy among the strategies which do not intervene on r_i, and π_Yum is a cheap lunch in r_i. Suppose that π’ is an (infeasible) strategy which does not intervene on r_i, with U(π’) >= U(π*) + N and that the composite strategy π’ U π_Yum is feasible. Then π’ U π_Yum is at least as good as π*.

Put roughly, our agent is happy to rob Peter of N units of value to pay Paul N or more units of value in return, so long as she does not lose resources in the process.

6. Application to Bit Universe

What do Theorems 1-3 imply in our example of the Bit Universe?

First, suppose that an agent does not care about some region r, say, the Alpha Centauri system.

Theorem 1 implies that the agent will not mind burning all bits in Alpha Centauri on the first day.

Theorem 2 implies that the agent will in fact strictly prefer burning all bits in Alpha Centauri so long as she cares about some other region and could improve that region by spending energy to flip bits.

Now, suppose that the agent’s concern for Alpha Centauri is bounded above by N. That is, suppose that the weight w_r attached to the Alpha Centauri system is no greater than N/|r|, where |r| returns the number of bits (of any status) in the Alpha Centauri system.

Theorem 3 implies that the agent will happily burn bits in Alpha Centauri, rather than leaving Alpha Centauri alone, so long as those bits can be used to create more value elsewhere.

In fact, rational behavior in the Bit Universe is a bit simpler than this. On the first day, the agent begins burning bits from the least-favored regions and transferring them to the most-favored regions. She does this until no such transfer is possible, then halts. The most-favored regions are stocked as full of energy as they can be, and the least-favored regions are left in complete disorder.

7. First link to instrumental convergence

Theorems 1-3 are not particularly surprising. They say that agents will happily pilfer resources from regions that they do not care about, or when they take the benefits of pilfering to exceed the losses. These are familiar and expected results. It would therefore be surprising if Benson-Tilsen and Soares could pull a strong version of instrumental convergence out of Theorems 1-3.

How do Benson-Tilsen and Soares link their formal results to instrumental convergence? Benson-Tilsen and Soares are a bit terse on this point. On my best reading, they make two arguments. First, they reiterate the claims of Theorems 1-2:

Our model demonstrates that if an AI system has preferences over the state of some region of the universe, then it will likely interfere heavily to affect the state of that region; whereas if it does not have preferences over the state of some region, then it will strip that region of resources whenever doing so yields net resources. If a superintelligent machine has no preferences over what happens to humans, then in order to argue that it would “ignore humans” or “leave humans alone,” one must argue that the amount of resources it could gain by stripping the resources from the human-occupied region of the universe is not worth the cost of acquiring those resources.

That does not get us very much. If we spot Benson-Tilsen and Soares the additional premise that the catastrophic disempowerment of humanity could yield more resources than it costs, then we recover:

(Catastrophic Goal Pursuit for Psychopaths) There are several values which would be likely to be pursued by agents with no preferences over what happens to humans to a degree that, if successful, would permanently and catastrophically disempower humanity.

But this is a highly restricted version of Catastrophic Goal Pursuit. We wanted a claim that applies to a wide range of agents:

(Catastrophic Goal Pursuit) There are several values which would be likely to be pursued by a wide range of intelligent agents to a degree that, if successful, would permanently and catastrophically disempower humanity.

The thought was meant to be that agents with a wide range of goals would find power conducive to those goals, to the extent that they would be driven to permanently and catastrophically disempower humanity. The advantage of arguing in this way was meant to be that we did not need to make many particular assumptions about agents’ goals – we just needed to note that instrumental convergence applies to a wide range of agents and argue that it will be difficult to keep agents out of that range.

Whereas Catastrophic Goal Pursuit grounds an argument for existential risk from a wide range of agents, Catastrophic Goal Pursuit for Psychopaths grounds only a much narrower argument. The argument would be that deploying superintelligent systems which care not a whit about humanity would constitute an existential risk.

There’s not really much point to arguing for this claim, since nearly any sensible party to the debate would agree with it. Arguing for Catastrophic Goal Pursuit for Psychopaths therefore shifts the burden of the argument to showing that we are in danger of deploying superintelligent agents that care not a whit for humanity. Benson-Tilsen and Soares haven’t attempted to do this.

8. Second link to instrumental convergence

Benson-Tilsen and Soares do a bit better in their second argument, which draws on Theorem 3. Theorem 3 says that agents will happily rob Peter of N units of value to pay Paul more than N units of value. (Here the values assigned to outcomes reflect the preferences of the agent doing the pilfering).

Linking Theorem 3 to the Bit Universe, Benson-Tilsen and Soares make three claims. The following is a direct quotation, except that descriptive names have been added for each claim:

(Optimal Bit Universe Behavior) In this toy model, whatever [the agent] A‘s values are, it does not leave region h [where humanity resides] alone. For larger values of w_h [the weight attached to region h], A will set to 1 many bits in region h, and burn the rest, while for smaller values of w_h, A will simply burn all the bits in region h. (Good Model) Viewing this as a model of agents in the real world, we can assume without loss of generality that humans live in region h and so have preferences over the state of that region. (Alignment Difficulty in Bit Model) These preferences are unlikely to be satisfied by the universe as acted upon by A. This is because human preferences are complicated and independent of the preferences of A, and because A steers the universe into an extreme of configuration space. Hence the existence of a powerful real-world agent with a motivational structure analogous to the agent of the Bit Universe would not lead to desirable outcomes for humans.

Let’s focus just on the claims, then ask what justifications are given for them:

(Optimal Bit Universe Behavior) Optimal behavior in the bit universe typically involves burning bits in low-value regions and shifting them to high-value regions.

(Good Model) The Bit Universe is a good model of the decision problem facing future superintelligent agents.

(Alignment Difficulty in Bit Model) It is difficult to ensure that future superintelligent agents will have preferences in the Bit Universe that coincide with humanity’s preferences for its own region.

The first claim, Optimal Bit Universe Behavior, is quite true. It simply reiterates our description of optimal behavior in the Bit Universe model.

The second claim, Good Model, holds entirely without argument that the Bit Universe is a good model of the decision problem facing future superintelligent agents. That is more than a bit surprising, since the Bit Universe model is simplified in any number of ways. Rather than criticize the Bit Universe model, let’s suppose we were to accept Good Model and see why it will not get Benson-Tilsen and Soares what they want. Along the way, we will see why the Bit Universe model is simplified in problematic ways.

The third claim, Alignment Difficulty in Bit Model, holds that it is difficult to ensure that future superintelligent agents will have preferences in the Bit Universe that coincide with humanity’s preferences for its own region. That is unlikely.

In the Bit Universe model, humanity presumably prefers that its own region not be stripped of resources for the sake of other regions. We saw that this happens just in case sufficiently high weight is given to the human region. In particular, if no region is given higher weight than the human region, then no resources will be stripped from the human region.

How hard is it to train an artificial agent to have these preferences? In the Bit Universe model, it is not hard at all. Weights are a single exogenous real-valued parameter hard-coded into the system. Weights are unchangeable, since the model contains no form of preference change or moral learning. Alignment Difficulty in Bit Model then amounts to the claim that it would be difficult for the designers of a superintelligent system to hard-code a high weight for the human region into the system. But this does not look hard. This looks like the labor of seconds, and something that is likely to be done as a matter of course.

At this point, Benson-Tilsen and Soares are likely to complain that I am relying on many unfortunate simplifying assumptions of the Bit Model. The real decision problem is much more complex than the problem presented here, making preferences more difficult to specify. Moreover, preferences are not hard-coded but learned, and it is hard to ensure that preference learning leads to aligned preferences.

All of this is a very standard informal presentation of the difficulty of alignment. But every word of the previous paragraph goes deliberately beyond the framework of the Bit Universe, and indeed beyond most or all parts of the formal framework of the Benson-Tilsen and Soares model.

At the end of the day, I think Benson-Tilsen and Soares should make this move anyways. That is, they should reject Good Model, since the Benson-Tilsen and Soares model is not a particularly helpful model of the decision problem facing a superintelligent agent. But then we really have not been given a novel argument for existential risk in this paper. The formalism itself is doing no work, and the subsequent discussion does not advance the literature.

9. Conclusion

Today’s post looked at one of the first and best-known power-seeking theorems due to Tsvi Benson-Tilsen and Nate Soares.

Section 3 presented the formal framework, and Section 5 proved Benson-Tilsen and Soares’ main results. Section 4 presented a Bit Universe model that Benson-Tilsen and Soares use to illustrate their model, and Section 6 characterized optimal behavior in the Bit Universe model.

Sections 7-8 then looked at two arguments by Benson-Tilsen and Soares linking their results to instrumental convergence. We saw that the first argument works only under the assumption that superintelligent agents care not a whit for humanity, dramatically reducing the scope of agents governed by the relevant form of instrumental convergence.

We saw that the second argument is implausible given its own assumption that the Bit Universe is a good model of the decision problem facing future superintelligent agents. We saw that the best way out of this problem is probably for Benson-Tilsen and Soares to deny that the Bit Universe is a good model of the decision problem facing future superintelligent agents. This is well and good, but it leaves us without a novel and compelling argument for the relevant form of instrumental convergence.

The next post in this series will look at a second power-seeking theorem due to Alexander Turner and colleagues. It may take a few posts to get through this theorem, because the mathematics involved is a good bit more demanding than that in the Benson-Tilsen and Soares paper, and the response to their argument is not so quick.

Instrumental convergence and power-seeking (Part 2: Benson-Tilsen and Soares)

1. Introduction

2. Overview

3. The model

4. An example: The Bit Universe

5. Free lunches

6. Application to Bit Universe

7. First link to instrumental convergence

8. Second link to instrumental convergence

9. Conclusion

Like this:

Comments

Leave a ReplyCancel reply

Instrumental convergence and power-seeking (Part 2: Benson-Tilsen and Soares)

1. Introduction

2. Overview

3. The model

4. An example: The Bit Universe

5. Free lunches

6. Application to Bit Universe

7. First link to instrumental convergence

8. Second link to instrumental convergence

9. Conclusion

Share this:

Like this:

Comments

Leave a ReplyCancel reply

Discover more from Reflective altruism