If anyone builds it, everyone dies (Part 2: Evolutionary mismatch)

What, exactly, will AIs want? The answer is complicated … in the sense that it’s chaotic and unpredictable. But one thing that is predictable is that AI companies won’t get what they trained for. They’ll get AIs that want weird and surprising stuff instead.

Yudkowsky and Soares, If anyone builds it, everyone dies

1. Introduction

This is Part 2 of my series If anyone builds it. This series discusses and responds to some key contentions of Eliezer Yudkowsky and Nate Soares’ book, If anyone builds it, everyone dies.

Part 1 introduced the book alongside some key argumentative cruxes. We saw that the book aims to establish a strong claim about existential risk in the space of a single chapter, then generalizes this to a stronger claim a few pages later.

The next several posts in this series will ask whether Yudkowsky and Soares successfully establish these claims about existential risk. Today’s post looks at the first of two main arguments in Chapter 4.

2. Misaligned goals

We saw in Part 1 of this series that Chapter 4 concludes with the claim that:

(Bad Goals 1) The preferences that wind up in a mature AI are complicated, practically impossible to predict, and vanishingly unlikely to be aligned with our own, no matter how it was trained.

Bad Goals 1 combines three claims that are worth separating.

(Complicated Goals) The preferences of a mature AI will be complicated.

(Unpredictable Goals) The preferences of a mature AI will be practically impossible to predict.

(Misaligned Goals) The preferences of a mature AI are vanishingly unlikely to be aligned with our own.

I want to focus primarily on Misaligned Goals. The reason for this is that Misaligned Goals is necessary to reach the conclusion that Yudkowsky and Soares arrive at only a few pages into the next chapter:

(Bad Goals 2) Most powerful artificial intelligences, created by any method remotely resembling the current methods, would not choose to build a future full of happy, free people.

Misaligned Goals is also necessary to reach the headline claim of the book:

(Everyone Dies) If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.

Without Misaligned Goals, we would be left only with the claim that superintelligent agents will have complex goals that are difficult to predict. It would not be particularly surprising for superintelligent agents to have complex goals, and the bare difficulty of predicting those goals does not imply that they would be antithetical to human freedom, happiness, or survival.

How do Yudkowsky and Soares take themselves to establish Misaligned Goals? The structure of their argument is not always clear. I will do my best to fill in what Yudkowsky and Soares might have been driving at.

I suspect that Yudkowsky, Soares and their advocates may claim that I am misreading them. One of the advantages of unclear writing and minimal argumentation is that it is always possible to claim that critics have missed some argument buried deeply inside of the text. I will do my best to head off such charges by considering the most reasonable readings of the text. This promises to be a never-ending quest, but I will do my best to keep things brief.

3. Evolution and gradient descent

Yudkowsky and Soares begin Chapter 4 by considering human dispositions to eat and procreate. In our ancestral environment, they note, it was broadly fitness-promoting to eat sweet, salty and fatty foods and to engage in sexual activity. Humans evolved to find these experiences rewarding.

However, as time went on, these rewarding experiences came to be pursued for their own sakes. Humans increasingly ate sweet, salty and fatty foods even when they had already consumed enough calories, and began to have sex for fun.

Yudkowsky and Soares argue that this story is broadly parallel to the story of modern artificial intelligence. In humans, they argue, the story looks like this:

  1. Natural selection, in selecting for organisms that pass on their genes in the ancestral environment, creates animals that eat energy-rich foods. Organisms evolve that eat sugar and fat, plus some other key resources like salt.
  2. That blind “training” process, while tweaking the organisms’ genome, stumbles across tastebuds that within the ancestral environment point towards eating berries, nuts, and roasted elk, and away from trying to eat rocks or sand.
  3. But the food in the ancestral environment is a narrow slice of all possible things that could be engineered to put into your mouth. So, later, when hominids become smarter, their set of available options expands immediately and in ways that their ancestral training never took into account. They develop ice cream, and Doritos, and sucralose.

The story with AI, they argue, looks like this:

  1. Gradient descent – a process that tweaks models depending only on their external behaviors and their consequences – trains an AI to act as a helpful assistant to humans.
  2. That blind training process stumbles across bits and pieces of mental machinery inside the AI that point it toward (say) eliciting cheerful user responses, and away from angry ones.
  3. But a grownup AI animated by those bits and pieces of machinery doesn’t care about cheerfulness per se. If it later became smarter and invented new options for itself, it would develop other interactions it liked even more than cheerful user responses, and would invent new interactions that it prefers over anything it was able to find back in its “natural” training environment.

What should we make of this parallel?

4. Locating the conclusion

The first thing we need to do before evaluating Yudkowsky and Soares’ argument is to locate the argument’s conclusion. That is difficult, because Yudkowsky and Soares do not provide an explicit conclusion.

There are, however, at least two natural readings. Only one is an argument for Misaligned Goals.

4.1. Version 1: No defense of Misaligned Goals

On one reading, the conclusion could be the passage immediately following those quoted above, which is also the last passage in this subsection of the book. That is certainly where I would place the conclusion of my arguments.

This passage reads:

What treat, exactly, would the powerful future AI prefer most? We don’t know; the result would be unpredictable to us. It might be chaotic enough that if you tried it twice, you’d get different results each time. The link between what the AI was trained for and what it ends up caring about would be complicated, unpredictable to engineers in advance, and possibly not predictable in principle.

On this reading, the argument aims to establish Complicated Goals and a weakened form of Unpredictable Goals. However, on this reading, the argument says nothing about Misaligned Goals.

This is a defensible reading of Yudkowsky and Soares. Readers who prefer this reading may skip to the next post in this series to see what else Yudkowsky and Soares say in defense of Misaligned Goals.

However, precisely because Yudkowsky and Soares say so little in direct defense of Misaligned Goals, it is worth considering a reading of their argument on which the argument shows something about misalignment.

I am not certain whether this is a faithful reading of Yudkowsky and Soares’ argument. I suspect that Yudkowsky and Soares might not have settled on a clear preferred reading either. But let us give it a go.

4.2. Version 2: Defense of misaligned goals

On this reading, the argument aims at least in part to establish:

(Misaligned Goals) The preferences of a mature AI are vanishingly unlikely to be aligned with our own.

Here it is important to establish not only the existence of misalignment, but also the degree of misalignment. Misaligned Goals comes in any number of degrees, such as the following:

(Bare Misalignment) The preferences of a mature AI, if pursued, are likely to lead to some outcomes that typical humans would judge to be suboptimal.

(Catastrophic Misalignment) The preferences of a mature AI, if pursued, are likely to lead to some outcomes that typical humans would judge to be catastrophic (say, at least as bad as the COVID-19 pandemic).

(Existentially Catastrophic Misalignment) The preferences of a mature AI, if pursued, are likely to lead to some outcomes that typical humans would judge to constitute an existential catastrophe.

As I discussed in my paper and series on power-seeking arguments, Yudkowsky and Soares need to establish a claim such as Existentially Catastrophic Misalignment. Very weak claims such as Bare Misalignment are trivial enough to be mostly uninteresting. And even strong claims such as Catastrophic Misalignment fall far short of the book’s claim that Everyone Dies.

Is there a defensible reading of Yudkowsky and Soares’ argument that establishes Existentially Catastrophic Misalignment? Let’s consider how the argument might go.

5. Formulating the argument

The first two premises of the argument should roughly parallel the first two of Yudkowsky and Soares’ claims about artificial agents. They should probably have something like the following structure.

(1) Training rewards helpfulness: Training through gradient descent heavily rewards AI systems for achieving the goal of helpfully assisting humans.

(2) Induced mechanism: AI systems succeed during training by acquiring mechanisms to prefer cheerful user responses and avoid angry responses.

On my best reading, the next claim splits into three parts:

(3a) Unconcern for cheerfulness: A system with these mechanisms is not concerned with cheerful user reactions for their own sake.

(3b) Possibility of reward hacking: In novel environments, there are interactions that this mechanism would lead the system to prefer even more than it prefers cheerful user responses.

(3c) Discovery of reward hacking: And a sufficiently intelligent system would discover these interactions.

And from this, we must sketch out the rest. Can we get from here to Existentially Catastrophic Misalignment? Not easily.

6. First problem: Unconcern for helpfulness

Premise 1 is framed in terms of a goal: helpfully assisting humans. Premises 2 and 3 are framed in terms of a proxy for that goal: achieving cheerful user responses.

This makes premise 3a much easier to establish. Many intelligent systems do not care intrinsically about proxy goals. But that does not yet show what Yudkowsky and Soares want – that the system does not care about the goal of helpfully assisting humans.

In fact, this framing makes Premise 3a almost the reverse of what Yudkowsky and Soares want to establish. Yudkowsky and Soares are concerned that proxy goals will come to be valued for their own sake, leading to reward hacking which targets the proxy goals in ways that come apart from the goals they are supposed to proxy for. So the last thing that Yudkowsky and Soares need to be doing here is to argue that training will lead systems to be unconcerned with proxy goals for their own sake.

One way to repair the argument would be to adopt a reading on which Yudkowsky and Soares unintentionally equivocated between the goal of helpfully assisting humans and the proxy goal of achieving cheerful user responses. We would then revise the argument to focus on the primary goal in Premise 3, confining the proxy goal to Premise 2:

(1) Training rewards helpfulness: Training through gradient descent heavily rewards AI systems for achieving the goal of helpfully assisting humans.

(2) Induced mechanism: AI systems succeed during training by acquiring mechanisms to prefer cheerful user responses and avoid angry responses.

(3a*) Unconcern for helpfulness: A system with these mechanisms is not concerned with helpfully assisting humans for its own sake.

(3b*) Possibility of reward hacking: In novel environments, there are interactions that this mechanism would lead the system to prefer even more than it prefers helpfully assisting humans.

(3c) Discovery of reward hacking: And a sufficiently intelligent system would discover these interactions.

This would put us closer to the conclusion that Yudkowsky and Soares want: that a system would have (3b*) and discover (3c) interactions that it prefers over those in which it helpfully assists humans.

However, it also leaves the passage from (2) to (3a*) completely unargued for. And it is not clear how Yudkowsky and Soares might close this gap.

Return to the human case. It is true that:

(2′) Induced taste mechanism: Humans evolved to acquire mechanisms to prefer sweet, fatty and salty foods.

From this, it follows that:

(Concern for tastiness) Typical human dietary choices will show a preference for sweet, fatty and salty foods.

But it does not follow that tastiness will be the only concern, or even the primary concern in human dietary choices. In particular, it does not follow that human dietary choices will be unconcerned with healthiness:

(3a’) Unconcern for healthiness: Typical human dietary choices will not show a preference for healthy foods for their own sake.

Most humans show a strong preference for healthy foods alongside their preference for sweet, fatty and salty foods.

Similarly, Induced Mechanism may well imply a claim such as the following:

(Concern for proxy goals) A system with these mechanisms will sometimes be concerned with bringing about cheerful user interactions for their own sake.

But Induced Mechanism does not imply that promoting proxy goals is the primary concern of a superintelligent system. And it certainly doesn’t imply that they lack intrinsic concern for the underlying goal of helpfully assisting humans.

This does not mean that our revision of Yudkowsky and Soares’ argument is not an improvement. The stated argument aimed to establish almost the reverse of what Yudkowsky and Soares want to establish, namely:

(3a) Unconcern for cheerfulness: A system with these mechanisms is not concerned with cheerful user reactions for their own sake.

We can revise the argument to aim at establishing:

(3a*) Unconcern for helpfulness: A system with these mechanisms is not concerned with helpfully assisting humans for its own sake.

But nothing like Unconcern for Helpfulness follows from the previous premises.

7. Second problem: Degree of misalignment

Suppose for a minute that Yudkowsky and Soares had established their final premises, which hold in revised form that:

(3a*) Unconcern for helpfulness: A system with these mechanisms is not concerned with helpfully assisting humans for its own sake.

(3b*) Possibility of reward hacking: In novel environments, there are interactions that this mechanism would lead the system to prefer even more than it prefers helpfully assisting humans.

(3c) Discovery of reward hacking: And a sufficiently intelligent system would discover these interactions.

How might the argument continue? Plausibly, the argument might continue like this:

(3d) Pursuit of reward hacking: And a sufficiently intelligent system would sometimes pursue these interactions.

Giving us a conclusion like the following:

(3e) Imperfect alignment: A sufficiently intelligent system would not always pursue the interactions that best promote the goal of helpfully assisting humans.

This is a plausible way of continuing the argument. Analogously, in the human case, we might continue that:

(Concern for tastiness) Typical human dietary choices will show a preference for sweet, fatty and salty foods.

(Possibility of taste hacking) In modern environments, there are sweet, salty or fatty foods that humans may sometimes prefer to healthier alternatives.

(Discovery of taste hacking) And humans will discover these foods.

(Pursuit of taste hacking) And humans will sometimes eat them.

Therefore,

(Imperfect health alignment) Humans will not always eat the healthiest foods available.

All of this is quite plausible, and well illustrated by Yudkowsky and Soares’ example of ice cream consumption.

However, alignment comes in degrees. What Yudkowsky and Soares need to establish is not:

(Imperfect alignment) A sufficiently intelligent system would not always pursue the interactions that best promote the goal of helpfully assisting humans.

Nor even:

(Catastrophic misalignment) A sufficiently intelligent system would pursue outcomes that, if achieved, most humans would judge to be catastrophic (say, at least as bad as the COVID-19 pandemic).

But rather:

(Existentially catastrophic misalignment) A sufficiently intelligent system would pursue outcomes that, if achieved, most humans would judge to constitute an existential catastrophe.

There is all the space in the world between Imperfect Alignment and Existentially Catastrophic Misalignment and no easy path between them.

In the human case, we would not argue from the fact that humans sometimes eat ice cream to the fact that humans will make ice cream consumption the sole purpose of their lives and turn the world into a giant ice cream factory. That would not be a good argument, because humans have other goals which limit the extent to which our unfortunate ice cream preferences can lead us astray.

Similarly, when dealing with artificial agents we cannot argue in any direct way from the claim that they will sometimes prefer something other than maximally pleasing humans to the claim that they will kill us all. Unless we bring in new, heavyweight arguments to bridge the gap, that would be making a mountain out of a molehill.

8. Filling out the argument

What is disappointing about this argument is not that it could not be further elaborated. There are several known ways to fill out Yudkowsky and Soares’ argument into what are often taken as leading arguments for existential risk. What is disappointing is that Yudkowsky and Soares did not take the time to do this. Once we elaborate the argument to have a more traditional form, we see that the premises which Yudkowsky and Soares spend their time establishing are not those pulling the lion’s share of the argumentative weight.

Probably the most natural way of filling out the argument is what we might call the Argument from Reward Hacking. This argument goes something like this:

(4) Reward unboundedness: The reward mechanisms evolved in response to the goal of promoting helpful human interactions will be capable of providing unboundedly strong reward signals, or at least of providing extremely strong reward signals.

(5) High reward realization: Some possible interactions in novel environments would allow artificial agents to realize extremely high values of this reward signal.

(6) Discovery: Sufficiently intelligent agents would discover some such interactions.

(7) Outweighing: Sufficiently intelligent agents would take the high reward signal of these interactions to outweigh any competing negative signals provided by the failure to satisfy competing goals.

(8) Pursuit: Sufficiently intelligent agents would therefore pursue interactions that allow them to realize extremely high values of the reward signal.

A number of further premises would then be required to establish a fundamental conflict between the system’s pursuit of extremely high-reward interactions and the survival of humanity.

Alternatively, Yudkowsky and Soares could make a more traditional argument from goal misgeneralization, suggesting that the goals which produced good results during training may not always produce good results during testing in novel environments. This is suggested by some of Yudkowsky and Soares’ earlier remarks about how the modern environment has changed from our evolutionary environment to allow humans ready access to unhealthy sources of sugar, salt and fat.

Arguments from goal misgeneralization need not have much to do with concerns about reward hacking. The concern here is not so much the possibility of inventing extremely creative and deviant ways to increase reward signals, such as the invention of sucralose to provide sweet flavors. The concern is rather that the goals which led to good results might quite generally begin to produce bad results in novel environments. For example, many of us would quite generally benefit from having less fat, salt and sugar in our diets, no matter the source.

These concerns about goal misgeneralization are largely separate from concerns about reward hacking, and would again require a range of new premises and arguments.

Finally, Yudkowsky and Soares could make one of a range of more traditional evolutionary arguments. For example, they could argue that natural selection favors traits such as selfishness and power-seeking, and that training through gradient descent favors similar traits. This would be almost an entirely different argument.

Any of these arguments might have resulted in a productive conversation that could, eventually, have reached the current state of the art, though not perhaps advanced it. But what we have been given is a highly incomplete fragment that, once clarified, takes us only a short way towards the best completed arguments for existential risk.

The right response to an argument like this is not to dig further for hidden meanings and insist that critics have missed them. The right response to an argument like this is to send the authors back to the drawing board and to ask them to produce a complete argument which can be more productively evaluated.

9. Looking forward

Today’s post looked at the first of two primary arguments in Chapter 4 of If anyone builds it, everyone dies. We saw that Yudkowsky and Soares draw on an evolutionary analogy to raise the possibility of reward hacking.

We saw that as stated, the argument does not take us very far towards its main argumentative goals. It does not show that artificial agents will be unconcerned with helping humans. At most, it establishes that artificial agents will not always be concerned only with being maximally helpful to humans. That is a plausible claim, but one which does not take us very far towards the book’s headline claim that Everyone Dies.

We saw that there are many traditional ways to flesh out Yudkowsky and Soares’ argument, including arguments from reward hacking, goal misgeneralization, and evolution. If Yudkowsky and Soares had done this, we might have been able to bring their argument into dialogue with what is already known about each type of argument.

I suspect that Yudkowsky and Soares will object that I have only discussed the first part of Chapter 4, and that the remainder of Chapter 4 contains a more extensive argument for misalignment. The next post will argue that this argument is likewise unsuccessful.

Comments

3 responses to “If anyone builds it, everyone dies (Part 2: Evolutionary mismatch)”

  1. Isaac Avatar
    Isaac

    I think it’s worthwhile to attack the “evolution as metaphor for AI alignment” argument more directly. Evolution is a terrible metaphor for AI alignment.

    Evolution optimises over the genome, not over human values. We humans get some useful instincts and values encoded in the genome, but we learn most of our values by living our lives in a society that has a vested interest in shaping us into prosocial and empathetic people (and this works well for the most part).

    Reinforcement learning / gradient descent is more like how individual humans learn within their own lifetime (being rewarded for good things and punished for bad things, which we can do with AI via RL) than it is like evolution.

    So evolution can’t stop you from liking ice cream, because all it does is select for optimal genomes, and your genome is already locked in when you were born. RL can stop AI’s from doing certain bad things more directly.

    PS: I’m cribbing most of this from Quintin Pope who wrote a very good post on this subject: https://www.alignmentforum.org/posts/FyChg3kYG54tEN3u6/evolution-is-a-bad-analogy-for-agi-inner-alignment

    1. David Thorstad Avatar

      Thanks Isaac!

      I very much agree with your skepticism about evolutionary analogies. Reinforcement learning and gradient descent are learning rules, and apart from some crossovers like evolutionary algorithms, learning rules are studied apart from evolution for a very good reason — namely, that they are different processes.

      Thanks for the link to Pope’s discussion. I’ll take a look.

      If you haven’t already seen it, there is an excellent paper by Maarten Boudry and Simon Friederich making some similar points: “The selfish machine? On the power and limitation of natural selection to understand the development of advanced AI.” They discuss the paper here on this blog. They think that the best analogy for how AIs are trained is not evolution, but rather domestication. But you don’t need to buy that stronger thesis to appreciate many of their criticisms of the evolutionary analogy, which are similar to those you raised.

      1. Isaac Avatar
        Isaac

        Thanks, I’ll check it out

Leave a Reply

Discover more from Reflective altruism

Subscribe now to keep reading and get access to the full archive.

Continue reading