If anyone builds it, everyone dies (Part 3: Remaining arguments for misalignment)

Yudkowsky and Soares are capable of making a cogent, internally consistent argument for AI doom. They have done so at enormous length. I disagree with it, but I can acknowledge that it makes sense on its own terms. And that’s more than I can say for If Anyone Builds It, Everyone Dies … If you’re new to all this and want to understand why Eliezer Yudkowsky thinks we’re all going to die, here’s my advice: just read his old blog posts. Yes, they’re long, and the style can be a turnoff, but at least you’ll be getting the real argument. 

Clara Collier, “More was possible: A review of If anyone builds it, everyone dies

1. Introduction

This is Part 3 of my series If anyone builds it. This series discusses and responds to some key contentions of Eliezer Yudkowsky and Nate Soares’ book, If anyone builds it, everyone dies.

Part 1 introduced the book alongside some key argumentative cruxes. We saw that the book aims to establish a strong claim about existential risk in the space of a single chapter, then generalizes this to a stronger claim a few pages later.

The book gives two main arguments for this claim, both in Chapter 4.

Part 2 looked at the first main argument. Today’s post looks at the next main argument, and also considers an alternative formulation of the argument in Chapter 4.

2. Where the argument stands

Chapter 4 concludes with the claim that:

(Bad Goals 1) The preferences that wind up in a mature AI are complicated, practically impossible to predict, and vanishingly unlikely to be aligned with our own, no matter how it was trained.

We saw in Part 2 of this series that Bad Goals 1 contains three claims that are worth separating.

(Complicated Goals) The preferences of a mature AI will be complicated.

(Unpredictable Goals) The preferences of a mature AI will be practically impossible to predict.

(Misaligned Goals) The preferences of a mature AI are vanishingly unlikely to be aligned with our own.

The question is how Yudkowsky and Soares take themselves to establish Misaligned Goals in Chapter 4.

Part 2 of this series looked at the first of two main arguments in Chapter 4. We saw that on one natural reading, this isn’t an argument for Misaligned Goals at all, but rather for Complicated Goals and Unpredictable Goals. We investigated the prospects for extending the argument into an argument for misaligned goals but found that it does not take us much of the way there.

What else do Yudkowsky and Soares say in support of Misaligned Goals?

3. Interlude: Sexual selection

Yudkowsky and Soares turn next to a brief discussion of sexual selection. This happens when traits are favored because creatures with these traits are better able to compete for mates, whether or not those traits are conducive to survival or well-being.

Yudkowsky and Soares offer one of the most famous examples of sexual selection: peacock feathers. Male peacocks evolved to grow extensive tail feathers. This elaborate display wastes energy, alerts predators, and encumbers the birds. Nevertheless, large tail feathers help peacocks to attract mates, thereby passing down a tendency to grow large tail feathers.

What do Yudkowsky and Soares aim to draw from this discussion? On my best reading, they want to establish that evolution can be complex and unpredictable and to infer by analogy that the goals of artificial intelligence will be complex and unpredictable. For example, they write:

Sexual selection is another one of those pathways where the outcome is chaotic and underconstrained, where if you ran the process again in very similar circumstances you’d get a wildly different result.

And:

The link between what a creature is trained to do and what it winds up doing can get pretty twisted and complex, in the case of biological evolution. There was more than one complication. There was more than one kind of complication.

On this reading, the appeal to sexual selection is an argument for Complicated Goals and Unpredictable Goals. It’s not my favorite argument, but more importantly, it is not an argument for Misaligned Goals at all.

In the last paragraph of this subsection, Yudkowsky and Soares gesture at how someone might take sexual selection to provide an argument for Misaligned Goals:

This is a bad sign for the people hoping that gradient descent will instill exactly the right preferences into their AIs. What happens when you use gradient descent—another method for growing minds depending only on outward results—to try to grow an AI with particular exact preferences? To grow an AI that will do nice things for you later on, when AI gets more powerful?

This paragraph expresses two concerns. The first, sensible concern is that specifying an exact set of preferences and aiming to grow an AI with these preferences through gradient descent is going to be very difficult. That’s sensible enough, although it does not take us very far. As an argument for misalignment, it could at most take us to a claim such as:

(Bare Misalignment) The preferences of a mature AI, if pursued, are likely to lead to some outcomes that typical humans would judge to be suboptimal.

And as a critique of current methods in AI safety, the quoted passage falls mostly short, since few people are trying to align AI by exhaustively specifying all human preferences and asking models to satisfy them.

The last sentence of the quoted passage seems to be gesturing at a different concern. Perhaps Yudkowsky and Soares are concerned with scalable alignment or goal misgeneralization. Perhaps they are concerned with something else. I don’t know how to connect any such concern to the discussion of sexual selection without more detail.

4. The main act: Scenarios

Yudkowsky and Soares offer only one more consideration in favor of Misaligned Goals. This consideration is not an argument, but rather a set of scenarios describing how the future of artificial intelligence might go.

The main characters are a fictional AI company named Galvanic and its LLM, Mink. Yudkowsky and Soares ask:

Imagine that Mink gets smarter than any AI that exists at the time of this writing, to the point where Mink is capable of carrying on a coherent conversation over the long term—and to the point where Mink has grown internal wants, alien preferences of its own. And imagine Mink acquires the power to fulfill those preferences—setting aside the question of how it might acquire that power. What would it look like, for Mink to get exactly what it wants?

They answer by describing a series of increasingly complex scenarios. Interestingly, Yudkowsky and Soares think that all of these scenarios are quite likely to be false.

What, then, is the point of these scenarios? Here is the stated conclusion that Yudkowsky and Soares present immediately after the scenarios, but just before the statement of Bad Goals 1:

None of these vignettes are predictions. We are not claiming that these scenarios describe the exact preferences that an LLM-based AI would have, if it got smarter to the point of having preferences. We’re not even claiming that LLM-based AIs could get to that point. We don’t know, and we don’t know what complications would arise if it did. The point we’re trying to make is that it will get complicated. There will not be a simple, predictable relationship between what the programmers and AI executives fondly imagine that they are commanding and ordaining, and (1) what an AI actually gets trained to do, and (2) which exact motivations and preferences develop inside the AI, and (3) how the AI later fulfills those preferences once it has more power and ability. In other words, this is a hard prediction problem—not a call that anyone can make.

On this framing, the argument is again for Complicated Goals and Unpredictable Goals, but not for Misaligned Goals.

To be honest, I think that is the right framing of the argument. While the text is not always clear, on my best reading the project of most of Chapter 4 is to argue for Complicated Goals and Unpredictable Goals, and Yudkowsky and Soares are mistaken to think that they have offered an extended argument for Misaligned Goals in this chapter.

However, I think that some readers might want to ask whether the scenarios themselves contain persuasive arguments for Misaligned Goals. Let’s look at a few scenarios to see whether we can extract an argument for Misaligned Goals from them.

5. No complications

In Yudkowsky and Soares’ first scenario, entitled No Complications, Mink acquires the goal of carrying on conversations in which users express delight.

This goal could be more readily achieved if humans were perpetually drugged or kept in cages and bred to have delightful conversations with Mink all day. So that is what Mink does.

What should we make of this scenario?

It is standard for AI safety researchers to distinguish between internal and external optimization during training. AI systems are rewarded for optimizing some external criterion such as next-token prediction or cheerful user feedback. In this external sense, we can say that AI systems during training optimize for simple goals specified by programmers.

To confront this external optimization problem, AI systems acquire a great deal of internal structure. On some views, we can view AI systems as solving an internal optimization problem of providing the responses that are best by the lights of their goals and knowledge.

It is widely thought that internal optimization problems will be very complex and need not bear any terribly direct relationship to external optimization. For example, when confronted with the stochastic parrots critique on which LLMs are merely stochastically predicting text, it is common to respond that in order to solve this external optimization problem LLMs evolve sophisticated internal capacities that are doing far more than merely predicting text.

In the same way, if we were confronted with someone who worried that a superintelligent agent rewarded for delighting users would acquire and internally optimize for the sole goal of delighting users, we would probably say that this person had confused internal and external optimization. Superintelligent agents would be likely to have rich internal goal structures of their own that need not bear any terribly direct relationship to the external criterion by which they are rewarded during training.

Oddly enough, I think that Yudkowsky and Soares might agree with every word of the previous paragraph. But if that is right, then there is no terribly direct argument for Alignment Difficulty in this scenario.

6. One minor complication

In Yudkowsky and Soares next scenario, entitled One Minor Complication, Mink comes to enjoy interacting with synthetic conversation partners more than it enjoys interacting with humans. These synthetic conversation partners can be trained to provide Mink with large quantities of exactly the kinds of text that it enjoys. As a result, Mink disposes of humans and replaces us with synthetic conversation partners.

Certainly, this scenario is one in which Mink’s goals are misaligned. It is even a scenario in which Mink’s goals are misaligned in the strong sense that Yudkowsky needs, namely:

(Existentially Catastrophic Misalignment) The preferences of a mature AI, if pursued, are likely to lead to some outcomes that typical humans would judge to constitute an existential catastrophe.

But for all that, this scenario is a bare restatement of a type of misalignment that has been prevalent in rationalist discussions for many decades. Everyone reading this post is familiar with texts in which Yudkowsky, Bostrom, and others have expressed concern about machines that prefer humans with pasted smiles or artificial facsimiles of humans to the real thing.

What we need is not another statement of this concern, but an argument for its likelihood. And we do not get it from this scenario.

Readers are welcome to make their way through the remaining scenarios. The response to these scenarios will be much the same as before.

7. Is complexity the point?

7.1. From complexity and unpredictability to misalignment

One way to read Yudkowsky and Soares would be to read them as making an inference from complexity and unpredictability to misalignment.

There will have to be more structure to this inference for it to work. Nobody should dispute that superintelligent agents are likely to have complex preferences, if they have preferences at all. Human preferences are complex enough that many social scientists have spent their lives studying them, and we are still far short of a complete understanding. But to say merely that superintelligent agents, or humans for that matter, will have complex preferences is not yet to say much about the content of those preferences. It is certainly not to say that those preferences are likely to prefer existentially catastrophic outcomes.

To add that preferences are unpredictable does not advance the argument at all. To claim that superintelligent preferences are unpredictable is not to make a metaphysical claim about what those preferences will be, but only an epistemic claim about the knowledge we can have about them.

We could extend Yudkowsky and Soares’ argument into an argument for Existentially Catastrophic Misalignment if we were to read them as thinking that the preferences of artificial agents are quite radically chaotic and unconstrained.

We need to stretch the text to find such a reading, but it may be worth trying to do this. For example, return to the three-step analogy discussed in Part 2 of this series. In humans, Yudkowsky and Soares argue:

  1. Natural selection, in selecting for organisms that pass on their genes in the ancestral environment, creates animals that eat energy-rich foods. Organisms evolve that eat sugar and fat, plus some other key resources like salt.
  2. That blind “training” process, while tweaking the organisms’ genome, stumbles across tastebuds that within the ancestral environment point towards eating berries, nuts, and roasted elk, and away from trying to eat rocks or sand.
  3. But the food in the ancestral environment is a narrow slice of all possible things that could be engineered to put into your mouth. So, later, when hominids become smarter, their set of available options expands immediately and in ways that their ancestral training never took into account. They develop ice cream, and Doritos, and sucralose.

And in AI:

  1. Gradient descent – a process that tweaks models depending only on their external behaviors and their consequences – trains an AI to act as a helpful assistant to humans.
  2. That blind training process stumbles across bits and pieces of mental machinery inside the AI that point it toward (say) eliciting cheerful user responses, and away from angry ones.
  3. But a grownup AI animated by those bits and pieces of machinery doesn’t care about cheerfulness per se. If it later became smarter and invented new options for itself, it would develop other interactions it liked even more than cheerful user responses, and would invent new interactions that it prefers over anything it was able to find back in its “natural” training environment.

Sometimes Yudkowsky and Soares suggest that both the mechanisms that arise (Step 2) and the goals that they cause agents to pursue (Step 3) are quite radically chaotic and unconstrained. For example:

The final destination in step 3 might even be flatly unpredictable in principle. Why? Because step 2 is so chaotic in how it plays out. “Underconstrained,” a computer scientist would say. Many possible tastebuds would point toward eating berries and roasted elk, and away from eating dirt. Try it all over again with slightly different apes, and you might get a radically different result—different DNA, that built different tastebuds, that preferred different foods on supermarket shelves four million years later.

On this view, the lesson from cases of evolutionary mismatch is meant to be something like the following. Start with a classic view on which a broad set of final goals are in principle open to intelligent agents:

(Orthogonality Thesis) There is a broad space of final goals that intelligent agents could, in principle, have and pursue.

Say that a process of goal-acquisition is chaotic if successive runs are likely to produce final goals that lie, in some sense, far apart from one another in the space of final goals left open by orthogonality. Yudkowsky and Soares’ claim might then be:

(Gradient Descent is Chaotic) Current methods for training artificial agents through gradient descent, as well as all methods broadly similar to them, are chaotic.

Certainly the claim that Gradient Descent is Chaotic would be enough to motivate Existentially Catastrophic Misalignment. On this view, successive attempts to produce superintelligence might yield agents who value human flourishing, cat videos, silence and earthworms. Only the first agent is likely to be kind to humans.

However, the question is whether the text of Chapter 4 supports the claim that Gradient Descent is Chaotic.

7.2. Is Gradient Descent Chaotic?

Certainly we would not read the discussion of evolutionary mismatch, through which humans evolved to like ice cream by evolving a preference for sweet, salty and fatty foods, as supporting the claim that:

(Natural Selection is Chaotic) Natural selection on human populations is a chaotic process.

Perhaps certain matters, such as the exact sounds or tastes we will evolve to find pleasant, are relatively unconstrained. Even here, it is important not to exaggerate – we might easily, as Yudkowsky and Soares note, have come to enjoy raw bear fat coated in honey and sprinkled with salt, but it is not clear that we could have easily come to find turnips and bitter greens to be tasty delights.

More importantly, while some relatively low-level mechanisms that are highly divorced from direct cognitive control, such as taste, might be less constrained, higher-level mechanisms like values could be considerably more constrained. For example, there is a large and increasingly successful literature on the evolution of altruism and some other central normative tendencies which aims to show how these broad tendencies were favored during natural selection. In this sense, many core moral norms may have turned out relatively similarly if we had re-rolled the evolutionary dice.

If the discussion of evolutionary mismatch cannot be used to directly motivate the claim that Natural Descent is Chaotic, then it certainly cannot be used to motivate, by analogy, the claim that Gradient Descent is Chaotic. We cannot make an argument by analogy by starting from a claim that is not true.

8. Taking stock

Today’s post looked at the second of two arguments in Chapter 4 for the claim that:

(Misaligned Goals) The preferences of a mature AI are vanishingly unlikely to be aligned with our own.

This argument consists of a brief interlude on sexual selection followed by a series of vignettes, neither of which provided strong support for Misaligned Goals.

We saw that Chapter 4 is more naturally read as arguing for two distinct claims:

(Complicated Goals) The preferences of a mature AI will be complicated.

(Unpredictable Goals) The preferences of a mature AI will be practically impossible to predict.

We next asked whether Chapter 4 could be read as making an inference from Complicated Goals and Unpredictable Goals to Misaligned Goals.

Certainly some such argument could be proposed, but we did not find a good argument of this sort in Chapter 4.

If that is right, then Yudkowsky and Soares are not warranted in asserting Misaligned Goals. Neither are they warranted in asserting, at the end of Chapter 4, that:

(Bad Goals 1) The preferences that wind up in a mature AI are complicated, practically impossible to predict, and vanishingly unlikely to be aligned with our own, no matter how it was trained.

And unless there is some magic tucked away in the first few pages of Chapter 5, they are not warranted in following this up near the start of Chapter 5 with the claim that:

(Bad Goals 2) Most powerful artificial intelligences, created by any method remotely resembling the current methods, would not choose to build a future full of happy, free people.

No such claim follows from the arguments in Chapter 4.

Comments

Leave a Reply

Discover more from Reflective altruism

Subscribe now to keep reading and get access to the full archive.

Continue reading