I thought that “If anyone builds it, everyone dies”, by Eliezer Yudk[ow]sky and Nate Soares, was disappointing, relying on weak arguments around the evolution analogy, an implicit assumption of a future discontinuity in AI progress, conflation of “misalignment” with “catastrophic misalignment”. I think that their positive proposal is not good. I had hoped to read a Yudkowsky-Soares worldview that has had meaningful updates in light of the latest developments in ML and AI safety, and that has meaningfully engaged with the scrutiny their older arguments received. I did not get that.
Will MacAskill, “A short review of If anyone builds it, everyone dies“
1. Introduction
Eliezer Yudkowsky and Nate Soares’ recent book, If anyone builds it, everyone dies: Why superhuman AI would kill us all, gets directly to the point. If it was not already obvious from the title, their point is conveniently set out and bolded in the introduction to their book.
If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.
Yudkowsky and Soares want to convince key decisionmakers not to build artificial superintelligence. They even provide an illustrative draft of an international treaty prohibiting the construction of artificial superintelligence.
This series discusses and responds to some key contentions of If anyone builds it, everyone dies.
Today’s post introduces the book and identifies two primary argumentative cruxes in the first part of the book, which will be considered in subsequent posts.
2. About the authors
Here are the preferred biographies of Yudkowsky and Soares from the book’s website:
Eliezer Yudkowsky is a founding researcher of the field of AI alignment and the co-founder of the Machine Intelligence Research Institute. With influential work spanning more than twenty years, Yudkowsky has played a major role in shaping the public conversation about smarter-than-human AI. He appeared on Time magazine’s 2023 list of the 100 Most Influential People In AI, and has been discussed or interviewed in The New Yorker, Newsweek, Forbes, Wired, Bloomberg, The Atlantic, The Economist, the Washington Post, and elsewhere.
Nate Soares is the President of the Machine Intelligence Research Institute. He has been working in the field for over a decade, after previous experience at Microsoft and Google. Soares is the author of a large body of technical and semi-technical writing on AI alignment, including foundational work on value learning, decision theory, and power-seeking incentives in smarter-than-human AIs.
3. Outline of book
The book is broken into three parts. Part 1, Nonhuman minds begins with an exploration of the nature of human and nonhuman minds and concludes with two claims. First, artificial agents are likely to want bad things.
Most powerful artificial intelligences, created by any method remotely resembling the current methods, would not choose to build a future full of happy, free people.
Second, if we tried to fight such an artificial intelligence, we would probably lose:
We’re pretty sure, actually very very sure, that a machine superintelligence can beat humanity in a fight, even if it’s starting with fairly limited resources.
Part 2, One extinction scenario lays out an extended scenario in which a hypothetical artificial intelligence named Sable, created by a fictional company named Galvanic, takes power.
Part 3, Facing the challenge, discusses how humanity can confront the existential threat posed by artificial intelligence.
Let’s start with Part 1.
4. Locating the argument
One might imagine that Yudkowsky and Soares would have devoted the majority of their book to establishing their headline claim:
If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.
In fact, that is not the case. The main argument for this claim is contained in the first of three parts, entitled Nonhuman minds.
Actually, the case for existential risk is very compressed. Yudkowsky and Soares take approximately one short chapter to argue that artificial superintelligence will want to do us in, and a second short chapter to argue that it can do us in.
The first three chapters are mostly setup.
Chapter 1, “Humanity’s special power,” argues that humanity’s special power is the generality of our thinking, and defines intelligence and superintelligence.
Chapter 2, “Grown, not crafted,” provides a brief nontechnical introduction to gradient descent and suggests (correctly) that this and other contemporary machine learning methods look more like growing than crafting a mind.
Chapter 3, “Learning to want,” argues that machines will behave as if they want things.
So far, we haven’t advanced much towards the headline conclusion that AI will probably want to kill us all. There are certainly things to quibble with, such as the definition of intelligence in Chapter 1 or the treatment of wanting in Chapter 3. But let’s not get bogged down here.
Just one chapter later, we arrive at the following conclusion, which is a direct quotation except for the addition of a descriptive label:
(Bad Goals 1) The preferences that wind up in a mature AI are complicated, practically impossible to predict, and vanishingly unlikely to be aligned with our own, no matter how it was trained.
A few short pages into Chapter 5, we arrive at a similar conclusion:
(Bad Goals 2) Most powerful artificial intelligences, created by any method remotely resembling the current methods, would not choose to build a future full of happy, free people.
Instead, Yudkowsky and Soares argue, they’d probably choose to kill us.
The rest of Chapter 5 is devoted to answering objections to Bad Goals 2. Chapter 6, “We’d lose,” rounds out Part 1 by arguing that we’d lose a conflict against an artificial intelligence that wanted to do us in.
There is a lot to unpack here. Let’s focus for now on the most obvious cruxes.
First, how do Yudkowsky and Soares establish Bad Goals 1 in a single short chapter?
Second, how do they establish Bad Goals 2 in a few pages of the next chapter?
5. The structure of Yudkowsky and Soares’ argument
The structure of Yudkowsky and Soares’ argument is not always transparent. We will examine their argument in more detail in future posts.
Speaking broadly, Chapter 4 does the following:
- Tells a story: About a dialogue between two gods, meant to illustrate that the pleasures that humans feel at inputs such as sex and tasty food might come to be pursued for their own sake, divorced from any direct connection to reproduction or nourishment.
- Argues that: “AI companies won’t get what they trained for. They’ll get AIs that want weird and surprising stuff instead” based on a comparison between misfires of gradient descent and the types of evolutionary misfires described in the story.
- Considers an example: Of an AI “trained to delight and retain users” and suggests several ways in which this AI might acquire strange goals.
Based on these arguments, Chapter 4 concludes:
(Bad Goals 1) The preferences that wind up in a mature AI are complicated, practically impossible to predict, and vanishingly unlikely to be aligned with our own, no matter how it was trained.
Chapter 5 makes a few short remarks that will require considerable reconstruction later. Immediately thereafter, Yudkowsky and Soares conclude:
(Bad Goals 2) Most powerful artificial intelligences, created by any method remotely resembling the current methods, would not choose to build a future full of happy, free people.
The rest of Chapter 5 is devoted to answering objections to Bad Goals 2.
Bad Goals 1 is a strong claim. It would be prima facie surprising if the story, brief argument and example of Chapter 4 coupled with a few remarks at the beginning of Chapter 5 could take us all the way towards Bad Goals 1.
Bad Goals 2 is a moderately distinct claim from Bad Goals 2. While Bad Goals 2 is perhaps less surprising once Bad Goals 1 has already been established, it would be prima facie surprising if Yudkowsky and Soares were able to motivate Bad Goals 2 on the basis of Bad Goals 1 in a few short pages at the beginning of Chapter 5.
Yudkowsky and Soares think they have met these challenges. The next project in this series is to argue that they have not.

Leave a Reply