The instrumental convergence thesis contains some grains of truth … Nonetheless, the thesis is mostly false … Like most of life’s dangers, the dangers posed by artificial intelligence are not easily identified from the armchair. If we want to understand the dangers posed by artificial superintelligence, we will have to do more careful empirical work investigating what kinds of desires future AI systems are likely to have.
Gallow, “Instrumental divergence“
1. Introduction
This is Part 4 of a series on my paper “Instrumental convergence and power-seeking.”
Part 1 introduced the argument from power-seeking and showed how this argument rests on a strong version of the instrumental convergence thesis:
(Catastrophic Goal Pursuit) There are several values which would be likely to be pursued by a wide range of intelligent agents to a degree that, if successful, would permanently and catastrophically disempower humanity.
The next item of business was to argue that leading power-seeking theorems do not establish Catastrophic Goal Pursuit.
Part 2 considered an early power-seeking theorem due to Tsvi Benson-Tilsen and Nate Soares. Part 3 looked at the most influential recent power-seeking theorem due to Alex Turner and colleagues. We saw that both theorems fall substantially short of establishing Catastrophic Goal Pursuit.
Today’s post concludes by drawing lessons from this discussion.
2. Empiricism and AI safety
It is very hard to study behavior from the armchair. There are, perhaps, a few things to be learned about agents from general facts about the kinds of agents that they are. But in the practice of science, as well as in our daily lives, it is often more helpful to look and see how agents behave if we want to understand how they are likely to behave in the future.
Early work in AI safety was often conducted from the armchair. This paper looked at some of the boldest armchair attempts, which aimed to conclude from a few facts about what most agents will want that artificial superintelligence is likely to aim at the permanent disempowerment of humanity. That is quite a strong claim, and it should not be surprising for us to fail to establish a claim of this sort from the armchair.
There is, within the field of AI safety, a growing empirical escarpment aimed at connecting safety concerns to detailed knowledge and observation of how leading AI systems behave. It is hard work. It is also, to the disappointment of some practitioners, unlikely to discover within a paper or two that humanity is doomed. But for all that, it is honest, empirically-grounded work that stands a good chance of advancing our understanding of the risks posed by AI systems and the strategies that might mitigate them.
I hope that the discussion in this paper goes some way towards reinforcing the trend towards growing empiricism within the field of AI safety. We can have fewer and better disagreements if we can agree to leave our armchairs behind and study the systems that are being developed and built today.
3. Clarifying terms
One of the most important moves in this paper was the move towards clarifying terms. In particular, we saw in Part 1 that many versions of Instrumental Convergence are substantially weaker than what would be needed to ground the argument from power-seeking and similar existential risk concerns.
In particular, we saw that it is necessary to distinguish: (a) instrumental convergence as a claim about behavior from a claim about what conduces to what, and (b) instrumental convergence as a claim about how much power systems will seek from the bare claim that systems will seek some amount of power. At each point, we saw that instrumental convergence becomes substantially more difficult to establish.
It is important to make sure that when we are having arguments about AI safety, we are clear about the meanings of key terms. It is also important to make sure that we pick meanings for key terms that allow them to play the argumentative roles we want them to play.
Getting clear on key terms at the outset can help us to avoid talking past one another. It might also reveal that some arguments have a ways to go before they can meet their goals.
4. Two leading arguments
A recent survey suggests that there are two leading arguments for existential risk from artificial intelligence. The first is the singularity hypothesis. The second is the argument from power-seeking.
I addressed the singularity hypothesis in my paper and blog series “Against the singularity hypothesis.” There, I argued that there are good reasons to doubt the singularity hypothesis and that leading arguments fall a good deal short of overcoming reasons for doubt.
I addressed the argument from power-seeking in this paper. We saw that the argument relies on a strong version of the instrumental convergence thesis, and that leading power-seeking theorems fall a good deal short of establishing this version of the instrumental convergence thesis.
If that is right, then it should give us some reason for skepticism about the state of leading arguments for existential risk from artificial intelligence.
There are, of course, other arguments. I will address them as they are rigorized and advanced to a comparable standing. But if I am right that both of the leading arguments for existential risk from artificial intelligence currently fall a good deal short of the mark, then that is some reason to be skeptical about the case for existential risk from artificial intelligence.
To be very clear,to advocate skepticism about existential risk from artificial intelligence is not to advocate skepticism about AI safety or to suggest that we should be complacent about risks posed by AI systems. Everyone in their right mind is at once concerned and hopeful about the many changes that artificial intelligence might bring. But we may wish to place a bit less focus on the very most extreme outcomes that might result, and spend a bit more time discussing the many good and bad outcomes short of existential risk that can be encouraged or prevented.
5. The academic paper
This blog series is based on my paper “Instrumental convergence and power-seeking.” The current draft of the paper differs from the blog series in that it places more emphasis on informal versions of the argument from power-seeking and less emphasis on power-seeking theorems, though the results by Turner and colleagues are prominently discussed.
I hope that this paper will be published before too long. In the meantime, you can read the latest draft here, and an earlier version closer in spirit to the blog series here.

Leave a Reply