In our timelines forecast, we forecast the time between present day and a superhuman coder (SC): an AI system that can do any coding tasks that the best AGI company engineer does, while being much faster and cheaper. In this forecast we condition on SC being achieved in March 2027, which is the date it’s achieved in our scenario. Now, we will forecast takeoff: the time between a superhuman coder and wildly superhuman capabilities. The superhuman coders and beyond will automate a large fraction of the AI R&D needed to traverse this gap.
Daniel Kokotajlo and Eli Lifland, “Takeoff forecast“
1. Introduction
This is Part 21 of my series Exaggerating the risks. In this series, I look at some places where leading estimates of existential risk look to have been exaggerated.
Part 1 introduced the series. Parts 2-5 (sub-series: “Climate risk”) looked at climate risk. Parts 6-8 (sub-series: “AI risk”) looked at the Carlsmith report on power-seeking AI. Parts 9-17 (sub-series: “Biorisk“) look at biorisk.
Part 18 continued my sub-series on AI risk by introducing the AI 2027 report. Part 19 and Part 20 looked at the AI 2027 team’s timelines forecast, which projects the date when superintelligent coders will be developed. Today’s post examines their takeoff forecast.
2. Takeoff forecast overview
The takeoff forecast conditions on the result of the timelines forecast, which projects the development of superintelligent coders in March 2027. The takeoff forecast then asks when we should expect the emergence of full artificial superintelligence.
They predict the emergence of artificial superintelligence in approximately one year, though with wide error bars:

The takeoff report models a software-only singularity, in part because the authors share my concerns about the difficulty of achieving a very fast takeoff to superintelligence on the basis of hardware growth, which has never been more than exponential. This does, however, mean that the timelines in the takeoff report would become more aggressive if hardware growth were included, though the speedup might be modest.
The methodology of the takeoff forecast is substantially similar to the methodology of the benchmarks and gaps model discussed in Part 20. The authors divide the space between superhuman coders (SC) and artificial superintelligence (ASI) into intermediate milestones. They estimate the number of years that it would take humans to cross each milestone, then estimate the research speedup that AI can provide after each milestone is crossed and interpolate this speedup to intermediate times between milestones. This results in a revised estimate of the time needed to cross each milestone, and combining the estimates leads to an estimate of the arrival date of superintelligence.
The probability distributions used in modeling, as well as the sampling methodology, are substantially similar to the distributions and sampling methods used in the timelines report – as they should be, for consistency – so I will not reintroduce them here.
3. Defining milestones
The report defines four milestones as follows:
| Milestone | Definition |
| Superhuman coder (SC) | An AI system for which the company could run with 5% of their compute budget 30x as many agents as they have human researchers, each which is on average accomplishing coding tasks involved in AI research (e.g. experiment implementation but not ideation/prioritization) at 30x the speed (i.e. the tasks take them 30x less time, not necessarily that they write or “think” at 30x the speed of humans) of the company’s top coder. It must have enough diversity of expertise to on average do the same for other top coders with complementary skills. Since SC is a subset of AI research, it cannot come after a fully superhuman AI researcher (SAR below). That said, it will also have some level of “research taste” and other AI research skills when it is first achieved, and may even be a full SAR if coding is the last skill needed. |
| Superhuman AI researcher (SAR) | An AI system that can do the job of the best human AI researcher but 30x faster and with 30x more agents, as defined above in the superhuman coder milestone. It must have enough diversity of expertise to on average do the same for other top researchers with complementary skills. |
| Superintelligent AI researcher (SIAR) | An AI system that is vastly better than the best human researchers: the gap between SAR and SIAR is 2x the gap between an automated median AGI company researcher and a SAR. By 2x, we mean that if we take consider the skill distribution of the AGI company’s researchers (as measured by skill on applicable cognitive tasks, not overall productivity including compute bottlenecks; so basically value of labor), the AI is 2 doublings more above the top researcher than the top researcher is above the median (i.e. the difference is 2x greater in log space). It also has the 30x task accomplishing speed and 30x copies requirements. |
| Artificial superintelligence (ASI) | Roughly, an SAR but for every cognitive task. An AI system that is 2x better at every cognitive task relative to the best human professional, than the best human professional is relative to the median human professional (across the whole field, not a single company as in SAR). |
I don’t want to dwell on these milestones except to note that the definition of artificial superintelligence used here is much weaker than that found in typical existential risk narratives. As such, the projected arrival date of artificial superintelligence in the takeoff report cannot be the end of the story.
Indeed, in their main scenario the authors project substantial continued growth in the capacities of artificial agents beyond the development of artificial superintelligence, as defined in the takeoff forecast. This projection lies outside of the model and the stated justifications provided by the takeoff forecast, and should be grounded by an additional argument.
4. Human-only timelines
The authors first ask how long it would take for humans to cross each milestone without the help of AI.
For the first crossing, from superhuman coders (SC) to superhuman AI researchers (SAR), the authors estimate a 15% chance that an SC is already a SAR. The remaining 85% of cases are analyzed in subcases, leading to a forecasted length distributed as a lognormal extrapolated from the 80% confidence interval [1.5 years, 10 years].
For the second crossing, from superhuman AI researchers (SAR) to superintelligent AI researchers (SIAR), they forecast a lognormal extrapolated from an 80% confidence interval of [2.3 years, 380 years].
For the final crossing, from superintelligent AI researchers (SIAR) to artificial superintelligence (ASI), they forecast the time needed to cross two intermediate gaps and construct a Davidson-style model to arrive at an 80% confidence interval of 2.4 to 1,000,000 years, extrapolating a lognormal distribution around this.
| Milestone | Time to next milestone |
| Superhuman coder (SC) | 15% 0 years; Otherwise 4 years (80% CI: 1.5 to 10; lognormal) |
| Superhuman AI researcher (SAR) | 19 years (80% CI: 2.3 to 380) FutureSearch aggregate conditional on SAR in 2027 (n=3): 11.5 years (1.75, 27) |
| Superintelligent AI researcher (SIAR) | 95 years (80% CI: 2.4 to 1,000,000) |
| Artificial superintelligence (ASI) | n/a |
While we will examine some of these forecasts in more detail later, the most important thing to note at this point is the very wide range spanned by many of the provided confidence intervals. This will form the basis of one challenge to the takeoff report, which takes the wide confidence intervals as reasons to think that the underlying phenomenon is highly uncertain and therefore to be skeptical of updating heavily on any given model.
5. AI speedups
Next, the authors estimate the amount by which research progress is sped up at each milestone by the ability of human researchers to enlist AI in their research.
The progress multiplier from superhuman coders (SC) is estimated by combining estimates of the speedups provided by five features of superhuman coders: flexible prioritization between projects; smaller marginal experiments; less wasted compute; fancier experiments; and a penalty for the lack of diversity among superhuman coders.
The progress multiplier from superhuman AI researchers (SAR) is estimated in three ways: through a (mostly disregarded) decomposition similar to the above, and through two surveys of AI researchers.
The progress multiplier from superintelligent AI researchers (SIAR) is estimated by projecting forward results of the surveys used to estimate the multiplier from SAR, with some adjustments.
The progress multiplier from artificial superintelligence (ASI) is estimated in the same way.
| Milestone | Progress multiplier when milestone reached |
| Superhuman coder (SC) | 5 |
| Superhuman AI researcher (SAR) | 25 |
| Superintelligent AI researcher (SIAR) | 250 |
| Artificial superintelligence (ASI) | 2,000 |
Again, we will talk about some of these estimates later. But for now, one important thing to note is that the authors do not model uncertainty over progress multipliers, even when their forecasting methodology provides uncertainty estimates. While this is understandable given the need to simplify modeling, it does have the effect of dramatically reducing reported uncertainty over the arrival date of superintelligent AI. When large uncertainty about human-only timelines is combined with large uncertainty about the extent of research speedup that AI can provide, the confidence intervals shown in the AI Takeoff Forecast Summary get substantially wider.
6. Plain English model summary
At this point, we have everything we need to understand how the takeoff forecast model works.
Let’s ignore uncertainty for the moment so that we don’t have to deal with the messy details of sampling. Ignoring uncertainty, the model is very simple.
To go from some milestone M to another milestone M’, it takes T years of human-only work. Since we’re ignoring uncertainty, T is a given real number.
At an arbitrary time t, research is actually proceeding at the equivalent of s(t) human-only research years. The parameter s(t) models the speedup in research due to AI, so that s(t) = 1 when AI is not meaningfully contributing to research, s(t) = 1,000 when AI speeds up research by 1,000x, and so on.
The speedup s(t) is determined by interpolating the speedup sM at milestone M and the speedup sM’ at milestone M’, scaling according to the progress that has been made. If, for example, we are 40% of the way from M to M’, then the current speedup s(t) is sM+ (0.4)(sM’ – sM).
At an arbitrary time t’, we’ve therefore made the equivalent of years of human-only progress. We need to make T years of human-only progress, so the time from milestone M to M’ is found by solving the equation
That’s all there is to it. There is, commendably, no funny business or baked-in hyperbola (although we’ll see that there is a hyperbola baked into the Davidson-style reasoning used to justify key forecasts). And the details of sampling methodology, while important, are not significant enough to fight about.
Model behavior depends on (1) the choice of milestones, (2) estimated human-only research time to cross each milestone, and (3) estimated progress speedups at each milestone. Challenges to the model need to concentrate on (1)-(3). I’ll focus on (2)-(3).
7. Three challenges
The challenges that I want to raise for this model are to a large extent similar to the challenges facing the benchmarks-and-gaps model discussed in Part 20.
Challenge 1: Weak data: Almost no data is available to ground reliable forecasts. Data is seldom used, and when data is used, it is not used in ways that should lend substantial credibility to the resulting forecasts.
Challenge 2: Under-justified forecasts: Despite a commendable increase in detail in the justifications of some forecasts, as compared to the justifications given in the benchmarks-and-gaps model, the arguments given for forecasts just aren’t strong enough to ground them.
Challenge 3: Wide uncertainty: The uncertainty in the forecasts is very wide. This is standardly taken to be a sign that the underlying phenomenon may be difficult to forecast and that caution should be placed in updating strongly based on provided forecasts.
As in our discussion of the benchmarks-and-gaps model, these problems do not arise because the AI 2027 team is incompetent — they are not. They arise because there just is not enough data or evidence available to ground reliable forecasts in this domain.
There is really no better way to appreciate these challenges than for readers to read the takeoff model and think through the forecasts on their own. But let me illustrate Challenges 1-2 with examples from the takeoff forecast. (I don’t think there is any need to illustrate Challenge 3).
8. Weak data
The first challenge was the challenge of weak data. This challenge splits into two parts.
8.1. Absent data
First, many of the AI 2027 team’s forecasts are made on the basis of no data at all. This is the case for all of their forecasted human-only timelines. It is also the case for their forecast of the progress multiplier provided by superintelligent coders.
8.2. Under-powered and misused data
The primary use of data comes in the estimated progress multiplier at the time when superhuman AI researchers (SAR) are introduced. This same methodology is projected forwards to determine the progress multipliers at remaining milestones, so that much of what is said here should apply to these latter estimates.
To determine the progress multiplier from SAR, the authors use two sets of surveys. Let’s just focus on the first.
8.2.1. Informal survey
The authors asked machine learning researchers the following questions:
- How much faster would your overall research progress be if you had access to 10x as much compute as you do now?
- How much slower would your overall research progress be if you had access to 10% as much compute as you do now?
An informal Twitter poll provided the following results (excluding non-ML researchers):
| <20% faster | 20-100% faster | >100% faster | |
| Q1 | 25% of respondents | 38% of respondents | 37% of respondents |
| <20% slower | 20-50% slower | >50% slower | |
| Q2 | 26% of respondents | 46% of respondents | 28% of respondents |
An undisclosed Slack poll of 6 AI safety researchers gave results reported as:
| Median | Range | |
| Q1 | 1.18x | 1.01 to >1.5 |
| Q2 | 0.6x | 0.2 to 1 |
The authors take the median decrease in Q2 from the Slack poll (0.6x), then decide that because the six researchers surveyed were “people whose work seemed less compute-intensive than the Twitter poll” a more appropriate figure for AI researchers would be 0.4x.
Assuming that this rate of decrease continues to hold, it puts the cost of reducing compute budgets by 30x at a 0.22x speedup.
Suppose next that the proportional effects of compute changes are the same for further increases and decreases in compute, as well as the same when applied to human or AI coders.
In this case, superintelligent AI researchers would generate a 30x increase in the number of coders. Compute would be split 30 ways among them, so each would operate at 22% speed for an effective progress speedup of 0.22*30, or approximately 7x.
The authors then argue for some factors that push this 7x speedup up to a 25x speedup. I must say that I am quite skeptical of this discussion, but there is already more than enough to complain about in the above reasoning.
8.2.2. Pushing back against the informal survey
Why might someone be worried about the use of data here?
First, researchers are not always good judges of their own productivity, and they certainly need not be good judges of how their productivity would change if available resources were scaled up or down by an order of magnitude.
Second, the surveys in question are highly non-scientific. They are conducted on social media and a Slack channel. The response format is constrained to a small number of choices which is likely to exhibit a significant effect on response tendencies. Respondents are not screened in any way for qualification or thoughtfulness, nor are these venues in which qualification and thoughtfulness are found in abundance. There are very large gaps in the number of people who answered the first and second questions on the Twitter survey, and there is no guarantee that these are the same people.
Third, the takeoff forecast threw away their more reliable survey and went with the less-reliable survey. The Twitter survey, for all its faults, was a public survey with 184 responses to the first question and 102 responses to the second question. The Slack poll had a whopping six respondents, and we are given no details of any kind about how the questions were asked. Nevertheless, the takeoff forecast throws away the Twitter survey and extrapolates on the basis of the much more dubious six-person Slack poll.
Fourth, the quantity forecasted was how much current AI researchers would find their research slowed by a 10x reduction in compute. But this quantity is treated during calculation as an estimate of how much superintelligent coders would find their research slowed by a 10x reduction in compute. This is a very different quantity, for two reasons.
The first, and merely very important thing to note is that humans are not artificial systems. To say how a reduction in compute would affect a human coder is not to say anything terribly direct about how the same reduction in compute would affect an artificial coder.
The second and most important thing to note is that this leap bakes in the very kind of Davidson/Chalmers-style proportionality thesis known to drive intelligence-explosion type results. The assumption is that at any level of compute and any level of previous progress made, the effect of a 10x reduction in compute on productivity is constant (here: 40%). This assumption is being used here to relate the effects of a 10x compute reduction given current progress and current compute to the effects of a 10x compute reduction given superintelligent coders and future compute. And in later forecasts, the same assumption will be made with small modifications all the way until superintelligence is deployed.
Because this assumption takes us a long way towards singularity-style dynamics on its own, it is really not appropriate to be baking this assumption into the justification of forecasts that are fed into a model. Otherwise, although the model won’t have a baked-in hyperbola, the forecasts fed into it will.
It is worth noting that this assumption is also not reflected in the data collected. If it were, then we should have expected respondents to report that a 10x decrease in compute would slow them down by the same factor as a 10x increase in compute would speed them up. That isn’t what they said: the researchers in the Slack poll used to ground the report’s estimates predicted a median 40% decrease from a 10x reduction in compute but a median 18% increase from a 10x increase in compute. That is exactly the kind of diminishing returns regime that cuts against the proportionality reasoning baked into the author’s forecast.
Fifth, at the end of the day the researchers get a 7x speedup. That isn’t what they want — it is barely larger than the 5x speedup they project from superintelligent coders. There is a very live risk that speedups will not increase in the way needed to drive an intelligence explosion. There is nothing contradictory about the view that superintelligent coders could speed up early-stage research by 5x but that superhuman artificial intelligence researchers could speed up later-stage research by a smaller factor, not because superhuman artificial intelligence researchers would not be better than superintelligent coders, but instead because superhuman artificial intelligence researchers would have a harder task ahead of them.
So what do the researchers do to increase the 7x speedup? They put data to the side and argue entirely without data from a 7x speedup to a 25x speedup. At this point, most of the increase from the previous 5x speedup lies entirely outside of the calculation that seemed, at least in some tangential way, to be grounded in data.
Again, there is more to say here. But the basic point is that the data being used is not adequate to the task, and it is being used in ways that say a good deal more about the modeling choices of the researchers than about how the data forced their hands.
9. Under-justified forecasts
I think that the authors may agree with me that the provided data is relatively weak. Perhaps for this reason, the authors make some of their forecasts without any data at all.
Let’s look at their forecast of the time it would take human researchers to move from SAR to SAIR.
The forecast begins with a simplified model in which:
- The distribution of AI R&D capabilities within OpenBrain is a lognormal distribution in terms of value of labor as cashed out in overall differences in research progress.
- Each doubling of cumulative human labor spent improving AI algorithms multiplies the AIs’ value of labor by a fixed amount (this is very similar to the assumption made in the Davidson report). In particular, for each doubling of cumulative labor, there are r doublings of the value of labor.
- Since the distribution in (1) is lognormal, increasing labor productivity by a fixed multiplier is equivalent to increasing by a fixed amount of SDs within the OpenBrain human range.
- Since SAR->SIAR is the same in terms of labor multiples as 2*(automated median OpenBrain researcher->SAR), the amount of cumulative effort doublings to go from SAR->SIAR is twice the amount required to go from automated median OpenBrain researcher->SAR.
Now the main thing to complain about here is that Davidsonian proportionality reasoning is again explicitly baked into the forecast at Step 2. But let’s set that aside and look at how the forecast continues.
First, the authors estimate how long it would have taken to progress from an automated coder that is as good as the median OpenBrain researcher to an SAR. They reason that this would take approximately ten years.
Next, the authors estimate how much of this time was spent going from an automated median coder to a SC, as opposed to from a SC to an SAR. Without any explicit reasoning, they model this quantity as a lognormal extrapolated around an 80% confidence interval from 1 to 25 years, correlated strongly to the number of years needed to go from SC to SAR.
Finally, the authors plug these numbers into a Davidson-style model of capacity growth given a constant relationship between doublings in accumulated research effort and doublings of the progress multiplier to ground an estimate for the human-only years needed to go from SAR to SAIR.
There is, to be fair, some effort at reasoning here. The authors’ estimate of the time from an automated median researcher to an SAR is grounded in several paragraphs of reasoning, whose strength readers are free to evaluate. The quantity to which this is compared, the time needed to go from SC to SAR, is grounded in no reasoning of any kind. These two quantities are then plugged into exactly the same kind of Davidson-style model that is under dispute, without any justification for that model, and the result is extracted as the final forecast.
There is, as I said, some reasoning here. But it is not the kind of reasoning that I would urge readers to put much credence in, and it is certainly not the kind of reasoning that would convince anyone who did not already have strong sympathies for the result.
10. Taking stock
Today’s post looked at the AI 2027 takeoff forecast. We saw that the stated model avoids a baked-in hyperbola. However, we saw that the model faces three challenges.
Challenge 1: Weak data: Almost no data is available to ground reliable forecasts. Data is seldom used, and when data is used, it is not used in ways that should lend substantial credibility to the resulting forecasts.
Challenge 2: Under-justified forecasts: Despite a commendable increase in detail in the justifications of some forecasts, as compared to the justifications given in the benchmarks-and-gaps model, the arguments given for forecasts just aren’t strong enough to ground them.
Challenge 3: Wide uncertainty: The uncertainty in the forecasts is very wide. This is standardly taken to be a sign that the underlying phenomenon may be difficult to forecast and that caution should be placed in updating strongly based on provided forecasts.
In examining Challenges 1-2, we also saw where the missing hyperbola went: Davidson-style modeling assumptions are baked not into the model itself, but rather into the justifications for forecasts fed into the model.
The authors are, all told, on a bit stronger ground here than they were with the timelines forecast. Unlike the first part of the timelines forecast, the model here does not explicitly bake in a hyperbola. And as compared to the second part of the timelines forecast, the reasoning here is more detailed and explicit.
But we are a long way from where we need to be in order to support the kinds of conclusions that the AI 2027 authors want to draw.

Leave a Reply