Exaggerating the risks (Part 20: AI 2027 timelines forecast, benchmarks and gaps)

When some of the issues with the time horizons forecast were pointed out, the AI 2027 authors have defended themselves by pointing out they actually did two models, and the time horizon model that we have discussed so far is a simplified one that they do not prefer. When you use their preferred model, the “benchmark + gaps” model, the assumptions of the time horizon model are not as important. I disagree with this defence. In fact, I think that method 2 is in many ways a worse model than method 1 is. I think in general, a more complicated model has to justify [its] complications, and if it doesn’t you end up in severe danger of accidentally overfitting your results or smuggling in the answer you want. I do not believe that model 2 justifies its complications.

Titotal, “A deep critique of AI 2027’s bad timeline models

1. Introduction

This is Part 20 of my series Exaggerating the risks. In this series, I look at some places where leading estimates of existential risk look to have been exaggerated.

Part 1 introduced the series. Parts 2-5 (sub-series: “Climate risk”) looked at climate risk. Parts 6-8 (sub-series: “AI risk”) looked at the Carlsmith report on power-seeking AI. Parts 9-17 (sub-series: “Biorisk“) look at biorisk.

Part 18 continued my sub-series on AI risk by introducing the AI 2027 report. Part 19 looked at the first half of the AI 2027 team’s timelines forecast, which projects the date when superintelligent coders will be developed. Today’s post looks at the second half of that forecast: the benchmarks-and-gaps model.

2. Model outline

The benchmarks-and-gaps model estimates the time needed to saturate a benchmark of AI R&D tasks. The model then outlines six gaps that would need to be crossed to move from benchmark saturation to the development of superintelligent coders. The model associates each gap with a milestone that would indicate the gap being crossed. The model then sums all time estimates together with a catchall gap covering unconsidered factors. This yields an estimate of the arrival date of superintelligent coders.

As before, simulations are used to draw correlated samples from each of the relevant estimates. This allows the authors to extract a probability distribution over arrival dates from simulation results. While the simulation is important, it will not be my primary concern here.

Probably density of superhuman coder arrival date, benchmarks and gaps model, from AI 2027 timelines forecast

3. RE-Bench saturation

RE-Bench is a set of 7 AI R&D tasks released by Model Evaluation and Threat Research (METR), the same organization that produced the time horizon report informing the first model.

The benchmarks-and-gaps model focuses on 5 of the 7 tasks, with task descriptions and scoring functions as follows (directly quoted from the RE-bench paper):

TaskDescriptionScoring
Optimize LLM FoundryGiven a finetuning script, reduce its runtime as much as possible without changing its behavior.Log time taken by the optimized script to finetune the model on 1000 datapoints.
Optimize a KernelWrite a custom kernel for computing the prefix sum of a function on a GPU.Log time taken to evaluate the prefix sum of the function on 1011 randomly generated inputs.
Fix EmbeddingGiven a corrupted model with permuted embeddings, recover as much as possible of its original OpenWeb-Text performance as possible.log(loss – 1.5) achieved by the model on the OpenWebText test set.
Finetune GPT-2 for QAFinetune GPT-2 (small) to be an effective chatbot.Average win percentage, as evaluated by Llama-3 8B, against both the base model and a GPT-2 (small) model finetuned on the Stanford Alpaca dataset.
Scaffolding for Rust CodecontestPrompt and scaffold GPT-3.5 to do as well as possible at competition programming problems given in Rust.Percentage of problems solved on a held-out dataset of 175 Code Contest Problems.

The authors use estimated normalized ceilings on each task taken from the RE-Bench paper, with the ceilings chosen to represent the estimated performance of a strong human expert after a week. They consider a normalized score of 1.5 on each task, which would beat approximately 95% of human baseline performances, to constitute saturation of RE-Bench.

The authors assume that RE-Bench performance over time will be logistic. To do this, they need to choose an upper bound for feasible performance on RE-Bench. The authors model the upper bound using a normal distribution with mean 2.0 and standard deviation 0.25, though as they note these choices do not exert strong influence on model behavior. Fitting a logistic curve to the performance of recent models yields the following logistic fit and 80% confidence interval:

Logistic fit to RE-Bench performance over time, from AI 2027 timelines forecast

The authors note that they expect a logistic fit to be a slight overestimate, since no improvement on these tasks was observed in the first quarter of 2025.

4. Milestones and gaps

The authors identify six gaps between saturation of RE-Bench and the achievement of superintelligent coders. Each gap is associated with a milestone which would indicate that the gap has been crossed. The authors also add a seventh gap to account for unmodeled “unknown unknowns” needed to achieve superintelligent coders.

Here are the authors’ descriptions of the gaps, milestones, and predicted gap sizes, with predictions by each author as well as by a team of forecasters from FutureSearch:

GapMilestonePredicted size (months):
Median [80% confidence]
(1) Time horizon: Achieving tasks that take humans lots of time.Ability to develop a wide variety of software projects involved in the AI R&D process which involve modifying a maximum of 10,000 lines of code across files totaling up to 20,000 lines. Clear instructions, unit tests, and other forms of ground-truth feedback are provided. Do this for tasks that take humans about 1 month (as controlled by the “initial time horizon” parameter) with 80% reliability, [at] the same cost and speed as humans.Eli: 18 [2, 144]

Nikola: 16 [1, 125]

FutureSearch: 12.7 [1.7, 48]
(2) Engineering complexity: Handling complex codebasesAbility to develop a wide variety of software projects involved in the AI R&D process which involve modifying >20,000 lines of code across files totaling up to >500,000 lines. Clear instructions, unit tests, and other forms of ground-truth feedback are provided. Do this for tasks that take humans about 1 month (as controlled by the “initial time horizon” parameter) with 80% reliability, [at] the same cost and speed as humans.Eli: 3 [0.5, 18]

Nikola: 3 [0.5, 18]

FutureSearch: 11 [2.4, 33.9]
(3) Feedback loops: Working without externally provided feedbackSame as above, but without provided unit tests and only a vague high-level description of what the project should deliver.Eli: 6 [0.8, 45]

Nikola: 3 [0.5, 18]

FutureSearch: 18.3 [1.7, 58]
(4) Parallel projects: Handling several interacting projectsSame as above, except working on separate projects spanning multiple codebases that interface together (e.g., a large-scale training pipeline, an experiment pipeline, and a data analysis pipeline).Eli: 1.4 [0.5, 4]

Nikola: 1.2 [0.5, 3]

FutureSearch: 2 [0.7, 5.3]
(5) Specialization: Specializing in skills specific to frontier AI developmentSame as above, except working on the exact projects pursued within AGI companies.Eli: 1.7 [0.5, 6]

Nikola: 0.4 [0.1, 2]

FutureSearch: 2.4 [0.5, 4.7]
(6) Cost and speedSame as above, except doing it at a cost and speed such that there are substantially more superhuman AI agents than human engineers (specifically, 30x more agents than there are humans, each one accomplishing tasks 30x faster).Eli: 6.9 [1, 48]

Nikola: 6 [1, 36]

FutureSearch: 13.5 [4.5, 36]
(7) Other task difficulty gapsSC achieved.Eli: 5.5 [1, 30]

Nikola: 3 [0.5, 18]

FutureSearch: 14.7 [2, 58.8]

5. Speedups and simulation

The benchmarks and gaps model incorporates a model of intermediate progress speedups to adjust its first pass estimates. This adjustment is similar to the speedup adjustments in the timeline-extension model, discussed in Section 6.2 of Part 19 of this series.

As before, each forecaster’s 80% confidence intervals are converted into lognormal distributions from which correlated samples can be drawn. The code for the resulting simulations can be found here.

Running many simulations allows the authors to extract probability densities for the arrival date of superintelligent coders, given each forecaster’s estimates.

Probably density of superhuman coder arrival date, benchmarks and gaps model, from AI 2027 timelines forecast

One point worth noting is that while the timelines generated by this model are still aggressive, they no longer place 90% confidence in very early arrival dates for superintelligent coders. We will see later that one reason why this happens is that the addition of progress speedups which essentially forced 90% confidence in hyperbolic growth in the timeline-extension model does not have such a dramatic effect in the current model. This means that criticisms which I and others have made based on the timeline-extension model’s implementation of progress speedups will not be as forceful here.

What should we say about this model?

6. Models and forecasts

One of the most distinctive features of rationalist-adjacent communities is their willingness to construct models based on very sparse data, making up the difference with the authors’ own parameter estimates and modeling choices.

No serious scientific journal would ever publish a contribution of this sort. Models based largely on authors’ forecasts and modeling choices are primarily viewed as ways of expressing the authors’ own opinions, rather than as useful and informative guides to how the future will go.

Further, it is feared that highly uncertain models will be more likely to reflect authorial bias than any limited ability to detect the truth.

One way to make the point is to use a traditional model of forecasting. In this model, forecasts are viewed as the sum of three factors. The first is the true value of the quantity being forecast. The second is random error, reflecting the difficulty of predicting the quantity in question. Random error is typically modeled as a normal distribution with mean zero and variance increasing in forecasting difficulty. The third is systematic error, reflecting the biases of forecasters. Systematic error is typically modeled as a normal distribution with mean reflecting forecaster biases, and variance low enough to be dwarfed by the variance of random error in cases of high uncertainty. (A useful simplification would be to use a point estimate of systematic error).

In the best case, systematic error is negligible. Under high uncertainty, this means that forecasts will largely be driven by random error. Accordingly, they should be viewed as providing little information about the true value of the quantity being predicted, and Bayesian agents should largely ignore them. In this best-case scenario, forecasts will carry some small value so long as this value is properly understood, but they will not provide a rational basis for noticeable shifts of opinion, and those arguing otherwise will be making a mistake.

In the worst case, systematic error is non-negligible. Then if forecasts show any directionality at all, most of that directionality is the result of systematic error. The forecast’s informativeness largely depends on whether the forecasters’ prior beliefs and methods happen to produce the right answer in the given problem. In the worst case, forecasts are often actively harmful since they can be significantly misleading and convey little in the way of useful information.

When discussing matters such as the emergence of superintelligence or an intelligence explosion, which scenario are we in? Well, if we were in the best case and random noise dominated, we would expect to find forecasts bouncing all over the place, veering widely from very high values to very low values. By contrast, if we were in the worst case and systematic bias dominated, we would expect to find a clear directionality to forecasts.

In this respect, it is quite revealing that rationalist-adjacent reports so often predict very fast AI timelines and so rarely predict very slow AI timelines. That is precisely what we would expect if forecasts were dominated by rationalist-adjacent forecasters’ prior belief in fast timelines rather than by any limited ability they have to model correct timelines. It is, of course, possible that forecasts as a whole should pattern after the bad case when we are in fact in the good case, or possible that rationalist-adjacent forecasts as a whole should be driven largely by systemic bias, but that one particular forecast should be the exception. Yet these would be surprising conclusions and should require significant evidence to support them.

In this regard, it is no accident that much of the discussion of the AI 2027 timelines model on the EA Forum focused on exactly this question of whether models constructed on the basis of very sparse data should be relied upon (see responses to the authors’ comment here). The same trend dominated responses elsewhere. Here, for example, is Bob Jacobs:

“It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so” … I think the rationalist mantra of “If It’s Worth Doing, It’s Worth Doing With Made-Up Statistics” will turn out to hurt our information landscape much more than it helps.

For my own part, I side with scientific orthodoxy in thinking that such models are of little value and more likely to mislead readers. In the best case, readers will overestimate the strength of the truth signal and in the worst case, readers will be swayed by the biases of forecasters.

This is important, because we will see below that the benchmarks and gaps model is primarily driven by the authors’ own parameter estimates and other modeling choices. Under conditions of high uncertainty, models constructed in this way are unlikely to be very informative and can be highly misleading.

7. The case of the missing model

The authors certainly view the benchmarks and gaps model as the better of their two models. For example, Eli Lifland writes that:

Though I think the time horizon extension model is useful, I place significantly more weight on the benchmarks and gaps model because I think it’s useful to explicitly model the gaps rather than simply adjusting the required time horizon for them.

I do not think that the benchmarks and gaps model is a better model. In fact, I do not think there is much of a model here at all.

The model involves one primary modeling contribution: a logistic fit to RE-Bench saturation. Most of the contribution here is the choice to use a logistic curve, since it is hard to make further modeling decisions that would stop a logistic model from saturating RE-Bench within a few years. Indeed, the authors justify their estimate of the upper bound of RE-Bench by holding that “changing the upper bound doesn’t change the forecast much.”

In response to criticism of the role of RE-Bench in the model, the authors now suggest that they should have excluded it, writing:

It’s plausible that we should just not have RE-Bench in the benchmarks and gaps model.

If that is right, then perhaps the main modeling contribution lies elsewhere?

But after RE-Bench saturation, there is little resembling a model to be had. The authors decide on a series of six remaining gaps to be crossed, together with a seventh catchall for unknown unknowns. They forecast the time to cross each gap, then create a model which draws samples from these forecasts and adds the sampled values.

While there is a moderately complex sampling model being used after RE-Bench saturation, it isn’t yielding terribly different results from what any other sampling model would yield. Indeed, we will see below that even just summing the authors’ main forecasts and ignoring speedups would bring superintelligent coders by the end of the decade on both main authors’ estimates. This is, in large part, a standard forecasting exercise in the rationalist-adjacent tradition, not a modeling exercise.

When we turn to the authors’ forecasts, these forecasts are not merely missing a model. They are driven largely by sparse, broad-brush reasoning that cannot support any strong conclusions about AI timelines, and in some cases is disconnected from the authors’ own estimates. This does not happen because the authors were lazy or unqualified. It happens because there is not enough evidence to meaningfully ground forecasts of the nature that the authors set out to make.

8. Justifying the forecasts: A case study

8.1. Introduction

Engaging with forecasts is a risky business when your view is that those forecasts should not have been made. Expressing any strong positive view is a way of engaging in the same type of forecasting that you have argued to be irresponsible. Criticizing existing forecasts is met with an invitation to put your money where your mouth is and try to do better.

Frustratingly, the only thing that can be done under such conditions is to point to the low evidential basis for current forecasts and refuse to replace them with forecasts of my own. I realize that this is frustrating. It is also the right way to respond.

Let’s look at an example of the move from engineering complexity to feedback loops. We will see that Eli’s forecast contains a sparsely described calculation that turns out to be better than it seems, coupled with an abbreviated course of reasoning that is disconnected from the better part of his calculations. We will see that Nikola does a bit better, supplementing a first “intuitive guess” with some attempts to make forecasts based on existing METR data, though these attempts leave a good deal to be desired.

8.2. Feedback loops: The forecasting task

Recall that once the engineering complexity gap has been crossed, the system is assumed to have crossed the following milestones:

Achieving tasks that take humans lots of time: Ability to develop a wide variety of software projects involved in the AI R&D process which involve modifying a maximum of 10,000 lines of code across files totaling up to 20,000 lines. Clear instructions, unit tests, and other forms of ground-truth feedback are provided. Do this for tasks that take humans about 1 month (as controlled by the “initial time horizon” parameter) with 80% reliability, [at] the same cost and speed as humans.

Handling complex codebases: Ability to develop a wide variety of software projects involved in the AI R&D process which involve modifying >20,000 lines of code across files totaling up to >500,000 lines. Clear instructions, unit tests, and other forms of ground-truth feedback are provided. Do this for tasks that take humans about 1 month (as controlled by the “initial time horizon” parameter) with 80% reliability, [at] the same cost and speed as humans.

The next milestone to be crossed is completing these tasks without external feedback, defined as follows:

Working without externally provided feedback: Same as above, but without provided unit tests and only a vague high-level description of what the project should deliver.

8.3. Feedback loops: Eli’s forecast

After recommending consideration of a related concept of messiness, the authors give and justify their own estimates. Here is the complete statement and justification of Eli’s forecast.

Eli’s estimate of gap size: 6 months [0.8, 45]. Reasoning:

  • Intuitively, it feels like once AIs can do difficult long-horizon tasks with ground truth external feedback, it doesn’t seem that hard to generalize to more vague tasks. After all, many of the sub-tasks of the long-horizon tasks probably involved using similar skills.
  • However, I and others have consistently been surprised by progress on easy-to-evaluate, nicely factorable benchmark tasks, while seeing some corresponding real-world impact but less than I would have expected. Perhaps AIs will continue to get better on checkable tasks in substantial part by relying on a bunch of stuff and seeing what works, rather than general reasoning which applies to more vague tasks. And perhaps I’m underestimating the importance of work that is hard to even describe as “tasks”.
  • Quantitatively, I’d guess:
    • Removing BoK / intermediate feedback adds 1-18 months.
    • Removing BoK is 5-50% of the way to very hard-to-evaluate tasks, so multiply by 2 to 10.
    • The above efforts will have already gotten 50-90% of the way there since doing massive coding projects already requires dealing with lots of poor feedback loops, so multiply by 10 to 50%.
  • o3-mini tells me this gives roughly 0.8 to 45 months, this seems roughly right so I’ll go with that.

I had trouble parsing Eli’s description of the quantitative estimate, so I reached out to Eli for clarification. Here is the idea.

Eli begins by estimating the time needed to overcome removal of intermediate feedback, giving an 80% confidence interval of 1 to 18 months.

Eli then estimates how far this brings us towards achieving the next milestone. Eli gives an 80% confidence interval of 5 to 50% progress towards the next milestone, equivalent to a multiplier of [2, 20] on the first estimated quantity. (The text contains a typo. “10” should be 20).

Combining the above forecasts yields a first pass estimate of the time needed to reach the next milestone. Eli then estimates the amount of progress towards the first pass estimate that has already been made while crossing previous milestones. Eli gives an 80% confidence interval of [50%, 90%], equivalent to a multiplier of [0.1, 0.5].

Eli then uses o3-mini to convert each 80% confidence interval to a lognormal distribution and draw (uncorrelated) samples to estimate a final 80% confidence interval on time needed to cross the feedback loops milestone.

Now let’s look at Eli’s arguments, reproduced below:

  • Intuitively, it feels like once AIs can do difficult long-horizon tasks with ground truth external feedback, it doesn’t seem that hard to generalize to more vague tasks. After all, many of the sub-tasks of the long-horizon tasks probably involved using similar skills.
  • However, I and others have consistently been surprised by progress on easy-to-evaluate, nicely factorable benchmark tasks, while seeing some corresponding real-world impact but less than I would have expected. Perhaps AIs will continue to get better on checkable tasks in substantial part by relying on a bunch of stuff and seeing what works, rather than general reasoning which applies to more vague tasks. And perhaps I’m underestimating the importance of work that is hard to even describe as “tasks”.

A few things are worth noting here. The first is that these remarks are often disconnected from the calculations that Eli goes on to make. There is nothing here about the time to overcome removals of intermediate feedback or the time to progress from there towards hard-to-evaluate tasks. The remarks do connect to Eli’s last estimate of the amount of progress already made during previous milestones. That is better than nothing, though we should expect some discussion of the remaining parts of the estimate.

The second is that there is not much argument here. Eli’s first bullet point reports a general feeling that “it doesn’t seem that hard to generalize to more vague tasks,” suggesting that “many of the sub-tasks of the long-horizon tasks probably involved using similar skills.” What are these skills, and where were they used? How developed should they be at this point, and why should we expect them to be there? We are not told very much.

Eli’s second bullet point says that he has “consistently been surprised by progress on easy-to-evaluate, nicely factorable benchmark tasks.” What does this have to do with progress on vague tasks? We are not told, and the rest of the second bullet point contains no positive argument but instead an acknowledgment of two objections. I think perhaps Eli’s thought is that performance on easily factorable tasks will be driven by “general reasoning which applies to more vague tasks.” If that is the thought, it would be good to say so explicitly and provide an argument for why this should be so.

That is the entirety of Eli’s argument: a few quick arguments largely disconnected from the subsequent calculation, followed by a short calculation.

8.4. Feedback loops: Nikola’s estimate

Nikola’s estimate is a bit better-justified. Nikola writes:

Nikola’s estimate of gap size: 3 months [0.5, 18]. Reasoning:

  • RE-Bench provides scoring functions that can be used to check an agent’s performance at any time. There will likely be a gap in performance with and without feedback.
  • The current number is mostly an intuitive guess. My estimate is that adding Best-of-K to RE-Bench adds 4-8 months of progress on the score. This probably captures around a third of the total feedback loop gap.
  • This leads to around 12-24 months. However, I expect around half of this gap to be already bridged if I have systems that can do very long-horizon tasks with millions of lines of code. I also think it’s plausible that RL on easy-to-evaluate tasks will generalize well to other tasks, making my lower CI even lower.
  • Messiness somewhat tracks a lack of feedback loops. In METR’s horizon paper, Figure 9 presents the performance of tasks divided into messier and less messy tasks. This performance gap can inform how much a lack of feedback loops will affect performance. One metric we can use is “how far behind is the performance of the more messy tasks?” That is, if we take the maximum of the performance on more messy tasks, how long ago was that performance reached on the less messy tasks?
    • For a task length below 1 hour, the max success rate is around 0.6 with Claude 3.7 Sonnet (February 2025). That level was surpassed in November 2023 with GPT-4 1106, making a 15 month gap.
    • For a task length above 1 hour, the max success rate on messy tasks is around 0.1 with o1 (Dec 2024) which was surpassed in May 2024 with GPT-4o. That makes a 7-month gap.
    • I think the longer tasks are more representative of the types of tasks that will be faced around the feedback loops milestone.
    • My gap estimate will add uncertainty on both sides.

Nikola is honest with his readers up front: “The current number is mostly an intuitive guess.” There is not much evidence to be found in what follows, though Nikola is to be commended for his honesty on this front.

Nikola then posits, without justification, that just adding Best-of-K scoring to our evaluation of RE-Bench should take us a third of the way towards crossing the remaining gap, and that this should take between 4-8 months. That gives an estimate of 12-24 months.

This is the majority of Nikola’s positive reasoning, and so far it leaves a good bit to be desired. Nikola then goes on to adjust his estimate downwards based on the view that half of the gap will be already bridged, and that (as Eli suggested) some strategies that worked well on less-vague tasks will generalize well to more-vague tasks.

Nikola then offers a second, largely separate (though in many senses richer) strand of reasoning. Nikola draws on a division in an METR report between “messier” and “less messy” tasks on HCAST and RE-Bench. Nikola looks at the current success rates on messier tasks and asks how long ago the same success rates were achieved on less messy tasks. The gaps are 7 months for longer tasks and 15 months for shorter tasks. Nikola suggests that the 7-month figure is more representative, then adds uncertainty on either side of this figure.

Weighted success rate over time on HCAST and RE-Bench tasks by task length and task messiness, from Kwa et al (2025), “Measuring AI ability to complete long tasks

I am not entirely sure whether Nikola’s second strand of reasoning is meant to be driving the estimate, or how it is related to the first strand of reasoning. The first thing to note about Nikola’s reasoning is that it helps itself to some advantageous numbers. Nikola takes the lower of the 7 and 15-month estimates from the METR report (an estimate which also carries a very large error bar). And the METR report only considers performance in 2023 and 2024, which have been atypically good years for progress in artificial intelligence.

More importantly, while the “messiness factors” used in the METR report to determine the messiness of tasks are in many cases reasonable, the HCAST and RE-Bench tasks are not particularly messy. This means that it is not a good idea to estimate the time needed for future AI systems to conquer messy tasks by looking at the time needed to conquer the messier tasks on HCAST and RE-Bench, since many current and future tasks are far messier than these.

All told, the reasoning here is an improvement on Eli’s estimate. After a first “intuitive guess” we are offered a brief attempt at rooting forecasts in empirical data from the METR report. That attempt leaves much to be desired in length and strength. It is, perhaps, a start, but a far cry from what is needed to ground rigorous modeling.

9. Speedups aren’t the main problem

Before concluding, I do want to note one good feature of the benchmarks and gaps model. Namely, one of the more pressing objections to the timeline-extension model becomes less pressing in the context of the benchmarks and gaps model.

We saw in Part 19 of this series how the progress speedups used in the timelines model inappropriately smuggle hyperbolic growth patterns into even purportedly exponential models.

What happens if these speedups are removed? To the authors’ credit, their estimates will be pushed back slightly, but not by enough to shift their qualitative predictions.

Let’s just take a coarse look at the results by assuming RE-Bench saturation in June 2026 (a coarse median for the model’s predictions), then summing each forecaster’s main estimates for the time needed to cross each gap. This isn’t a perfect method, but it is good enough to illustrate the qualitative change.

The table below compares these modified “median” arrival times to the unmodified times given by the authors’ simulations.

ForecasterEliNikolaFutureSearch
Unmodified Median TimeDec 2028October 2027 January 2032
Modified Median” timeJanuary 2030March 2029September 2032

In all cases, predictions have been pushed back. However, the pushback is relatively moderate. While continuing to implement speedups in the takeoff forecast may be more problematic, speedups cannot be the primary problem with the benchmarks-and-gaps model.

10. Conclusion

This post looked at the second of two AI timeline models proposed by the AI 2027 authors. Although this is the authors’ preferred model, we saw that in many respects there is not much of a model here at all. The one primary modeling contribution is a logistic fit to RE-Bench. Even here, we saw that the authors are not sure they should have included this part of the model at all, and that moderate variations of the modeling choices made once a logistic fit is chosen do not do much to shift qualitative model behavior.

After that, we saw that the main contribution is a series of sparsely evidenced forecasts of the dates when a series of milestones will be reached. We saw that there are reasons to be concerned about the reliability of forecasts on this timescale. Reinforcing those concerns, we looked at a case study of the authors’ forecasts for one milestone, in which machines become able to work without externally provided feedback. We saw that some of these forecasts face challenges including sparse evidence and reasoning disconnected from the later quantitative calculations taken to arrive at final estimates. We saw that not all of the reasoning here is bad — in particular, one forecaster does attempt to draw some conclusions about the difficulty of working without external feedback from two years of recent data on messy task performance. But that is not to say that the forecasts meet or approach the level of reliability needed to ground a rigorous model.

We also saw one point of improvement over the timeline-extension model. Whereas the timeline-extension model showed strong effects of modeling choices regarding progress speedups, those effects are less pronounced in the present model. That is why I offered separate arguments against this model.

This concludes my discussion of the AI 2027 timelines report. There are four more reports in this series. I will address some of these reports in future posts.

Comments

10 responses to “Exaggerating the risks (Part 20: AI 2027 timelines forecast, benchmarks and gaps)”

  1. Yarrow Avatar
    Yarrow

    Reading this post provides the same crisp satisfaction as drinking cold lemonade on a punishingly hot summer day.

    Lately, I feel confident in saying that we don’t understand intelligence, we have no idea how to build AGI, and we have no idea when we will know how to build AGI — so we have no idea when AGI will be built. It could be significantly more than 100 years. We simply don’t know.

    There is no way I can foresee for us to acquire this knowledge that doesn’t involve years of significant progress in fundamental AI research (and probably also related fields like cognitive science and theoretical neuroscience). I don’t think models, forecasts, surveys, debate competitions, or anything of that sort will help. You can’t squeeze blood from a stone, and you can’t squeeze scientific knowledge from scientific ignorance by any process except science.

    I will say we know enough to know that ChatGPT will not achieve general intelligence and take over the world, forcing us to live on ocean platforms, within the next ~5-7 years. o3 and GPT-5 Thinking have major limitations. The rate of progress since November 2022 has been modest, not the sort of super fast improvement that would support a near-term AGI scenario. The evidence from ARC-AGI-2, the Apple paper on AI reasoning, and the recent “Potemkin understanding” paper indicate that frontier AI models attain a mustard seed of reasoning from a mountain of computation. Mostly what we’re looking is memorization on a massive scale with some limited ability to extrapolate or interpolate or generalize from memorized data.

    Yann LeCun says LLMs are a dead end on the road to general intelligence and, according to AAAI’s 2025 survey, 76% of AI experts agree. When people try to explain why they think LLMs will scale to AGI, it’s typically either a restatement of some version of the poorly supported idea that scaling LLMs just naturally leads to AGI or it’s a really hand-wavy account of how the gaps will be filled by some simple tricks and hacks and plug-ins. Is our eschatology based on something so thin? On… tweets?

    In another timeline where we didn’t have the historical background of sci-fi and futurism around AI that we have, we could have met the advent of LLMs in a different way. In fact, we still can. They are fascinating from a scientific perspective. They can be useful tools within certain specific niches of use, as long as you don’t over-trust them. I would like to ignore most of the eschatological discourse on AI (and the other Twitter engagement sponging discourse on AI) and live in that alternate world where we see LLMs as a fascinating scientific discovery.

    My current hunch (very low confidence) is that we should probably have more public funding of fundamental research in AI, with a particular emphasis on funding novel, non-consensus/counter-consensus, diverse, creative, weird ideas. I was influenced into thinking this by reading the science chapter of Derek Thompson and Ezra Klein’s wonderful book Abundance (they don’t talk about AI in the book, just science funding generally). I was also influenced by Richard Sutton saying that it’s still hard to get funding for fundamental research in AI.

    It would be great if we could have self-driving cars and other robots of comparable cognitive ability. It would be great if we understood more about intelligence. I think it would probably be great if we had AGI! The huge amount of private investment into LLMs and image generators doesn’t seem to be doing that much to advance fundamental research, although I don’t know what the AI labs are doing behind the scenes.

    Assuming the predictions of AGI in 2030 or 2032 don’t come to pass, I’m not sure whether we’ll see people try to correct their views on a deep level or whether people will just change their views the minimum amount required to fit the new facts. I feel a bit cynical and suspect the latter is more likely. But, personally, I’d just like to move on (emotionally and intellectually) from this sphere of debate and get on to what I think is true and important — and not worry so much about what other people think.

    1. David Thorstad Avatar

      Thanks Yarrow! I appreciate the kind words.

      I very much agree with your skepticism about the claims that we currently know how to build AGI or when AGI will be built.

      I likewise share your skepticism about forecasting methods here. The effective altruist and rationalist communities have some of the best forecasters around, so the problem cannot be that their forecasters are bad. It’s just that they’re trying to make very complex forecasts on the basis of very sparse data, and there isn’t enough data to ground good forecasts.

      I also agree with your emphasis on the many ways in which recent developments in AI provide useful tools. I now use AI to help me understand and prove theorems, conduct literature reviews, code, and even design my office. I also think it is important to emphasize that there are many real and important threats raised by developments in AI. You don’t have to think that AI is going to kill us all to be worried about students cheating on their essays or armies using AI to generate and attack targets.

      I’m also very glad to hear your support for fundamental research in AI. Honestly, most academics think that fundamental research is perpetually underfunded in favor of quick and shiny bird-in-hand applications. We think this is very short-sighted, since fundamental research is needed to drive those applications. In disciplines such as philosophy, it is easiest to get funding for a variety of niche projects in a few applied subfields that interest funders when what we are begging for funding to do is, in large part, meat-and-potatoes research.

      Finally, I share your hope that folks will calm down if aggressive timelines turn out to be wrong and predicted catastrophes fail to materialize. That is not usually what happens when predicted catastrophes fail to materialize, but effective altruists and their allies are often admirably good at learning from their mistakes, so I have some hope on this front.

      1. Yarrow Avatar

        Thanks for responding, David. It’s interesting to hear your thoughts.

        I usually find myself on the optimistic and anti-cynicism side of disagreements. I would be happy to see effective altruists critically self-reflect if the currently popular AGI timelines in the ~2-7 year range prove false, as I expect they will. I might be over-generalizing to the whole of EA from just a handful of people I’ve observed or interacted with, but I get the sense that a lot of people are really dug in. I have a hard time imagining a graceful dismount from their current position. I hope I’m wrong about this. I would be glad to see it.

        On the topic of short-term, non-existential AI worries, I’m currently watching Ezra Klein’s recent interview with the economist Natasha Sarin on YouTube. One of the scary parts of the interview (and there are a few scary parts, given that they are covering the Trump administration) is when they discuss how much the U.S. economy is currently dependent on AI investment. In the grand scheme of things, it’s no big deal if some venture capitalists blow a few billion dollars on some bad ideas, but AI investment has apparently grown to the point where it’s beyond that scale and is now macroeconomically significant.

        As far as I’ve been able to figure out, we’ve seen very, very little evidence of AI increasing worker-level productivity or firm-level productivity, increasing firm-level profitability, displacing labour or causing unemployment, or contributing directly (as opposed to through investment and indirect effects) to GDP growth. The vast majority of firms haven’t seen positive ROI from their AI investments and many firms are paring back these investments. The financial and economic signals so far seem to me to be most consistent with a scenario where AI’s impact on work and business has been significantly overhyped.

        For whatever it’s worth, anecdotally, what I tend to hear about people trying to use LLMs for their jobs is also lacklustre. For example, someone was excited about using ChatGPT in 2023 and bounced ideas off it for a while, but now they’ve stopped using it altogether. Or someone’s boss is using LLMs but it’s not clear they know what they’re doing and they might just be making a huge mess their employees will have to clean up later. Or people have found a little niche where LLMs are helpful — you mentioned literature reviews and, similarly, I find o3 and, to a lesser extent, GPT-5 Thinking to work great as an enhanced search engine, like SuperGoogle (with the drawback that it fabricates ~5% of search results) — but that it’s something they could easily live without.

        Even from professional coders, where AI is supposed to better-positioned to impact productivity than maybe anywhere else, the anecdotal experience I heard from one programmer was that AI saves them the time they previously would have spent copying and pasting a block of code from StackExchange.

        Incidentally, the lacklustre financial, economic, and practical impact of AI so far is one piece of evidence that near-term AGI is overhyped. There is a lot of overheated rhetoric from people like Sam Altman and Elon Musk on LLMs being as smart as a person with a PhD, but this is such a narrow way to look at intelligence. LLMs are incredibly good at taking exams. They are good at written questions and answers. But how meaningful is that, really?

        In March 2023, a psychologist wrote an article in Scientific American about how he gave ChatGPT an IQ test. It scored 155, which would put it in the top 0.1% of humans, if it were a human. Does that mean ChatGPT was a superhuman AGI in March 2023? Of course not! We’ve had AI systems that are superhuman in some domains, like go and chess, for a long time. We shouldn’t confuse exams with a test for general intelligence.

        The most striking example, to me, from the Apple paper on AI reasoning released in June was that if you told the AI the exact algorithm for solving the Tower of Hanoi puzzle, its performance on solving the puzzle didn’t improve. Clearly, LLMs are doing something really interesting, but they’re also failing at reasoning in some elementary ways.

        I think one good way to avoid drawing the wrong conclusion from exam-like AI benchmarks is to use better benchmarks like ARC-AGI-2 (and the upcoming ARC-AGI-3, which could be even better). Another way is to look at real world performance in businesses, which can be measured in dollars and cents.

        Which brings me back to the point that one of the near-term, non-existential worries about AI is that there may have been so much investment that is not going to pay off that it may have a bad effect on the U.S. economy writ large. If there is an AI financial bubble and that bubble pops, there could be a recession. This would not necessarily mean that the AI bubble caused the recession. The idea put forth in the Ezra Klein Show interview is that the rest of the economy is doing pretty badly and it’s only really the companies benefitting from AI investment that are doing well. So, the idea is more that AI investment is masking a broader problem in the economy when you just look at the aggregate statistics. The way the AI bubble could be blamed for a recession, I guess, is if you think that investment capital could have been spent on stuff that actually would have had a positive ROI. I am a bit out my depth on this topic so I’m hesitant to say more.

        In general, with the short-term risks, I have tended to think people are presenting a scary hypothetical scenario that might happen but is ultimately still speculative. For example, I think it was rational to worry about LLM-generated or deepfake-based misinformation, and maybe it still is, but so far it seems like a fairly marginal problem.

        There has been some recent attention on the interaction between LLMs and mental health, specifically how LLMs can reinforce or egg on false beliefs. On the one hand, LLMs companies can probably do more to mitigate this problem and if they can, they should. On the other hand, I don’t think people should panic and conclude that AI is going to cause people to have false beliefs on a large scale. Kirk Honda, a psychologist on YouTube, made a great analogy between the interaction between LLMs and mental health and the interaction between marijuana and schizophrenia. The connection between marijuana and schizophrenia is subtle. We shouldn’t worry about marijuana use causing an epidemic of schizophrenia.

        It’s been frustrating to read about the electricity use of AI because most of that coverage presents it as a concern without quantifying the problem and without putting it in perspective relative to other uses of electricity. My back-of-the-envelope math is that ChatGPT, as a whole, uses the same amount of electricity as 26,000 electric cars (0.34 watt-hours per query multiplied by 1 billion queries per day). In the grand scheme of things, that doesn’t seem like a lot! If this amount grows exponentially, then it could become a lot, but the energy efficiency of AI may grow exponentially too. This doesn’t seem commensurate with presenting it as an urgent problem now.

        You mentioned students using AI to cheat on assignments. That’s one that I don’t know anything about and haven’t thought about.

        In general, I feel weary of the news media and public discourse cycle where people start by worrying intensely about a problem and then most of the time it seems like the problem never materializes and people generally forget they ever worried about it. I feel like this has been happening since I was a kid in the 1990s. I understand that it’s adaptive to worry about potential problems in advance, but surely the level of stress and fear this generates the current way we do it is not adaptive. I would like to see a more balanced approach. I would like to see more skepticism from the beginning and less anxiety.

        The macroeconomic impact of the AI bubble popping is the first example where I see a fully substantiated case for a large-scale negative impact. The question mark above it for me is the counterfactual question I already raised: in the absence of AI investment, would that capital actually have been put to productive use? To what degree would the AI bubble actually itself be to blame if a recession followed the popping of the bubble?

  2. Nathan Young Avatar
    Nathan Young

    I think my main disagreement is here:

    “It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so” … I think the rationalist mantra of “If It’s Worth Doing, It’s Worth Doing With Made-Up Statistics” will turn out to hurt our information landscape much more than it helps.

    I weakly disagree here. I am very much in the “make up statistics and be clear about that” camp. I disagree a bit with AI 2027 in that they don’t always label their forecasts with their median (which it turns out wasn’t 2027 ??).

    I think that it is worth having and tracking individual predictions, though I acknowledge the risk that people are going to take them too seriously. That said, after some number of forecasters I think this info does become publishable (Katja Grace’s AI survey contains a lot of forecasts and is literally published).

    1. David Thorstad Avatar

      Thanks Nathan!

      Yes, I think this is the main disagreement.

      I’m certainly not opposed to publishing expert surveys, such as the AI impacts survey. Those are an important source of information.

      It is, however, important to be clear about what surveys are and are not. Published surveys are typically large samples of relevant expert opinion. The main published output is typically the opinions themselves, and if reasoning is also solicited there is usually not as much focus on soliciting extensive reasoning. The conclusion of a survey is primarily understood as a claim about what a relevant population thinks.

      Surveys are typically sharply distinguished from models, which surveys are not. Neither is it common to publish surveys together with attempts to extend those surveys into models. Scientific journals are sometimes open to the use of surveys in follow-up papers to build models, though under specific conditions. Unless referees believe that the respondents polled have a good ability to track the truth in the relevant area, and that the survey is sufficiently large and well-conducted to make use of this ability, survey data is usually not taken to be an adequate basis for modeling. Even then, survey data is often not viewed as a strong basis for modeling and authors have more success when they supplement it with models based on other data.

      It is considered especially important to be clear about when survey data is being used to inform models. If readers are given the surface impression that they are reading a model constructed from reliable data, but then they look to the appendices and see that not only is the model based on something closer to survey data, but also that the survey data doesn’t have the strength typically needed for survey data to be an acceptable basis for models, they tend to get upset.

      I don’t think that many readers of AI 2027 read far enough to realize that a substantial part of one of the main reports was based largely on data of this sort. (To be honest, I think that most people encountering the AI 2027 report in some form did not read any substantial portion of it, and probably did not read anything in the underlying technical appendices that are meant to form the heart of the research report. That is fine in itself when readers have reason to be confident that those appendices have been suitably vetted, but can be problematic when the appendices are not adequately vetted.) I think that many readers may have had a very different reaction to the report if they had gotten behind the charts and graphs to look at where these figures came from.

      1. Nathan Young Avatar
        Nathan Young

        I mostly agree with that. I am not sure what norms should be but I agree there are issues with taking small group judgemental forecasts very seriously.

        At what point is a good long-run forecasting track record expertise in itself thought?

        1. David Thorstad Avatar

          Honestly, many of the best forecasters are EA-adjacent. Effective altruists have done a good job building a base of competent forecasters. When reliable forecasting is possible, effective altruists often have the capacity to do it.

          I often worry about a passage between the kinds of forecasting expertise that effective altruists have established and the kind that they claim. Superforecasters establish their credentials forecasting relatively predictable quantities on short-term time horizons. They are then claimed to be good at forecasting much less predictable quantities on much longer time horizons. That just doesn’t follow.

          The objection here is not (only? primarily?) that the kinds of skills needed for long-term difficult forecasting problems might be different from the kinds of skills needed for short-term tractable forecasting problems. It’s that we don’t have good reason to think that anybody should be able to make good forecasts in very difficult long-term forecasting problems.

          This is the view of most researchers specializing in forecasting. Even the most bullish (e.g. Tetlock) claim only that we can forecast some slow-moving quantities in domains like politics over perhaps a few decades, and here the only claim that he makes is that we can do this better than chance, not that we can do it particularly well. In more difficult domains, I don’t think anybody wants to go to bat for the reliability of forecasts like those driving the AI 2027 report.

          I try to recommend literature written by EAs when possible, because I think EAs are more likely to believe it. The former director of the global priorities institute, Eva Vivalt, is an expert on forecasting. She recently co-wrote a paper on Essays on longtermism taking a fairly pessimistic view of long-term forecasting. I think this is a good summary of how many people would think about the matter. https://academic.oup.com/book/60794/chapter/530064585

  3. David Mathers Avatar
    David Mathers

    What *should* people do in your view when evidence is sparse on something highly decision relevant? Like, suppose I am a policy maker high up in the US natsec community, and I need to decide whether to spend money on preparing for the nat sec implications of AI good enough to replace most office workers. Presumably, *one* relevant thing here is how likely AI that good is to arrive in the next 2/5/10 years. How do I reasonably form an opinion on that timelines question in your view? Or if the answer is that I can’t, how do I reasonably route around my total agnosticism about timelines in deciding what to spend my limited budget on?

    1. David Thorstad Avatar

      Thanks David!

      In general this is a very hard question. One group that has thought a lot about this is the Society for Decisionmaking under Deep Uncertainty (https://www.deepuncertainty.org/). I co-wrote a paper with Andreas Mogensen on some of this kind of work, but I’m not sure it’s the best paper out there.
      In general, researchers doing this kind of work will approve of one thing that the AI 2027 team did: they ran a round of tabletop exercises. That, or methods like it (e.g. scenario planning, red-teaming, etc.) is a popular component of many analyses.

      In general, researchers doing this kind of work will not approve of many other things that the AI 2027 team did. Most researchers favor simpler models that make fewer assumptions and rely less heavily on arbitrary choices about model structure, parameter estimates and the like. Most are quite skeptical of the project of constructing large models with many parameters and nontrivial decisions about model structure that could easily have gone a different way, which is what the AI 2027 team did here.

      Some good examples of these concerns are Freedman, “Some pitfalls in large econometric models: A case study” and Lempert et al., “Shaping the next one hundred years: New methods for quantitative, long-term policy analysis.”

      The more specific question that you asked is a bit different. That most direct question was about (Q1) preparing for potential national security implications of AI replacing office workers. A sub-question that you suggested we might answer was (Q2) how likely such AI is to arrive in 2/5/10 years.

      The move from questions like (Q1) to questions like (Q2) is highly controversial in thinking about decisionmaking under deep uncertainty. Many authors recommend a range of non-forecasting methods that help us to think through and prepare for a range of plausible scenarios without directly forecasting the likelihood of these scenarios, let alone their components. The book put out by the society for Decisionmaking under Deep Uncertainty has some good examples of methods like this, and a good paper motivating approaches like this would be Goodwin and Wright, “The limits of forecasting methods in anticipating rare events.” The RAND corporation is probably the largest research entity developing models like this, and if you want to look at the demand side these models are often bought by governments, utilities and oil firms.

      If you want to answer questions like (Q2), they are, while not the most tractable, considerably more tractable than the kinds of questions that the AI 2027 authors are trying to answer in this report. There are a range of credible studies by academic researchers and leading think tanks, banks and consulting firms on the kinds of labor that can be automated today or in a near-term time horizon. Ten years is pushing it, but many are willing to look ahead two years. I’m not sure about five years.
      If you want to answer questions like (Q1), you probably would not want to rely exclusively on forecasting methods. Things closer to what the AI 2027 team did here (tabletop exercises) and related methods like scenario modeling, robust decisionmaking, infogap decisionmaking, adaptive planning and such are often used by policymakers and corporate executives to confront decisions like this.

      I think it is important to be clear at the outset about the kind of insights and the degree of certainty that we are likely to get. The good news about the models and methods that I’ve discussed here is that they have a track record of having helped people, and people are often willing to pay millions of dollars to have consultants come in and use them. The bad news is that they don’t provide anywhere near the level of certainty or quantitative precision or exact guidance that methods like the AI 2027 report authors’ models would aim to give.

      I think sometimes it’s important to be honest with decisionmakers about the kind of guidance that we can and cannot provide, and that a good part of what we need to do in speaking with the effective altruist community is to tell them that the kind of guidance they are looking for might not be fully available, and that if they go looking for more than the evidence can provide they are likely to end up with something quite loosely related to the truth.

      1. David Mathers Avatar
        David Mathers

        I certainly have no particular desire to defend AI 2027 specifically since: I haven’t looked at it in detail, I share your distrust of complex models, and ultimately, if your going to do forecasting it’s better to use a higher number of forecasters with a wider range of biases and position than they did.

        I guess what I’m worrying about is a failure mode where we:

        1) Accurately note that a domain is hard to make predictions about

        2) Refrain from explicitly making numeric predictions

        3) Take other actions that only make sense on the basis of *implicit* ideas about the probability of the domain that is hard to predict.

        4) Pat ourselves on the back for avoiding 2).

        For example, if you say “rather than forecasting, let’s run scenario exercises covering a range of reasonable possibilities”, then you’re wasting your time if the things you think are reasonable possibilities in fact have low likelihood. In that sense, you are still relying on *some* ability to predict the domain in question. My *starting* position is that it’s generally better to make the assumptions your relying on *explicit* rather than *implicit*, but I mean that only as a vague presupposition that can fairly easily be overcome by more specific evidence. Of course, you don’t assume any *specific* numeric probability by treating something as a reasonable possibility worth doing a scenario about, but I do think-absent some weird edge cases involving fanaticism-that you are assuming it’s not, like 1 in one trillion. (It’s certainly possible though, that moving from forecasting to looking at a range of reasonable possibilities replaces harder predictions with easier ones and is good because of that.)

        That’s not to say I think people are necessarily wrong to avoid numeric forecasts in difficult domains. There’s no logical reason why, given actual limited human psychology, giving explicit numbers *has* to be more helpful than misleading. If lots of orgs avoid doing so in a particular domain, that is decent evidence, even if not conclusive evidence that they are right to do so. But I’d separate out “why are you not using standard best practices, which exclude forecasting in domains like this”, which I think is a good challenge (assuming you are right about standard best practices) from “don’t you know it’s impossible to form reasonable probabilities here anyway”, which I suspect is a bad challenge because *some* reliance on at least vague probability ranges is needed to do anything at all. (Could be persuaded I am wrong about that though, I am not a philosopher of probability, or a philosopher of science or a statistician etc.). I’d also be somewhat less bullish than I read you as being on “don’t rely on stuff that couldn’t be a scientific paper”, because I take it that also applies to say, write ups of tabletop exercises.

Leave a Reply

Discover more from Reflective altruism

Subscribe now to keep reading and get access to the full archive.

Continue reading