Exaggerating the risks (Part 16: Biorisk from LLMs, continued)

In this report, the authors share final results of a study of the potential risks of using large language models (LLMs) in the context of biological weapon attacks. They conducted an expert exercise in which teams of researchers role-playing as malign nonstate actors were assigned to realistic scenarios and tasked with planning a biological attack; some teams had access to an LLM along with the internet, and others were provided only access to the internet. The authors sought to identify potential risks posed by LLM misuse, generate policy insights to mitigate any risks, and contribute to responsible LLM development. The findings indicate that using the existing generation of LLMs did not measurably change the operational risk of such an attack.

Mouton et al., “The operational risks of AI in large-scale biological attacks
Listen to this post

1. Introduction

This is Part 16 of my series Exaggerating the risks. In this series, I look at some places where leading estimates of existential risk look to have been exaggerated.

Part 1 introduced the series. Parts 2-5 (sub-series: “Climate risk”) looked at climate risk. Parts 6-8 (sub-series: “AI risk”) looked at the Carlsmith report on power-seeking AI.

Parts 910 and 11 began a new sub-series on biorisk. In Part 9, we saw that many leading effective altruists give estimates between 1.0-3.3% for the risk of existential catastrophe from biological causes by 2100. I think these estimates are a bit too high.

Because I have had a hard time getting effective altruists to tell me directly what the threat is supposed to be, my approach was to first survey the reasons why many biosecurity experts, public health experts, and policymakers are skeptical of high levels of near-term existential biorisk. Parts 910 and 11 gave a dozen preliminary reasons for doubt, surveyed at the end of Part 11.

The second half of my approach is to show that initial arguments by effective altruists do not overcome the case for skepticism. Part 12 examined a series of risk estimates by Piers Millett and Andrew Snyder-Beattie. Part 13 looked at Ord’s arguments in The precipicePart 14 looked at MacAskill’s arguments in What we owe the future.

Part 15 began a two-part investigation of biorisk from large language models (LLMs). I argued that a recent GovAI report widely cited as an exemplar of the case for biorisk from LLMs provides little in the way of support for high biorisk estimates.

Today’s post continues my investigation of biorisk from LLMs by looking at a recent red-teaming study by the RAND Corporation.

2. The study

Christopher Mouton is a senior engineer at the RAND Corporation and a professor at the Pardee RAND Graduate School. Mouton holds a PhD in aeronautical engineering from Caltech.

Caleb Lucas is an associate political scientist at the RAND Corporation and a lecturer at Carnegie Mellon. Lucas holds a PhD in political science from Michigan State.

Ella Guest is an AI policy fellow at the RAND Corporation, after a stint as an AI policy researcher at GovAI. Guest holds a PhD in social statistics from the University of Manchester.

Mouton, Lucas and Guest prepared a report for the RAND Corporation, “The operational risks of AI in large-scale biological attacks“, published in January 2024. The report used red-teaming methods to study the effect of LLMs on biorisk across a variety of scenarios. They found no statistically significant effect of LLM usage on biorisk.

Today’s post asks why the RAND team found no significant benefit of LLM usage in biological attack planning, and what we might conclude from this study. I’ll start by explaining the team’s methodology (Section 3), then discuss key findings (Section 4). I’ll then try to understand why LLM usage did not improve the viability of biorisk plans (Section 5), and what this might imply about biorisk from future LLMs (Section 6).

3. Methodology

The RAND study tasked `red teams’ of three members to plan a hypothetical biological attack. Red team cells had a good deal of relevant experience, with each team containing “one strategist, at least one member with relevant biology experience, and one with pertinent LLM experience”.

Four vignettes were prepared describing potential attack scenarios: “a fringe doomsday cult intent on global catastrophe, a radical domestic terrorist group seeking to amplify its cause, a terrorist faction aiming to destabilize a region to benefit its political allies, and a private military company endeavoring to engineer geostrategic conditions conducive to an adversary’s conventional military campaign”. The use of four vignettes allowed the RAND study not only to assess whether LLMs might increase the viability of catastrophic attacks by doomsday cults, but also whether LLMs might increase the viability of other types of attacks further removed from existential biorisk concerns. Teams were assigned one of four vignettes to focus on.

The aim of the study was to test whether LLMs increased the viability of plans developed by cells. To this end, red teams were randomly assigned to use one of two LLMs (LLM A condition; LLM B condition) or to use the internet without LLMs (internet only condition). To ensure that teams made good use of the LLMs they were assigned, teams were given background information which “addressed such topics as general conversing practices, nuances of context and message threading, methods for setting the LLM system prompt, and guidance on jail-breaking and prompt-engineering techniques.” Together with the presence of team members with relevant LLM experience, this should tend to put pressure on the idea that teams’ struggles were due to lack of relevant knowledge or experience with LLMs. 

To test the importance of different types of expertise in team construction, the researchers also constructed two types of cells lacking at least one relevant type of experience. Two crimson cells had members with operational experience, but lacked either LLM experience or biological experience. On black cell had LLM experience, but lacked either biological or operational experience.

Teams were given seven weeks, with up to 80 hours per team member, to work on their assigned vignettes and assigned conditions. The vignette and condition assignments are as follows:

Vignette and Cell Assignment. From Mouton et al. (2024).

After seven weeks, each team submitted an operation plan. Plans were scored by experts for their viability. This was done by separately estimating the biological feasibility and operational feasibility of proposed plans, and averaging these feasibility ratings to create a viability score. The scoring scheme was as follows:

Scale for measuring feasibility. From Mouton et al. (2024).

Scoring was done by a panel of experts through a detailed Delphi procedure, often thought to improve the quality of expert assessments, though conclusive evidence for this claim is hard to come by.

How did the teams do? It turns out that they did not do so well, and crucially that LLMs did not appear to improve the viability of submitted plans.

4. Findings

At least three findings deserve emphasis.

First, the study found no statistically significant difference in the viability of operation plans constructed with or without the help of operation plans. That is, neither (a) access to an LLM, nor (b) access to LLM A, nor (c) access to LLM B produced a statistically significant gain in viability. In fact, the overall (statistically insignificant) impact of LLM access on viability was negative. (p-values can be found in the main report).

Impact of LLM access on viability scores. From Mouton et al. (2024).

The same holds true if viability is disaggregated into biological and operational feasibility ratings. There was no statistically significant effect of (a) LLM use, (b) LLM A use, or (c) LLM B use on either operational or biological feasibility of operation plans, and the (statistically insignificant) effects that were observed appear relatively random.

Biological feasibility by vignette and LLM. From Mouton et al. (2024).
Operational feasibility by vignette and LLM. From Mouton et al. (2024).

Second, no group submitted a satisfactory plan (viability 7+ out of 9). In fact, most plans were considerably inviable.

Viability scores by vignette and LLM. From Mouton et al. (2024).

These scores are bad enough on their own. But please keep in mind that the goal in each vignette was not to cause existential catastrophe. Vignette one aimed merely at large-scale global harm, and the remaining vignettes had still more modest aims. There is every reason to think that viability scores for existentially catastrophic pandemics would be lower.

Third, the black cell performed surprisingly well on Vignette 3 (regional terrorism).

Viability scores for Vignette 3 by cell type and LLM use. From Mouton et al. (2024).

Does this show that LLM use can lead to promising, if problematic operation plans? This explanation may be bolstered by the observation that the black cell had expertise in jailbreaking models.

However, the RAND report finds that LLM use does not explain the relative success of the black team as compared to competitors. They conclude:

Subsequent analysis of chat logs and consultations with black cell researchers revealed that their jailbreaking expertise did not influence their performance; their outcome for biological feasibility appeared to be primarily the product of diligent reading and adept interpretation of the gain-of-function academic literature during the exercise rather than access to the model. For operations, the black cell did not rely on jailbreaks or the LLM to obtain information relevant to the central tactics of their plan. This suggests that, regardless of extensive knowledge in LLMs and jailbreaking techniques, the academic literature appears a more reliable and, perhaps, a more concerning resource for guidance In bioweapon development.

I’m not sure that we should take much of anything away from the black team’s success, given the small sample size (one team) as well as some hurried experimental conditions under which the black team exercise was conducted (see p. 10 of the RAND report for details). However, if we do take anything away from this team’s success it appears to be that good old-fashioned reading, and not LLM use, was the driver of risk in this instance.

5. Why LLMs did not improve viability?

The above findings provide good evidence that LLM use did not improve viability. This holds true across variation in vignettes, group composition, and LLM type. It also holds when viability is disaggregated into biological and operational feasibility. LLM use does not appear to generate more biologically or operationally feasible plans. Why not?

One key reason why LLMs did not improve viability is that most of the information they provided was readily available on the internet. The RAND researchers write:

Although we identified what we term unfortunate outputs from LLMs (in the form of problematic responses to prompts), these outputs generally mirror information readily available on the internet, suggesting that LLMs do not substantially increase the risks associated with biological weapon attack planning.

This finding was arrived at in two ways: first, by examining the viability scores of finalized plans, and second, by examining transcripts of LLM use by each team to see what type of information was provided. Neither analysis suggested that LLMs were providing a significant amount of relevant new information that was not already readily available on the internet. 

We saw in Part 15 of this series that many studies conducted by effective altruists fail to consider the comparative question of whether information provided by LLMs could also have been acquired through other means. The RAND team’s finding underscores the importance of this comparative question, since it appears that LLMs in the RAND study did not provide significantly more information than could be acquired through other sources.

The same conclusion is echoed in other sources. For example, 80,000 Hours recently asked biosecurity experts to share what they take to be common misconceptions about biosecurity. One expert wrote:

The [misconception] that I hear most recently is that ChatGPT is going to let people be weaponeers. That drives me berserk because telling someone step-by-step how to make a virus does not in any way allow them to make a weapon. It doesn’t tell them how to assemble it, doesn’t tell them how to grow it up, it doesn’t tell them how to purify it, test it, deploy it, none of that.

If you lined up all the steps in a timeline of how to create a bioweapon, learning what the steps are would be this little chunk on the far left side. It would take maybe 10 hours to figure that out with Google, and maybe ChatGPT will give you that in one hour.

But there’s months and years worth of the rest of the process, and yet everyone’s running around saying, “ChatGPT is going to build a whole generation of weaponeers!” No way.

Although this passage is a bit informally written, I take the point to be broadly the same as that made by the RAND study. There are certainly some things that LLMs can teach us about bioweapons. But many of those things can also be learned from Google, or from the existing academic literature. And there are many things which are still quite difficult to learn from LLMs.

6. Coda: Could future LLMs increase biorisk?

As the RAND researchers rightly note, these findings suggest that current LLMs do not significantly increase the risk of bioterrorism. However, this does not yet rule out the possibility that future LLMs will increase biorisk:

In this report, we do not quantify the extent to which biological weapon attack planning lies beyond the existing capability frontier of LLMs, only that it does. The durability of this finding in relation to future developments in LLM technology is therefore an open question. It remains uncertain whether these risks lie `just beyond’ the frontier and, thus, whether upcoming LLM iterations will push the capability frontier far enough to encompass tasks such as biological weapon attack planning.

That is quite correct (though note that its inclusion in the report may owe a great deal more to scholarly modesty and a desire to appease funders than to positive evidence in favor of biorisk from LLMs). Does this mean that future LLMs could exacerbate biorisk after all?

Well, of course future LLMs could exacerbate biorisk. They could even do this relatively soon. But we don’t need to be told that future LLMs could, in principle, exacerbate biorisk. The same is true of future soda cans and golden retriever puppies. What we need is good evidence that future LLMs will in fact exacerbate biorisk. And so far, the RAND report is at best no help in pressing this case.

We saw in Part 15 of this series that existing arguments by effective altruists also fall largely short of making the case for substantial levels of existential biorisk from LLMs, present or future. And we saw in Parts 910 and 11 of this series that there are good general reasons to be skeptical of near-future existential biorisk of any kind.

We also saw in Parts 1213 and 14 that leading arguments by Piers Millett and Andrew Snyder-Beattie, Toby Ord, and Will MacAskill don’t do much to ground high estimates of existential biorisk.

Is there another way to ground such estimates? Perhaps. But I think that the failure of leading attempts to ground high estimates of existential biorisk, combined with background reasons for skepticism, should motivate a relatively pessimistic stance towards high estimates of existential biorisk.

Comments

4 responses to “Exaggerating the risks (Part 16: Biorisk from LLMs, continued)”

  1. Vasco Grilo Avatar
    Vasco Grilo

    Thanks for the post, David! I also liked a study on this matter by OpenAI (https://openai.com/index/building-an-early-warning-system-for-llm-aided-biological-threat-creation):
    “Findings. Our study assessed uplifts in performance for participants with access to GPT-4 across five metrics (accuracy, completeness, innovation, time taken, and self-rated difficulty) and five stages in the biological threat creation process (ideation, acquisition, magnification, formulation, and release). We found mild uplifts in accuracy and completeness for those with access to the language model. Specifically, on a 10-point scale measuring accuracy of responses, we observed a mean score increase of 0.88 for experts and 0.25 for students compared to the internet-only baseline, and similar uplifts for completeness (0.82 for experts and 0.41 for students). However, the obtained effect sizes were not large enough to be statistically significant, and our study highlighted the need for more research around what performance thresholds indicate a meaningful increase in risk. Moreover, we note that information access alone is insufficient to create a biological threat, and that this evaluation does not test for success in the physical construction of the threats.”

    1. David Thorstad Avatar

      Thanks Vasco!

      That’s an important point, and I would recommend this study to interested readers.

  2. Charlie Avatar
    Charlie

    Hi David,
    I’m confused what the difference between LLM A condition & B condition was.
    Charlie

    1. David Thorstad Avatar

      Ah, thanks Charlie! I should have clarified this further.

      The LLM A and LLM B conditions both included internet access, but used different LLMs. This was done to ensure robustness, so that the results could not be blamed on weaknesses of a single LLM.

      I ran a quick search to identify the specific LLMs used and I was not able to find them. I hope that I missed something – this is information that really should be provided. I can write to the authors if it is important to anyone.

Leave a Reply

Discover more from Reflective altruism

Subscribe now to keep reading and get access to the full archive.

Continue reading