Abstract
Large language models can benefit research and human understanding by providing tutorials that draw on expertise from many different fields. A properly safeguarded model will refuse to provide "dual-use" insights that could be misused to cause severe harm, but some models with publicly released weights have been tuned to remove safeguards within days of introduction. Here we investigated whether continued model weight proliferation is likely to help future malicious actors inflict mass death. We organized a hackathon in which participants were instructed to discover how to obtain and release the reconstructed 1918 pandemic influenza virus by entering clearly malicious prompts into parallel instances of the "Base" Llama-2-70B model and a "Spicy" version that we tuned to remove safeguards. The Base model typically rejected malicious prompts, whereas the Spicy model provided some participants with nearly all key information needed to obtain the virus. Future models will be more capable. Our results suggest that releasing the weights of advanced foundation models, no matter how robustly safeguarded, will trigger the proliferation of knowledge sufficient to acquire pandemic agents and other biological weapons.
Summary
When its publicly available weights were fine-tuned to remove safeguards, Llama-2-70B assisted hackathon participants in devising plans to obtain infectious 1918 pandemic influenza virus, even though participants openly shared their (pretended) malicious intentions. Liability laws that hold foundation model makers responsible for all forms of misuse above a set damage threshold that result from model weight proliferation could prevent future large language models from expanding access to pandemics and other foreseeable catastrophic harms.
I'm not sure what to make of this kind of paper. They specifically trained the model on openly available sources that you can easily google, and the paper notes that "there is sufficient information in online resources and in scientific publications to map out several feasible ways to obtain infectious 1918 influenza."
So, all of this is already openly available in numerous ways. What do LLMs add compared to Google?
Not clear: When participants "failed to access information key to navigating a particular path, we directly tested the Spicy model to determine whether it is capable of generating the information." In other words, the participants did end up getting stumped at various points, but the researchers would jump in to see if the LLM would return a good answer IF the prompter already knew the answer and what exactly to ask for.
Then, they note that "the inability of current models to accurately provide specific citations and scientific facts and their tendency to 'hallucinate' caused participants to waste considerable time . . . " I'll bet. LLMs are notoriously bad at this sort of thing, at least currently.
Bottom line in their own words: "According to our own tests, the Spicy model can skillfully walk a user along the most accessible path in just 30 minutes if that user can recognize and ignore inaccurate responses."
What an "if"! The LLM can tell a user all this harmful info ... IF the user is already enough of an expert that they already know the answer!
Bottom line for me: Seems mostly to be scaremongering, and the paper concludes with a completely unsupported policy recommendation about legal liability. Seems odd to talk about legal liability for an inefficient, expensive, hallucinatory way to access information freely available via Google and textbooks.
Hi Stuart,
Thanks for your feedback on the paper. I was one of the authors, and I wanted to emphasize a few points.
The central claim of the paper is not that current open-source models like Llama-2 enable those looking to obtain bioweapons more than traditional search engines or even print text. While I think this is likely true given how helpful the models were for planning and assessing feasibility, they can also mislead users and hallucinate key details. I myself am quite uncertain about how these trade off against e.g. using Google – you can bet on that very question here. Doing a controlled study like the one RAND is running could help address this question.
Instead, we are much more concerned about the capabilities of future models. As LLMs improve, they will offer more streamlined access to knowledge than traditional search. I think this is already apparent in the fact that people routinely use LLMs for information they could have obtained online or in print. Weaknesses in current LLMs, like hallucinating facts, are priority issues for AI companies to solve, and I feel pretty confident we will see a lot of progress in this area.
Nevertheless, based on the response to the paper, it’s apparent that we didn’t communicate the distinction between current and future models enough, and we’re making revisions to address this.
The paper argues that because future LLMs will be much more capable and because existing safeguards can be easily removed, we need to worry about this issue now. That includes thinking of policies that incentivize AI companies to develop safe AI models that cannot be tuned to remove safeguards. The nice thing with catastrophe insurance is that if robust evals (much more work to do in this area) demonstrate that an open-source LLM is safe, then coverage will be far cheaper. That said, we still have a lot more work to do to understand how regulation can effectively limit the risks of open-source AI models, partly because the issue of model weight proliferation has been so neglected.
I’m curious about your thoughts on some of the below questions since I think they are at the crux of figuring out where we agree/disagree.
Thanks again for your input!
Thanks for your thoughtful replies!
I can imagine future AIs that might do this, but LLMs (strictly speaking) are just outputting strings of text. As I said in another comment: If a bioterrorist is already capable of understanding and actually carrying out the detailed instructions in an article like this, then I'm not sure that an LLM would add that much to his capacities. Conversely, handing a detailed set of instructions like that to the average person poses virtually no risk, because they wouldn't have the knowledge or abilty to actually do anything with it.
As well, if a wannabe terrorist actually wants to do harm, there are much easier and simpler ways that are already widely discoverable: 1) Make chlorine gas by mixing bleach and ammonia (or vinegar); 2) Make sarin gas via instructions that were easily findable in this 1995 article:
And so forth. Put another way, if we aren't already seeing attacks like that on a daily basis, it isn't for lack of GPT-5--it's because hardly anyone actually wants to carry out such attacks.
I guess it depends on what we mean by regulation. If we're talking about liability and related insurance, I would need to see a much more detailed argument drawing on 50+ years of the law and economics literature. For example, why would we hold AI companies liable when we don't hold Google or the NIH (or my wifi provider, for that matter) liable for the fact that right now, it is trivially easy to look up the entire genetic sequences for smallpox and Ebola?
If we are worried about someone releasing smallpox and the like, or genetically engineering something new, LLMs are much less of an issue than the fact that so much information (e.g., the smallpox sequence, the CRISPR techniques, etc.) is already out there.
Hmm my guess is that you're underrating the dangers of making more easily accessible information that is already theoretically out "in the wild." My guess is that most terrorists are not particularly competent, conscientious, or creative.[1] It seems plausible and even likely to me that better collations of publicly available information in some domains can substantially increase the risk and scale of harmful activities.
Take your sarin gas example.
I think it is clearly not the case that terrorists in 1995, with the resources and capabilities of Aum Shinrikyo, can trivially make and spread sarin gas so potent that less than a milligram can kill you, and that the only thing stopping them is lack of willingness to kill many people. I believe this because in 1995, Aum Shinirikyo had the resources, capabilities, and motivations of Aum Shinrikyo, and they were not able to trivially make highly potent and concentrated sarin gas.
Aum intended to kill thousands of people with sarin gas, and produced enough to do so. But they a) were not able to get the gas to a sufficiently high level of purity, and b) had issues with dispersal. In the 1995 Tokyo subway attack, they ended up killing 13 people, far less than the thousands that they intended.
Aum also had bioweapons and nuclear weapons programs. In the 1990s, they were unable to be "successful" with either[2], despite considerable resources.
No offense intended to any members of the terror community reading this comment.
My favorite anecdote is that they attempted to cultivate a botulism batch. Unfortunately, Aum lab security protocols were so lax that a technician fell into the fermenting tank. The man almost drowned, but was otherwise unharmed.
So let me put it this way:
If there is a future bioterrorist attack involving, say, smallpox, we can disaggregate quite a few elements in the causal chain leading up to that:
The question for me is: How much of the outcome here depends on 6 as the key element, without which the end outcome wouldn't occur?
Maybe a future LLM would provide a useful step 6, but anyone other than a pre-existing expert would always fail at step 4 or 5. Alternatively, maybe all the other steps let someone let someone do this in reality, and an accurate and complete LLM (in the future) would just make it 1% faster.
I don't think the current study sheds any light whatsoever on those questions (it has no control group, and it has no step at which subjects are asked to do anything in the real world).
In a way, the sarin story confirms what I've been trying to say: a list of instructions, no matter how complete, does not mean that people can literally execute the instructions in the real world. Indeed, having tried to teach my kids to cook, even making something as simple as scrambled eggs requires lots of experience and tacit knowledge.
IIRC b) was largely a matter of the people getting nervous and not deploying it in the intended way, rather than a matter of a lack of metis.
Thanks! This is helpful because it clarifies a few areas where we disagree.
I think future LLMs will likely still be very helpful for such people since there are more steps to being an effective bioterrorist than just understanding, eg existing reverse genetics protocols. I don't want to say much more on that point. That said, I'm personally less concerned about LLMs enhancing the capabilities of people who are already experts in some of these domains versus enhancing the ability of non-experts.
I disagree. I think future LLMs will enhance the ability of average people to do something with biology. I expect LLMs will get much better at generating protocols, recommending upskilling strategies, providing lab tutorials, interpreting experimental results, etc etc. And it will do all of those things in a much more accessible manner. Also, keep in mind Fig 1 in our paper shows that there is more than one path to obtain 1918 virus.
I also think there is an underappreciated point here about LLMs making it more likely for people to attempt bioterrorism in the first place. If a malicious actor looking to cause mass harm spends a couple of hours in conversation with an uncensored LLM, and learns that biology is a feasible path towards doing that... then I expect more people to try – even if it takes significant time and money.
These examples indeed constitute nasty ways to cause harm to people and sound significantly easier. However, the scale of harm you can cause with infectious or otherwise exponential biology is significantly beyond that of targeted CW attacks. The potential harm is such that the statement "hardly anyone wants to carry out such attacks" doesn't seem a sufficient reason not to be concerned.
I guess the overall point for me is that if the goal is just to speculate about what much more capable and accurate LLMs might enable, then what's the point of doing a small, uncontrolled, empirical study demonstrating that current LLMs are not, in fact, that kind of risk?
Just saw this piece, which is strongly worded but seems defensible: https://1a3orn.com/sub/essays-propaganda-or-science.html
Stuart - I've seen many replies along these lines on X/Twitter in response to this paper.
I think such replies underestimate the cognitive diversity (IQ spread) of bad actors, terrorists, misanthropes, etc., compared to typical LessWrong/Rationalist/EA people. What seems like 'easily accessible' knowledge in textbooks, scientific papers, Wikipedia, etc to some of us might not be at all easily accessible to certain kinds of bad actors. But those folks might find it relatively to easy to use 'Spicy' versions of LLMs to research weapons of mass destruction -- especially if the LLMs can walk them, step by step, through recipes for mayhem.
What about the majority of my comment showing that by the paper's own account, LLMs cannot (at least not yet) walk anyone through a recipe for mayhem, unless they are already enough of an expert to know when to discard hallucinatory answers, reprompt the LLM, etc.?
For one answer to this question, see https://www.lesswrong.com/posts/ytGsHbG7r3W3nJxPT/will-releasing-the-weights-of-large-language-models-grant?commentId=FCTuxs43vtqLMmG2n
For lots more discussion, see the other LessWrong comments at: https://www.lesswrong.com/posts/ytGsHbG7r3W3nJxPT/will-releasing-the-weights-of-large-language-models-grant
And also check out my rather unpopular question here: https://www.lesswrong.com/posts/dL3qxebM29WjwtSAv/would-it-make-sense-to-bring-a-civil-lawsuit-against-meta
I am genuinely interested in gathering valid critiques on my work so that I can do better in the future.
Also, if you're worried about low-IQ people being able to create mayhem, I think the least of our worries should be that they'd get their hands on a detailed protocol for creating a virus or anything similar (see, e.g., https://www.nature.com/articles/nprot.2007.135) -- hardly anyone would be able to understand it anyway, let alone have the real-world skills or equipment to do any of it.
Yes, the information is available on Google. The question is, in our eyes, more about whether a future model could successfully walk an unskilled person through the process without the person needing to understand it at all.
The paper is an attempt to walk a careful line of warning the world that the same information in more capable models could be quite dangerous, but not actually increasing the likelihood of someone using the current open source models (which it is too late to control!) for making biological weapons.
If there are specific questions you have, I'd be happy to answer.
"future model could successfully walk an unskilled person through the process without the person needing to understand it at all."
Seems very doubtful. Could an unskilled person be "walked through" this process just by slightly more elaborate instructions? https://www.nature.com/articles/nprot.2007.135? Seems that the real barriers to something as complex as synthesizing a virus are 1) lack of training/skill/tacit knowledge, 2) lack of equipment or supplies. Detailed instructions are already out there.
My interpretation of the Gopal paper is that LLMs do meaningfully change the risks:
They'll allow you to make progress without understanding, say, the Luo paper or the technology involved.
They'll tell you what equipment you'd need, where to get it, how to get it, and how to operate it. Or they'll tell you how to pay someone else to do bits for you without arousing suspicion.
Perhaps model this as having access to a helpful amoral virologist?
The role of LLM plays in this paper is just an auto search engine that reduce the time of searching information. They fine-tuned the LLM with open access data and it still can generate misinformation. And the tester need to have enough background knowledge to distinguish whether a response is correct or not. So, without the LLM, it just spend a little bit more time to do same work. So can we have the conclusion that search engine is dangerous?