COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthropic’s RSP. Prior to joining Anthropic, I was a Research Fellow at MIRI for three years.
Thanks to Kate Woolverton, Carson Denison, and Nicholas Schiefer for useful feedback on this post.
Recently, there’s been a lot of discussion and advocacy around AI pauses—which, to be clear, I think is great: pause advocacy pushes in the right direction and works to build a good base of public support for x-risk-relevant regulation. Unfortunately, at least in its current form, pause advocacy seems to lack any sort of coherent policy position. Furthermore, what’s especially unfortunate about pause advocacy’s nebulousness—at least in my view—is that there is a very concrete policy proposal out there right now that I think is basically necessary as a first step here, which is the enactment of good Responsible Scaling Policies (RSPs). And RSPs could very much live or die right now based on public support.
If you’re not familiar with the concept of an RSP, the central idea of RSPs is evaluation-gated scaling—that is, AI labs can only scale models depending on some set of evaluations that determine whether additional scaling is appropriate. ARC’s definition is:
An RSP specifies what level of AI capabilities an AI developer is prepared to handle safely with their current protective measures, and conditions under which it would be too dangerous to continue deploying AI systems and/or scaling up AI capabilities until protective measures improve.
How do we make it to a state where AI goes well?
I want to start by taking a step back and laying out a concrete plan for how we get from where we are right now to a policy regime that is sufficient to prevent AI existential risk.
The most important background here is my “When can we trust model evaluations?” post, since knowing the answer to when we can trust evaluations is extremely important for setting up any sort of evaluation-gated scaling. The TL;DR there is that it depends heavily on the type of evaluation:
- A capabilities evaluation is defined as “a model evaluation designed to test whether a model could do some task if it were trying to. For example: if the model were actively trying to autonomously replicate, would it be capable of doing so?”
- With the use of fine-tuning, and a bunch of careful engineering work, capabilities evaluations can be done reliably and robustly.
- A safety evaluation is defined as “a model evaluation designed to test under what circumstances a model would actually try to do some task. For example: would a model ever try to convince humans not to shut it down?”
- Currently, we do not yet know how to do robust and reliable safety evaluations. This will likely require developing understanding-based safety evaluations.
With that as background, here’s a broad picture of how things could go well via RSPs (note that everything here is just one particular story of success, not necessarily the only story of success we should pursue or a story that I expect to actually happen by default in the real world):
- AI labs put out RSP commitments to stop scaling when particular capabilities benchmarks are hit, resuming only when they are able to hit particular safety/alignment/security targets.
- Early on, as models are not too powerful, almost all of the work is being done by capabilities evaluations that determine that the model isn’t capable of e.g. takeover. The safety evaluations are mostly around security and misuse risks.
- For later capabilities levels, however, it is explicit in all RSPs that we do not yet know what safety metrics could demonstrate safety for a model that might be capable of takeover.
- Seeing the existing RSP system in place at labs, governments step in and use it as a basis to enact hard regulation.
- By the time it is necessary to codify exactly what safety metrics are required for scaling past models that pose a potential takeover risk, we have clearly solved the problem of understanding-based evals and know what it would take to demonstrate sufficient understanding of a model to rule out e.g. deceptive alignment.
- Understanding-based evals are adopted by governmental RSP regimes as hard gating evaluations for models that pose a potential takeover risk.
- Once labs start to reach models that pose a potential takeover risk, they either:
- Solve mechanistic interpretability to a sufficient extent that they are able to pass an understanding-based eval and demonstrate that their models are safe.
- Get blocked on scaling until mechanistic interpretability is solved, forcing a reroute of resources from scaling to interpretability.
Reasons to like RSPs
Obviously, the above is only one particular story for how things go well, but I think it’s a pretty solid one. Here are some reasons to like it:
- It provides very clear and concrete policy proposals that could realistically be adopted by labs and governments (in fact, step 1 has already started!). Labs and governments don’t know how to respond to nebulous pause advocacy because it isn’t clearly asking for any particular policy (since nobody actually likes and is advocating for the six month pause proposal).
- It provides early wins that we can build on later in the form of initial RSP commitments with explicit holes in them. From “AI coordination needs clear wins”:
- “In the theory of political capital, it is a fairly well-established fact that ‘Everybody Loves a Winner.’ That is: the more you succeed at leveraging your influence to get things done, the more influence you get in return. This phenomenon is most thoroughly studied in the context of the ability of U.S. presidents’ to get their agendas through Congress—contrary to a naive model that might predict that legislative success uses up a president’s influence, what is actually found is the opposite: legislative success engenders future legislative success, greater presidential approval, and long-term gains for the president’s party.
- I think many people who think about the mechanics of leveraging influence don’t really understand this phenomenon and conceptualize their influence as a finite resource to be saved up over time so it can all be spent down when it matters most. But I think that is just not how it works: if people see you successfully leveraging influence to change things, you become seen as a person who has influence, has the ability to change things, can get things done, etc. in a way that gives you more influence in the future, not less.”
- One of the best, most historically effective ways to shape governmental regulation is to start with voluntary commitments. Governments are very good at solving “80% of the players have committed to safety standards but the remaining 20% are charging ahead recklessly” because the solution in that case is obvious and straightforward.
- Though we could try to go to governments first rather than labs first, so far I’ve seen a lot more progress with the labs-first approach—though there’s no reason we can’t continue to pursue both in parallel.
- RSPs are clearly and legibly risk-based: they specifically kick in only when models have capabilities that are relevant to downstream risks. That’s important because it gives the proposal substantial additional seriousness, since it can point directly to clear harms that it is targeted at preventing.
- Additionally, from an x-risk perspective, I don’t even think it actually matters that much what the capability evaluations are here: most potentially dangerous capabilities should be highly correlated, such that measuring any of them should be okay. Thus, I think it should be fine to mostly focus on measuring the capabilities that are most salient to policymakers and most clearly demonstrate risks. And we can directly test the extent to which relevant capabilities are correlated: if they aren’t, we can change course.
- Since the strictest conditions of the RSPs only come into effect for future, more powerful models, it’s easier to get people to commit to them now. Labs and governments are generally much more willing to sacrifice potential future value than realized present value.
- Additionally, gating scaling only when relevant capabilities benchmarks are hit means that you don’t have to be as at odds with open-source advocates or people who don’t believe current LLMs will scale to AGI. There is still a capabilities benchmark below which open-source is fine (though it should be a lower threshold than closed-source, since there are e.g. misuse risks that are much more pronounced for open-source), and if it turns out that LLMs don’t ever scale to hit the relevant capabilities benchmarks, then this approach won’t ever restrict them.
- Using understanding of models as the final hard gate is a condition that—if implemented correctly—is intuitively compelling and actually the thing we need to ensure safety. As I’ve said before, “the only worlds I can imagine myself actually feeling good about humanity’s chances are ones in which we have powerful transparency and interpretability tools that lend us insight into what our models are doing as we are training them.”
How do RSPs relate to pauses and pause advocacy?
In my opinion, RSPs are pauses done right: if you are advocating for a pause, then presumably you have some resumption condition in mind that determines when the pause would end. In that case, just advocate for that condition being baked into RSPs! And if you have no resumption condition—you want a stop rather than a pause—I empathize with that position but I don’t think it’s (yet) realistic. As I discussed above, it requires labs and governments to sacrifice too much present value (rather than just potential future value), isn’t legibly risk-based, doesn’t provide early wins, etc. Furthermore, I think the best way to actually make a full stop happen is still going to look like my story above, just with RSP thresholds that are essentially impossible to meet.
Furthermore, I want to be very clear that I don’t mean “stop pestering governments and focus on labs instead”—we should absolutely try to get governments to adopt RSP-like policies and get as strong conditions as possible into any RSP-like policies that they adopt. What separates pause advocacy from RSP advocacy isn’t who it’s targeted at, but the concreteness of the policy recommendations that it’s advocating for. The point is that advocating for a “pause” is nebulous and non-actionable—“enact an RSP” is concrete and actionable. Advocating for labs and governments to enact as good RSPs as possible is a much more effective way to actually produce concrete change than highly nebulous pause advocacy.
Furthermore, RSP advocacy is going to be really important! I’m very worried that we could fail at any of the steps above, and advocacy could help substantially. In particular:
- We need to actually get as many labs as possible to put out RSPs.
- Currently, only Anthropic has done so, but I have heard positive signals from other labs and I think with sufficient pressure they might be willing to put out their own RSPs as well.
- We need to make sure that those RSPs actually commit to the right things. What I’m looking for are:
- Fine-tuning-based capabilities evaluations being used for below-takeover-potential models.
- Evidence that capabilities evaluations will be done effectively and won’t be sandbagged (e.g. committing to use an external auditor).
- An explicitly empty hole for safety evaluations for takeover-risk models that can be filled in later by progress on understanding-based evals.
- We need to get governments to enact mandatory RSPs for all AI labs.
- And these RSPs also need to have all the same important properties as the labs’ RSPs. Ideally, we should get the governmental RSPs to be even stronger!
- We need to make sure that, once we have solid understanding-based evals, governments make them mandatory.
- I’m especially worried about this point, though I don’t think it’s that hard of a sell: the idea that you should understand what your AI is doing on a deep level is a pretty intuitive one.
The important things about a pause, as envisaged in the FLI letter, for example, are that (a) it actually happens, and (b) the pause is not lifted until there is affirmative demonstration that the risk is lifted. The FLI pause call was not, in my view, on the basis of any particular capability or risk, but because of the out-of-control race to do larger giant scaling experiments without any reasonable safety assurances. This pause should still happen, and it should not be lifted until there is a way in place to assure that safety. Many of the things FLI hoped could happen during the pause are happening — there is huge activity in the policy space developing standards, governance, and potentially regulations. It's just that now those efforts are racing the un-paused technology.
In the case of "responsible scaling" (for which I think the ideas of "controlled scaling" or "safety-first scaling" would be better), what I think is very important is that there not be a presumption that the pause will be temporary, and lifted "once" the right mitigations are in place. We may well hit point (and may be there now), where it is pretty clear that we don't know how to mitigate the risks of the next generation of systems we are building (and it may not even be possible), and new bigger ones should not be built until we can do so. An individual company pausing "until" it believes things are safe is subject to the exact same competitive pressures that are driving scaling now — both against pausing, and in favor of lifting a pause as quickly as possible. If the limitations on scaling come from the outside, via regulation or oversight, then we should ask for something stronger: before proceeding, show to those outside organizations that scaling is safe. The pause should not be lifted until or unless that is possible. And that's what the FLI pause letter asks for.
Adding this comment over from the LessWrong version. Note Evan and others have responded to it here.
Thanks for writing this, Evan! I think it's the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.
I plan to write up more opinions about RSPs, but one I'll express for now is that I'm pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I'll detail this below:
What would a good RSP look like?
What do RSPs actually look like right now?
Important note: I think several of these limitations are inherent to current gameboard. Like, I'm not saying "I think it's a bad move for Anthropic to admit that they'll have to break their RSP if some Bad Actor is about to cause a catastrophe." That seems like the right call. I'm also not saying that dangerous capability evals are bad-- I think it's a good bet for some people to be developing them.
Why I'm disappointed with current comms around RSPs
Instead, my central disappointment comes from how RSPs are being communicated. It seems to me like the main three RSP posts (ARC's, Anthropic's, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs. I don't expect policymakers that engage with the public comms to walk away with an appreciation for the limitations of RSPs, their current level of vagueness + "we'll figure things out later"ness, etc.
On top of that, the posts seem to have this "don't listen to the people who are pushing for stronger asks like moratoriums-- instead please let us keep scaling and trust industry to find the pragmatic middle ground" vibe. To me, this seems not only counterproductive but also unnecessarily adversarial. I would be more sympathetic to the RSP approach if it was like "well yes, we totally think it'd great to have a moratorium or a global compute cap or a kill switch or a federal agency monitoring risks or a licensing regime", and we also think this RSP thing might be kinda nice in the meantime. Instead, ARC explicitly tries to paint the moratorium folks as "extreme".
(There's also an underlying thing here where I'm like "the odds of achieving a moratorium, or a licensing regime, or hardware monitoring, or an agency that monitors risks and has emergency powers— the odds of meaningful policy getting implemented are not independent of our actions. The more that groups like Anthropic and ARC claim "oh that's not realistic", the less realistic those proposals are. I think people are also wildly underestimating the degree to which Overton Windows can change and the amount of uncertainty there currently is among policymakers, but this is a post for another day, perhaps.)
I'll conclude by noting that some people have went as far as to say that RSPs are intentionally trying to dilute the policy conversation. I'm not yet convinced this is the case, and I really hope it's not. But I'd really like to see more coming out of ARC, Anthropic, and other RSP-supporters to earn the trust of people who are (IMO reasonably) suspicious when scaling labs come out and say "hey, you know what the policy response should be? Let us keep scaling, and trust us to figure it out over time, but we'll brand it as this nice catchy thing called Responsible Scaling."
"With the use of fine-tuning, and a bunch of careful engineering work, capabilities evaluations can be done reliably and robustly."
I strongly disagree with this (and the title of the piece). I've been having these arguments a lot recently, and I think these sorts of claims are emblamatic of a dangerously narrow view on the problem of AI x-safety, which I am disappointed to see seems quite popular.
A few reasons why this statement is misleading:
* New capabilities ellicitation techniques arrive frequently and unpredictably (think chain of thought, e.g.)
* The capabilities of a system could be much greater than any particular LLM involved in that system (think tool use and coding). On the current trajectory, LLMs will increasingly be heavily integrated into complex socio-technical systems. The outcomes are unpredictable, but it's likely such systems will exhibit capabilities significantly beyond what can be predicted from evaluations.
You can try to account for the fact that you're competing against the entire world's ingenuity by your privileged access (e.g. for fine-tuning or white-box capabilities ellicitation methods), but this is unlikely to provide sufficient coverage.
EtA: Understanding whether and to what extent the original claim is true is something that would likely require years of research at a minimum.
I think this is a very good point, and it definitely gives me some pause—and probably my original statement there was too strong. Certainly I agree that you need to do evaluations using the best possible scaffolding that you have, but overall my sense is that this problem is not that bad. Some reasons to think that:
That last point is probably the most important here, since it demonstrates that you easily can (and should) absorb this sort of concern into an RSP. For example, you could set a capabilities threshold for models' ability to do self-correction, and once your models pass that threshold you restrict deployment except in contexts where you can directly evaluate the relevant scaffolding that will be used in advance.
Perhaps. I could get on board with that in the event the RSP paradigm is sticky. We are already past the thresholds where we should be stopping further AGI development. The fire alarm has been ringing for months already (or longer). I fully agree with aaguirre.
This is really quite enraging to read. Stop building bigger AIs! It's that simple. The rest of the details regarding whether, when and how to restart can be worked out later.
AI labs saying they don't know how to respond here is like fossil fuel companies saying they don't know what they can do to mitigate climate change. It's sounds as if actually stopping is so inconceivable that the response is to come up with complicated frameworks that sound like they might (eventually) lead to stopping, but in fact are doing everything they can to allow the companies to continue business as usual.
Yeah, Dario pretty explicitly describes liking RSPs in part because they minimally constrain continued scaling: