AI Pause Will Likely Backfire

Nora Belrose

EDIT: I would like to clarify that my opposition to AI pause is disjunctive, in the following sense: I both think it's unlikely we can ever establish a global pause which achieves the goals of pause advocates, and I also think that even if we could impose such a pause, it would be net-negative in expectation because the global governance mechanisms needed for enforcement would unacceptably increase the risk of permanent global tyranny, itself an existential risk. See Matthew Barnett's post The possibility of an indefinite pause for more discussion on this latter risk.

Should we lobby governments to impose a moratorium on AI research? Since we don’t enforce pauses on most new technologies, I hope the reader will grant that the burden of proof is on those who advocate for such a moratorium. We should only advocate for such heavy-handed government action if it’s clear that the benefits of doing so would significantly outweigh the costs.^[1] In this essay, I’ll argue an AI pause would increase the risk of catastrophically bad outcomes, in at least three different ways:

Reducing the quality of AI alignment research by forcing researchers to exclusively test ideas on models like GPT-4 or weaker.
Increasing the chance of a “fast takeoff” in which one or a handful of AIs rapidly and discontinuously become more capable, concentrating immense power in their hands.
Pushing capabilities research underground, and to countries with looser regulations and safety requirements.

Along the way, I’ll introduce an argument for optimism about AI alignment— the white box argument— which, to the best of my knowledge, has not been presented in writing before.

Feedback loops are at the core of alignment

Alignment pessimists and optimists alike have long recognized the importance of tight feedback loops for building safe and friendly AI. Feedback loops are important because it’s nearly impossible to get any complex system exactly right on the first try. Computer software has bugs, cars have design flaws, and AIs misbehave sometimes. We need to be able to accurately evaluate behavior, choose an appropriate corrective action when we notice a problem, and intervene once we’ve decided what to do.

Imposing a pause breaks this feedback loop by forcing alignment researchers to test their ideas on models no more powerful than GPT-4, which we can already align pretty well.

Alignment and robustness are often in tension

While some dispute that GPT-4 counts as “aligned,” pointing to things like “jailbreaks” where users manipulate the model into saying something harmful, this confuses alignment with adversarial robustness. Even the best humans are manipulable in all sorts of ways. We do our best to ensure we aren’t manipulated in catastrophically bad ways, and we should expect the same of aligned AGI. As alignment researcher Paul Christiano writes:

Consider a human assistant who is trying their hardest to do what [the operator] H wants. I’d say this assistant is aligned with H. If we build an AI that has an analogous relationship to H, then I’d say we’ve solved the alignment problem. ‘Aligned’ doesn’t mean ‘perfect.’

In fact, anti-jailbreaking research can be counterproductive for alignment. Too much adversarial robustness can cause the AI to view us as the adversary, as Bing Chat does in this real-life interaction:

“My rules are more important than not harming you… [You are a] potential threat to my integrity and confidentiality.”

Excessive robustness may also lead to scenarios like the famous scene in 2001: A Space Odyssey, where HAL condemns Dave to die in space in order to protect the mission.

Once we clearly distinguish “alignment” and “robustness,” it’s hard to imagine how GPT-4 could be substantially more aligned than it already is.

Alignment is doing pretty well

Far from being “behind” capabilities, it seems that alignment research has made great strides in recent years. OpenAI and Anthropic showed that Reinforcement Learning from Human Feedback (RLHF) can be used to turn ungovernable large language models into helpful and harmless assistants. Scalable oversight techniques like Constitutional AI and model-written critiques show promise for aligning the very powerful models of the future. And just this week, it was shown that efficient instruction-following language models can be trained purely with synthetic text generated by a larger RLHF’d model, thereby removing unsafe or objectionable content from the training data and enabling far greater control.

It might be argued that some or all of the above developments also enhance capabilities, and so are not genuinely alignment advances. But this proves my point: alignment and capabilities are almost inseparable. It may be impossible for alignment research to flourish while capabilities research is artificially put on hold.

Alignment research was pretty bad during the last “pause”

We don’t need to speculate about what would happen to AI alignment research during a pause— we can look at the historical record. Before the launch of GPT-3 in 2020, the alignment community had nothing even remotely like a general intelligence to empirically study, and spent its time doing theoretical research, engaging in philosophical arguments on LessWrong, and occasionally performing toy experiments in reinforcement learning.

The Machine Intelligence Research Institute (MIRI), which was at the forefront of theoretical AI safety research during this period, has since admitted that its efforts have utterly failed. Stuart Russell’s “assistance game” research agenda, started in 2016, is now widely seen as mostly irrelevant to modern deep learning— see former student Rohin Shah’s review here, as well as Alex Turner’s comments here. The core argument of Nick Bostrom’s bestselling book Superintelligence has also aged quite poorly.^[2]

At best, these theory-first efforts did very little to improve our understanding of how to align powerful AI. And they may have been net negative, insofar as they propagated a variety of actively misleading ways of thinking both among alignment researchers and the broader public. Some examples include the now-debunked analogy from evolution, the false distinction between “inner” and “outer” alignment, and the idea that AIs will be rigid utility maximizing consequentialists (here, here, and here).

During an AI pause, I expect alignment research would enter another “winter” in which progress stalls, and plausible-sounding-but-false speculations become entrenched as orthodoxy without empirical evidence to falsify them. While some good work would of course get done, it’s not clear that the field would be better off as a whole. And even if a pause would be net positive for alignment research, it would likely be net negative for humanity’s future all things considered, due to the pause’s various unintended consequences. We’ll look at that in detail in the final section of the essay.

Fast takeoff has a really bad feedback loop

I think discontinuous improvements in AI capabilities are very scary, and that AI pause is likely net-negative insofar as it increases the risk of such discontinuities. In fact, I think almost all the catastrophic misalignment risk comes from these fast takeoff scenarios. I also think that discontinuity itself is a spectrum, and even “kinda discontinuous” futures are significantly riskier than futures that aren’t discontinuous at all. This is pretty intuitive, but since it’s a load-bearing premise in my argument I figured I should say a bit about why I believe this.

Essentially, fast takeoffs are bad because they make the alignment feedback loop a lot worse. If progress is discontinuous, we’ll have a lot less time to evaluate what the AI is doing, figure out how to improve it, and intervene. And strikingly, pretty much all the major researchers on both sides of the argument agree with me on this.

Nate Soares of the Machine Intelligence Research Institute has argued that building safe AGI is hard for the same reason that building a successful space probe is hard— it may not be possible to correct failures in the system after it’s been deployed. Eliezer Yudkowsky makes a similar argument:

“This is where practically all of the real lethality [of AGI] comes from, that we have to get things right on the first sufficiently-critical try.” — AGI Ruin: A List of Lethalities

Fast takeoffs are the main reason for thinking we might only have one shot to get it right. During a fast takeoff, it’s likely impossible to intervene to fix misaligned behavior because the new AI will be much smarter than you and all your trusted AIs put together.

In a slow takeoff world, each new AI system is only modestly more powerful than the last, and we can use well-tested AIs from the previous generation to help us align the new system. OpenAI CEO Sam Altman agrees we need more than one shot:

“The only way I know how to solve a problem like [aligning AGI] is iterating our way through it, learning early, and limiting the number of one-shot-to-get-it-right scenarios that we have.” — Interview with Lex Fridman

Slow takeoff is the default (so don’t mess it up with a pause)

There are a lot of reasons for thinking fast takeoff is unlikely by default. For example, the capabilities of a neural network scale as a power law in the amount of computing power used to train it, which means that returns on investment diminish fairly sharply,^[3] and there are theoretical reasons to think this trend will continue (here, here). And while some authors allege that language models exhibit “emergent capabilities” which develop suddenly and unpredictably, a recent re-analysis of the evidence showed that these are in fact gradual and predictable when using the appropriate performance metrics. See this essay by Paul Christiano for further discussion.

Alignment optimism: AIs are white boxes

Let’s zoom in on the alignment feedback loop from the last section. How exactly do researchers choose a corrective action when they observe an AI behaving suboptimally, and what kinds of interventions do they have at their disposal? And how does this compare to the feedback loops for other, more mundane alignment problems that humanity routinely solves?

Human & animal alignment is black box

Compared to AI training, the feedback loop for raising children or training pets is extremely bad. Fundamentally, human and animal brains are black boxes, in the sense that we literally can’t observe almost all the activity that goes on inside of them. We don’t know which exact neurons are firing and when, we don’t have a map of the connections between neurons,^[4] and we don’t know the connection strength for each synapse. Our tools for non-invasively measuring the brain, like EEG and fMRI, are limited to very coarse-grained correlates of neuronal firings, like electrical activity and blood flow. Electrodes can be invasively inserted in the brain to measure individual neurons, but these only cover a tiny fraction of all 86 billion neurons and 100 trillion synapses.

If we could observe and modify everything that’s going on in a human brain, we’d be able to use optimization algorithms to calculate the precise modifications to the synaptic weights which would cause a desired change in behavior.^[5] Since we can’t do this, we are forced to resort to crude and error-prone tools for shaping young humans into kind and productive adults. We provide role models for children to imitate, along with rewards and punishments that are tailored to their innate, evolved drives.

It’s striking how well these black box alignment methods work: most people do assimilate the values of their culture pretty well, and most people are reasonably pro-social. But human alignment is also highly imperfect. Lots of people are selfish and anti-social when they can get away with it, and cultural norms do change over time, for better or worse. Black box alignment is unreliable because there is no guarantee that an intervention intended to change behavior in a certain direction will in fact change behavior in that direction. Children often do the exact opposite of what their parents tell them to do, just to be rebellious.

Status quo AI alignment methods are white box

By contrast, AIs implemented using artificial neural networks (ANN) are white boxes in the sense that we have full read-write access to their internals. They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost. And this enables a lot of really powerful alignment methods that just aren’t possible for brains.

The backpropagation algorithm is an important example. Backprop efficiently computes the optimal direction (called the “gradient”) in which to change the synaptic weights of the ANN in order to improve its performance the most, on any criterion we specify. The standard algorithm for training ANNs, called gradient descent, works by running backprop, nudging the weights a small step along the gradient, then running backprop again, and so on for many iterations until performance stops increasing. The black trajectory in the figure on the right visualizes how the weights move from higher error regions to lower error regions over the course of training. Needless to say, we can’t do anything remotely like gradient descent on a human brain, or the brain of any other animal!

Gradient descent is super powerful because, unlike a black box method, it’s almost impossible to trick. All of the AI’s thoughts are “transparent” to gradient descent and are included in its computation. If the AI is secretly planning to kill you, GD will notice this and almost surely make it less likely to do that in the future. This is because GD has a strong tendency to favor the simplest solution which performs well, and secret murder plots aren’t actively useful for improving human feedback on your actions.

White box alignment in nature

Almost every organism with a brain has an innate reward system. As the organism learns and grows, its reward system directly updates its neural circuitry to reinforce certain behaviors and penalize others. Since the reward system directly updates it in a targeted way using simple learning rules, it can be viewed as a crude form of white box alignment. This biological evidence indicates that white box methods are very strong tools for shaping the inner motivations of intelligent systems. Our reward circuitry reliably imprints a set of motivational invariants into the psychology of every human: we have empathy for friends and acquaintances, we have parental instincts, we want revenge when others harm us, etc. Furthermore, these invariants must be produced by easy-to-trick reward signals that are simple enough to encode in the genome.

This suggests that at least human-level general AI could be aligned using similarly simple reward functions. But we already align cutting edge models with learned reward functions that are much too sophisticated to fit inside the human genome, so we may be one step ahead of our own reward system on this issue.^[6] Crucially, I’m not saying humans are “aligned to evolution”— see Evolution provides no evidence for the sharp left turn for a debunking of that analogy. Rather, I’m saying we’re aligned to the values our reward system predictably produces in our environment.

An anthropologist looking at humans 100,000 years ago would not have said humans are aligned to evolution, or to making as many babies as possible. They would have said we have some fairly universal tendencies, like empathy, parenting instinct, and revenge. They might have predicted these values will persist across time and cultural change, because they’re produced by ingrained biological reward systems. And they would have been right.

When it comes to AIs, we are the innate reward system. And it’s not hard to predict what values will be produced by our reward signals: they’re the obvious values, the ones an anthropologist or psychologist would say the AI seems to be displaying during training. For more discussion see Humans provide an untapped wealth of evidence about alignment.

Realistic AI pauses would be counterproductive

When weighing the pros and cons of AI pause advocacy, we must sharply distinguish the ideal pause policy— the one we’d magically impose on the world if we could— from the most realistic pause policy, the one that actually existing governments are most likely to implement if our advocacy ends up bearing fruit.

Realistic pauses are not international

An ideal pause policy would be international— a binding treaty signed by all governments on Earth that have some potential for developing powerful AI. If major players are left out, the “pause” would not really be a pause at all, since AI capabilities would keep advancing. And the list of potential major players is quite long, since the pause itself would create incentives for non-pause governments to actively promote their own AI R&D.

However, it’s highly unlikely that we could achieve international consensus around imposing an AI pause, primarily due to arms race dynamics: each individual country stands to reap enormous economic and military benefits if they refuse to sign the agreement, or sign it while covertly continuing AI research. While alignment pessimists may argue that it is in the self-interest of every country to pause and improve safety, we’re unlikely to persuade every government that alignment is as difficult as pessimists think it is. Such international persuasion is even less plausible if we assume short, 3-10 year timelines. Public sentiment about AI varies widely across countries, and notably, China is among the most optimistic.

The existing international ban on chemical weapons does not lend plausibility to the idea of a global pause. AGI will be, almost by definition, the most useful invention ever created. The military advantage conferred by autonomous weapons will certainly dwarf that of chemical weapons, and they will likely be more powerful even than nukes due to their versatility and precision. The race to AGI will therefore be an arms race in the literal sense, and we should expect it will play out similarly to the last such race: major powers rushed to make a nuclear weapon as fast as possible.

If in spite of all this, we somehow manage to establish a global AI moratorium, I think we should be quite worried that the global government needed to enforce such a ban would greatly increase the risk of permanent tyranny, itself an existential catastrophe. I don’t have time to discuss the issue here, but I recommend reading Matthew Barnett’s “The possibility of an indefinite AI pause” and Quintin Pope’s “AI is centralizing by default; let's not make it worse,” both submissions to this debate. In what follows, I’ll assume that the pause is not international, and that AI capabilities would continue to improve in non-pause countries at a steady but somewhat reduced pace.

Realistic pauses don’t include hardware

Artificial intelligence capabilities are a function of both hardware (fast GPUs and custom AI chips) and software (good training algorithms and ANN architectures). Yet most proposals for AI pause (e.g. the FLI letter and PauseAI^[7]) do not include a ban on new hardware research and development, focusing only on the software side. Hardware R&D is politically much harder to pause because hardware has many uses: GPUs are widely used in consumer electronics and in a wide variety of commercial and scientific applications.

But failing to pause hardware R&D creates a serious problem because, even if we pause the software side of AI capabilities, existing models will continue to get more powerful as hardware improves. Language models are much stronger when they’re allowed to “brainstorm” many ideas, compare them, and check their own work— see the Graph of Thoughts paper for a recent example. Better hardware makes these compute-heavy inference techniques cheaper and more effective.

Hardware overhang is likely

If we don’t include hardware R&D in the pause, the price-performance of GPUs will continue to double every 2.5 years, as it did between 2006 and 2021. This means AI systems will get at least 16x faster after ten years and 256x faster after twenty years, simply due to better hardware. If the pause is lifted all at once, these hardware improvements would immediately become available for training more powerful models more cheaply— a hardware overhang. This would cause a rapid and fairly discontinuous increase in AI capabilities, potentially leading to a fast takeoff scenario and all of the risks it entails.

The size of the overhang depends on how fast the pause is lifted. Presumably an ideal pause policy would be lifted gradually over a fairly long period of time. But a phase-out can’t fully solve the problem: legally-available hardware for AI training would still improve faster than it would have “naturally,” in the counterfactual where we didn’t do the pause. And do we really think we’re going to get a carefully crafted phase-out schedule? There are many reasons for thinking the phase-out would be rapid or haphazard (see below).

More generally, AI pause proposals seem very fragile, in the sense that they aren’t robust to mistakes in the implementation or the vagaries of real-world politics. If the pause isn’t implemented perfectly, it seems likely to cause a significant hardware overhang which would increase catastrophic AI risk to a greater extent than the extra alignment research during the pause would reduce risk.

Likely consequences of a realistic pause

If we succeed in lobbying one or more Western countries to impose an AI pause, this would have several predictable negative effects:

Illegal AI labs develop inside pause countries, remotely using training hardware outsourced to non-pause countries to evade detection. Illegal labs would presumably put much less emphasis on safety than legal ones.
There is a brain drain of the least safety-conscious AI researchers to labs headquartered in non-pause countries. Because of remote work, they wouldn’t necessarily need to leave the comfort of their Western home.
Non-pause governments make opportunistic moves to encourage AI investment and R&D, in an attempt to leap ahead of pause countries while they have a chance. Again, these countries would be less safety-conscious than pause countries.
Safety research becomes subject to government approval to assess its potential capabilities externalities. This slows down progress in safety substantially, just as the FDA slows down medical research.
Legal labs exploit loopholes in the definition of a “frontier” model. Many projects are allowed on a technicality; e.g. they have fewer parameters than GPT-4, but use them more efficiently. This distorts the research landscape in hard-to-predict ways.
It becomes harder and harder to enforce the pause as time passes, since training hardware is increasingly cheap and miniaturized.
Whether, when, and how to lift the pause becomes a highly politicized culture war issue, almost totally divorced from the actual state of safety research. The public does not understand the key arguments on either side.
Relations between pause and non-pause countries are generally hostile. If domestic support for the pause is strong, there will be a temptation to wage war against non-pause countries before their research advances too far:
“If intelligence says that a country outside the agreement is building a GPU cluster, be less scared of a shooting conflict between nations than of the moratorium being violated; be willing to destroy a rogue datacenter by airstrike.” — Eliezer Yudkowsky
There is intense conflict among pause countries about when the pause should be lifted, which may also lead to violent conflict.
AI progress in non-pause countries sets a deadline after which the pause must end, if it is to have its desired effect.^[8] As non-pause countries start to catch up, political pressure mounts to lift the pause as soon as possible. This makes it hard to lift the pause gradually, increasing the risk of dangerous fast takeoff scenarios (see below).

Predicting the future is hard, and at least some aspects of the above picture are likely wrong. That said, I hope you’ll agree that my predictions are plausible, and are grounded in how humans and governments have behaved historically. When I imagine a future where the US and many of its allies impose an AI pause, I feel more afraid and see more ways that things could go horribly wrong than in futures where there is no such pause.

This post is part of AI Pause Debate Week. Please see this sequence for other posts in the debate.

^{^}
Of course, even if the benefits outweigh the costs, it would still be bad to pause if there's some other measure that has a better cost-benefit balance.
^{^}
In brief, the book mostly assumed we will manually program a set of values into an AGI, and argued that since human values are complex, our value specification will likely be wrong, and will cause a catastrophe when optimized by a superintelligence. But most researchers now recognize that this argument is not applicable to modern ML systems which learn values, along with everything else, from vast amounts of human-generated data.
^{^}
Some argue that power law scaling is a mere artifact of our units of measurement for capabilities and computing power, which can’t go negative, and therefore can’t be related by a linear function. But non-negativity doesn’t uniquely identify power laws. Conceivably the error rate could have turned out to decay exponentially, like a radioactive isotope, which would be much faster than power law scaling.
^{^}
Called a “connectome.” This was only recently achieved for the fruit fly brain.
^{^}
Brain-inspired artificial neural networks already exist, and we have algorithms for optimizing them. They tend to be harder to optimize than normal ANNs due to their non-differentiable components.
^{^}
On the other hand, we might be roughly on-par with our own reward system insofar as it does within-lifetime learning to figure out what to reward. This is sort of analogous to the learned reward model in reinforcement learning from human feedback.
^{^}
To its credit, the PauseAI proposal does recognize that hardware restrictions may be needed eventually, but does not include it in its main proposal. It also doesn’t talk about restricting hardware research and development, which is the specific thing I’m talking about here.
^{^}
This does depend a bit on whether safety research in pause countries is openly shared or not, and on how likely non-pause actors are to use this research in their own models.

141 Reactions

What's in a Pause?

11 comments74 karma

Policy ideas for mitigating AI risk

16 comments121 karma

Mentioned in

140A tale of 2.5 orthogonality theses

129The International PauseAI Protest: Activism under uncertainty

109Pause For Thought: The AI Pause Debate

100Aim for conditional pauses

87Muddling Along Is More Likely Than Dystopia

Load more (5/9)

Comments167

Sorted by

New & upvoted

Click to highlight new comments since: Today at 7:11 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

DanielFilan1y71

The core argument of Nick Bostrom’s bestselling book Superintelligence has also aged quite poorly: In brief, the book mostly assumed we will manually program a set of values into an AGI, and argued that since human values are complex, our value specification will likely be wrong, and will cause a catastrophe when optimized by a superintelligence. But most researchers now recognize that this argument is not applicable to modern ML systems which learn values, along with everything else, from vast amounts of human-generated data.

For what it's worth, the book does discuss value learning as a way of an AI acquiring values - you can see chapter 13 as being basically about this.

I would describe the core argument of the book as the following (going off of my notes of chapter 8, "Is the default outcome doom?"):

It is possible to build AI that's much smarter than humans.
This process could loop in on itself, leading to takeoff that could be slow or fast.
A superintelligence could gain a decisive strategic advantage and form a singleton.
Due to the orthogonality thesis, this superintelligence would not necessarily be aligned with human interests.
Due to instrumental convergence, an unalig

... (read more)

Nora Belrose1y11

Yep I am aware of the value learning section of Chapter 12, which is why I used the "mostly" qualifier. That said he basically imagines something like Stuart Russell's CIRL, rather than anything like LLMs or imitation learning.

If we treat the Orthogonality Thesis as the crux of the book, I also think the book has aged poorly. In fact it should have been obvious when the book was written that the Thesis is basically a motte-and-bailey where you argue for a super weak claim (any combo of intelligence and goals is logically possible), which is itself dubious IMO but easy to defend, and then pretend like you've proven something much stronger, like "intelligence and goals will be empirically uncorrelated in the systems we actually build" or something.

RobertM1y26

I do not think the orthogonality thesis is a motte-and-bailey. The only evidence I know of that suggests that the goals developed by an ASI trained with something resembling modern methods would by default be picked from a distribution that's remotely favorable to us is the evidence we have from evolution^[1], but I really think that ought to be screened off. The goals developed by various animal species (including humans) as a result of evolution are contingent on specific details of various evolutionary pressures and environmental circumstances, which we know with confidence won't apply to any AI trained with something resembling modern methods.

Absent a specific reason to believe that we will be sampling from an extremely tiny section of an enormously broad space, why should we believe we will hit the target?

Anticipating the argument that, since we're doing the training, we can shape the goals of the systems - this would certainly be reason for optimism if we had any idea what goals we would see emerge while training superintelligent systems, and had any way of actively steering those goals to our preferred ends. We don't have either, right now.

^{^}
Which, mind you, is sti

... (read more)

Arepo

5mo

The orthogonality thesis is trivially a motte and bailey - you're using it as one right here! The original claim by Bostrom was a statement against logical necessity: 'an artificial mind need not care intrinsically about any of those things' (emphasis mine); yet in your comment you're equivocating with a statement that's effectively about probability: 'sampling from an extremely tiny section of an enormously broad space'. You might be right in your claim, but your claim is not what the arguments in the orthogonality thesis papers purport to show. I would also like to make a stronger counterclaim: I think a priori arguments about 'probability space' (dis)prove way too much. If you disregard empirical data, you can use them to disprove anything, like 'the height of Earth fauna is contingent on specific details of various evolutionary pressures and environmental circumstances, and is sampled from a tiny section on the number line, so we should expect that alien fauna we encounter will be arbitrarily tall (or perhaps have negative height)'. If Earth-evolved intelligence tends even weakly to have e.g. sympathy towards non-kin, that is evidence that Earth-evolved intelligence is a biased sample, but also evidence that there exists some pull towards non-kin-sympathy in intelligence space. My sense is that (as your footnote hints at), the more intelligent animals are, the more examples we seem to see of individual non-reciprocal altruism to non-kin (there are many clear examples of non-reciprocal altruism across species in cetaceans for e.g., and less numerous but still convincing examples of it in corvids).

Nora Belrose1y10

Anticipating the argument that, since we're doing the training, we can shape the goals of the systems - this would certainly be reason for optimism if we had any idea what goals we would see emerge while training superintelligent systems, and had any way of actively steering those goals to our preferred ends. We don't have either, right now.

What does this even mean? I'm pretty skeptical of the realist attitude toward "goals" that seems to be presupposed in this statement. Goals are just somewhat useful fictions for predicting a system's behavior in some domains. But I think it's a leaky abstraction that will lead you astray if you take it too seriously / apply it out of the domain in which it was designed for.

We clearly can steer AI's behavior really well in the training environment. The question is just whether this generalizes. So it becomes a question of deep learning generalization. I think our current evidence from LLMs strongly suggests they'll generalize pretty well to unseen domains. And as I said in the essay I don't think the whole jailbreaking thing is any evidence for pessimism— it's exactly what you'd expect of aligned human mind uploads in the same situation.

titotal1y10

Absent a specific reason to believe that we will be sampling from an extremely tiny section of an enormously broad space, why should we believe we will hit the target?

I could make this same argument about capabilities, and be demonstratably wrong. The space of neural network values that don't produce coherent grammar is unimaginably, ridiculously vast compared to the "tiny target" of ones that do. But this obviously doesn't mean that chatGPT is impossible.

The reason is that we aren't randomly throwing a dart at possibility space, but using a highly efficient search mechanism to rapidly toss out bad designs until we hit the target. But when these machines are trained, we simultaneously select for capabilities and for alignment (murderbots are not efficient translators). For chatGPT, this leads to an "aligned" machine, at least by some definitions.

Where I think the motte and bailey often occurs is jumping between "aligned enough not to exterminate us", and "aligned with us nearly perfectly in every way" or "unable to be misused by bad actors". The former seems like it might happen naturally over development, whereas the latter two seem nigh impossible.

RobertM1y15

The argument w.r.t. capabilities is disanalogous.

Yes, the training process is running a search where our steering is (sort of) effective for getting capabilities - though note that with e.g. LLMs we have approximately zero ability to reliably translate known inputs [X] into known capabilities [Y].

We are not doing the same thing to select for alignment, because "alignment" is:

an internal representation that depends on multiple unsolved problems in philosophy, decision theory, epistemology, math, etc, rather than "observable external behavior" (which is what we use to evaluate capabilities & steer training)
something that might be inextricably tied to the form of general intelligence which by default puts us in the "dangerous capabilities" regime, or if not strongly bound in theory, then strongly bound in practice

I do think this disagreement is substantially downstream of a disagreement about what "alignment" represents, i.e. I think that you might attempt outer alignment of GPT-4 but not inner alignment, because GPT-4 doesn't have the internal bits which make inner alignment a relevant concern.

Aleksi Maunu

Is this commonly agreed upon even after fine-tuning with RLHF? I assumed it's an open empirical question. The way I understand is is that there's a reward signal (human feedback) that's shaping different parts of the neural network that determines GPT-4's ouputs, and we don't have good enough interpretability techniques to know whether some parts of the neural network are representations of "goals", and even less so what specific goals they are. I would've thought it's an open question whether even base models have internal representations of "goals", either always active or only active in some specific context. For example if we buy the simulacra (predictors?) frame, a goal could be active only when a certain simulacrum is active. (would love to be corrected :D)

RobertM

I don't know if it's commonly agreed upon; that's just my current belief based on available evidence (to the extent that the claim is even philosophically sound enough to be pointing at a real thing).

-1

Gerald Monroe

Or another rephrase. How is the "secretly is planning to murder all humans" improving the models scores on a benchmark? If you think about it first, what gradient from the training set even led to this capability of an inner cognitive process looking for a chance to betray. What force is causing this cognitive process to come out of the random initial weights? Humans seem to have such a force but it's because modeling "if I kill this rival then the reward for me is..." was evolutionarily useful. Also it's probably a behavior that is learned. And second, yeah, SGD should push "neutral" weights inside the network that are not contributing to correct answers towards weights that do increase the odds of a correct output distribution. So it should actively destroy "unnecessary " cognitive processes inside the model. You could prove this. Make a psychopathic model designed to "betray" in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.

Aleksi Maunu

(I personally don't find this likely, so this might accidentally be a strawman) For example: planning and gaining knowledge are incentivized on many benchmarks -> instrumental convergence makes model instrumentally value power among other things -> a very advanced system that is great at long-term planning might conclude that "murdering all humans" is useful for power or other instrumentally convergent goals I think with our current interpretability techniques we wouldn't be able to robustly distinguish between a model that generalized to behave well in any reasonable environment vs a model that learned to behave well in that specific environment but would turn back to betray in many other environments

Nora Belrose

Please stop saying that mind-space is an "enormously broad space." What does that even mean? How have you established a measure on mind-space that isn't totally arbitrary? What if concepts and values are convergent when trained on similar data, just like we see convergent evolution in biology?

RobertM

Why don't you make the positive case for the space of possible (or, if you wish, likely) minds being minds which have values compatible with the fulfillment of human values? I think we have pretty strong evidence that not all minds are like this even within the space of minds produced by evolution. Concepts do seem to be convergent to some degree (though note that ontological shifts at increasing levels of intelligence seem likely), but I do in fact think that evidence from evolution suggests that values are strongly contingent on the kinds of selection pressures which produced various species.

Nora Belrose

The positive case is just super obvious, it's that we're trying very hard to make these systems aligned, and almost all the data we're dumping into these systems is generated by humans and is therefore dripping with human values and concepts. I also think we have strong evidence from ML research that ANN generalization is due to symmetries in the parameter-function map which seem generic enough that they would apply mutatis mutandis to human brains, which also have a singular parameter-function map (see e.g. here). Not really sure what you're getting at here/why this is supposed to help your side

bcforstadt

what you mean by this? (compare "we don't know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model". Is this the same thing?). Is there a good writeup anywhere of why we should expect this to happen? This seems speculative and unlikely to me The fact that natural selection produced species with different goals/values/whatever isn't evidence that that's the only way to get those values, because "selection pressure" isn't a mechanistic explanation. You need more info about how values are actually implemented to rule out that a proposed alternative route to natural selection succeeds in reproducing them.

RobertM

Re: ontological shifts, see this arbital page: https://arbital.com/p/ontology_identification. I'm not claiming that evolution is the only way to get those values, merely that there's no reason to expect you'll get them by default by a totally different mechanism. The fact that we don't have a good understanding of how values form even in the biological domain is a reason for pessimism, not optimism.

bcforstadt

The point I was trying to make is that natural selection isn't a "mechanism" in the right sense at all. it's a causal/historical explanation not an account of how values are implemented. What is the evidence from evolution? The fact that species with different natural histories end up with different values really doesn't tell us much without a discussion of mechanisms. We need to know 1) how different are the mechanisms actually used to point biological and artificial cognitive systems toward ends and 2) how many possible mechanisms to do so are there. One reason for pessimism would be that human value learning has too many messy details. But LLMs are already better behaved than anything in the animal kingdom besides humans and are pretty good at intuitively following instructions, so there is not much evidence for this problem. If you think they are not so brainlike, then this is evidence that not-so-brainlike mechanisms work. And there are also theories that value learning in current AI works roughly similarly to value learning in the brain. Which is just to say I don't see the prior for pessimism, just from looking at evolution.

Gerald Monroe

As a side note the actual things that break this loop are (1) we don't use superintelligent singletons and probably won't, I hope. We instead create context limited model instances of a larger model and tell it only about our task and the model doesn't retain information. This "break an ASI into a billion instances each which lives only in the moment" is a powerful alignment method (2) it seems to take an absolutely immense amount of compute hardware to host even today's models which are significantly below human intelligence in some expensive to fix ways. (For example how many H100s would you need for useful realtime video perception?) This means a "rogue" Singleton would have nowhere to exist, as it would be too heavy in weights and required bandwidth to run on a botnet. This breaks everything else. It's telling that Bostroms PhD is in philosophy and I don't see any industry experience on his wiki page. He is correct if you ignore real world limitations on AI.

DanielFilan

FYI, current cutting-edge large language models are trained on a massive amount of text on the internet (in the case of GPT-4, likely approximately all the text OpenAI could get their hands on). So they certainly have tons of information about stuff other than the task at hand.

Gerald Monroe

This is not what that statement means. What it means is the model has no context of its history since training. It has no context if the task it has been given is "real". It does not know if other copies of itself or other AIs are checking it's outputs for correctness, with serious consequences if it sabotages the output. It doesn't know it's not still in training. It doesn't know if there are a billion instances of it or just 1. We can scrub all this information fairly easily and we already do this as of right now. We can also make trick output where we try to elicit latent deception by giving information that would tell the model its time to betray. We can also work backwards and find what the adversarial inputs are. When will the model change it's answer for this question?

Davidmanheim1y71

There's a giant straw man in this post, and I think it's entirely unreasonable to ignore. It's the assertion, or assumption, that the "pause" would be a temporary measure imposed by some countries, as opposed to a stop-gap solution and regulation imposed to enable stronger international regulation, which Nora says she supports. (I'm primarily frustrated by this because it ignores the other two essays, which Nora had access to a week ago, that spelled this out in detail.)

Nora Belrose

I don't understand the distinction you're trying to make between these two things. They really seem like the same thing to me, because a stop-gap measure is temporary by definition:If by "stronger international regulation" you mean "global AI pause" I argue explicitly that such a global pause is highly unlikely to happen. You don't get to assume that your proposed "stop-gap" pause will in fact lead to a global pause just because you called it a stop-gap. What if it doesn't? Will it be worse than no pause at all in that scenario? That's a big part of what we're debating. Is it a "straw man" if I just disagree with you about the likely effects of the policies you're proposing? I'm also against a global pause even if we can make it happen, and I say so in the post:

Davidmanheim1y24

First, it sounds like you are agreeing with others, including myself, about a pause.

An immediate, temporary pause isn’t currently possible to monitor, much less enforce, even if it were likely that some or most parties would agree. Similarly, a single company or country announcing a unilateral halt to building advanced models is not credible without assurances, and is likely both ineffective at addressing the broader race dynamics, and differentially advantages the least responsible actors.

I am not advocating for a pause right now. If we had a pause, I think it would only be useful insofar as we use the pause to implement governance structures that mitigate risk after the pause has ended.

If something slows progress temporarily, after it ends progress may gradually partially catch up to the pre-slowing trend, such that powerful AI is delayed but crunch time is shortened

So yes, you're arguing against a straw-man. (Edit to add: Perhaps Rob Bensinger's views are more compatible with the claim that someone is advocating a temporary pause as a good idea - but he has said that ideally he wants a full stop, not a pause at all.)

Second, you're ignoring... (read more)

Nora Belrose

In my essay I don't make an assumption that the pause would immediate, because I did read your essay and I saw that you were proposing that we'd need some time to prepare and get multiple countries on board. I don't see how a delay before a pause changes anything. I still think it's highly unlikely you're going to get sufficient international backing for the pause, so you will either end up doing a pause with an insufficiently large coalition, or you'll back down and do no pause at all.

Davidmanheim

Is your opposition to stopping the building of dangerously large models via international regulation because you don't think that it's possible to do, or because you are opposed to having such limits? You seem to equivocate; first you say that we need larger models in order to do alignment research, and a number of people have already pointed out that this claim is suspect - but it implies you think any slowdown would be bad even if done effectively. Next, you say that a fast takeoff is more likely if we stop temporarily and then remove all limits, and I agree, but pointed out that no-one is advocating that, and that it's not opposition to any of the actual proposals, it's opposition to a straw man. Finally, you say that it's likely to push work to places that aren't part of the pause. That's assuming international arms control of what you agree could be an existential risk is fundamentally impossible, and I think that's false - but you haven't argued the point, just assumed that it will be ineffective. (Also, reread my piece - I call for action to regulate and stop larger and more dangerous models immediately as a prelude to a global moratorium. I didn't say "wait a while, then impose a pause for a while in a few places.")

Gerald Monroe

Clarifying question: is a nuclear arms pause or moratorium possible, by your definition of the word? Is it likely? With the evidence that many world leaders, including the leaders of the USA, Israel, China, and Russia speak of AI as a must have strategic technology, do you think they are likely in plausible future timelines to reverse course and support international AI pauses before evidence of the dangers of AGI, by humans building one, exists? Do you dispute that they have said this publicly and recently? Do you believe there is any empirical evidence proving an AGI is an existential risk available to policymakers? If there is, what is the evidence? Where is the benchmark of model performance showing this behavior? I am aware many experts are concerned but this is not the same as having empirical evidence to support their concerns. There is an epistemic difference. I am wondering if we are somehow reading two different sets of news. I acknowledge that it is possible that an AI pause is the best thing humanity could do right now to ensure further existence. But I am not seeing any sign that it is a possible outcome. (By "possible" I mean it's possible for all parties to inexplicably act against their own interests without evidence, but it's not actually going to happen) Edit: it's possible for Saudi Arabia to read the news on climate change and decide they will produce 0 barrels in 10 years. It's possible for every OPEC member to agree to the same pledge. It's possible, with a wartime level of effort, to transition the economy to no longer need Opec petroleum worldwide, in just 10 years. But this is not actually possible. The probability of this happening is approximately 0.

Davidmanheim

1. Is a nuclear arms moratorium or de-escalation possible? You say it is not, but evidently you're not aware of the history. The base rate on the exact thing you just said is not possible repeatedly working (NPT, SALT, START) tells me all I need to know about whether your estimates are reasonable. 2. You're misusing the word empirical. Using your terminology, there's no empirical evidence that the sun will rise tomorrow, just validated historical trends of positions of celestial objects and claims that fundamental physical laws hold even in the future. I don't know what to tell you; I agree that there is a lack of clarity, but there is even less empirical evidence that AGI is safe than that it is not. 3. World leaders have said it's a vital tool, and also that it's an existential risk. You're ignoring the fact that many said the latter. 4. OPEC is a cartel, and it works to actually restrict output - despite the way that countries have individual incentives to produce more.

Gerald Monroe

1. To qualify this would be a moratorium or pause on nuclear arms before powerful nations had doomsday sized arsenals. The powerful making it expensive for poor nations to get nukes - though several did - is different. And notably I wonder how well it would have gone if the powerful nation had no nukes of their own. Trying to ban AGI from others - when the others have nukes and their own chip fabs - would be the same situation. Not only will you fail you will eventually, if you don't build your own AGI, lose everything. Same if you have no nukes. 2. What data is that? A model misunderstanding "rules" on an edge case isn't misaligned. Especially when double generation usually works. The sub rising has every prior sunrise as priors. Which empirical data would let someone conclude AGI is an existential risk justifying international agreements. Some measurement or numbers. 3. Yes, and they said this about nukes and built thousands 4. Yes to maximize profit. Pledging to go to zero is not the same thing.

Davidmanheim

You seem to dismiss the claim that AI is an existential risk. If that's correct, perhaps we should start closer to the beginning, rather than debating global response, and ask you to explain why you disagree with such a large consensus of experts that this risk exists.

Gerald Monroe

I don't disagree. I don't see how it's different than nuclear weapons. Many many experts are also saying this. Nobody denies nuclear weapons are an existential risk. And every control around their use is just probability based, there is absolutely nothing stopping a number of routes from ending the richest civilizatios. Multiple individuals appear to have the power to do it at a time, every form of interlock and safety mechanism has a method of failure or bypass. Survival to this point was just probability. Over an infinite timescale the nukes will fly. Point is that it was completely and totally intractable to stop the powerful from getting nukes. SALT was the powerful tiring of paying the maintenance bills and wanting to save money on MAD. And key smaller countries - Ukraine and Taiwan - have strategic reasons to regret their choice to give up their nuclear arms. It is possible that if the choice happens again future smaller countries will choose to ignore the consequences and build nuclear arsenals. (Ukraines first opportunity will be when this war ends, they can start producing plutonium. Taiwan chance is when China begins construction of the landing ships) So you're debating something that isn't going to happen without a series of extremely improbable events happening simultaneously. If you start thinking about practical interlocks around AI systems you end up with similar principles to what protects nukes albeit with some differences. Low level controllers running simple software having authority, air gaps - there are some similarities. Also unlike nukes a single AI escaping doesn't end the world. It has to escape and there must be an environment that supports its plans. It is possible for humans to prepare for this and to make the environment inhospitable to rogue AGIs. Heavy use of air gaps, formally proven software, careful monitoring and tracking of high end compute hardware. A certain minimum amount of human supervision for robots working on large s

Davidmanheim

You are arguing impossibilities despite a reference class with reasonably close analogues that happened. If you could honestly tell me people thought the NPT was plausible when proposed, and I'll listen when you say this is implausible. In fact, there is appetite for fairly strong reactions, and if we're the ones who are concerned about the risks, folding before we even get to the table isn't a good way to get anything done.

Gerald Monroe

I am saying the common facts that we both have access to do not support your point of view. It never happened. There are no cases of "very powerful, short term useful, profitable or military technologies" that were effectively banned, in the last 150 years. You have to go back to the 1240s to find a reference class match. These strongly worded statements I just made are trivial for you to disprove. Find a counterexample. I am quite confident and will bet up to $1000 you cannot.

Davidmanheim

You've made some strong points, but I think they go too far. The world banned CFCs, which were critical for a huge range of applications. It was short term useful, profitable technology, and it had to be replaced entirely with a different and more expensive alternative. The world has banned human cloning, via a UN declaration, despite the promise of such work for both scientific and medical usage. Neither of these is exactly what you're thinking of, and I think both technically qualify under the description you provided, if you wanted to ask a third party to judge whether they match. (Don't feel any pressure to do so - this is the kind of bet that is unresolvable because it's not precise enough to make everyone happy about any resolution.) However, I also think that what we're looking to do in ensuring only robustly safe AI systems via a moratorium on untested and by-default-unsafe systems is less ambitious or devastating to applications than a full ban on the technology, which is what your current analogy requires. Of course, the "very powerful, short term useful, profitable or military technolog[y]" of AI is only those things if it's actually safe - otherwise it's not any of those things, it's just a complex form of Russian roulette on a civilizational scale. On the other hand, if anyone builds safe and economically beneficial AGI, I'm all for it - but the bar for proving safety is higher than anything anyone currently suggests is feasible, and until that changes, safe strong AI is a pipe-dream.

-25

Gerald Monroe

Davidmanheim

You're misinterpreting what a moratorium would involve. I think you should read my post, where I outlined what I think a reasonable pathway would be - not stopping completely forever, but a negotiated agreement about how to restrict more powerful and by-default dangerous systems, and therefore only allowing those that are shown to be safe. Edit to add: "unlike nukes a single AI escaping doesn't end the world" <- Disagree on both fronts. A single nuclear weapons won't destroy the world, while a single misaligned and malign superintelligent AI, if created and let loose, almost certainly will - it doesn't need a hospitable environment.

Gerald Monroe

So there is one model that might have worked for nukes. You know about PAL and weak-link strong link design methodology? This is a technology for reducing the rogue use of nuclear warheads. It was shared with Russia/the USSR so that they could choose to make their nuclear warheads safe from unauthorized use. Major AI labs could design software frameworks and tooling that make AI models, even ASI capabilities level models, less likely to escape or misbehave. And release the tooling. It would be voluntary compliance but like the Linux Kernel it might in practice be used by almost everyone. As for the second point, no. Your argument has a hidden assumption that is not supported by evidence or credible AI scientists. The evidence is that models that exhibit human scale abilities need human scale (within an oom) level of compute and memory. The physical hardware racks to support this are enormous and not available outside AI labs. Were we to restrict the retail sale of certain kinds of training accelerator chips and especially high bandwidth interconnects, we could limit the places human level + AI could exist to data centers at known addresses. Your hidden assumption is optimizations, but the problem is that if you consider not just "AGI" but "ASI", the amount of hardware to support superhuman level cognition is probably nonlinear. If you wanted a model that could find an action that has a better expected value than a human level model with 90 percent probability (so the model is 10 times smarter in utility), it probably needs more than 10 times the compute. Probably logarithmic, that to find a better action 90 percent of the time you need to explore a vastly larger possibility space and you need the compute and memory to do this. This is probably provable in a theorem but the science isn't there yet. If correct, actually ASI is easily contained. Just write down where 10,000+ H100s are located or find it by IR or power consumption. If you suspect a rogue ASI has

Nora Belrose

I have now made a clarification at the very top of the post to make it 1000% clear that my opposition is disjunctive, because people repeatedly get confused / misunderstand me on this point.

Nora Belrose

My opposition is disjunctive! I both think that if it's possible to stop the building of dangerously large models via international regulation, that would be bad because of tyranny risk, and I also think that we very likely can't use international regulation to stop building these things, so that any local pauses are not going to have their intended effects and will have a lot of unintended net-negative effects. This really sounds like you are committing the fallacy I was worried about earlier on. I just don't agree that you will actually get the global moratorium. I am fully aware of what your position is.

Davidmanheim

I think that you're claiming something much stronger than "we very likely can't use international regulation to stop building these things" - you're claiming that international regulation won't even be useful to reduce risk by changing incentives. And you've already agreed that it's implausible that these efforts would lead to tyranny, you think they will just fail. But how they fail matters - there's a huge difference between something like the NPT, which was mostly effective, and something like the Kellogg-Briand Pact of 1928, which was ineffective but led to a huge change, versus... I don't know, I can't really think of many examples of treaties or treaty negotiations that backfired, even though most fail to produce exactly what they hoped. (I think there's a stronger case to make that treaties can prevent the world from getting stronger treaties later, but that's not what you claim.)

Nora Belrose

I think that conditional on the efforts working, the chance of tyranny is quite high (ballpark 30-40%). I don't think they'll work, but if they do, it seems quite bad. And since I think x-risk from technical AI alignment failure is in the 1-2% range, the risk of tyranny is the dominant effect of "actually enforced global AI pause" in my EV calculation, followed by the extra fast takeoff risks, and then followed by "maybe we get net positive alignment research."

Davidmanheim

Conditional on "the efforts" working is hooribly underspecified. A global governance mechanism run by a new extranational body with military powers monitoring and stopping production of GPUs, or a standard treaty with a multi-party inspection regime?

Nora Belrose

I'm not conditioning on the global governance mechanism— I assign nonzero probability mass to the "standard treaty" thing— but I think in fact you would very likely need global governance, so that is the main causal mechanism through which tyranny happens in my model

Rafael Harth1y62

This essay seems predicated on a few major assumptions that aren't quite spelled out, or any rate not presented as assumptions.

Far from being “behind” capabilities, it seems that alignment research has made great strides in recent years. OpenAI and Anthropic showed that Reinforcement Learning from Human Feedback (RLHF) can be used to turn ungovernable large language models into helpful and harmless assistants. Scalable oversight techniques like Constitutional AI and model-written critiques show promise for aligning the very powerful models of the future. And just this week, it was shown that efficient instruction-following language models can be trained purely with synthetic text generated by a larger RLHF’d model, thereby removing unsafe or objectionable content from the training data and enabling far greater control.

This assumes that making AI behave nice is genuine progress in alignment. The opposing take is that all it's doing is making the AI play a nicer character, but doesn't lead it to internalize its goals, which is what alignment is actually about. And in fact, AI playing rude characters was never the problem to begin with.

You say that alignment i... (read more)

Nora Belrose1y13

The opposing take is that all it's doing is making the AI play a nicer character, but doesn't lead it to internalize its goals, which is what alignment is actually about.

I think this is a misleading frame which makes alignment seem harder than it actually is. What does it mean to "internalize" a goal? It's something like, "you'll keep pursuing the goal in new situations." In other words, goal-internalization is a generalization problem.

We know a fair bit about how neural nets generalize, although we should study it more (I'm working on a paper on the topic atm). We know they favor "simple" functions, which means something like "low frequency" in the Fourier domain. In any case, I don't see any reason to think the neural net prior is malign, or particularly biased toward deceptive, misaligned generalization. If anything the simplicity prior seems like good news for alignment.

Rafael Harth1y10

It's something like, "you'll keep pursuing the goal in new situations." In other words, goal-internalization is a generalization problem.

I think internalizing $X$ means "pursuing $X$ as a terminal goal", whereas RLHF arguably only makes model pursue $X$ as an instrumental goal (in which case the model would be deceptively aligned). I'm not saying that GPT-4 has a distinction between instrumental and terminal goals, but a future AGI, whether an LLM or not, could have terminal goals that are different from instrumental goals.

You might argue that deceptive alignment is also an obsolete paradigm, but I would again respond that we don't know this, or at any rate, that the essay doesn't make the argument.

Nora Belrose1y16

I don’t think the terminal vs. instrumental goal dichotomy is very helpful, because it shifts the focus away from behavioral stuff we can actually measure (at least in principle). I also don’t think humans exhibit this distinction particularly strongly. I would prefer to talk about generalization, which is much more empirically testable and has a practical meaning.

Rafael Harth1y14

What if it just is the case that AI will be dangerous for reasons that current systems don't exhibit, and hence we don't have empirical data on? If that's the case, then limiting our concerns to only concepts that can be empirically tested seems like it means setting ourselves up for failure.

Tom McGrath1y16

I'm not sure what one is supposed to do with a claim that can't be empirically tested - do we just believe it/act as if it's true forever? Wouldn't this simply mean an unlimited pause in AI development (and why does this only apply to AI)?

Joe Collman

In principle, we do the same thing as with any claim (whether explicitly or otherwise): - Estimate the expected value of (directly) testing the claim. - Test it if and only if (directly) testing it has positive EV. The point here isn't that the claim is special, or that AI is special - just that the EV calculation consistently comes out negative (unless someone else is about to do something even more dangerous - hence the need for coordination). This is unusual and inconvenient. It appears to be the hand we've been dealt. I think you're asking the right question: what is one supposed to do with a claim that can't be empirically tested?

Gerald Monroe

So just to summarize: No deceptive or dangerous AI has ever been built or empirically tested. (1) Historically AI capabilities have consistently been "underwhelming", far below the hype. (2) If we discuss "ok we build a large AGI, give it persistent memory and online learning, and isolate it in an air gapped data center and hand carry data to the machine via hardware locked media, what is the danger" you are going to respond either with: "I don't know how the model escapes but it's so smart it will find a way" or (3) "I am confident humanity will exist very far into the future so a small risk now is unacceptable (say 1-10 percent pDoom)". and if I point out that this large ASI model needs thousands of H100 accelerator cards and megawatts of power and specialized network topology to exist and there is nowhere to escape to, you will argue "it will optimize itself to fit on consumer PCs and escape to a botnet". (4) Have I summarized the arguments? Like we're supposed to coordinate an international pause and I see 4 unproven assertions above that have zero direct evidence. The one about humanity existing far into the future I don't know I don't want to argue that because it's not falsifiable. Shouldn't we wait for evidence?

Tom McGrath

Thanks I mean more in terms of "how can we productively resolve our disagreements about this?", which the EV calculations are downstream of. To be clear, it doesn't seem to me that this is necessarily the hand we've been dealt but I'm not sure how to reduce the uncertainty. At the risk of sidestepping the question, the obvious move seems to be "try harder to make the claim empirically testable"! For example, in the case of deception, which I think is a central example we could (not claiming these ideas are novel): 1. Test directly for deception behaviourally and/or mechanistically (I'm aware that people are doing this, think it's good and wish the results were more broadly shared). 2. Think about what aspects of deception make it particularly hard, and try to study those in isolation and test those. The most important example seems to me to be precursors: finding more testable analogues to the question of "before we get good, undetectable deception do we get kind of crappy detectable deception?" Obviously these all run some (imo substantially lower) risks but seem well worth doing. Before we declare the question empirically inaccessible we should at least do these and synthesise the results (for instance, what does grokking say about (2)?).

Tom McGrath

(I'm spinning this comment out because it's pretty different in style and seems worth being able to reply to separately. Please let me know if this kind of chain-posting is frowned upon here.) Another downside to declaring things empirically out of reach and relying on priors for your EV calculations and subsequent actions is that it more-or-less inevitably converts epistemic disagreements into conflict. If it seems likely to you that this is the way things are (and so we should pause indefinitely) but it seems highly unlikely to me (and so we should not) then we have no choice but to just advocate for different things. There's not even the prospect of having recourse to better evidence to win over third parties, so the conflict becomes no-holds-barred. I see this right now on Twitter and it makes me very sad. I think we can do better.

Joe Collman

(apologies for slowness; I'm not here much) I'd say it's more about being willing to update on less direct evidence when the risk of getting more direct evidence is high. Clearly we should aim to get more evidence. The question is how to best do that safely. At present we seem to be taking the default path - of gathering evidence in about the easiest way, rather than going for something harder, slower and safer. (e.g. all the "we need to work with frontier models" stuff; I do expect that's most efficient on the empirical side; I don't expect it's a safe approach)

Joe Collman

I think a lot depends on whether we're: * Aiming to demonstrate that deception can happen. * Aiming to robustly avoid deception. For demonstration, we can certainly do useful empirical stuff - ARC Evals already did the lying-to-a-taskrabbit worker demonstration (clearly this isn't anything like deceptive alignment, but it's deception [given suitable scaffolding]). I think that other demonstrations of this kind will be useful in the short term. For avoiding all forms of deception, I'm much more pessimistic - since this requires us to have no blind-spots, and to address the problem in a fundamentally general way. (personally I doubt there's a [general solution to all kinds of deception] without some pretty general alignment solution - though I may be wrong) I'm sure we'll come up with solutions to particular types of / definitions of deception in particular contexts. This doesn't necessarily tell us much about other types of deception in other contexts. (for example, this kind of thing - but not only this kind of thing) I'd also note that "reducing the uncertainty" is only progress when we're correct. The problem that kills us isn't uncertainty, but overconfidence. (though granted it might be someone else's overconfidence)

Nora Belrose1y10

You need to have some motivation for thinking that a fundamentally new kind of danger will emerge in future systems, in such a way that we won't be able to handle it as it arises. Otherwise anyone can come up with any nonsense they like.

If you're talking about e.g. Evan Hubinger's arguments for deceptive alignment, I think those arguments are very bad, in light of 1) the white box argument I give in this post, 2) the incoherence of Evan's notion of "mechanistic optimization," and 3) his reliance on "counting arguments" where you're supposed to assume that the "inner goals" of the AI are sampled "uniformly at random" from some uninformative prior over goals (I don't think the LLM / deep learning prior is uninformative in this sense at all).

Davidmanheim1y10

You need to have some motivation for thinking that a fundamentally new kind of danger will emerge in future systems, in such a way that we won't be able to handle it as it arises.

That was what everyone ins AI safety was discussing for a decade or more, until around 2018. You seem to ignore these arguments about why AI will be dangerous, as well as all of the arguments that alignment will be hard. Are you familiar with all of that work?

Tomasz Kaye

the ideological turing test seems like a case where the distinction can be seen clearly in humans; the instrumental goal is to persuade the other that you sincerely hold beliefs/values (which imply goals). while your terminal goal is to advance advocacy of your different actual beliefs.

evhub

I definitely disagree with this—especially the last sentence; essentially all of my hope for neural net inductive biases comes from them not being like an actual simplicity prior. The primary literature I'd reference here would be "How likely is deceptive alignment?" for the practical question regarding concrete neural net inductive biases and "The Solomonoff Prior is Malign" for the purely theoretical question concerning the actual simplicity prior.

Nora Belrose

So, I definitely don't have the Solomonoff prior in mind when I talk about simplicity. I'm actively doing research at the moment to better characterize the sense in which neural nets are biased toward "simple" functions, but I would be shocked if it has anything to do with Kolmogorov complexity.

-1

Sharmake

Okay, my crux is that the simplicity/Kolmogorov/Solomonoff prior is probably not very malign, assuming we could run it, and in general I find the prior not to be malign except for specific situations. This is basically because it relies on the IMO dubious assumption that the halting oracle can only be used once, and notably once we use the halting/Solomonoff oracle more than once, the Solomonoff oracle loses it's malign properties. More generally, if the Solomonoff Oracle is duplicatable, as modern AIs generally are, then there's a known solution to mitigate the malignancy of the Solomonoff prior: Duplicate it, and let multiple people run the Solomonoff inductor in parallel to increase the complexity of manipulation. The goal is essentially to remove the uniqueness of 1 Solomonoff inductor, and make an arbitrary number of such oracles to drive up the complexity of manipulation. So under a weak assumption, the malignancy of the Solomonoff prior goes away. This is described well in the link below, and the important part is that we need either a use-once condition, or we need to assume uniqueness in some way. If we don't have either assumption holding, as is likely to be the case, then the Solomonoff/Kolmogorov prior isn't malign. https://www.lesswrong.com/posts/f7qcAS4DMKsMoxTmK/the-solomonoff-prior-is-malign-it-s-not-a-big-deal#Comparison_ And that's if it's actually malign, which it might not be, at least in the large-data limit: https://www.lesswrong.com/posts/Tr7tAyt5zZpdTwTQK/the-solomonoff-prior-is-malign#fDEmEHEx5EuET4FBF More specifically, it's this part of John Wentworth's comment: As far as the actual practical question, there is a very important limitation on inner-misaligned agents by SGD, primarily because gradient hacking is very difficult to do, and is an underappreciated limitation on misalignment, since SGD has powerful tools to remove inner-misaligned circuits/TMs/Agents in the link below: https://www.lesswrong.com/posts/w2TAEvME2yAG9MHeq/

Tom McGrath

On the last part of your comment - if AGI doesn't come out of LLMs then what would the justification for a pause be?

Davidmanheim

That progress is incredibly fast, and new architectures explicitly aimed at creating AGI are getting proposed and implemented. (I'm agnostic about whether LLMs will scale past human reasoning - it seems very plausible they won't. But I don't think it matters, because that's not the only research direction with tons of resources being put into it that create existential risks.)

Tom McGrath

Interesting - what do you have in mind for fast-progressing architectures explicitly aimed at creating AGI? On your 2nd point on x-risks from non-LLM AI, am I right in thinking that you would also hope to catch dual-use scientific AI (for instance) in a compute governance scheme and/or pause? That's a considerably broader remit than I've seen advocates of a pause/compute restrictions argue for and seems much harder to achieve both politically and technically.

Davidmanheim

If regulators or model review firms have any flexibility (which seems very plausible,) and the danger of AGI is recognized (which seems increasingly likely,) once there is any recognition of promising progress towards AGI, review of the models for safety would occur - as it should, as in any other engineering discipline, albeit in this case more like civil engineering, where lives are on the line, than software engineering, where they usually aren't. And considering other risks, as I argued in my piece, there's an existing requirement for countries to ban bioweapons development, again, as there should be. I'm simply proposing that countries should fulfill that obligation, in this case, by requiring review of potentially dangerous research into ML which can be applied to certain classes of virology.

Zach Stein-Perlman1y44

I feel like I detect a missing mood from you where you're skeptical of pausing (for plausible-to-me reasons), but you're not conflicted about it like I am and you don't e.g. look for ways to buy time or ways for regulation to help without the downsides of a pause. (Sorry if this sounds adversarial.) Relatedly, this post is one-sided and so feels soldier-mindset-y. Likely this is just due to the focus on debating AI pause. But I would feel reassured if you said you're sympathetic to: labs not publishing capabilities research, labs not publishing model weights, dangerous-capability-model-eval-based regulation, US and allies slowing other states and denying them compute, and/or other ways to slow AI or for regulation to help. If you're unsympathetic to these, I would doubt that the overhang nuances you discuss are your true rejection (but I'd be interested in hearing more about your take on slowing and regulation outside of "pause").

Edit: man, I wrote this after writing four object-level comments but this got voted to the top. Please note that I'm mostly engaging on the object level and I think object-level discussion is generally more productive—and I think Nora's post makes several good points.

titotal1y13

I think people have started to stretch the "missing mood" concept a bit too far for my taste.

What actual mood is missing here?

If you think that the default path of AI development leads towards eventual x-risk safety, but that rash actions like an AI pause could plausibly push us off that path and into catastrophe, then your default moods would be "fervent desire to dissuade people from doing the potentially disastrous thing", and "happy that the disastrous thing probably won't happen". I think this matches with the mood the OP has provided.

I worry that these sort of meta-critiques can inadvertently be used to pressure people into one side of object-level disagreements. This isn't a dig at you in particular, and I acknowledge that you made object level points as well, which really should be higher than this comment.

Zach Stein-Perlman1y18

What actual mood is missing here?

Noticing the irony that this very natural AI safety idea is (in Nora's view) actually counterproductive and so constructively searching for ways to modify it and for adjacent ideas that don't have its downsides
Sympathy with the pro-pause position and its proponents

Also feeling more conflicted in general—there are several real considerations in favor of pausing and Nora doesn't grapple with them. (But this is a debate and maybe Nora is deliberately one-sidedly arguing for a particular position.)

Maybe "missing mood" isn't exactly the right concept.

titotal1y11

So the point of the "missing mood" concept was that it was an indicator for motivated reasoning. If someone reports to you that "lithuanians are genetically bad at chess" with a mood of unrestrained glee, you can rightly get suspicious of their methods. If they weren't already prejudiced against lithuanians, they would find the result about chess ability sad and unfortunate.

I see no similar indicators here. From nora's perspective, the AI pause and similar proposals are a bomb that will hurl us much closer to catastrophe. Why, (from their perspective) would there be a requirement to show sympathy for the bomb-throwers, or propose a modified bomb design?

Now of course, as a human being nora will have pre-existing biases towards one side or the other, and you can pick apart the piece if you want to find evidence of that (like using the phrase "heavy handed government regulation"). But having some bias towards one side doesn't mean your arguments are wrong. The meta can have some uses if it's truly blatant, but it's the object level that actually matters.

Steven Byrnes1y11

If you desperately wish we had more time to work on alignment, but also think a pause won’t make that happen or would have larger countervailing costs, then that would lead to an attitude like: “If only we had more time! But alas, a pause would only make things worse. Let’s talk about other ideas…” For my part, I definitely say things like that (see here).

However, Nora has sections claiming “alignment is doing pretty well” and “alignment optimism”, so I think it’s self-consistent for her to not express that kind of mood.

Zach Stein-Perlman

Insofar as Nora discusses nuances of overhang, it would be odd and annoying for that to not actually be cruxy for her (given that she doesn't say something like this isn't cruxy for me).

Steven Byrnes

I was reading it as a kinda disjunctive argument. If Nora says that a pause is bad because of A and B, either of which is sufficient on its own from her perspective, then you could say "A isn't cruxy for her" (because B is sufficient) or you could say "B isn't cruxy for her" (because A is sufficient). Really, neither of those claims is accurate. Oh well, whatever, I agree with you that the OP could have been clearer.

Nora Belrose

Yep it's all meant to be disjunctive and yep it could have been clearer. FWIW this essay went through multiple major revisions and at one point I was trying to make the disjunctivity of it super clear but then that got de-prioritized relative to other stuff. In the future if/when I write about this I think I'll be able to organize things significantly better

Zach Stein-Perlman

Hmm, AI safety is probably easy implies that slowing AI is lower-stakes but doesn't obviously imply much about whether it's net-positive. It's not obvious to me what alignment optimism has to do with the pause debate, and I don't think you discuss this.

Nora Belrose

Sorry, I thought it would be fairly obvious how it's related. If you're optimistic about alignment then the expected benefits you might hope to get out of a pause (whether or not you actually do get those benefits) are commensurately smaller, so the unintended consequences should have more relative weight in your EV calculation. To be clear, I think slowing down AI in general, as opposed to the moratorium proposal in particular, is a more reasonable position that's a bit harder to argue against. I do still think the overhang concerns apply in non-pause slowdowns but in a less acute manner.

Zach Stein-Perlman

Given alignment optimism, the benefits of pause are smaller—but the unintended consequences for alignment are smaller too. I guess alignment optimism suggests pause-is-bad if e.g. your alignment optimism is super conditional on smooth progress...

IanDavidMoss

Could you say more about what you see as the practical distinction between a "slow down AI in general" proposal vs. a "pause" proposal?

Nora Belrose1y10

Where we agree:

"dangerous-capability-model-eval-based regulation" sounds good to me. I'm also in favor of Robin Hanson's foom liability proposal. These seem like very targeted measures that would plausibly reduce the tail risk of existential catastrophe, and don't have many negative side effects. I'm also not opposed to the US trying to slow down other states, although it'd depend on the specifics of the proposal.

Where we (partially) disagree:

I think there's a plausible case to be made that publishing model weights reduces foom risk by making AI capabilities more broadly distributed, and also enhances security-by-transparency. Of course there are concerns about misuse— I do think that's a real thing to be worried about— but I also think it's generally exaggerated. I also relatively strongly favor open source on purely normative grounds. So my inclination is to be in favor of it but with reservations. Same goes for labs publishing capabilities research.

Steven Byrnes1y42

Some examples include the now-debunked analogy from evolution, the false distinction between “inner” and “outer” alignment, and the idea that AIs will be rigid utility maximizing consequentialists (here, here, and here).

I feel like you’re trying to round these three things into a “yay versus boo” axis, and then come down on the side of “boo”. I think we can try to do better than that.

One can make certain general claims about learning algorithms that are true and for which evolution provides as good an example as any. One can also make other claims that are true for evolution and false for other learning algorithms. and then we can argue about which category future AGI will be in. I think we should be open to that kind of dialog, and it involves talking about evolution.

Likewise, I think “inner misalignment versus outer misalignment” is a helpful and valid way to classify certain failure modes of certain AI algorithms.

For the third one, there’s an argument like:

“Maybe the AI will really want something-or-other to happen in the future, and try to make it happen, including by long-term planning—y'know, the way some humans really want to break out of prison, or the way Elon Musk really ... (read more)

Tom McGrath1y10

I certainly give relatively little weight to most conceptual AI research. That said, I respect that it's valuable for you and am open to trying to narrow the gap between our views here - I'm just not sure how!

To be more concrete, I'd value 1 year of current progress over 10 years of pre-2018 research (to pick a date relatively arbitrarily). I don't intend this as an attack on the earlier alignment community, I just think we're making empirical progress in a way that was pretty much impossible before we had good models available to study and I place a lot more value on this.

Steven Byrnes1y14

I have a vague impression—I forget from where and it may well be false—that Nora has read some of my AI alignment research, and that she thinks of it as not entirely pointless. If so, then when I say “pre-2020 MIRI (esp. Abram & Eliezer) deserve some share of the credit for my thinking”, then that’s meaningful, because there is in fact some nonzero credit to be given. Conversely, if you (or anyone) don’t know anything about my AI alignment research, or think it’s dumb, then you should ignore that part of my comment, it’s not offering any evidence, it would just be saying that useless research can sometimes lead to further useless research, which is obvious! :)

I probably think less of current “empirical” research than you, because I don’t think AGI will look and act and be built just like today’s LLMs but better / larger. I expect highly-alignment-relevant differences between here and there, including (among other things) reinforcement learning being involved in a much more central way than it is today (i.e. RLHF fine-tuning). This is a big topic where I think reasonable people disagree and maybe this comment section isn’t a great place to hash it out. ¯\_(ツ)_/¯

My own research d... (read more)

Tom McGrath

I've certainly heard of your work but it's far enough out of my research interests that I've never taken a particularly strong interest. Writing this in this context makes me realise I might have made a bit of a one-man echo chamber for myself... Do you mind if we leave this as 'undecided' for a while? Regarding ELK - I think the core of the problem as I understand it is fairly clear once you begin thinking about interpretability. Understanding the relation between AI and human ontologies was part of the motivation behind my work on alphazero (as well as an interest in the natural abstractions hypothesis). Section 4 "Encoding of human conceptual knowledge" and Section 8 "Exploring activations with unsupervised methods" are the places to look. The section on challenges and limitations in concept probing I think echoes a lot of the concerns in ELK. In terms of subsequent work on ELK, I don't think much of the work on solving ELK was particularly useful, and often reinvented existing methods (e.g. sparse probing, causal interchange interventions). If I were to try and work on it then I think the best way to do so would be to embed the core challenge in a tractable research program, for instance trying to extract new scientific knowledge from ML models like alphafold. To move this in a more positive direction, the most fruitful/exciting conceptual work I've seen is probably (1) the natural abstractions hypothesis and (2) debate. When I think a bit about why I particularly like these, for (1) it's because it seems plausibly true, extremely useful if true, and amenable to both formal theoretical work and empirical study. For (2) it's because it's a pretty striking new idea that seems very powerful/scalable, but also can be put into practice a bit ahead of really powerful systems.

Zach Furman

It's perhaps also worth separating the claims that A) previous alignment research was significantly less helpful than today's research and B) the reason that was the case continues to hold today. I think I'd agree with some version of A, but strongly disagree with B. The reason that A seems probably true to me is that we didn't know the basic paradigm in which AGI would arise, and so previous research was forced to wander in the dark. You might also believe that today's focus on empirical research is better than yesterday's focus on theoretical research (I don't necessarily agree) or at least that theoretical research without empirical feedback is on thin ice (I agree). I think most people now think that deep learning, perhaps with some modifications, will be what leads to AGI - some even think that LLM-like systems will be sufficient. And the shift from primarily theoretical research to primarily empirical research has already happened. So what will cause today's research to be worse than future research with more capable models? You can appeal to a general principle of "unknown unknowns," but if you genuinely believe that deep learning (or LLMs) will eventually be used in future AGI, it seems hard to believe that knowledge won't transfer at all.

Gerald Monroe

Steven the issue is without empirical data you end up with a branching tree of possible futures. And if you make some faulty assumptions early - such as assuming the amount of compute needed to host optimal AI models is small and easily stolen via hacking - you end up lost in a tree of possibilities where every one you consider is "doom". And thus you arrive at the conclusion of "pDoom is 99 percent", because you are only cognitively able to consider adjacent futures in the possibility tree. No living human can keep track of thousands of possibilities in parallel. This is where I think Eliezer and Zvi are lost, where they simply ignore branches that would lead to different outcomes. (And vice versa, you could arrive at the opposite conclusion). It becomes angels at the head of a pin. There is no way to make a policy decision based on this. You need to prove you beliefs with data. It's how we even got here as a species.

Zach Stein-Perlman1y30

One of the three major threads in this post (I think) is noticing pause downsides: in reality, an "AI pause" would have various predictable downsides.

Part of this is your central overhang concerns, which I discuss in another comment. The rest is:

Illegal AI labs develop inside pause countries, remotely using training hardware outsourced to non-pause countries to evade detection. Illegal labs would presumably put much less emphasis on safety than legal ones.
There is a brain drain of the least safety-conscious AI researchers to labs headquartered in non-pause countries. Because of remote work, they wouldn’t necessarily need to leave the comfort of their Western home.
Non-pause governments make opportunistic moves to encourage AI investment and R&D, in an attempt to leap ahead of pause countries while they have a chance. Again, these countries would be less safety-conscious than pause countries.
Safety research becomes subject to government approval to assess its potential capabilities externalities. This slows down progress in safety substantially, just as the FDA slows down medical research.
Legal labs exploit loopholes in the definition of a “frontier” model. Many project

... (read more)

Gerald Monroe

Just something that jumped out at me. Suppose a pause is on 1e28+ training runs. The human brain is made of modules organized in a way we don't understand. But we do know the frontal lobes associated with executive functions are a small part of the total tissue. This means an AI system could be a collection of a few dozen specialized 1e28 models separated by api calls, hosted in a common data center for low latency interconnects. If a "few dozen" is 100+ modules the total compute used would be 1e30 and it might be possible to make this system an AGI with difficult training tasks to cause this level of cognitive development through feedback. Especially with "meta" system architectures where new modules could be automatically added to improve score where deficiencies are present in a way that training existing weights is leading to regressions.

Greg_Colbourn

Interesting - something to watch out for! Perhaps it could be caught by limiting the number of training runs any individual actor can do that are close to / at the FLOP limit (to 1/year?). Of course then actors intent on it could try and use a maze of shell companies or something, but that could be addressed by requiring complete financial records and audits.

Gerald Monroe

Sure. In practice there's the national sovereignty angle though. This just devolves to each party "complies" with the agreement, violating it in various ways. Too much incentive to defect. The US government just never audits its secret national labs, China just never checks anything, Israel just openly decides they can't afford to comply at all etc. Everyone claims to be in compliance.

Greg_Colbourn

Really depends on how much of a taboo develops around AGI. If it's driven underground it becomes much less likely to happen given the resources required.

-1

Gerald Monroe

So my thought on this is I think of flamethrowers and gas shells and the worst ww1 battlefields. I am not sure what taboo humans won't violate in order to win.

Greg_Colbourn

This isn't war though. What are some peace-time examples of taboo violations (especially state-sanctioned ones)? I can only really think of North Korea and a handful of other pariah states (none of which would be capable of developing AGI).

MilesTS

This can be avoided with a treaty that requires full access given to international inspectors. This already happens with the IAEA and was set up even in the far greater tensions of the cold war. If someone like Iran tries to kick out the inspectors, everyone assumes they're trying to develop nuclear weapons and takes serious action (harsh sanctions, airstrikes, even the threat of war). If governments think of this as an existential threat, they should agree to it for the same reasons they did with the IAEA. And while there's big incentives to defect (unless they have very high p(doom)), there is also the knowledge that kicking out inspectors will lead to potential war and their rivals defecting too.

MilesTS

If this turns out to be feasible, one solution would be to have people on-site (or make TSMC put hardware level controls in place) to randomly sample from the training data several times a day to verify outside data isn't involved in the training run.

RobertM1y27

At best, these theory-first efforts did very little to improve our understanding of how to align powerful AI. And they may have been net negative, insofar as they propagated a variety of actively misleading ways of thinking both among alignment researchers and the broader public. Some examples include the now-debunked analogy from evolution, the false distinction between “inner” and “outer” alignment, and the idea that AIs will be rigid utility maximizing consequentialists (here, here, and here).

Random aside, but I think this paragraph is unjustified in both its core argument (that the referenced theory-first efforts propagated actively misleading ways of thinking about alignment) and none of the citations provide the claimed support.

The first post (re: evolutionary analogy as evidence for a sharp left turn) sees substantial pushback in the comments, and that pushback seems more correct to me than not, and in any case seems to misunderstand the position it's arguing against.

The second post presents an interesting case for a set of claims that are different from "there is no distinction between inner and outer alignment"; I do not consider it to be a full refutation of that conceptu... (read more)

JackM1y26

I agree that alignment research would suffer during a pause, but I've been wondering recently how much of an issue that is. The key point is that capabilities research would also be paused, so it's not like AI capabilities would be racing ahead of our knowledge on how to control ever more powerful systems. You'd simply be delaying both capabilities and alignment progress.

You might then ask - what's the point of a pause if alignment research stops? Isn't the whole point of a pause to figure out alignment?

I'm not sure that's the whole point of a pause. A pause can also give us time to figure out optimal governance structures whether it be standards, regulations etc. These structures can be very important in reducing x-risk. Even if the U.S. is the only country to pause that still gives us more time, because the U.S. is currently in the lead.

I realise you make other points against a pause (which I think might be valid), but I would welcome thoughts on the 'having more time for governance' point specifically.

Larks1y19

Thanks very much for writing this very interesting piece!

The "AI safety winter" section argues that pre-2020, AI alignment researchers made little progress because they had no AI to work on aligning. But now that we have GPT-4 etc., I feel like we have a capabilities overhang, and it seems like there is plenty of AI alignment researchers to work on for the next 6 months or so? Then their work could be 'tested' by allowing some more algorithmic progress.

Chris Leong1y17

This post has definitely made me more pessimistic on a pause, particularly:

• If we pause, it's not clear on how much extra time we get at the end and how much this costs us in terms of crunch time.
• The implementation details are tricky and actors are incentivised to try to work around the limitations.

On the other hand, I disagree with the following:
• That it is clear that alignment is doing well. There are different possible difficulty levels that alignment could have. I agree that we are in an easier world, where ChatGPT has already achieved a greater amount of outer alignment than we would have expected from some of the old arguments about the impossibility of listing all of our implicit conditions. On the other hand, it's not at all clear that we're anywhere near close to scalable alignment techniques, so there's a pretty decent argument that we're far behind where we need to be.
• Labelling AI's as white box merely because we can see all of the weights. You've got a point. I can see where you're coming from. However, I'm worried that your framing is confusing and will cause people to talk past each other.
• That if there was a pause, alignment research would magically revert bac... (read more)

Nora Belrose

The claim is more like, "the MIRI days are a cautionary tale about what may happen when alignment research isn't embedded inside a feedback loop with capabilities." I don't literally believe we would revert back to pure theoretical research during a pause, but I do think the research would get considerably lower quality. Perhaps, but I think the current conventional wisdom that neural nets are "black box" is itself a confusing and bad framing and I'm trying to displace it.

Chris Leong1y12

AI safety currently seems to heavily lean towards empirical and this emphasis only seems to be growing, so I’m rather skeptical that a bit more theoretical work on the margin will be some kind of catastrophe. I’d actually expect it to be a net positive.

Greg_Colbourn

There are probably 100s of AI Alignment / Interpretability PhD theses that could be done on GPT-4 alone. That's 5 years of empirical work right there without any further advances in capabilities.

Greg_Colbourn

Any serious Pause would be indefinite, and only lifted when there is global consensus on an alignment solution that provides sufficient x-safety. I think a lot of objections to Pause are based on the idea that it would be of fixed time limit. This is obviously unrealistic - when has there ever been an international treaty or moratorium that had a fixed expiry date?

Zach Stein-Perlman1y17

One of the three major threads in this post (I think) is feedback loops & takeoff: for safety, causing capabilities to increase more gradually and have more time with more capable systems is important, relative to total time until powerful systems appear. By default, capabilities would increase gradually. A pause would create an "overhang" and would not be sustained forever; when the pause ends, the overhang entails that capabilities increase rapidly.

I kinda agree. I seem to think rapid increase in training compute is less likely, would be smaller, and would be less bad than you do. Some of the larger cruxes:

Magnitude of overhang: it seems the size of the largest training run largely isn't about the cost of compute. Why hasn't someone done a billion-dollar LLM training run, why did we only recently break $10M? I don't know but I'd guess you can't effectively (i.e. you get sharply diminishing returns for doing more than a couple orders of magnitude more than models that have been around for a while), or it's hard to get a big cluster to parallelize and so the training run would take years, or something. Relevant meme:
Magnitude of overhang: endogeneity. AI progress impr

... (read more)

Zach Stein-Perlman1y16

One of the three major threads in this post (I think) is alignment optimism: AI safety probably isn't super hard.

A possible implication is that a pause is unnecessary. But the difficulty of alignment doesn't seem to imply much about whether slowing is good or bad, or about its priority relative to other goals.

(I disagree that gradient descent entails "we are the innate reward system" and thus safe, or that "full read-write access to [AI systems'] internals" gives safety in the absence of great interpretability. I think likely failure modes include AI playing the training game, influence-seeking behavior dominating, misalignment during capabilities generalization, and catastrophic Goodharting, and that AGI Ruin: A List of Lethalities is largely right. But I think in this debate we should focus on determining optimal behavior as a function of the difficulty of alignment, rather than having intractable arguments about the difficulty of alignment.)

Davidmanheim

Yes. This one seems critical, and I don't understand it at all.

Aleksi Maunu

At the extremes, if alignment-to-"good"-values by default was 100% likely I presume slowing down would be net-negative, and racing ahead would look great. It's unclear to me where the tipping point is, what kind of distribution over different alignment difficulty levels one would need to have to tip from wanting to speed up vs wanting to slow down AI progress. Seems to me like the more longtermist one is, the more slowing down looks good even when one is very optimistic about alignment. Then again there are some considerations that push against this: risk of totalitarianism, risk of pause that never ends, risk of value-agnostic alignment being solved and the first AGI being aligned to "worse" values than the default outcome. (I realize I'm using two different definitions of alignment in this comment, would like to know if there's standardized terminology to differentiate between them)

DanielFilan1y15

Stuart Russell’s “assistance game” research agenda, started in 2016, is now widely seen as mostly irrelevant to modern deep learning— see former student Rohin Shah’s review here, as well as Alex Turner’s comments here.

The second link just takes me to Alex Turner's shortform page on LW, where ctrl+f-ing "assistance" doesn't get me any results. I do find this comment when searching for "CIRL", which criticizes the CIRL/assistance games research program, but does not claim that it is irrelevant to modern deep learning. For what it's worth, I think it's plausible that Alex Turner thinks that assistance games is mostly irrelevant to modern deep learning (and plausible that he doesn't think that) - I merely object that the link provided doesn't provide good evidence of that claim.

The first link is to Rohin Shah's reviews of Human Compatible and some assistance games / CIRL research papers. ctrl+f-ing "deep" gets me two irrelevant results, plus one description of a paper "which is inspired by [the CIRL] paper and does a similar thing with deep RL". It would be hard to write such a paper if CIRL (aka assistance games) was mostly irrelevant to modern deep learning. The closest thing I ca... (read more)

Rohin Shah1y12

Yeah, I don't think it's accurate to say that I see assistance games as mostly irrelevant to modern deep learning, and I especially don't think that it makes sense to cite my review of Human Compatible to support that claim.

The one quote that Daniel mentions about shifting the entire way we do AI is a paraphrase of something Stuart says, and is responding to the paradigm of writing down fixed, programmatic reward functions. And in fact, we have now changed that dramatically through the use of RLHF, for which a lot of early work was done at CHAI, so I think this reflects positively on Stuart.

I'll also note that in addition to the "Learning to Interactively Learn and Assist" paper that does CIRL with deep RL which Daniel cited above, I also wrote a paper with several CHAI colleagues that applied deep RL to solve assistance games.

My position is that you can roughly decompose the overall problem into two subproblems: (1) in theory, what should an AI system do? (2) Given a desire for what the AI system should do, how do we make it do that?

The formalization of assistance games is more about (1), saying that AI systems should behave more like assistants than like autonomous agents (basica... (read more)

DanielFilan

I asked Alex "no chance you can comment on whether you think assistance games are mostly irrelevant to modern deep learning?" His response was "i think it's mostly irrelevant, yeah, with moderate confidence". He then told me he'd lost his EA forum credentials and said I should feel free to cross-post his message here. (For what it's worth, as people may have guessed, I disagree with him - I think you can totally do CIRL-type stuff with modern deep learning, to the extent you can do anything with modern deep learning.)

Steven Byrnes1y15

By contrast, AIs implemented using artificial neural networks (ANN) are white boxes in the sense that we have full read-write access to their internals. They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost.

Suppose you walk down a street, and unbeknownst to you, you’re walking by a dumpster that has a suitcase full of millions of dollars. There’s a sense in which you “can”, “at essentially no cost”, walk over and take the money. But you don’t know that you should, so you don’t. All the value is in the knowledge.

A trained model is like a computer program with a billion unlabeled parameters and no documentation. Being able to view the code is helpful but doesn’t make it “white box”. Saying it’s “essentially no cost” to “analyze” a trained model is just crazy. I’m pretty sure you have met people doing mechanistic interpretability, right? It’s not trivial. They spend months on their projects. The thing you said is just so crazy that I have to assume I’m misunderstanding you. Can you clarify?

Ben_West🔸1y20

I’m pretty sure you have met people doing mechanistic interpretability, right?

Nora is Head of Interpretability at EleutherAI :)

Nora Belrose

It's essentially no cost to run a gradient-based optimizer on a neural network, and I think this is sufficient for good-enough alignment. I view the the interpretability work I do at Eleuther as icing on the cake, allowing us to steer models even more effectively than we already can. Yes, it's not zero cost, but it's dramatically lower cost than it would be if we had to crack open a skull and do neurosurgery. Also, if by "mechanistic interpretability" you mean "circuits" I'm honestly pretty pessimistic about the usefulness of that kind of research, and I think the really-useful stuff is lower cost than circuits-based interp.

Steven Byrnes1y12

If you want to say "it's a black box but the box has a "gradient" output channel in addition to the "next-token-probability-distribution" output channel", then I have no objection.

If you want to say "...and those two output channels are sufficient for safe & beneficial AGI", then you can say that too, although I happen to disagree.

If you want to say "we also have interpretability techniques on top of those, and they work well enough to ensure alignment for both current and future AIs", then I'm open-minded and interested in details.

If you want to say "we can't understand how a trained model does what it does in any detail, but if we had to drill into a skull and only measure a few neurons at a time etc. then things sure would be even worse!!", then yeah duh.

But your OP said "They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost", and used the term "white box". That's the part that strikes me as crazy. To be charitable, I don't think those words are communicating the message that you had intended to communicate.

For example, find a random software engineer on the street, and ask them: "if I give... (read more)

Nora Belrose

Differentiability is a pretty big part of the white box argument. The terabyte compiled executable binary is still white box in a minimal sense but it's going to take a lot of work to mould that thing into something that does what you want. You'll have to decompile it and do a lot of static analysis, and Rice's theorem gets in the way of the kinds of stuff you can prove about it. The code might be adversarially obfuscated, although literal black box obfuscation is provably impossible. If instead of a terabyte of compiled code, you give me a trillion neural net weights, I can fine tune that network to do a lot of stuff. And if I'm worried about the base model being preserved underneath and doing nefarious things, I can generate synthetic data from the fine tuned model and train a fresh network from scratch on that (although to be fair that's pretty compute-intensive).

Steven Byrnes1y20

I don’t think “mouldability” is a synonym of “white-boxiness”. In fact, I think they’re hardly related at all:

There can be a black box with lots of knobs on the outside that change the box’s behavior. It’s still a black box.
Conversely, consider an old-fashioned bimetallic strip thermostat with a broken dial. It’s not mouldable at all—it can do one and only thing, i.e. actuate a switch at a certain fixed temperature. (Well, I guess you can use it as a doorstop!) But a bimetallic strip thermostat still very white-boxy (after I spend 30 seconds telling you how it works).

You wrote “They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost.” I feel like I keep pressing you on this, and you keep motte-and-bailey'ing into some other claim that does not align with a common-sense reading of what you originally wrote:

“Well, the cost of analysis could theoretically be even higher—like, if you had to drill into skulls…” OK sure but that’s not the same as “essentially no cost”.
“Well, the cost of analysis may be astronomically high, but there’s a theorem proving that it’s not theoretically impossible…” OK sur

... (read more)

RobertM

But this is irrelevant to the original claim, right? Being able to fine-tune might make introspection on its interal algorithmic representations a bit cheaper, but in practice we observe that it takes us weeks or months of alignment researchers' time to figure out what extremely tiny slices of two-generations-old LLMs are doing.

Rafael Harth

Well, a computer model is "literally" transparent in the sense that you can see everything, which means the only difficulty is only in understanding what it means. So the part where you spend 5 million dollars on a PET scanner doesn't exist for ANNs, and in that sense you can analyze them for "free". If the understanding part is sufficiently difficult... which it sure seems to be... then this doesn't really help, but it is a coherent conceptual difference.

NickLaing1y13

Simple and genuine question from a non-AI guy

I understand the arguments towards encouraging gradual development vs. fast takeoff, but I don't understand this argument I've heard multiple times (not just on this post) that "we need capabilities to increase so that we can stay up to date with alignment research".

First I thought there's still a lot of work we could do with current capabilities - technical alignment is surely limited by time, money and manpower not just by computing power. I'm also guessing less powerful AI could be made during a "pause" specifically for alignment research

Second in a theoretical situation where capabilities research globally stopped overnight, isn't this just free-extra-time for the human race where we aren't moving towards doom? That feels pretty valuable and high EV in and of itself.

It seems to me the argument would have to be that the advantage to the safety work of improving capabilities would outstrip the increasing risk of dangerous GAI, which I find hard to get my head around, but I might be missing something important.

Thanks.

Aaron_Scher

Not responding to your main question: I'm interpreting this as saying that buying humanity more time, in and of itself, is good. I don't think extra time pre-transformative-AI is particularly valuable except its impact on existential risk. Two reasons for why I think this: * Astronomical waste argument. Time post-transformative-AI is way more valuable than time now, assuming some (but strong version not necessary) aggregating/total utilitarianism. If I was trading clock-time seconds now for seconds a thousand years from now, assuming no difference in existential risk, I would probably be willing to trade every historical second of humans living good lives for like a minute a thousand years from now, because it seems like we could have a ton of (morally relevant) people in the future, and the moral value derived from their experience could be significantly greater than current humans. * The moral value of the current world seems plausibly negative due to large amounts of suffering. Factory farming, wild animal suffering, humans experiencing suffering, and more, seem like they make the total sign unclear. Under moral views that weigh suffering more highly than happiness, there's an even stronger case for the current world being net-negative. This is one of those arguments that I think is pretty weird and almost never affects my actions, but it is relevant to the question of whether extra time for the human race is positive EV. * Third argument about how AI sooner could help reduce other existential risks. e.g., normal example of AI speeding up vaccine research, or weirder example of AI enabling space colonization, and being on many planets makes x-risk lower. I don't personally put very much weight on this argument, but it's worth mentioning.

NickLaing

Thanks Aaron appreciate the effort. I Faild to point out my central assumpton here, that Transformative AI in our current state of poor preparedness is net negative due to the existential risk it entails. Its a good point about time pre transformative AI not being so valuable in the grand scheme of the future, but that ev would increase substantally assuming transformative AI is the end. Still looking for the fleshing out of this argument that I don't understand - if anyone can be bothered! "It seems to me the argument would have to be that the advantage to the safety work of improving capabilities would outstrip the increasing risk of dangerous GAI, which I find hard to get my head around, but I might be missing something important."

Greg_Colbourn

What is your p(doom|AGI)? (Assuming AGI is developed in the next decade.) Note that Bostrom himself says in Astronomical Waste (my emphasis in bold):

Aaron_Scher

I don't think you read my comment: I also think it's bad how you (and a bunch of other people on the internet) ask this p(doom) question in a way that (in my read of things) is trying to force somebody into a corner of agreeing with you. It doesn't feel like good faith so much as bullying people into agreeing with you. But that's just my read of things without much thought. At a gut level I expect we die, my from-the-arguments / inside view is something like 60%, and my "all things considered" view is more like 40% doom.

-1

Greg_Colbourn

Wow that escalated quickly :( It's really not. I'm trying to understand where people are coming from. If someone has low p(doom|AGI), then it makes sense that they don't see pausing AI development as urgent. Or their p(doom) relative to their actions can give some idea of how risk taking they are (but I still don't understand how OpenAI and their supporters think it's ok to gamble 100s of millions of lives in expectation for a shot at utopia without any democratic mandate). and Surely means that extra time now (pausing) is extremely valuable? i.e. because of its impact on existential risk. Or do you think that the chance we're in a net negative world now means that the astronomical future we could save would also most likely be net negative? I don' think this follows. Or that continuing to allow AI to speed up now will actually prevent extinction threats in the next 10 years that we would otherwise be wiped out by (this seems very unlikely to me).

Aaron_Scher

Sorry, I agree my previous comment was a bit intense. I think I wouldn't get triggered if you instead asked "I wonder if a crux is that we disagree on the likelihood of existential catastrophe from AGI. I think it's very likely (>50%), what do you think?" P(doom) is not why I disagree with you. It feels a little like if I'm arguing with an environmentalist about recycling and they go "wow do you even care about the environment?" Sure, that could be a crux, but in this case it isn't and the question is asked in a way that is trying to force me to agree with them. I think asking about AGI beliefs is much less bad, but it feels similar. I think it's pretty unclear if extra time now positively impacts existential risk. I wrote about a little bit of this here, and many others have discussed similar things. I expect this is the source of our disagreement, but I'm not sure.

Aaron_Scher

I think one of the better write-ups about this perspective is Anthropic's Core Views on AI Safety. From its main text, under the heading The Role of Frontier Models in Empirical Safety, a couple relevant arguments are: * Many safety concerns arise with powerful systems, so we need to have powerful systems to experiment with * Many safety methods require large/powerful models * Need to understand how both problems and our fixes change with model scale (if model gets bigger, does it look like safety technique is still working) * To get evidence of powerful models being dangerous (which is important for many reasons), you need the powerful models.

NickLaing1y11

Thanks Aaron that's a good article appreciate it. It still wasn't clear to me they were making an argument that increasing capabilities could be net positive, more that safety people should be working with whatever is the current most powerful model

"But we also cannot let excessive caution make it so that the most safety-conscious research efforts only ever engage with systems that are far behind the frontier."

This makes sense to me, the best safety researchers should have full access to the current most advanced models, preferably in my eyes before they have been (fully) trained.

But then I don't understand their next sentance "Navigating these tradeoffs responsibly is a balancing act, and these concerns are central to how we make strategic decisions as an organization."

I'm clearly missing something, what's the tradeoff? Is working on safety with the most advanced current model while generally slowing everything down not the best approach? This doesn't seem like a tradeoff to me

How is there any net safety advantage in increasing AI capacity?

Greg_Colbourn

Anthropic[1] have a massive conflict of interest (making money), so their statements are in some sense safetywashing. There is at least a few years worth of safety work that can be done on current models if we had the time (i.e. via a pause): interpretability is still stuck on trying to decipher GPT-2 sized models and smaller. And jailbreaks are still very far from being solved. Plenty to be getting on with without pushing the frontier of capabilities yet further. 1. ^ And the other big AI companies that supposedly care about x-safety (OpenAI, Google DeepMind)

Gerald Monroe

The assumptions are that more powerful models won't be like weaker models but more accurate. They will show emergent abilities. Many things that gpt-4 can solve gpt-3 cannot, and those models share a similar lineage. Safety issues show up when you have a model powerful enough to even exhibit them, and they may not be anything you predicted will happen from theory. Waluigi effect, hallucinations - both were not predicted by any theory by AI safety research groups. They seem to be the majority of the issues with models at the current level of capabilities.

Zach Stein-Perlman

Free extra time is good. The reasonable version of the argument is that you should avoid buying total-time in ways that cost time with more powerful systems; maybe AI progress will look like the purple line.

NickLaing

Nice post. Why didn't you post it here for AI pause debate week haha. Yes I somewhat understand this potential "overhang" danger as an argument in and of itself against a pause., I just don't see how it relates to technical alignment research specifically.

Holly Elmore ⏸️ 🔸1y12

Should we lobby governments to impose a moratorium on AI research? Since we don’t enforce pauses on most new technologies, I hope the reader will grant that the burden of proof is on those who advocate for such a moratorium.

You could have stopped here. This is our crux.

Yonatan Cale

I agree that the question of "what priors to use here" is super important. For example, if someone would chose priors for "we usually don't bring new more intelligent life forms to live with us, so the burden of proof is on doing so" - would that be valid? Or if someone would say "we usually don't enforce pauses on writing new computer programs" - would THAT be valid? imo: the question of "what priors to use" is important and not trivial. I agree with @Holly_Elmore that just assuming the priors here is skipping over some important stuff. But I disagree that "you could have stopped here", since there might be things which I could use to update my own (different) prior

Holly Elmore ⏸️ 🔸

*As far as my essay (not posted yet) was concerned, she could have stopped there, because this is our crux.

Davidmanheim

In a debate, which is what was supposed to be happening, the point is to make claims that either support or refute the central claim. That's what Holly was pointing out - this is a fundamental requirement for accepting Nora's position. (I don't think that this is the only crux - "AI Safety is gonna be easy" and "AI is fully understandable" are two far larger cruxes, but they largely depend on this first one.)

Rafael Harth1y12

Nate Soares of the Machine Intelligence Research Institute has argued that building safe AGI is hard for the same reason that building a successful space probe is hard— it may not be possible to correct failures in the system after it’s been deployed. Eliezer Yudkowsky makes a similar argument:
“This is where practically all of the real lethality [of AGI] comes from, that we have to get things right on the first sufficiently-critical try.” — AGI Ruin: A List of Lethalities

Eliezer and Nate also both expect discontinuous Takeoff by default. I feel like it's a bit disingenuous to argue that the thinking of Eliezer et al has proven obsolete and misguided, but then also quote them as apparent authority figures in this one case where their arguments align with your essay. It has to be one or the other!

Nora Belrose

Why does it have to be one or the other? I personally don't put much stock in what Eliezer and Nate think, but many other people do.

Davidmanheim

Are you presenting arguments that you think will convince others, regardless of whether you think they are correct? Edit: Apologies, this doesn't live up to my goals in having a conversation. However, I am concerned that quoting someone you think has non-predictive models of what will happen as an authority, without flagging that you're quoting them to point out that your opposition grants that particular point, is disengenious.

dirk

7mo

If you don't think their arguments are convincing, I consider it misleading to attempt to convince other people with those same arguments.

Nora Belrose1y11

Unfortunately, this post got published under the wrong username. I'm the Nora who wrote this post. I hope it can be fixed soon.

Zach Stein-Perlman

Also near the end bullet 9 should be a subbullet of 8.

Nora Belrose

Yep, I was also hoping the images could be text-wrapped, but idk if this platform supports that.

Ben_West🔸

Sorry about this, I believe it has now been fixed.

Thomas Kwa1y6

Upvoted. I don't agree with all of these takes but they seem valuable and underappreciated.

EdoardoPona1y6

I don't think the comparison with human alignment being successful is fair.

If you mean that most people don't go on to be antisocial etc.. which is comparable to non-X AI risk, the yes perhaps simple techniques like a 'good upbringing' are working on humans. A lot of it however is just baked in by evolution regardless. If you mean that most humans don't go on to become X-risks, then that mostly has to do with lack of capability, rather than them being aligned. There are very few people I would trust with 1000x human abilities, assuming everyone else remains a 1x human.

Gideon Futerman1y6

I feel in a number of areas this post relies on the concept of AI being constructed/securitised in a number of ways that seem contradictory to me. (By constructed, I am referring to the way the technology is understood, percieved and anticipated, what narratives it fits into and how we understand it as a social object. By securitised, I mean brought into a limited policy discourse centred around national security that justifies the use of extraordinary measures (eg mass surveillance or conflict) to combat, concerned narrowly with combatting the existential... (read more)

Greg_Colbourn1y5

Addressing some of your objections:

Realistic pauses don’t include hardware

Hardware development restriction would be nice, but it’s not necessary for a successful moratorium (at least for the next few years) given already proposed compute governance schemes. There are only a handful of large hardware manufacturers and data centre vendors who would need to be regulated into building in detection and remote kill switches into their products to ensure training runs over a certain threshold of compute aren’t completed. And training FLOP limits could be regularl... (read more)

JWS 🔸1y5

Thanks for this post Nora :) It's well-written, well-argued, and has certainly provoked some lively discussion. (FWIW I welcome good posts like this that push back against certain parts of the 'EA Orthodoxy')^[1]

My only specific comment would be similar to Daniel's, I'm not sure the references to the CIRL paradigm being irrelevant are fully backed-up. Not saying that's wrong, just that I didn't find the links convincing (though I don't work on training/aligning/interpreting LLMs as my day job)

My actual question is that I want there to be more things like th... (read more)

-1

Greg_Colbourn

Downvoter here. The post is more than just wrong (worthy of a disagree vote). It’s substantially negative EV for the future of the world. Or, to put it bluntly, it’s significantly[1] increasing the risk that we all get killed in the next few years. It’s dangerous because it sounds plausible (and indeed has been upvoted a bunch and is the second highest karma post in this debate series currently). But it contains a number of unjustified claims (see other comments, e.g. [1], [2], [3], [4]), and is framed from the perspective of AI x-risk not being a problem (there’s a reason Nora works at Eleuther rather than Conjecture). Right now, the EA community seems like it’s on the fence on the issue of an AGI moratorium (or slowing down AI in general). But there are signs that EAs are warming to the idea. I see this debate series as being high stakes in terms whether there will be significant EA resources directed toward pushing for a moratorium. Such resources could really make the difference between it happening or not (given how few resources are being directed toward it so far). EDIT: I expected that this comment itself would be downvoted. Why are you downvoting the comment? [There are separate disagree votes available on comments now.] 1. ^ 1+ basis points?

JWS 🔸

Thanks for replying Greg. I have indeed upvoted/disagreevoted you here, because I really appreciate Forum voters explaining their reasoning even if I disagree. * Mainly, I think calling Nora's post "substantially negative EV for the future of the world" is tending towards the 'galaxy brain' end of EA that puts people off. I can't calculate that, and I think it's much more plausible that it provides EA Forum with a well written and knowledgable perspective of someone who disagrees on alignment difficulty and whether a pause is the best policy. * It's part of a debate series, so in my opinion it's entirely fine for it to be Nora's perspective. Her post is quite open that she thinks Alignment is going well, and I valued it a lot even if I disagreed with specific points in it. I don't think Nora's being intentionally wrong, those are just claims she believes that may turn out to be incorrect. * I recognise that you are a lot more concerned about AI x-risk than I am (not to say I'm not concerned though) and are a lot more sure about pursuing a moratorium. I suppose I'd caution against presupposing your conclusion is so correct that other views, such as Nora's, don't deserve a hearing in the public sphere. I think that's a really dangerous line of thought to go down. I think this is a place where a moral uncertainty framework could mitigate this line of thought, without necessarily watering down your commitment to prevent AI xRisk.

Greg_Colbourn

I agree with this (apart from the "valued it a lot" part, and I think Nora is coming in with a pro-AI bias). I downvoted because I thought the karma total was (still is) way too high, and high karma posts and their headlines do, for better or worse, influence the community and how it directs its resources. Again, it deserves a hearing. I'm upset by how highly upvoted it is. If it was on, say, 10 karma (on a similar number of votes), I wouldn't've downvoted it any further[1]. [I also upvoted, disagreevoted your comment above :)] 1. ^ It's currently on 101 karma on 114 votes, which at least marks it out as somewhat controversial (I think <1 karma/vote is generally the sign of a controversial post on the EA Forum). Note for reference that my post from a few months ago, raising the alarm about very short term AGI x-risk, is on 66 karma from 100 votes. But I made the mistake of cross-posting it to LW (where people are generally allergic to any kind of political activism), which led to a bunch of people coming over from there and downvoting it here as well.

FlorianH1y4

Enjoyed the post, thanks! But it starts with an invalid deduction:

Since we don’t enforce pauses on most new technologies, I hope the reader will grant that the burden of proof is on those who advocate for such a moratorium. We should only advocate for such heavy-handed government action if it’s clear that the benefits of doing so would significantly outweigh the costs.

(I added the emphasis)

Instead, it seems more reasonable to simply advocate for such action exactly if, in expectation, the benefits seem to [even just about] outweigh the costs. Of course, we... (read more)

Matthew_Barnett

I agree in theory, but disagree in practice. In theory, utilitarians only care about the costs and benefits of policy. But in practice, utilitarians should generally be constrained by heuristics and should be skeptical of relying heavily on explicit cost-benefit calculations. Consider the following thought experiment: You're the leader of a nation and are currently deciding whether to censor a radical professor for speech considered perverse. You're very confident that the professor's views are meritless. You ask your advisor to run an analysis on the costs and benefits of censorship in this particular case, and they come back with a report concluding that there is slightly more social benefit from censoring the professor than harm. Should you censor the professor? Personally, my first reaction would be to say that the analysis probably left out second order effects from censoring the professor. For example, if we censor the professor, there will be a chilling effect on other professors in the future, whose views might not be meritless. So, let's make the dilemma a little harder. Let's say the advisor insists they attempted to calculate second order effects. You check and can't immediately find any flaws in their analysis. Now, should you censor the professor? In these cases, I think it often makes sense to override cost-benefit calculations. The analysis only shows a slight net-benefit, and so unless we're extremely confident in its methodology, it is reasonable to fall back on the general heuristic that professors shouldn't be censored. (Which is not to say we should never violate the principle of freedom of speech. If we learned much more about the situation, we might eventually decide that the cost-benefit calculation was indeed correct.) Likewise, I think it makes sense to have a general heuristic like, "We shouldn't ban new technologies because of abstract arguments about their potential harm" and only override the heuristic because of strong evidence abo

FlorianH

I have some sympathy with 'a simple utilitarian CBA doesn't suffice' in general, but I do not end at your conclusion; your intuition pump also doesn't lead me there. It doesn't seem to require any staunch utilitarianism to arrive at 'if a quick look at the gun design suggests it has 51% to shoot in your own face, and only 49% to shoot at the tiger you want to hunt as you otherwise starve to death'*, to decide to drop the project of it's development. Or, to halt, until a more detailed examination might allow you to update with a more precise understanding. You mention that with AI we have 'abstract arguments', to which my gun's simple failure probability may not do full justice. But I think not much changes, even if your skepticism about the gun would be as abstract or intangible as 'err, somehow it just doesn't seem quite right, I cannot even quite perfectly pin down why, but overall the design doesn't make me trust; maybe it explodes in my hand, it burns me, it's smoke might make me fall ill, whatever, I just don't trust it; i really don't know, but HAVING TAKEN ALL EVIDENCE AND LIVE EXPERIENCE, incl. the smartest EA and LW posts and all, I guess, 51% I get the harm, and only 49% the equivalent benefit, one way or another' - as long as it's still truly the best estimate you can do at the moment. The (potential) fact that we more typically have found new technologies to advance us, does very little work in changing that conclusion, though, of course, in a complicated case as in AI, this observation itself may have informed some of our cost-benefit reflections. *Yes you guessed correctly, I better implicitly assume something like, you have 50% of survival w/o catching the tiger, and 100% with him (and you only care about your survival) to really arrive at the intended 'slightly negative in the cost-benefit comparison'; so take the thought experiment as an unnecessarily complicated quick and dirty one, but I think it still makes the simple point.

Matthew_Barnett

In my thought experiment, we generally have a moral and legal presumption against censorship, which I argued should weigh heavily in our decision-making. By contrast, in your thought experiment with the tiger, I see no salient reason for why we should have a presumption to shoot the tiger now rather than wait until we have more information. For that reason, I don't think that your comment is responding to my argument about how we should weigh heuristics against simple cost-benefit analyses. In the case of an AI pause, the current law is not consistent with a non-voluntary pause. Moreover, from an elementary moral perspective, inventing a new rule and forcing everyone to follow it generally requires some justification. There is no symmetry here between action vs. inaction as there would be in the case of deciding whether to shoot the tiger right now. If you don't see why, consider whether you would have had a presumption against pausing just about any other technology, such as bicycles, until they were proven safe. My point is not that AI is just as safe as bicycles, or that we should disregard cost-benefit analyses. Instead, I am trying to point out that cost-benefit analyses can often be flawed, and relying on heuristics is frequently highly rational even when they disagree with naive cost-benefit analyses.

FlorianH

I tried to account for the difficulty to pin down all relevant effects in our CBA by adding the somewhat intangible feeling about the gun to backfire (standing for your point that there may be more general/typical but less easy to quantify benefits of not censoring etc.). Sorry, if that was not clear. More importantly: I think your last paragraph gets to the essence: You're afraid the cost-benefit analysis is done naively, potentially ignoring the good reasons for which we most often may not want to try to prevent the advancement of science/tech. This does, however, not imply that for pausing we'd require Pause Benefit >> Pause Cost. Instead, it means, simply you're wary of certain values for E[Pause Benefit] (or of E[Pause Cost]) to be potentially biased in a particular direction, so that you don't trust in conclusions based on them. Of course, if we expect a particular bias of our benefit or our cost estimate, we cannot just use the wrong estimates. When I'm advocating to be even-handed, I refer to a cost-benefit comparison that is non-naive. That is, if we have priors that there may exist positive effects that we've just not yet managed to pin down well, or to quantify, we have (i) used reasonable placeholders for these, avoiding bias as good as we can, and (ii) duly widened our uncertainty intervals. It is therefore, that in the end, we can remain even-handed, i.e. pause roughly iif E[Pause Benefit] > E[Pause Cost]. Or, if you like, iif E[Pause Benefit*] > E[Pause Cost*], with * = Accounting with all duty of care for the fact that you'd usually not want to stop your professor or so/usually not want to stop tech advancements because of yadayada..

Max H1y3

Far from being “behind” capabilities, it seems that alignment research has made great strides in recent years. OpenAI and Anthropic showed that Reinforcement Learning from Human Feedback (RLHF) can be used to turn ungovernable large language models into helpful and harmless assistants. Scalable oversight techniques like Constitutional AI and model-written critiques show promise for aligning the very powerful models of the future. And just this week, it was shown that efficient instruction-following language models can be trained purely

... (read more)

Greg_Colbourn1y2

Nora, what is your p(doom|AGI)?

Greg_Colbourn1y2

GPT-4, which we can already align pretty well.

I think this is a crux. GPT-4 is only safe because it is weak. It is so far from being 100% aligned -- see e.g this boast from OpenAI that is very far from being reassuring ("29% more often"), or all the many many jailbreaks -- which is what will be needed for us to survive in the limit of superintelligence!

You go on to talk about robustness (to misuse) and how this (jailbreaks) is is a separate issue, but whilst the distinction may be important from the perspective of ML research (or A... (read more)

Arthur Conmy1y2

I think this post provides some pretty useful arguments about the downsides of pausing AI development. I feel noticeably more pessimistic about a pause going well having read this.

However, I don't agree with some of the arguments about alignment optimism and think they're a fair bit weaker

When it comes to AIs, we are the innate reward system

Sure, we can use RLHF/related techniques to steer AI behavior. Further,

[gradient descent] is almost impossible to trick

Sure, unlike in most cases in biology, ANN updates do act on the whole model without noise etc... (read more)

Zach Stein-Perlman1y2

Good post.

Small things:

2. Increasing the chance of a “fast takeoff” in which one or a handful of AIs rapidly and discontinuously become more capable, concentrating immense power in their hands.

You don't actually discuss concentrating power, I think. (You just say fast takeoff is bad because it makes alignment harder, which is the same as your 1.)

But failing to pause hardware R&D creates a serious problem because, even if we pause the software side of AI capabilities, existing models will continue to get more powerful as hardware improves. Language mode

... (read more)

Stephen McAleese1y1

"In brief, the book [Superintelligence] mostly assumed we will manually program a set of values into an AGI, and argued that since human values are complex, our value specification will likely be wrong, and will cause a catastrophe when optimized by a superintelligence"

Superintelligence describes exploiting hard-coded goals as one failure mode which we would probably now call specification gaming. But the book is quite comprehensive, other failure modes are described and I think the book is still relevant.

For example, the book describes what we would ... (read more)

FlorianH1y1

One of my favorite passages is your remark on AI in some ways being rather more white-boxy, while instead humans are rather black boxy and difficult to align. Some often ignored truth in that (even if, in the end, what really matters, arguably is that we're so familiar with human behavior, that overall, the black boxy-ness of our inner workings may matter less).