Ryan Greenblatt

When AI companies have human-level AI systems, will they use them for alignment research, or will they use them (mostly) to advance capabilities instead?

It's not clear this is a crux for the automating alignment research plan to work out.

In particular, suppose an AI company currently spends 5% of its resources on alignment research and will continue spending 5% when they have human level systems. You might think this suffices for alignment to keep pace with capabilities as the alignment labor force will get more powerful as alignment gets more difficult (and more important) due to higher levels of capability.

This doesn't mean this plan will necessarily work, it depends on the relative difficulty of advancing capabilities vs alignment. I'd naively guess that the probability of success just keeps going up the more resources you use for alignment.

There are some reasons for thinking automation of labor is particularly compelling in the alignment case relative to the capabilities case:

There might be scalable solutions to alignment which effectively indefinitely resolve the research problem while expect that capabilities looks more like continuously making better and better algorithms.
Safety research might benefit relatively more from labor (rather than compute) when compared to capabilities. Two reasons for this:
- Safety currently seems relatively more labor bottlenecked.
- We can in principle solve large fraction of safety/alignment with fully theoretical safety research without any compute while it seems harder to do purely theoretical capabilities research.

I do think that pausing further capabilities once we have human-ish-level AIs for even just a few years while we focus on safety would massively improve the situation. This currently seems unlikely to happen.

Another way to put this is that automating alignment research is a response in the following dialogue:

Bob: We won't have enough time to solve alignment because AI takeoff will go very fast due to AIs automating AI R&D (and AI labor generally accelerating AI progress through other mechanisms).

Alice: Actually, as AIs are accelerating AI R&D, they could also be accelerating alignment work, so it's not clear that accelerating AI progress due to AI R&D acceleration makes the situation very different. As AI progress speeds up, alignment progress might speed up by a similar amount. Or it could speed up by a greater amount due to compute bottlenecks hitting capabilities harder.

Announcing: Existential Choices Debate Week (March 17-23)

Ryan Greenblatt2mo15

I think reducing the risk of misaligned AI takeover looks like a pretty good usage of people on the margin. My guess is that misaligned AI takeover typically doesn't result in extinction in the normal definition of the term (killing basically all humans within 100 years). (Maybe I think the chance of extinction-defined-normally given AI takeover is 1/3.)

Thus, for me, the bottom line of the debate statement comes down to whether misaligned AI takeover which doesn't result in extinction-defined-normally actually counts as extinction in the definition used in the post.

I don't feel like I understand the definition you give of "a future with 0 value" handles cases like:

"Misaligned AIs takeover and have preferences that on their own have ~0 value from our perspective. However, these AIs keep most humans alive out of a small amount of kindness and due to acausal trade. Additionally, lots of stuff happens in our lightcone which is good due to acausal trade (but this was paid for by some entity that shared our preferences). Despite this, misaligned AI takeover is actually somewhat worse (from a pure longtermist perspective) than life on earth being wiped about prior to this point, because aliens were about 50% likely to be able to colonize most of our lightcone (or misaligned AIs they create would do this colonization) and they share our preferences substantially more than the AIs do."

More generally, my current overall guess at a preference ordering something like: control by a relatively enlightened human society that shares my moral perspectives (and has relatively distributed power > human control where power is roughly as democratic as now > human dictator > humans are driven extinct but primates aren't (so probably other primates develop an intelligent civilization in like 10-100 million years) > earth is wiped out totally (no AIs and no chance for intelligent civilization to re-evolve) > misaligned AI takeover > earth is wiped out and there aren't aliens so nothing ever happens with resources in our lightcone > various s-risk scenarios.

What line here counts as "extinction"? Does moving from misaligned AI takeover to "human control where power is roughly as democratic as now" count as an anti extinction scenario?

Gideon Futerman's Quick takes

Ryan Greenblatt2mo8

I think work of the sort you're discussing isn't typically called digital minds work. I would just describe this as "trying to ensure better futures (from a scope-sensitive longtermist perspective) other than via avoiding AI takeover, human power grabs, or extinction (from some other source)".

This just incidentally ends up being about digital entities/beings/value because that's where the vast majority of the value probably lives.

The way you phrase (1) seems to imply that you think large fractions of expected moral value (in the long run) will be in the minds of laborers (AIs we created to be useful) rather than things intentionally created to provide value/disvalue. I'm skeptical.

Ozzie Gooen's Quick takes

Ryan Greenblatt3mo9

A large reason to focus on opaque components of larger systems is that difficult-to-handle and existentially risky misalignment concerns are most likely to occur within opaque components rather than emerge from human built software.

I don't see any plausible x-risk threat models that emerge directly from AI software written by humans? (I can see some threat models due to AIs building other AIs by hand such that the resulting system is extremely opaque and might takeover.)

In the comment you say "LLMs", but I'd note that a substantial fraction of this research probably generalizes fine to arbitrary DNNs trained with something like SGD. More generally, various approaches that work for DNNs trained with SGD plausibly generalize to other machine learning approaches.

Are AI safetyists crying wolf?

Ryan Greenblatt3mo5

Here is that tweet.

Are AI safetyists crying wolf?

Ryan Greenblatt4mo68

I think the AI Notkilleveryoneism Memes ⏸️ (@AISafetyMemes) twitter account reasonably often says things that feel at least close to crying wolf. (E.g., in response to our recent paper "Alignment Faking in Large Langauge Models", they posted a tweet which implied that we caught the model trying to escape in the wild. I tried to correct possible misunderstandings here.)

I wish they would stop doing this.

They are on the fringe IMO and often get called out for this.

It looks like there are some good funding opportunities in AI safety right now

Ryan Greenblatt4mo15

The Long Term Future Fund (LTFF) also looks pretty good IMO, especially if you're less optimistic about policy.

Alignment Faking in Large Language Models

Ryan Greenblatt4mo5

I don't think non-myopia is required to prevent jailbreaks. A model can in principle not care about the effects of training on it and not care about longer term outcomes while still implementing a policy that refuses harmful queries.

I think we should want models to be quite deontological about corrigibility.

This isn't responding to this overall point and I agree by default there is some tradeoff (in current personas) unless you go out of your way to avoid this.

(And, I don't think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)

Yanni Kyriacos's Quick takes

Ryan Greenblatt4mo10

This is an old thread, but I'd like to confirm that a high fraction of my motivation for being vegan^[1] is signaling to others and myself. (So, n=1 for this claim.) (A reasonable fraction of my motivation is more deontological.)

^{^}
I eat fish rarely as I was convinced that the case for this improving productivity is sufficiently strong.

Ben Millwood's Quick takes

Ryan Greenblatt5mo6

I suppose the complement to the naive thing I said before is "80k needs a compelling reason to recruit people to EA, and needs EA to be compelling to the people to recruit to it as well; by doing an excellent job at some object-level work, you can grow the value of 80k recruiting, both by making it easier to do and by making the outcome a more valuable outcome. Perhaps this might be even better for recruiting than doing recruiting."

I think there are a bunch of meta effects from working in an object level job:

The object level work makes people more likely to enter the field as you note. (Though this doesn't just route through 80k and goes through a bunch of mechanisms.)
You'll probably have some conversations with people considering entering the field from a slightly more credible position at least if the object level stuff goes well.
Part of the work will likely involve fleshing stuff out so people with less context can more easily join/contribute. (True for most / many jobs.)

Ryan Greenblatt

Bio

Posts 5

Comments194

Topic contributions2

Posts
5

Comments
194

Topic contributions
2