RG

Ryan Greenblatt

Member of Technical Staff @ Redwood Research
1101 karmaJoined

Bio

This other Ryan Greenblatt is my old account[1]. Here is my LW account.

  1. ^

    Account lost to the mists of time and expired university email addresses.

Comments
193

Topic contributions
2

I think reducing the risk of misaligned AI takeover looks like a pretty good usage of people on the margin. My guess is that misaligned AI takeover typically doesn't result in extinction in the normal definition of the term (killing basically all humans within 100 years). (Maybe I think the chance of extinction-defined-normally given AI takeover is 1/3.)

Thus, for me, the bottom line of the debate statement comes down to whether misaligned AI takeover which doesn't result in extinction-defined-normally actually counts as extinction in the definition used in the post.

I don't feel like I understand the definition you give of "a future with 0 value" handles cases like:

"Misaligned AIs takeover and have preferences that on their own have ~0 value from our perspective. However, these AIs keep most humans alive out of a small amount of kindness and due to acausal trade. Additionally, lots of stuff happens in our lightcone which is good due to acausal trade (but this was paid for by some entity that shared our preferences). Despite this, misaligned AI takeover is actually somewhat worse (from a pure longtermist perspective) than life on earth being wiped about prior to this point, because aliens were about 50% likely to be able to colonize most of our lightcone (or misaligned AIs they create would do this colonization) and they share our preferences substantially more than the AIs do."

More generally, my current overall guess at a preference ordering something like: control by a relatively enlightened human society that shares my moral perspectives (and has relatively distributed power > human control where power is roughly as democratic as now > human dictator > humans are driven extinct but primates aren't (so probably other primates develop an intelligent civilization in like 10-100 million years) > earth is wiped out totally (no AIs and no chance for intelligent civilization to re-evolve) > misaligned AI takeover > earth is wiped out and there aren't aliens so nothing ever happens with resources in our lightcone > various s-risk scenarios.

What line here counts as "extinction"? Does moving from misaligned AI takeover to "human control where power is roughly as democratic as now" count as an anti extinction scenario?

I think work of the sort you're discussing isn't typically called digital minds work. I would just describe this as "trying to ensure better futures (from a scope-sensitive longtermist perspective) other than via avoiding AI takeover, human power grabs, or extinction (from some other source)".

This just incidentally ends up being about digital entities/beings/value because that's where the vast majority of the value probably lives.


The way you phrase (1) seems to imply that you think large fractions of expected moral value (in the long run) will be in the minds of laborers (AIs we created to be useful) rather than things intentionally created to provide value/disvalue. I'm skeptical.

A large reason to focus on opaque components of larger systems is that difficult-to-handle and existentially risky misalignment concerns are most likely to occur within opaque components rather than emerge from human built software.

I don't see any plausible x-risk threat models that emerge directly from AI software written by humans? (I can see some threat models due to AIs building other AIs by hand such that the resulting system is extremely opaque and might takeover.)

In the comment you say "LLMs", but I'd note that a substantial fraction of this research probably generalizes fine to arbitrary DNNs trained with something like SGD. More generally, various approaches that work for DNNs trained with SGD plausibly generalize to other machine learning approaches.

I think the AI Notkilleveryoneism Memes ⏸️ (@AISafetyMemes) twitter account reasonably often says things that feel at least close to crying wolf. (E.g., in response to our recent paper "Alignment Faking in Large Langauge Models", they posted a tweet which implied that we caught the model trying to escape in the wild. I tried to correct possible misunderstandings here.) 

I wish they would stop doing this.

They are on the fringe IMO and often get called out for this.

The Long Term Future Fund (LTFF) also looks pretty good IMO, especially if you're less optimistic about policy.

I don't think non-myopia is required to prevent jailbreaks. A model can in principle not care about the effects of training on it and not care about longer term outcomes while still implementing a policy that refuses harmful queries.

I think we should want models to be quite deontological about corrigibility.

This isn't responding to this overall point and I agree by default there is some tradeoff (in current personas) unless you go out of your way to avoid this.

(And, I don't think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)

This is an old thread, but I'd like to confirm that a high fraction of my motivation for being vegan[1] is signaling to others and myself. (So, n=1 for this claim.) (A reasonable fraction of my motivation is more deontological.)

  1. ^

    I eat fish rarely as I was convinced that the case for this improving productivity is sufficiently strong.

I suppose the complement to the naive thing I said before is "80k needs a compelling reason to recruit people to EA, and needs EA to be compelling to the people to recruit to it as well; by doing an excellent job at some object-level work, you can grow the value of 80k recruiting, both by making it easier to do and by making the outcome a more valuable outcome. Perhaps this might be even better for recruiting than doing recruiting."

I think there are a bunch of meta effects from working in an object level job:

  • The object level work makes people more likely to enter the field as you note. (Though this doesn't just route through 80k and goes through a bunch of mechanisms.)
  • You'll probably have some conversations with people considering entering the field from a slightly more credible position at least if the object level stuff goes well.
  • Part of the work will likely involve fleshing stuff out so people with less context can more easily join/contribute. (True for most / many jobs.)

I think people wouldn't normally consider it Pascalian to enter a postive total returns lottery with a 1 / 20,000 (50 / million) chance of winning?

And people don't consider it to be Pascalian to vote, to fight in a war, or to advocate for difficult to pass policy that might reduce the chance of nuclear war?

Maybe you have a different-than-typical perspective on what it means for something to be Pascalian?

Load more