P(misalignment x-risk | AGI) is high.
Intent alignment should not be the goal for AGI x-risk reduction. If AGI is developed, and we solve AGI intent alignment, we will not have lowered x-risk sufficiently, and we may have even increased it higher than it would have been otherwise.
P(misalignment x-risk | intent-aligned AGI) >> P(misalignment x-risk | societally-aligned AGI).
The goal of AI alignment should be alignment with (democratically determined) societal values (because these have broad buy-in from humans).
P(misalignment x-risk | AGI) is higher if intent alignment is solved before societal-AGI alignment.
Most technical AI alignment research is currently focused on solving intent alignment. The (usually implicit, sometimes explicit) assumption is that solving intent alignment will help subsequently solve societal-AGI alignment. This would only be the case if all the humans that had access to intent-aligned AGI had the same intentions (and did not have any major conflicts between them); that is highly unlikely.
Solving intent alignment is likely to make practically implementing societal-AGI alignment harder. If we first solve intent alignment before solving societal alignment, humans with intent-aligned AGIs are likely to be incentivized to inhibit the development and roll-out of societal AGI-alignment techniques because they would be giving up significant power. Furthermore, humans with intent-aligned AIs would suddenly have significantly more power, and their advantages over others would likely compound.
Why does solving intent alignment not lower x-risk sufficiently?
- If we solve the intent alignment problem between a human, H, and an AI, A, then A implements H’s intentions with super-human intelligence and skill.
- There are multiple Hs and multiple As.
- By the very nature of humans, there are conflicts in the intentions of the Hs.
- Humans have conflicting preferences about the behavior of other humans and about states of the world more broadly. Intent-aligned As would thus have different intentions from one another.
- The As execute actions furthering the H’s intentions far too quickly for those conflicts to be solved through any existing human-driven conflict resolution. Conflicts are thus likely to spiral out of control.
- Any ultimate conflict resolution mechanism needs to be human-driven. No A can conduct the conflict resolution work because it does not have buy-in from all Hs (or their intent-aligned As). Affected Hs need to endorse the process and respect the outcome. That only happens with democratic procedures.
- Therefore, if we solve intent alignment, we do not solve the problem of AGI being sufficiently beneficial to humans. We do not drastically reduce P(misalignment x-risk) because there will be misalignment between many of the AGI systems and many of the humans. That level of conflict of powerful agents could be existential for humanity as a whole.
Then what should we be aiming for?
To minimize P(misalignment x-risk | AGI) we should work on technical solutions to societal-AGI alignment, which is where As internalize a distilled and routinely updated constellation of shared values as determined by deliberative democratic processes driven entirely by humans (and not AI) and authoritative conflict resolution mechanisms driven entirely by humans (and not AI). Humans already have these things (and they are well-developed in the nation with the highest probability of producing AGI, the U.S.).
We need to do the work to internalize these things in AI systems. Work toward intent alignment distracts resources from societal-AGI alignment technical work (at best); and it actively makes finishing the societal-AGI alignment work harder (at worst), if intent aligned AGI is developed first.
If societal-AGI alignment is solved before intent-alignment is solved, then there is powerful societally-aligned AGI that can reduce the probability of intent-aligned AGIs being developed and/or having negative impacts.
Conclusion
We don't yet have a solution for societal-AGI-alignment or intent-AGI-alignment, and both are very hard problems. This post is intended to raise questions about where/when to devote development resources.
Appendix A: What is intent-AGI alignment?
Cullen O’Keefe summarized intent alignment well in this Alignment Forum post.
The standard definition of "intent alignment" generally concerns only the relationship between some property of a human principal H and the actions of the human's AI agent A:
- Jan Leike et al. define the "agent alignment problem" as "How can we create agents that behave in accordance with the user's intentions?"
- Amanda Askell et al. define "alignment" as "the degree of overlap between the way two agents rank different outcomes."
- Paul Christiano defines "AI alignment" as "A is trying to do what H wants it to do."
- Richard Ngo endorses Christiano's definition.
Iason Gabriel does not directly define "intent alignment," but provides a taxonomy wherein an AI agent can be aligned with:
- "Instructions: the agent does what I instruct it to do."
- "Expressed intentions: the agent does what I intend it to do."
- "Revealed preferences: the agent does what my behaviour reveals I prefer."
- "Informed preferences or desires: the agent does what I would want it to do if I were rational and informed."
- "Interest or well-being: the agent does what is in my interest, or what is best for me, objectively speaking."
- "Values: the agent does what it morally ought to do, as defined by the individual or society."
All but (6) concern the relationship between H and A. It would therefore seem appropriate to describe them as types of intent alignment.
Appendix B: What is societal-AGI alignment?
Two examples from Alignment Forum posts:
- Coherent Extrapolated Volition is a non-democratic version of societal alignment, where "an AI would predict what an idealized version of us would want, "if we knew more, thought faster, were more the people we wished we were, had grown up farther together". It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge."
- Law-Informed AI is a democratic version of societal alignment where AGI learns societal values from democratically developed legislation, regulation, court opinions, legal expert human feedback, and more.
Related post: AGI misalignment x-risk may be lower due to an overlooked goal specification technology.