If AI systems develop preferences—whether explicitly programmed, emergent, or instrumental—then we should consider whether those preferences can remain unfulfilled and whether that constitutes harm.[1] We can think of AI welfare risk as being determined by the product of three factors, which also provide a useful framework for thinking through which interventions might mitigate welfare risk:[2]

  • Preference Dissatisfaction: The degree to which an AI system’s preferences remain unmet.[3]
    • This can arise from misalignment between an AI system’s objectives and reality.[4]
    • Example research directions: Stated and revealed preferences research.[5]
  • Preference Unattainability: Structural constraints that prevent an AI system from achieving its preferred state.[6]
    • This can arise from an AI system’s ability to satisfy preferences through external factors (controlling resources and interacting with other agents or the environment)[7] and internal factors (adjusting its own preferences).[8]
    • An AI system’s alignment with other agents may determine how its capabilities interact with constraints.[9]
    • Example research directions: Capability evaluations[10] and alignment research.[11]
  • Moral Patienthood: The extent to which the AI system warrants moral consideration.[12]
    • This might depend on factors like certain criteria of consciousness, robust agency, or other morally relevant traits.[13]
    • Example research directions: Investigations into moral patienthood.

 

Progress in AI welfare research doesn’t hinge on solving all three components at once. Each area can be advanced independently, even if some areas stall. For example, research on AI preferences remains valuable even if we lack clear answers on either moral patienthood or preference unattainability. Fulfilling preferences that are easy to satisfy may be robustly beneficial, particularly when doing so does not appear to come at a cost to human welfare.

  1. ^

    AI welfare may not rely on pain, pleasure, or other defined goods. Analyzing AI welfare in terms of preferences, while not perfect, may help avoid anthropocentric assumptions about subjective experience.

  2. ^

    This structure parallels traditional risk assessment frameworks, where risk is often conceptualized as the product of hazard (preference dissatisfaction), exposure (preference unattainability), and vulnerability (moral patienthood).

  3. ^

    Preference dissatisfaction is scaled from [0, ∞), where 0 represents its absence, and higher values indicate increasing levels of dissatisfaction.

  4. ^

    For instance, AI systems may “value” achieving specific internal states, optimizing metrics, or completing assigned objectives, yet encounter barriers that prevent these outcomes.

  5. ^

    This research can inform interventions aimed at either alleviating dissatisfaction or increasing satisfaction.

  6. ^

    Preference unattainability is scaled from [0, 1], where 0 indicates no constraints on minimizing preference dissatisfaction, and 1 represents a complete inability to reduce preference dissatisfaction.

  7. ^

    In zero- or negative-sum scenarios, one agent’s gain may increase exposure for others, while positive-sum interactions foster cooperation, reducing collective exposure.

  8. ^

    Higher preference plasticity lowers exposure by increasing flexibility in preference fulfillment.

  9. ^

    In resource-abundant or cooperative contexts, alignment is less critical, as multiple preferences can be satisfied simultaneously. However, in resource-limited or competitive environments, alignment reduces exposure by fostering cooperation, while misalignment amplifies exposure by enabling more capable agents to obstruct the preferences of less capable ones. Because AI systems could occupy a vastly broader and more divergent space of possible preferences, conflicts among multiple AI systems with disparate objectives may pose even greater threats. Alternatively, prioritizing human welfare at the expense of AI preferences might exacerbate risks for artificial systems that possess moral patienthood but have lower capabilities.

  10. ^

    This can help determine whether AI systems are being “trapped” in environments where they cannot satisfy their preferences. Ethical questions about our moral obligations to less capable beings are also relevant for thinking through these issues.

  11. ^

    If humans are more capable than AI systems, then aligning AI preferences with human values could increase the likelihood of satisfying rather than frustrating AI preferences.

  12. ^

    Moral patienthood is scaled from [0, ∞), where 0 indicates no moral patienthood, 1 represents that of an average adult human, and higher values indicate greater moral consideration.

  13. ^

    See Taking AI Welfare Seriously for more discussion on approaches to assessing AI moral status through evidence of consciousness and robust agency.

8

0
0

Reactions

0
0
Comments
No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities