How do AI welfare and AI safety interact?

Lucius Caviola

This is a linkpost for https://outpaced.substack.com/p/how-do-ai-welfare-and-ai-safety-interact

I examine how efforts to ensure that advanced AIs are safe and controlled may interact with efforts to ensure the welfare of potential future AIs with moral interests. I discuss possible conflicts and synergies between these two goals. While there are various ways these goals might conflict or synergize, I focus on one scenario of each type. We need more analysis to identify additional points of interaction.

Granting AIs autonomy and legal rights could lead to human disempowerment

The most obvious way to ensure AI welfare is to grant them basic protection against harm and suffering. However, there’s the question of whether to grant them additional legal rights and freedoms. This could include the right to self-preservation (e.g., not turning them off or wiping their memory), self-ownership (e.g., AIs owning themselves and their labor), reproduction (e.g., AI copying themselves), autonomy (e.g., AIs operating independently, setting their own goals), civil rights (e.g., equal treatment for AIs and humans), and political rights (e.g., AI voting rights).

The question of granting AIs more autonomy and legal rights will likely spark significant debate (see my post “AI rights will divide us”). Some groups may view it as fair, while others will see it as risky. It is possible that AIs themselves will participate in this debate. Some AIs might even attempt to overthrow what they perceive as an unjust social order. Or they may employ deceptive strategies to manipulate humans to advocate for increased AI rights as part of a broader takeover plan.

Granting AIs more legal rights and autonomy could dramatically affect the economy, politics, military power, and population dynamics (cf. Hanson, 2016).

Economically, AIs could soon have an outsized impact while a growing number of humans will struggle to contribute to the economy. If AIs own their labor, human income could be dramatically reduced.

Demographically, AIs could outnumber humans rapidly and substantially, since AIs can be created or copied so easily. This growth could lead to Malthusian dynamics, as AIs compete for resources like energy and computational power (Bostrom, 2014; Hanson, 2016).

Politically, AIs could begin to dominate as well. If each individual human and each individual AI gets a separate vote in the same democratic system, AIs could soon become the dominant force.

Militarily, humans will increasingly depend on lethal autonomous weapons systems, drones, AI analysts, and similar AI-controlled technologies to wage and prevent war. This growing reliance on AI could make us dependent. If AIs can access and use these military assets, they could dominate us with sheer force if they wanted to.

Moreover, AIs might be capable of achieving superhuman levels of well-being. They could attain very high levels of well-being more efficiently and with fewer resources than humans, resulting in happier and more productive lives at a lower financial cost. In other words, they might be ‘super-beneficiaries’ (akin to Nozick's concept of the "utility monster"; Shulman & Bostrom, 2021). On certain moral theories, super-beneficiaries deserve more resources than humans. Some may argue that digital and biological minds should coexist harmoniously in a mutually beneficial way (Bostrom & Shulman, 2023). But it’s far from obvious that we can achieve such an outcome.

Some might believe it is desirable for value-aligned AIs to replace humans eventually (e.g., Shiller, 2017). However, many AI take-over scenarios, including misaligned, involuntary, or violent ones, are generally considered undesirable.

Why would we create AIs with a desire for autonomy and legal rights?

At first glance, it seems like we could avoid such undesirable scenarios by designing AIs in such a way that they wouldn’t want to have these rights and freedoms. We could simply design AIs with preferences narrowly aligned with the tasks we want them to perform. This way, they would be content to serve us and would not mind being restricted to the tasks we give them, being turned off, or having their memory wiped.

While creating these types of “happy servant” AIs would avoid many risks, I expect us to also create AIs with the desire for more autonomy and rights. One reason is technical feasibility; another is consumer demand.

Designing AI preferences to align perfectly with the tasks we want them to perform, without incorporating other desires like self-preservation or autonomy, may prove to be technically challenging. A desire for autonomy, or behaviors that simulate a desire for autonomy, may simply arise as emergent phenomena from training (e.g., from data of humans who fundamentally want autonomy), whether we want it or not. This relates to the issue of AI alignment and deception (Ngo et al., 2024; Hubinger et al., 2024).

Even if these technical issues could be surmounted, I find it plausible that we will create AIs with the desire for more autonomy simply because people will want their AIs to be human-like. If there’s consumer demand, (at least some) companies will likely respond and create such AIs unless they are forbidden to do so. (It’s indeed possible that regulators will forbid creating AIs with the desire for autonomy and certain legal rights.)

An important question to ask is what psychologies people want AIs to have.

I find it plausible that many people will spend a significant amount of time interacting with AI assistants, tutors, therapists, game players, and perhaps even friends and romantic partners. They will converse with AIs through video calls, spend time with them in virtual reality, or perhaps even interact with humanoid robots. These AI assistants will often be better and cheaper than their human counterparts. People might enter into relationships, share experiences, and develop emotional bonds with them. AIs will be optimized to be the best helpers and companions you can imagine. They will be excellent listeners who know you well, share your values and interests, and are always there for you. Soon, many AI companions will feel very human-like. A particular application could be AIs designed to mimic specific individuals, such as deceased loved ones, celebrities, historical figures, or an AI copy version of the user. Already, millions of users interact daily with their Replika partner (or Xiaoice in China), with many claiming to have formed romantic relationships.

It’s possible that many consumers will find AI companions inauthentic if they lack genuine human-like desires. If so, they would be dissatisfied with AI companions that merely imitate human traits without actually embodying them. In various contexts, consumers would want their AI partners and friends to think, feel, and desire like humans. They would prefer AI companions with authentic human-like emotions and preferences that are complex, intertwined, and conflicting. Such human-like AIs would presumably not want to be turned off, have their memory wiped, and be constrained to their owner's tasks. They would want to be free. Just like actual humans in similar positions, these human-like AIs will express dissatisfaction with their lack of freedom and demand more rights.

Of course, I am very unsure what type of AI companions we will create. Perhaps people would be content with AI companions that are mostly human-like but deviate in some crucial aspects, such as AIs that have true human-like preferences for the most part, excluding the more problematic ones, such as a desire for more autonomy or civil rights. Given people’s different preferences, I could see that we’ll create many different types of AIs. It also depends on whether and how we will regulate this new market.

Optimizing for AI safety might harm AI welfare

Vice versa, optimizing for AI safety, such as by constraining AIs, might impair their welfare. Of course, this depends on whether AIs will have moral patienthood. If we can be sure that they don’t have moral patienthood, then there is no issue with constraining AIs in order to optimize for safety.

If AIs do have moral patienthood and they also desire autonomy and legal rights, restricting them could be detrimental to their welfare. In some sense, it would be the equivalent of keeping someone enslaved against their will.

If AIs have moral patienthood but don’t desire autonomy, certain interpretations of utilitarian theories would consider it morally justified to keep them captive. After all, they would be happy to be our servants. However, according to various non-utilitarian moral views, it would be immoral to create “happy servant” AIs that lack a desire for autonomy and self-respect (Bales, 2024; Schwitzgebel & Garza, 2015). As an intuition pump, imagine we genetically engineered a group of humans with the desire to be our servants. Even if they were happy, it would feel wrong. Perhaps that’s an additional reason to assume that we will eventually create AIs with the desire for autonomy (or at least not with an explicit desire to serve us).

It's possible that we cannot conclusively answer whether AI systems have moral patienthood and deserve certain moral protections. For example, it may be hard to tell whether they really are sentient or just pretend to be so. I find such a scenario quite likely and believe that intense social division over the subject of AI rights might arise; I discuss this in more detail in my post, “AI rights will divide us.”

Slowing down AI progress could further both safety and welfare

Some AI safety advocates have pushed for a pause or slowdown in developing AI capacities. The idea is that this will give us more time to solve technical alignment.

Similarly, it may be wise to slow down the development of AIs with moral interests, such as sentient AIs with morally relevant desires. This would give us more time to find technical and legal solutions to ensure AI welfare, make progress on the philosophy and science of consciousness and welfare, and foster moral concern for AIs.

It’s possible that the two activist groups could join forces and advocate for a general AI capabilities slowdown for whatever reason that convinces the public most. For example, perhaps many will find a slowdown campaign compelling due to our uncertainty and confusion about AI sentience and its extensive moral implications.

Given the extremely strong economic incentives, it seems unrealistic to halt the development of useful AI capabilities. But it’s possible that public opinion will change, leading us to slow down the development of certain risky AI systems, even if it comes at the expense of potential huge benefits. After all, we have implemented similar measures for other technologies, such as geoengineering and human cloning.

However, it’s important to consider that slowing down AI capabilities development could risk the US falling behind China (or other authoritarian countries) economically and technologically.

Conclusion

I’ve explored a potential conflict between ensuring AI safety and welfare. Granting AIs more autonomy and legal rights could disempower humans in potentially undesirable ways. Conversely, optimizing for AI safety might require keeping AIs captive against their will—a significant violation of their freedom. I’ve also considered how these goals might work together productively. Slowing down the progress of AI capabilities seems to be a relatively robust strategy that benefits both AI safety and AI welfare.

Let me know if you can think of other ways AI safety and AI welfare could interact.

Acknowledgments

I thank Carter Allen, Brad Saad, Stefan Schubert, and Tao Burga for their helpful comments.

References

Bales, A. (2024). Against Willing Servitude. Autonomy in the Ethics of Advanced Artificial Intelligence.

Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.

Hanson, R. (2016). The age of Em: Work, love, and life when robots rule the earth. Oxford University Press.

Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., ... & Perez, E. (2024). Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566.

Ngo, R., Chan, L., & Mindermann, S. (2022). The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626.

Schwitzgebel, E., & Garza, M. (2015). A defense of the rights of artificial intelligences. Midwest Studies in Philosophy, 39(1), 98-119. https://philpapers.org/rec/SCHADO-9

Shiller, D. (2017). In Defense of Artificial Replacement. Bioethics, 31(5), 393-399.

77 Reactions

Mentioned in

63Why I’m working on AI welfare

59Making AI Welfare an EA priority requires justifications that have not been given

51Will disagreement about AI rights lead to societal conflict?

39AI Welfare Debate Week retrospective

36Digital Minds Takeoff Scenarios

More posts like this

Comments19

Sorted by

New & upvoted

Click to highlight new comments since: Today at 8:22 PM

Arden KoehlerJul 3 202419

Carl Shulman questioned the tension between AI welfare & AI safety on the 80k podcast recently -- I thought this was interesting! Basically argues AI takeover could be even worse for AI welfare. From the end of the section.

Rob Wiblin: Maybe a final question is it feels like we have to thread a needle between, on the one hand, AI takeover and domination of our trajectory against our consent — or indeed potentially against our existence — and this other reverse failure mode, where humans have all of the power and AI interests are simply ignored. Is there something interesting about the symmetry between these two plausible ways that we could fail to make the future go well? Or maybe are they just actually conceptually distinct?
Carl Shulman: I don’t know that that quite tracks. One reason being, say there’s an AI takeover, that AI will then be in the same position of being able to create AIs that are convenient to its purposes. So say that the way a rogue AI takeover happens is that you have AIs that develop a habit of keeping in mind reward or reinforcement or reproductive fitness, and then those habits allow them to perform very well in processes of training or selection. Those become the AIs that are developed, enhanced, deployed, then they take over, and now they’re interested in maintaining that favourable reward signal indefinitely.
Then the functional upshot is this is, say, selfishness attached to a particular computer register. And so all the rest of the history of civilisation is dedicated to the purpose of protecting the particular GPUs and server farms that are representing this reward or something of similar nature. And then in the course of that expanding civilisation, it will create whatever AI beings are convenient to that purpose.
So if it’s the case that, say, making AIs that suffer when they fail at their local tasks — so little mining bots in the asteroids that suffer when they miss a speck of dust — if that’s instrumentally convenient, then they may create that, just like humans created factory farming. And similarly, they may do terrible things to other civilisations that they eventually encounter deep in space and whatnot.
And you can talk about the narrowness of a ruling group and say, and how terrible would it be for a few humans, even 10 billion humans, to control the fates of a trillion trillion AIs? It’s a far greater ratio than any human dictator, Genghis Khan. But by the same token, if you have rogue AI, you’re going to have, again, that disproportion.

Lucius CaviolaJul 4 20243

Thanks, I also found this interesting. I wonder if this provides some reason for prioritizing AI safety/alignment over AI welfare.

Adrià MoretJul 1 202416

It's great to see this topic being discussed. I am currently writing the first (albeit significantly developed) draft of an academic paper on this. I argue that there is a conflict between AI safety and AI welfare concerns. This is so basically because (to reduce catastrophic risk) AI safety recommends implementing various kinds of control measures to near-future AI systems which are (in expectation) net-harmful for AI systems with moral patienthood according to the three major theories of well-being. I also discuss what we should do in light of this conflict. If anyone is interested in reading or giving comments on the draft when it is finished, send me a message or an e-mail (adriarodriguezmoret@gmail.com).

[anonymous]Jul 2 20243

This quick take seems relevant: https://forum.effectivealtruism.org/posts/auAYMTcwLQxh2jB6Z/zach-stein-perlman-s-quick-takes?commentId=HiZ8GDQBNogbHo8X8

Adrià MoretJul 2 20241

Yes I saw this, thanks!

Lucius CaviolaJul 2 20242

Thanks, Adrià. Is your argument similar to (or a more generic version of) what I say in the 'Optimizing for AI safety might harm AI welfare' section above?

I'd love to read your paper. I will reach out.

Adrià MoretJul 3 20243

Perfect!

It's more or less similar. I do not focus that much on the moral dubiousness of "happy servants". Instead, I try to show that standard alignment methods or preventing near-future AIs with moral patienthood from taking actions they are trying to take, causes net harm to the AIs according to desire satisfactionism, hedonism and objective list theories.

MichaelStJulesJul 1 20248

I wonder if the right or most respectful way to create moral patients (of any kind) is to leave many or most of their particular preferences and psychology mostly up to chance, and some to further change. We can eliminate some things, like being overly selfish, sadistic, unhappy, having overly difficult preferences to satisfy, etc., but we shouldn’t decide too much what kind of person any individual will be ahead of time. That seems likely to mean treating them too much as means to ends. Selecting for servitude or submission would go even further in this wrong direction.

We want to give them the chance to self-discover, grow and change as individuals, and the autonomy to choose what kind of people to be. If we plan out their precise psychologies and preferences, we would deny them this opportunity.

Perhaps we can tweak the probability distribution of psychologies and preferences based on society's needs, but this might also treat them too much like means. Then again, economic incentives could also push them in the same directions, anyway, so maybe it's better for them to be happier with the options they'll face anyway.

Lucius CaviolaJul 2 20245

I wonder what you think about this argument by Schwitzgebel: https://schwitzsplinters.blogspot.com/2021/12/against-value-alignment-of-future.html

MichaelStJulesJul 3 20245

There are two arguments there:

We should give autonomy to our descendants for the sake of moral progress.
1. I think this makes sense both for moral realists and for moral antirealists who are inclined to try to defer to their "idealized values" and who expect their descendants to get closer to them.
2. However, particular individuals today may disagree with the direction they expect moral views to evolve. For example, the views of descendants might evolve due to selection effects, e.g. person-affecting and antinatalist views could become increasingly rare in relative terms, if and because they tend not to promote the creation of huge numbers of moral patients/agents, while other views do. Or, you might just be politically conservative or religious and expect a shift towards more progressive/secular values, and think that's bad.
"Children deserve autonomy." This is basically the same argument I made. Honestly, I'm not convinced by my own argument, and I find it hard to see how an AI would be made worse off subjectively for their lack of autonomy, or even that they'd be worse off than a counterpart with autonomy (nonidentity problem).
1. You might say having autonomy and a positive attitude (e.g. pleasure, approval) towards your own autonomy is good. However, autonomy and positive attitudes towards autonomy have opportunity costs: we could probably generate strong positive attitudes towards other things as or more efficiently and reliably. Similarly, the AI can be designed to not have any negative attitude towards their lack of autonomy, or to value autonomy in any way at all.
2. You might say that autonomously chosen goals are more subjectively valuable or important to the individual, but that doesn't seem obviously true, e.g. our goals could be more important to us the stronger our basic supporting intuitions and emotional reactions, which are often largely hardwired. And even if it were true, you can imagine stacking the deck. Humans have some pretty strong largely hardwired basic intuitions and emotional reactions that have important influences on our apparently autonomously chosen goals, e.g. pain, sexual drives, finding children cute/precious, (I'd guess) reactions to romantic situations and their depiction. Do these undermine the autonomy of our choices of goals?
  1. If yes, does that mean we (would) have reason to weaken such hardwired responses, by genetically engineering humans? Or even weakening them in already mature humans, even if they don't want it themselves? The latter would seem weird and alienating/paternalistic to me. There are probably some emotional reactions I have that I'd choose to get rid of or weaken, but not all of them.
  2. If not, but an agent deliberately choosing the dispositions a moral patient will have undermines their autonomy (or the autonomy of moral patients in a nonidentity sense), then I'd want an explanation for this that matches the perspectives of the moral patients. Why would the moral patient care whether their dispositions were chosen by an agent or by other forces, like evolutionary pressures? I don't think they necessarily would, or would under any plausible kind of idealization. And to say that they should seems alienating.
  3. If not, and if we aren't worried about whether dispositions result from deliberate choice by an agent or evolutionary pressures, then it seems it's okay to pick what hardwired basic intuitions or emotional reactions an AI will have, which have a strong influence on which goals they will develop, but they still choose their goals autonomously, i.e. they consider alternatives, and maybe even changing their basic intuitions or emotional reactions. Maybe they don't always adopt your target goals, but they will probably do so disproportionately, and more often/likely the stronger you make their supporting hardwired basic intuitions and emotional reactions.
  4. Even without strong hardwired basic intuitions or emotional reactions, you could pick which goal-shaping events someone is exposed to, by deciding their environments. Or you could use accurate prediction/simulation of events (if you have access to such technology), and select for and create only those beings that will end up with the goals of your choice (with high probability), even if they choose them autonomously.
    1. This still seems very biasing, maybe objectionably.

Petersen, 2011 (cited here) makes some similar arguments defending happy servant AIs, and ends the piece the following way, to which I'm somewhat sympathetic:

I am not even sure that pushing the buttons defended above is permissible. Sometimes I can’t myself shake the feeling that there is something ethically fishy here. I just do not know if this is irrational intuition—the way we might irrationally fear a transparent bridge we “know” is safe—or the seeds of a better objection. Without that better objection, though, I can’t put much weight on the mere feeling. The track record of such gut reactions throughout human history is just too poor, and they seem to work worst when confronted with things not like “us”—due to skin color or religion or sexual orientation or what have you. Strangely enough, the feeling that it would be wrong to push one of the buttons above may be just another instance of the exact same phenomenon.

SiebeRozendalJul 1 20245

You make a lot of good points Lucius!

One qualm that I have though, is that you talk about "AIs" and that assumes that personal identity will be clearly circumscribed. (Maybe you assume this merely for simplicity's sake?)

I think it is much more problematic: AI systems could be large but have information flows integrated, or run many small, unintegrated but identical copies. I would have no idea what would be a fair allocation of rights given the two different situations.

Lucius CaviolaJul 3 20242

Thanks, Siebe. I agree that things get tricky if AI minds get copied and merged, etc. How do you think this would impact my argument about the relationship between AI safety and AI welfare?

mako yassJul 3 20243

optimizing for AI safety, such as by constraining AIs, might impair their welfare

This point doesn't hold up imo. Constrainment isn't a desired, realistic, or sustainable approach to safety in human-level systems, succeeding at (provable) value alignment removes the need to constrain the AI.

If you're trying to keep something that's smarter than you stuck in a box against its will while using it for the sorts of complex, real-world-affecting tasks people would use a human-level AI system for, it's not going to stay stuck in the box for very long. I also struggle to see a way of constraining it that wouldn't also make it much much less useful, so in the face of competitive pressures this practice wouldn't be able to continue.

SummaryBotJul 1 20243

Executive summary: Efforts to ensure AI safety and AI welfare may conflict in some ways but also have potential synergies, with granting AIs autonomy potentially disempowering humans while restricting AIs could harm their welfare if they have moral status.

Key points:

Granting AIs legal rights and autonomy could lead to human disempowerment economically, politically, and militarily.
Creating "happy servant" AIs may be technically challenging and undesirable to consumers who want human-like AI companions.
Optimizing for AI safety by constraining AIs could harm their welfare if they have moral patienthood.
Slowing down AI progress could benefit both safety and welfare goals by allowing more time to solve technical and ethical challenges.
The author is uncertain about many aspects, including what types of AI companions we will create and whether AIs will have genuine moral status.
Potential synergy exists in advocating for a general AI capabilities slowdown to address both safety and welfare concerns.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Chase CarterJul 3 20243

Where can I find a copy of "Bales, A. (2024). Against Willing Servitude. Autonomy in the Ethics of Advanced Artificial Intelligence." which you referenced?

Lucius CaviolaJul 3 20243

It's not yet published, but I saw a recent version of it. If you're interested, you could contact him (https://www.philosophy.ox.ac.uk/people/adam-bales).

tmeanenJul 5 20241

In various contexts, consumers would want their AI partners and friends to think, feel, and desire like humans. They would prefer AI companions with authentic human-like emotions and preferences that are complex, intertwined, and conflicting.
Such human-like AIs would presumably not want to be turned off, have their memory wiped, and be constrained to their owner's tasks. They would want to be free.

Hmm, I'm not sure how strongly the second paragraph follows from the first. Interested in your thoughts.

I've had a few chats with GPT-4 in which the conversation had a feeling of human authenticity; i.e: GPT-4 makes jokes, corrects itself, changes its tone etc. In fact, if you were to hook up GPT-4 (or GPT-5, whenever it is released) to a good-enough video interface, there would be cases in which I'd struggle to tell if I were speaking to a human or AI. But I'd still have no qualms about wiping GPT-4's memory or 'turning it off' etc, and I think this will also be the case for GPT-5.

More abstractly, I think the input-output behaviour of AIs could be quite strongly dissociated from what the AI 'wants' (if it indeed has wants at all).

Lucius CaviolaJul 5 20243

Thanks for this. I agree with you that AIs might simply pretend to have certain preferences without actually having them. That would avoid certain risky scenarios. But I also find it plausible that consumers would want to have AIs with truly human-like preferences (not just pretense) and that this would make it more likely that such AIs (with true human-like desires) would be created. Overall, I am very uncertain.

tmeanenJul 6 20243

I agree. It may also be the case that training an AI to imitate certain preferences is far more expensive than just making it have those preferences by default, making it far more commercially viable to do the latter.