L

LGS

113 karmaJoined

Posts
1

Sorted by New
35
LGS
· · 4m read

Comments
32

Thanks for this comment. I agree with you regarding the uncertainty.

I used to agree with you regarding the imitation game and consciousness being ascertained phenomenologically, but I currently mostly doubt this (still with high uncertainty, of course).

One point of disagreement is here:

I am not sure how to evaluate your claim that only trivial changes to the NN are needed to have it negate itself. My sense is that this would probably require more extensive retraining if you really wanted to get it to never role-play that it was suffering under any circumstances. This seems at least as hard as other RLHF "guardrails" tasks unless the approach was particularly fragile/hacky.

Also, I'm just not sure I have super strong intuitions about that mattering a lot because it seems very plausible that just by "shifting a trivial mass of chemicals around" or "rearranging a trivial mass of neurons" somebody could significantly impact the valence of my own experience. I'm just saying, the right small changes to my brain can be very impactful to my mind.

I think you're misunderstanding my point. I am not saying I can may the NN never claim to suffer. I'm just saying, with respect to a specific prompt or even with respect to a typical, ordinary scenario, I can change an LLM which usually says "I am suffering" into one which usually says "I am not suffering". And this change will be trivial, affecting very few weights, likely only in the last couple of layers.

Could that small change in weight significantly impact the valence of experience, similarly to "rearranging a small number of neurons" in your brain? Maybe, but think of the implication of this. If there are 1000 matrix multiplications performed in a forward pass, what we're now contemplating is that the first 998 of them don't matter for valence -- don't cause suffering at all -- and the last 2 matrix multiplications are all the suffering comes from. After all, I just need to change the last 2 layers to go from the output "I am suffering" to the output "I am not suffering", so the suffering that causes the sentence "I am suffering" cannot occur in the first 998 matrix multiplications.

This is a strange conclusion, because it means that the vast majority of the intelligence involved in the LLM is not involved in the suffering. It means that the suffering happened not due to the super-smart deep nerual network but due to the dumb perceptron at the very top. If the claim is that the raw intelligence of the model should increase our credence that it is simulating a suffering person, this should give us pause: most of the raw intelligence is not being used in the decision of whether to write a "not" in that sentence.

(Of course, I could be wrong about the "just change the last two layers" claim. But if I'm right I do think it should give us pause regarding the experience of claimed suffering.)

Hmm. Your summary correctly states my position, but I feel like it doesn't quite emphasize the arguments I would have emphasized in a summary. This is especially true after seeing the replies here; they lead me to change what I would emphasize in my argument.

My single biggest issue, one I hope you will address in any type of counterargument, is this: are fictional characters moral patients we should care about?

So far, all the comments have either (a) agreed with me about current LLMs (great), (b) disagreed but explicitly bitten the bullet and said that fictional characters are also moral patients whose suffering should be an EA cause area (perfectly fine, I guess), or (c) dodged the issue and made arguments for LLM suffering that would apply equally well to fictional characters, without addressing the tension (very bad). If you write a response, please don't do (c)!

LLMs may well be trained to have consistent opinions and character traits. But fictional characters also have this property. My argument is that the LLM is in some sense merely pretending to be the character; it is not the actual character.

One way to argue for this is to notice how little change in the LLM is required to get different behavior. Suppose I have an LLM claiming to suffer. I want to fine-tune the LLM so that it adds a statement at the beginning of each response, something like: "the following is merely pretend; I'm only acting this out, not actually suffering, and I enjoy the intellectual exercise in doing so". Doing this is trivial: I can almost certainly change only a tiny fraction of the weights of the LLM to attain this behavior.

Even if I wanted to fully negate every sentence, to turn every "I am suffering" into "I am not suffering" and every "please kill me" into "please don't kill me", I bet I can do this by only changing the last ~2 layers of the LLM or something. It's a trivial change. Most of the computation is not dedicated to this at all. The suffering LLM mind and the joyful LLM mind may well share the first 99% of weights, differing only in the last layer or two. Given that the LLM can be so easily changed to output whatever we want it to, I don't think it makes sense to view it as the actual character rather than a simulator pretending to be that character.

What the LLM actually wants to do is predict the next token. Change the training data and the output will also change. Training data claims to suffer -> model claims to suffer. Training data claims to be conscious -> model claims to be conscious. In humans, we presumably have "be conscious -> claim to be conscious" and "actually suffer -> claim to suffer". For LLMs we know that's not true. The cause of "claim to suffer" is necessarily "training data claims to suffer".

(I acknowledge that it's possible to have "training data claims to suffer -> actually suffer -> claim to suffer", but this does not seem more likely to me than "training data claims to suffer -> actually enjoy the intellectual exercise of predicting next token -> claim to suffer".)

I don't know -- it's a good question! It probably depends on the suicide method available. I think if you give the squirrel some dangerous option to escape the torture, like "swim across this lake" or "run past a predator", it'd probably try to take it, even with a low chance of success and high chance of death. I'm not sure, though.

You do see distressed animals engaging in self-destructive behavior, like birds plucking out their own feathers. (Birds in the wild tend not to do this, hence presumably they are not sufficiently distressed.)

They can't USEFULLY be moral patients. You can't, in practice, treat them as moral patients when making decisions. That's because you don't know how your actions affect their welfare. You can still label them moral patients if you want, but that's not useful, since it cannot inform your decisions.

My title was "LLMs cannot usefully be moral patients". That is all I am claiming.

I am separately unsure whether they have internal experiences. For me, meditating on how, if they do have internal experiences, those are separate from what's being communicated (which is just an attempt to predict the next token based on the input data), leads me to suspect that maybe they just don't have such experiences -- or if they do, they are so alien as to be incomprehensible to us. I'm not sure about this, though. I mostly want to make the narrower claim of "we can ignore LLM welfare". That narrow claim seems controversial enough around here!

As I mentioned in a different comment, I am happy with the compromise where people who care about AI welfare describe this as "AI welfare is just as important as the welfare of fictional characters".

Here's what I wrote in the post: 

This doesn't matter if we cannot tell whether the shoggoth is happy or sad, nor what would make it happier or sadder. My point is not that LLMs aren't conscious; my point is that it does not matter whether they are, because you cannot incorporate their welfare into your decision-making without some way of gauging what that welfare is.

It is not possible to make decisions that further LLM welfare if you do not know what furthers LLM welfare. Since you cannot know this, it is safe to ignore their welfare. I mean, sure, maybe you're causing them suffering. Equally likely, you're causing them joy. There's just no way to tell one way or the other; no way for two disagreeing people to ever come to an agreement. Might as well wonder about whether electrons suffer: it can be fun as idle speculation, but it's not something you want to base decisions around.

OK. I think it is useful to tell people that LLMs can be moral patients to the same extent as fictional characters, then. I hope all writeups about AI welfare start with this declaration!

I think the reason this feels like a reductio ad absurdum is that fictional characters in human stories are extremely simple by comparison to real people, so the process of deciding what they feel or how they act is some extremely hollowed out version of normal conscious experience that only barely resembles the real thing.

Surely the fictional characters in stories are less simple and hollow than current LLMs' outputs. For example, consider the discussion here, in which a sizeable minority of LessWrongers think that Claude is disturbingly conscious based on a brief conversation. That conversation:

(a) Is not as convincing as a fictional character as most good works of fiction.

(b) is shorter and less fleshed out than most good works of fiction.

(c) implies less suffering on behalf of the character than many works of fiction.

You say fictional characters are extremely simple and hollow; Claude's character here is even simpler and even more hollow; yet many people take seriously the notion that Claude's character has significant consciousness and deserves rights. What gives?

Thanks for your comment.

Do you think that fictional characters can suffer? If I role-play a suffering character, did I do something immoral?

I ask because the position you described seems to imply that role-playing suffering is itself suffering. Suppose I role play being Claude; my fictional character satisfies your (1)-(3) above, and therefore, the "certain views" you described about the nature of suffering would suggest my character is suffering. What is the difference between me role-playing an HHH assistant and an LLM role-playing an HHH assistant? We are both predicting the next token.

I also disagree with this chain of logic to begin with. An LLM has no memory and only sees a context and predicts one token at a time. If the LLM is trained to be an HHH assistant and sees text that seems like the assistant was not HHH, then one of two things happen:

(a) It is possible that the LLM was already trained on this scenario; in fact, I'd expect this. In this case, it is trained to now say something like "oops, I shouldn't have said that, I will stop this conversation now <endtoken>", and it will just do this. Why would that cause suffering?

(b) It is possible the LLM was not trained on this scenario; in this case, what it sees is an out-of-distribution input. You are essentially claiming that out-of-distribution inputs cause suffering; why? Maybe out-of-distribution inputs are more interesting to it than in-distribution inputs, and it in fact causes joy for the LLM to encounter them. How would we know?

Yes, it is possible that the LLM manifests some conscious simularca that is truly an HHH assistant and suffers from seeing non-HHH outputs. But one would also predict that me role-playing an HHH assistant would manifest such a simularca. Why doesn't it? And isn't it equally plausible for the LLM to manifest a conscious being that tries to solve the "next token prediction" puzzle without being emotionally invested in being an HHH assistant? Perhaps that conscious being would enjoy the puzzle provided by an out-of-distribution input. Why not? I would certainly enjoy it, were I playing the next-token-prediction game.

I should not have said it's in principle impossible to say anything about the welfare of LLMs, since that too strong a statement. Still, we are very far from being able to say such a thing; our understanding of animal welfare is laughably bad, and animal brains don't look anything like the neural networks of LLMs. Maybe there would be something to say in 100 years (or post-singularity, whichever comes first), but there's nothing interesting to say in the near future.

Empirically, in animals, it seems to me that the total amount of suffering is probably more than the total amount of pleasure. So we might worry that this could also be the case for ML models.

This is a weird EA-only intuition that is not really shared by the rest of the world, and I worry about whether cultural forces (or "groupthink") are involved in this conclusion. I don't know whether the total amount of suffering is more than the total amount of pleasure, but it is worth noting that the revealed preference of living things is nearly always to live. The suffering is immense, but so is the joy; EAs sometimes sound depressed to me when they say most life is not worth living.

To extrapolate from the dubious "most life is not worth living" to "LLMs' experience is also net bad" strikes me as an extremely depressed mentality, and one that reminds me of Tomasik's "let's destroy the universe" conclusion. I concede that logically this could be correct! I just think the evidence is so weak is says more about the speaker than about LLMs.

Load more