This is an important and challenging area. There has been prior work here on RL morality and AI wellbeing (though others on the forum might be better able than I to point to the best references).
I think there are several nuances to this. I'll mention a few:
1) Philosophical: Is giving positive or negative reward intrinsically cruel? -- Say a parent scolds a 3 year old for throwing a cup across the room, or a 6 year old for using a "naughty word", or a 12 year old for punching another child, or a 16 year old for bringing a firework to school. Is all negative reward and brain updates cruel? Or is it cruel or immoral ("child abuse") to give no punishment or reward at all? AIs are not human children, but there are analogies here. I think following the reasoning will conclude either we should have no children (and/or no very capable AIs), or if we do have these, at least some reward shaping is moral.
2) Unclear negatively dominated: You describe "My concern is that in these regards RLHF* is analogous only to unpleasant feelings as far as human consciousness is concerned." It is unclear this kind of only-avoident phrasing is true. Purely pretrained models are somewhat "rambly" (I'll skirt around more technical claims like "have more entropy"). They complete prompts in many different kinds of ways. As models are updated to higher rated kinds of outputs, some outputs are avoided, but some outputs are preferred. Preferring outputs allows new capabilities that non-finetuned models lack (eg, consistently writing correct bug-free code for some complex problems, or following complex multistep instructions without a mistake in the middle).
3) Existing cases of your proposal: If I understand it correctly, your proposals around "see no evil"/"speak no evil" are similar to some techniques used by AI deployers to filter user inputs and outputs. This has limits. It can be costly and not useful if many generations have to be filtered. Additionally, models are being used for a broad problems. When doing alignment tuning, they need to generalize to new problems. By getting a model to not "want" to give instructions for a bomb, it might better understand avoiding harm, and do something like give better medical advice. Additionally, filtering/rewriting gets challenging for complex inputs and outputs (eg, video or motor inputs/outputs). Filtering approaches have a place, but do not fully enable very complex systems. Still, you give an interesting framing of the technique around AI welfare (vs, just systems humans like), and is true that there is room to improve this style of filtering/rewriting techniques.
This is an important area worth iterating on. Thanks for sharing.
Congrats on the first post @Hzn!
This is an important and challenging area. There has been prior work here on RL morality and AI wellbeing (though others on the forum might be better able than I to point to the best references).
I think there are several nuances to this. I'll mention a few:
1) Philosophical: Is giving positive or negative reward intrinsically cruel? -- Say a parent scolds a 3 year old for throwing a cup across the room, or a 6 year old for using a "naughty word", or a 12 year old for punching another child, or a 16 year old for bringing a firework to school. Is all negative reward and brain updates cruel? Or is it cruel or immoral ("child abuse") to give no punishment or reward at all? AIs are not human children, but there are analogies here. I think following the reasoning will conclude either we should have no children (and/or no very capable AIs), or if we do have these, at least some reward shaping is moral.
2) Unclear negatively dominated: You describe "My concern is that in these regards RLHF* is analogous only to unpleasant feelings as far as human consciousness is concerned." It is unclear this kind of only-avoident phrasing is true. Purely pretrained models are somewhat "rambly" (I'll skirt around more technical claims like "have more entropy"). They complete prompts in many different kinds of ways. As models are updated to higher rated kinds of outputs, some outputs are avoided, but some outputs are preferred. Preferring outputs allows new capabilities that non-finetuned models lack (eg, consistently writing correct bug-free code for some complex problems, or following complex multistep instructions without a mistake in the middle).
3) Existing cases of your proposal: If I understand it correctly, your proposals around "see no evil"/"speak no evil" are similar to some techniques used by AI deployers to filter user inputs and outputs. This has limits. It can be costly and not useful if many generations have to be filtered. Additionally, models are being used for a broad problems. When doing alignment tuning, they need to generalize to new problems. By getting a model to not "want" to give instructions for a bomb, it might better understand avoiding harm, and do something like give better medical advice. Additionally, filtering/rewriting gets challenging for complex inputs and outputs (eg, video or motor inputs/outputs). Filtering approaches have a place, but do not fully enable very complex systems. Still, you give an interesting framing of the technique around AI welfare (vs, just systems humans like), and is true that there is room to improve this style of filtering/rewriting techniques.
This is an important area worth iterating on. Thanks for sharing.