This term I took a reinforcement learning course at my university, hoping to learn something useful for the directions of research that I'm considering to enter (one among which is AI safety; others are speculative so I'm not listing them).
I'm about to start coding my first toy model, when I suddenly recalled something that I previously read: Brian Tomasik's Which Computations Do I Care About? and Ethical Issues in Artificial Reinforcement Learning. So I re-read the two essays, and despite dissenting on many of the points that Brian had made, I did become convinced that RL agents (and some other algorithms too), in expectation, deserve a tiny yet non-zero moral weight, and this weight can accumulate over the many episodes in the training process to become significant.
This problem seems to me very counter-intuitive, but as a rational person, I have to admit that it's a legitimate implication under the expected value framework, and so I recognise the problem and start thinking about solutions.
The solution turns out to be obvious, but is even more counter-intuitive. I only need to add an insanely large number (say, ) to every reward value that the agent receives, and then, assuming that the agent can feel happiness, there should be a small yet unneglectable probability that its happiness will increase linearly with the number added.
- One could object that utility should be scale-invariant, and depends only on the temporal difference of expectations (i.e. how much the expectations of future reward has risen or fallen), as suggested by some relevant studies. My response is that 1. this problem is far from settled and I'm only arguing for a unneglectable probability of linear correlation, and 2. I don't think the results of psychological studies imply scale-invariance of utility on all rewards (instead they only imply scale-invariance of utility on monetary returns) - think about it: how on earth can extreme pain be simply neutralized by adjusting one's expectations?
And once I accept this conclusion, the most counter-intuitive conclusion of them all follows. By increasing the computing power devoted to the training of these utility-improved agents, the utility produced grows exponentially (as more computing power means more digits to store the rewards). On the other hand, the impact of all other attempts to improve the world (e.g. by improving our knowledge of artificial sentience so we can more efficiently promote their welfare) grows at only a polynomial rate with the amount of resource devoted into these attempts. Therefore, running these trainings is the single most impactful thing that any rational altruist should do.
Apparently, we're in a situation of Pascal's Mugging.
Quite a few hypothetical scenarios of Pascal's Mugging had already been proposed, but this one strikes me the most. It seems to me the first such scenario that has real practical implication in real life, and one that I cannot dismiss using simple arguments like "the opposite outcome is equally likely to happen, which makes net expected impact zero".
- One thing to note: GiveWell's article Why we can’t take expected value estimates literally uses Bayesian prior as the remedy to Pascal's Mugging, but here when estimating the probability of linear correlation (between utility and the number added to reward) we have already taken our prior into account, so such reasoning does not work.
Is there anything that we can say about this situation, or about the EV framework in general?