This essay is one of the awarded essays for the Econ Theory AI Alignment Prize from Superlinear. The essay uses the tools of theoretical economics to contribute to the AI alignment problem.
Psychologists are familiar with the notion of cognitive dissonance, the discomfort when a person’s behavior conflicts with that person’s beliefs and preferences. Cognitive dissonance is accompanied with seemingly irrational behavior such as motivated reasoning and self-deception. It is basically an example of misalignment: decisions are not aligned with values. In that sense, perhaps cognitive dissonance can shed some light on the AI-alignment problem (Bostrom, 2014): how to create safe artificial intelligence that does things that we (creators, humans, sentient beings) really want? How can the decisions of AI-machines become aligned with our values and preferences? Even a small misalignment between our values and the goals of superintelligent machines (that are more intelligent and hence powerful than us) could cause very serious problems. And we will never be intelligent enough to solve those problems once the misaligned machines are more intelligent than us.
Computer scientists who study the AI-alignment problem speak a different language than psychologists who study cognitive dissonance. But new studies in behavioral economics, especially some new theoretical models about cognitive dissonance (Hestermann, Le Yaouanq & Treich, 2020), can perhaps bridge this gap between computer scientists and psychologists. The language of economists, when they speak of e.g. utility functions, optimization, expected value maximization, game-theoretic strategic behavior and Bayesian updating, is closer to the language of computer scientists who develop AI.
The cognitive dissonance models in behavioral economics are well illustrated by the meat paradox (Hestermann, Le Yaouanq & Treich, 2020): many people are animal-loving meat eaters, and their high levels of meat consumption are not aligned with their concern for animal welfare. They do not want to cause unnecessary animal suffering, but they know that meat consumption involves unnecessary suffering when there are alternatives (e.g. plant-based meat) that cause less or no suffering. When those people do not switch to a meat-free diet, they start rationalizing their meat consumption, denying a mind to farm animals (Bastian e.a. 2012), derogating vegetarians (Minson & Monin, 2012) and actively avoiding information from animal farming and slaughterhouses. This example of cognitive dissonance, known as the meat paradox (Loughnan & Davies, 2019), is striking, because it shows that such dissonance can have large scale consequences: billions of animals are killed every year.
The basic model of the meat paradox starts with a utility function: a person values meat consumption, animal welfare and reliable knowledge. These values or preferences are variables in the utility function. The animal welfare preference in the utility function depends on the subjective belief or estimate of the level of farm animal suffering. And this belief depends on the received information about how farm animals are treated and what their mental capacities are.
The model has an intrapersonal game-theoretic framework. A person is modelled by having two selves. The first self receives external information about the welfare of farm animals used for meat production (for example information about the treatment of farm animals or their mental capacities to experience suffering). That first self can decide to transmit that information reliably (truthfully) or wrongly (deceptively) to the second self, who will use that transmitted information to form a belief about animal suffering and consequently makes the decision how much meat to buy. Hence, the first self chooses the information to send at time T1, the second self makes the consumption decision at a later time T2.
The crucial assumption is that the first self incorporates the utility function of the second self, i.e. they both value the same things. The utility functions of the two selves are fully aligned: the first self internalizes the utility of the second self, including the beliefs of the second self. Hence, the utility of the first self, and especially the term that contains the preference for animal welfare, is not based on the true value of animal welfare (the true, external information received by the first self), but on the believed value (believed by the second self and based on the transmitted information from the first self).
Suppose the external information about animal welfare is bad news, i.e. the farm animals experience too much suffering. If the first self reliably transmits this information to the second self, and animal welfare is part of that person’s utility function, the second self may decide not to buy meat. The first self does not like that outcome (as she values meat consumption). So the first self can decide to deceive the second self by transmitting a good news message that the farm animal welfare is fine. This self-deception comes at a cost, however, as reliable knowledge is also part of the person’s utility function. Self-deception has a cost, a negative term in the utility function.
Both selves start an interaction and play a strategic game. The two selves are strategic rational agents who perform Bayesian updating. The second self considers the possibility that the first self might be lying about the true state of farm animal welfare. The second self can start to distrust the first self if that first self is prone to deception. The first self knows that the second self may distrust her and adapts her decisions accordingly. When receiving bad news, the first self can strategically decide to reliably transmit this bad news to the second self, or give good news instead.
As a result, a game-theoretic perfect Bayesian equilibrium is reached. Depending on the parameters in the utility function, such an equilibrium could consist of self-deception, where the person is information averse, i.e. is not open for information about animal suffering. Especially a person with both a high level of meat attachment (who really wants to eat animal meat) and a high concern for animal welfare (who really feels guilty when causing animal suffering) might experience a strong cognitive dissonance resulting in a high level of self-deception and information aversion. Only if the preference for reliable knowledge is strong enough (if the cost of deception is large), self-deception can be avoided.
This cognitive dissonance model predicts many phenomena of the meat paradox studied by psychologists. But it can also be very relevant and instructive in the study of AI-alignment, where the utility function translates into the goal function of an AI-machine. When applying this model to AI-alignment, we can give two interpretations of the model.
In the first interpretation, the first self is the AI-machine, the second self is the human. The AI does nothing more than receiving information, analyzing data and transmitting the processed information to the human. The human can ask the AI a question, the AI calculates and gives the response. That doesn’t look dangerous, as the human can always decide to neglect the information received by the AI. But what if the AI is clever enough to deceive the human? Then the human can decide to do terrible things. To solve this problem, you may think it is sufficient for the AI to be aligned with the human, i.e. that the AI shares the very same utility function as the human. If the human values the truth and does not want to be told lies, why would the aligned AI tell lies? But as the cognitive dissonance model shows, even that solution is not enough. Even a well-aligned AI might deceive humans, just like humans might deceive themselves as in the case of meat consumption. What is required, is a sufficiently strong preference for realism, for reliable information, for transmitting the truth. The goal function of the AI-machine should include a term that measures the cost of deception, similar to the term in the utility function of the cognitive dissonance model. The marginal cost or disutility of deception, when the AI tells one more lie, should be sufficiently large to avoid misalignment.
Perhaps a concrete illustration of this first interpretation of AI-misalignment is the spread of disinformation on social media. Social media algorithms are very good at deciphering what human users of social media prefer and want. When they learn about human preferences, they basically incorporate the human utility functions in their newsfeed algorithms. But there is no cost for the AI to spread disinformation as long as the human users keep their trust in the social media. If the AI is smart enough, it can spread disinformation in such a way that humans still trust the AI. In the end, the human users can be confronted with disinformation and start making bad decisions based on that deception.
In the second interpretation, the AI-machine becomes a real agent instead of merely an information source. The AI-machine can make influential decisions that change the world. In this interpretation, the AI consists of two selves or algorithms. The first algorithm receives data from the outside world, analyses it and decides to transmit the processed information to the second algorithm who uses that information to make real world decisions. Even if both algorithms share the same utility function, and even if this is the same utility function as that of a human, misalignment can occur, just like cognitive dissonance can occur in intelligent humans. As in the first interpretation, the goal function of this AI-machine should include a term that measures the cost of deception.
So what do we learn from this analogy between the behavioral economics model of cognitive dissonance and the AI-alignment problem? First, that mere alignment in terms of equality of utility functions is not enough. Second, that the utility function of an AI-machine should contain a sufficiently large term that measures the cost of deception. And third, more generally that behavioral economics models can be useful in solving AI-misalignment problems, as these models use a language that is very similar to those of computer scientists who develop AI.
References
Bastian, B., Loughnan, S., Haslam, N., & Radke, H. R. (2012). Don’t mind meat? The denial of mind to animals used for human consumption. Personality and Social Psychology Bulletin, 38(2), 247-256.
Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press.
Hestermann, N., Le Yaouanq, Y., & Treich, N. (2020). An economic model of the meat paradox. European Economic Review, 129, 103569.
Loughnan, S., & Davies, T. (2019). The meat paradox. In Why We Love and Exploit Animals (pp. 171-187). Routledge.
Minson, J. A., & Monin, B. (2012). Do-gooder derogation: Disparaging morally motivated minorities to defuse anticipated reproach. Social Psychological and Personality Science, 3(2), 200-207.