I’ve just spent the last three days reading Stuart Russell’s new book on AI safety, ‘Human Compatible’. To be fair I didn’t read continuously for three days, this is because the book rewards thoughtful pauses to walk or drink coffee, because it nurtures reflection about what really matters.
You see, Russell has written a book about AI for social scientists, that is also a book about social science for AI engineers, while at the same time providing the conceptual framework to bring us all ‘provably beneficial AI’.
‘Human Compatible’ is necessarily a whistle-stop tour of very diverse but interdependent thinking across computer science, philosophy and the social sciences and I am recommending that all AI practitioners, technology policymakers, and social scientists read it.
The problem
The key elements of the book are as follows:
- No matter how defensive some AI practitioners get, we need to all agree there are risks inherent in the development of systems that will outperform us
- Chief among these risks is the concern that AI systems will achieve exactly the goals that we set them, even if in some cases we’d prefer if they hadn’t
- Human preferences are complex, contextual, and change over time
- Given the foregoing, we must avoid putting goals ‘in the machine’, but rather build systems that consult us appropriately about our preferences.
Russell argues the case for all these points. The argument is informed by an impressive and important array of findings from philosophy, psychology, behavioural economics, and game theory, among other disciplines.
A key problem as Russell sees it, is that most present day technology optimizes a ‘fixed externally supplied objective’, but this raises issues of safety if the objective is not fully specified (which it can never be), and if the system is not easily reset (which is plausible for a range of AI systems).
The solution
Russell’s solution is that ‘provably beneficial AI’ will be engineered according to three guidelines:
- The machine’s only objective is to maximize the realization of human preferences
- The machine is initially uncertain about what those preferences are
- The ultimate source of information about human preferences is human behaviour
There are some mechanics that can be deployed to achieve such design. These include game theory, utilitarian ethics, and an understanding of human psychology. Machines must defer to humans regularly, ask permission, and their programming will explicitly allow for the machines to be wrong and therefore be open to being switched off.
Agree with Russell or disagree, he has provided a framework to which disparate parties can now refer, a common language and usable concepts accessible to those from all disciplines to progress the AI safety dialogue.
If you think that goals should be hard-coded, then you must point out why Russell’s warnings about fixed goals are mistaken. If you think that human preferences can always be predicted, then you must explain why centuries of social science research is flawed. And be aware that Russell preempts many of the inadequate slogan-like responses to these concerns.
I found an interesting passage late in the book where the argument is briefly extended from machines to political systems. We vote every few years on a government (expressing our preferences). Yet the government then acts unilaterally (according to its goals) until the next election. Russell is disparaging of this process whereby ‘one byte of information’ is contributed by each person every few years. One can infer that he may also disapprove of the algorithms of large corporate entities with perhaps 2 billion users acting autonomously on the basis of ‘one byte’ of agreement with blanket terms and conditions.
Truly ‘human compatible’ AI will ask us regularly what we want, and then provide that to us, checking to make sure it has it right. It will not dish up solutions to satisfy a ‘goal in the machine’ which may not align with current human interests.
What do we want to want?
The book makes me think that we need to be aware that machines will be capable of changing our preferences (we already experience this with advertising) and indeed machines may do so in order to more easily satisfy the ‘goals in the machine’ (think of online engagement and recommendation engines). It seems that we (thanks to machines) are now capable of shaping our environment (digital or otherwise) in such a way that we can shape the preferences of people. Ought this be allowed?
We must be aware of this risk. If you prefer A to B, and are made to prefer B, then how is this permitted? As Russell notes, would it ever make sense for someone to choose to switch from preferring A to preferring B, given that they currently prefer A?
This point actually runs very deep and a lot more philosophical thought needs to be deployed here. If we can build machines that can get us what we want, but we can also build machines that can change what we want, then we need to figure out an answer to the following deeply thought-provoking question, posed by Yuval Noah Harari at the end of his book ‘Sapiens’: ‘What do we want to want?’ There is no dismissive slogan answer to this problem.
What ought intelligence be for?
In the present context we are using ‘intelligence’ to refer to the operation of machines, but in a mid-2018 blog I posed the question what ought intelligence be used for? The point being that we are now debating how we ought to deploy AI, but what uses of other kinds of intelligence are permissible?
The process of developing and confronting an intelligence other than our own is cause for some self-reflexive thought. If there are certain features and uses of an artificial intelligence that we wouldn’t permit, then how are we justified in permitting similar goals and methods of humans? If Russell’s claims that we should want altruistic AI have any force, then why do we permit non-altruistic human behaviour?
Are humans ‘human compatible’?
I put down this book agreeing that we need to control AI (and indeed we can, according to Russell, with good engineering). But if intelligence is intelligence is intelligence then must we necessarily turn to humans, and constrain them in the same way so that humans don’t pursue ‘goals inside the human’ that are significantly at odds with ‘our’ preferences?
The key here is defining ‘our’. Whose preferences matter? There is a deep and complex history of moral and political philosophy addressing this question, and AI developers would do well to familiarise themselves with key aspects of it. As would corporations, as would policymakers. Intelligence has for too long been used poorly.
Russell notes that many AI practitioners strongly resist regulation and may feel threatened when non-technical influences encroach on ‘their’ domain. But the deep questions above, coupled with the risks inherent due to ‘goals in the machine’, require an informed and collaborative approach to beneficial AI development. Russell is an accomplished AI practitioner speaking on behalf of philosophers to AI scientists, but hopefully this book will speak to everyone.
Elsewhere we sometimes call this the "human alignment problem" and use it as a test case in the sense that if we can't design a mechanism at least robust enough to solve human alignment we probably can't use it to solve AI alignment because AIs (especially superhuman AI) are much better optimizers than humans. Some might argue against this, pointing out that humans are fallible in ways that machines are not, but the point is that if you can't make safe something so bad at optimizing as humans who often look like they are just taking random walks due to a wide variety of reasons, you can't possibly hope to make safe something that is reliably good at achieving its goals.
But we can decide what goes inside the machine, whereas with people we can only control outside circumstances. It seems to me that such a mechanism would be highly likely to be an internal mechanism, so wouldn't be applicable to people
We're in an analogous situation with AI. AI is too complex for us to fully understand what it does (by design), and this is also true of mundane, human-programmed software (asking any software engineer who has worked on something more than 1k lines long if their program ever did anything unexpected and I can promise you the answer is "yes"). Thus although we in theory have control of what goes on inside AI, that's much less the case than it seems at first, so much so that we often have better models of how humans decide to do things than we do for AI.
Great additional detail, thanks!
Russels' assumption that "The machine’s only objective is to maximize the realization of human preferences" seems to assume some controversial and (to my judgement) highly implausible moral views. In particular, it is speciesistic, for why should only human preferences be maximized? Why not animal or machine preferences?
One might respond that Russel is giving advice to humans and humans should maximize human preferences, since we should all maximize our own preferences. Thus, he isn't assuming that there is anything morally special about humans and his position is therefore not speciestic. I respond, that maximizing my own prefrences and maximizing human preferences are very different objectives, since there are many humans other than myself. This defence therefore rests on a mischaracterization of Russel's assumption (at least as you outlined it). Furthermore, the assumption that we should maximize our own preferences seems anyway arbitrary and unsurported.
You write that "There are some mechanics that can be deployed to achieve [an AI following the guidelines]. These include game theory, utilitarian ethics, and an understanding of human psychology."
I doubt that a utilitarian ethic is useful for maximizing of human preferences, since utilitarianism is impartial in the sense that it takes everyone's wellbeing into account, human or otherwise. I also doubt that it supports the maximization of the agent's own preferences, where "the agent" is assumed to be an individual human, since human preferences have non-utilitarian features. The precise nature of these features depends on what exactly you mean by "preference," so let me illustrate the point with some sensible-sounding definitions of "preference".
(A) An agent is said to prefer x over y, iff he would choose the certain outcome x over the certain outcome y, when given the option.
This makes it tautological that agents maximizes their preferences, when the necessary factual information is availeble. However, people often behave in non-utilitarian ways even if they posses all the relevant factual information. They may e.g. use their money on luxeries instead of donations, or they may support factory farming by buying its products.
(B) An agent is said to prefer x over y, iff he has an urge/craving towards doing x instead of doing y. To put it in other words, the agent would have to muster some strength of will, if he is to avoid doing x instead of y.
People's cravings/urges can often lead them in non-utilitarian directions (think e.g. of a drug addict who would be better of he could muster the will to quit the drugs).
(C) An agent is said to prefer x over y, iff the feelings/emotions/passions that motivate him towards x are more intense, than those which motivate him towards y. The intensity is here assumed to be some consciously felt feature of the feelings.
Warm glow giving is, by definition, motivated by our feelings/emotions. However, it usually has fairly little impact upon aggragate happiness, so uttilitarianism doesn't recommend it.
(D) An agent is said to prefer x over y, iff he values x more than y.
This definition prompts the question "what does 'valuing' refer to?". One possible answer is to define "valuing" like (C), but (C) has already been dealt with. Another option is the following.
(E) An agent values is x more than y, iff he believes it to be more valuable.
This would make preference-maximization compatible with uttilitarianism, insofar as the agent believes in utilitarism and lacks beliefs that contradict utilitarianism. However, it would also be compatible with any other moral theory whatsoever, so long as we make the analogous assumptions on behalf of that theory.
It seems worth adding two more comments about (E). First, unlike (A), (B) and (C) it introduces a rationale for maximizing one's prefernces. We cannot act on an unknown truth, but only on what we believe to be true. Thus, we must act on our moral beliefs, rather than some unknown moral truth.
Second, (E) seems like a bad analysis of "preference," for although moral views have some preference-like features (specifically, they can motivate behavior), they also have some features, that are more belief-like, than preference-like. They can e.g. serve as premises or conclusions in arguments, one can have credences in them and they can be the subjectmatter of questions.
The view I would advocate is that something like utilitarianism (i.e., some form of impartial, species-indifferent welfare maximization) is a core part of human values. What I mean by 'human values' here isn't on your list; it's closer to an idealized version of our preferences: what we would prefer if we were smarter, more knowledgeable, had greater self-control.
The language of "human-compatible" is very speciesist, since ethically we should want AGI to be "compatible" with all moral patients, human or not.
On the other hand, the idea of using human brains as a "starting point" for identifying what's moral makes sense. "Which ethical system is correct?" isn't written in the stars or in Plato's heaven; it seems like if the answer is encoded anywhere in the universe, it must be encoded in our brains (or in logical constructs out of brains).
The same is true for identifying the right notion of "impartial", "fair", "compassionate", "taking other species' welfare into account", etc.; to figure out the correct moral account of those important values, you would primarily need to learn facts about human brains. You'd then need to learn facts about non-humans' brains in order to implement the resultant impartiality procedure (because the relevant criterion, "impartiality", says that whether you have human DNA is utterly irrelevant to moral conduct).
The need to bootstrap from values encoded in our brains doesn't and shouldn't mean that humans are the only moral patients (or even that we're particularly important moral patients; insects could turn out to be utility monsters, for all we know today). Hence "human-compatible" is an unfortunate phrase here.
But it does mean that if, e.g., it turns out that cats' ultimate true preferences are to torture all species forever, we shouldn't give that particular preference equal decision weight. Speaking very loosely, the goal is more like 'ensuring all beings gets to have a good life', not like 'ensuring all species (however benevolent or sadistic they turn out to be) get an equal say in what kind of life all beings get to live'.
If there's a more benevolent species than humans, I'd hope that sufficiently advanced science could identify that species, and pass the buck to them. (In an odd sense, we're already building an alien species to defer to if we're constructing 'an idealized version of human preferences', since I would expect sufficiently idealized preferences to turn out to be pretty alien compared to the views human beings espouse today.)
I think it's reasonable to worry that given humans' flaws, humans might not in fact build AGI that 'ensures all beings get to have a good life'. But I do think that something like the latter is the goal; and when you ask me what physical facts in the world make that 'the goal', and what we would need to investigate in order to work out all the wrinkles and implementation details, I'm forced to initially point to facts about human (if only to identify the right notions of 'what a moral patient is' and 'how one ought to impartially take into account all moral patients' welfare').
If "human-compatible" means anything non-speciesistic, then I agree that it is an unfortunate phrase, since it is misleading. I also think it is misleading to call idealized preferences for "human values," since humans don't actually hold those preferences, as you correctly point out.
You write that
Let X be the claims, which you deny in this quote. If X is taken litterally, then it is a straw man, since no one believes in it. If X is metaphorical, then it is very unclear what its supposed to mean or whether it means anything. The claim that "ethics is encoded somewhere in the universe" is also unclear. My best attempt to ascribe meaning to it is as follows "there is some entity in the universe, which constitutes all of ethics," but claims seems false. The most basic ethical principles is, I believe, in some ways like logical principles. The validity of the argument "p and q, therefore p" is not constituted by any feature of the universe. To see this, imagine an alternative universe, which differs from the real in basically any way you like. It's governed by different laws of nature, contains different lifeforms (or perhaps no life at all) has a different cosmological history etc. If this universe had been real, then "p and q, therefore p" would still be valid. Basic ethical principles like the claim that the suffering is bad, seems just like this. If human preferences (or other features of the universe) where to be different, then suffering would still be bad.
I agree that suffering is bad in all universes, for the reasons described in https://www.lesswrong.com/posts/zqwWicCLNBSA5Ssmn/by-which-it-may-be-judged. I'd say that "ethics... is not constituted by any feature of the universe" in the sense you note, but I'd point to our human brains if we were asking any question like: