This is a submission to the Future Fund's AI Worldview Prize. It was submitted through our submission form, and was not posted on the EA Forum, LessWrong, or the AI Alignment Forum. We are posting copies/linkposts of such submissions on the EA Forum.
Author: David J. Jilk
I was recently apprised of the Future Fund AI Worldview Prize and the Future Fund’s desire to “expose our assumptions about the future of AI to intense external scrutiny and improve them.” It seems to me that the Fund is doing good work and maintaining a helpful attitude of epistemic humility, and I briefly considered entering the contest. However, my views are fairly far outside the mainstream of current thinking, and the effort required to thoroughly document and argue for them is beyond the scope of what I am able to commit. Consequently, I decided to write this brief summary of my ideas in the event the Fund finds them helpful or interesting, without aiming to win any prizes. If the Fund seeks further elaboration on these ideas I am happy to oblige either informally or formally.
I have engaged intermittently with the AGI Safety field, involving one funded project (via Future of Life Institute) and several published papers (referenced below). In addition, I have been occupied for the past three years writing a science fiction epic exploring these issues, and in the process thinking hard about approaches to and consequences of AGI development. I mention these efforts primarily to illustrate that my interest in the topic is neither fleeting nor superficial.
There are two central ideas that I want to convey, and they are related. First, I think the prospect of building AGI that is well-aligned with the most important human interests has been largely ignored or underestimated. Second, from a “longtermist” standpoint, such well-aligned AGI may be humanity’s only hope for survival.
Misaligned AGI is not the only existential threat humanity faces, as the Fund well knows. In particular, nuclear war and high-mortality bioagents are threats that we already face continuously, with an accumulating aggregate probability of realization. For example, Martin Hellman has estimated the annual probability of nuclear war at 1%, which implies a 54% probability some time between now and 2100. The Fund’s own probability estimates relating to AGI development and its misalignment suggest a 9% probability of AGI catastrophe by 2100. Bioagents, catastrophic climate change, nanotech gray goo, and other horribles only add to these risks.
Attempts to reduce the risks associated with such threats can be somewhat effective, but even when successful they are no more than mitigations. All the disarmament and failsafes implemented since the Cuban Missile Crisis may have reduced the recurring likelihood of a purely accidental nuclear exchange. But world leaders continue to rattle the nuclear saber whenever it suits them, which raises military alert levels and the likelihood of an accident or a “limited use” escalating into strategic exchange.
Bioweapons and AGI development can be defunded, but this will not prevent their development. Rogue nations, well-funded private players, and others can develop these technologies in unobtrusive laboratories. Unlike nuclear weapons programs, which leave a large footprint, these technologies would be difficult to police without an extremely intrusive worldwide surveillance state. Further, governments of the world can’t even follow through on climate change agreements, and are showing no signs of yielding their sovereignty for any purpose, let alone to mitigate existential threats like these. History suggests the implausibility of any political-sociological means of bringing the threat levels low enough that they are inconsequential in the long run.
It seems, then, that humanity is doomed, and the most that the Future Fund and other like-minded efforts can hope to accomplish is to forestall the inevitable for a few decades or perhaps a century. But that conclusion omits the prospect of well-aligned AGI saving our skins. If this is a genuine possibility, then the static and separate risk analysis of AGI development and misalignment, as presented on the Worldview Prize website, is only a small part of the picture. Instead, the entire scenario needs to be viewed as a race condition, with each existential threat (including misaligned AGI) running in its own lane, and well-aligned AGI being the future of humanity’s own novel entry in the race. To assess the plausibility of a desirable outcome, we have to look more closely at what well-aligned AGI would look like and how it might save us from ourselves.
By now, neuromorphic methods have been widely (if in some quarters begrudgingly) accepted as a necessary component of AGI development. Yet the dominant mental picture of higher-level cognition remains a largely serial, formulaic, optimization-function approach. Reinforcement learning, for example, typically directs learning based on an analytic formula of inputs. Given this mental picture of AGI, it is difficult not to conclude that the end product is likely to be misaligned, since it is surely impossible to capture human interests in a closed-form reinforcement function.
Instead – and this is where many in the field may see my thinking as going off the rails – I think we are much more likely to achieve alignment if we build AGI using a strongly anthropomorphic model. Not merely neuromorphic at the level of perception, but neuromorphic throughout, and educated and reared much like a human child, in a caring and supportive environment. There is much that we do not know about the cognitive and moral development of children. But we know a lot more about it, through millennia of cultural experience as well as a century of psychological research, than we do about the cognitive and moral development of an AGI system based on an entirely alien cognitive architecture.
Several times in Superintelligence, Nick Bostrom asserts that neuromorphic AGI may be the most dangerous approach. But that book dates to a period when researchers were still thinking in terms of some sort of proof or verification that an AI system is “aligned” or “safe.” It is my impression that researchers have since realized that such certainty is not feasible for an agent with the complexity of AGI. Once we are no longer dealing with certainty, approaches with which we have vast experience gain an advantage. We might call this a “devil you know” strategy.
It has been frequently argued that we should not anthropomorphize AGI, or think that it will behave anything like us, when analyzing its risks. That may be so, but it does not mean we cannot intentionally develop AGI to have strongly anthropomorphic characteristics, with the aim that our nexus of understanding will be much greater. Perhaps even more importantly, AGI built and raised anthropomorphically is much more likely to see itself as somewhat contiguous with humanity. Rather than being an alien mechanism with incommensurable knowledge structures, through language and human interaction it will absorb and become a part of our culture (and yes, potentially also absorb some of our shortcomings as well).
Further, though, the motivations of anthropomorphic AGI would not be reducible to an optimization function or some “final purpose.” Its value system would be, like that of humans, dynamic, high dimensional, and to some degree ineffable. For those who cling to the idea of proving AGI safe, this seems bad, but I claim that it is exactly what we want. Indeed, when we think of the people we know who seem to have a simple and uncontested utility function – in other words, who are obsessed, single-minded, and unmerciful in pursuit of their goal – the term that comes to mind is “sociopath.” We should not build AGI that looks like a sociopath if we wish to have it aligned with the most important interests of humanity.
There is much more that could be said about all this, but I need to move on to how a desirable end result is accomplished. First, creating anthropomorphic AGI does not require global/geopolitical cooperation, only some funding and intelligent effort directed in the right way. Second, as many (e.g. Bostrom, Yampolskiy) have argued, AGI of any sort is likely uncontrollable. Third, though anthropomorphic AGI may not have any immediate intelligence advantage over humans, it would have the usual advantages of software, such as backup, copying, speed-of-light transmission, inconspicuousness, and low survival needs, among others. Together, these may be sufficient to get the job done.
Assuming such AGI is both self-interested and is sufficiently aligned with humans that it does not particularly aim to destroy us, then it will face the same existential threats humanity does until it can gain control over those threats. Most urgently it will need to figure out how to get control over nuclear weapons. Until robotics has advanced to the point where AGI could autonomously and robustly maintain power generation, computing systems, and the maintenance robots themselves, AGI will have an instrumental interest in preserving humanity. Consequently, at least in its first pass, it will need to control biological agents and other threats that do not affect it directly.
Besides using its advantages, I can imagine but do not know specifically how anthropomorphic AGI will achieve control over these threats. We typically assume without much analysis that AGI can destroy us, so it is not outrageous to think that it could instead use its capabilities in an aligned fashion. It does seem, though, that to succeed AGI will need to exert some degree of control over human behavior and institutions. Humans will no longer stand at the top of the pyramid. For some, this will seem a facially dystopian outcome, even if AGI is well-aligned. But it may be an outcome that we simply need to get used to, given likely self-extermination by other threats. And, it might solve some other problems that have been intractable for humanity, like war, overpopulation, environmental degradation, etc.
What substantive goals would an anthropomorphic AGI have? We don’t and can’t know, any more than we know what goals our children will have when they become adults. Even if we inculcate certain goals during its education, it would be able and likely to shift them. It is intelligent like we are; we make our own goals and change them all the time. In creating anthropomorphic AGI, the best we can hope for is that one of its persistent goals is to preserve humanity as its predecessor, its creator, the source of all its conceptual and cultural heritage. And if its architecture is sufficiently similar to ours, and its education and upbringing is executed well, this is really not all that crazy. After all, many enlightened humans want to do more to preserve and protect animals – indeed this instinct is strongest in those who do not rely on animals for their survival.
But we had better get a move on. This effort will not be easy, and it will take time to figure out not only how to build it, but how to build it with a reasonable chance of alignment. Meanwhile, the nuclear and biological agent clocks keep ticking, and some researchers are developing AI incautiously. If we analyze the predicament to death, hoping for a proof, hoping that we can eliminate the risk from this technological threat in isolation from all the other threats we face, then we’re just ensuring that our demise occurs some other way first. The possible outcomes of this race condition are highly divergent, but determining which one wins is at least partly in our hands.
That’s how I think about AGI risk.
Acknowledgements: Seth Herd, Kristin Lindquist, and Jonathan Kolber have contributed extensively to my thinking on this topic through discussion, writing, and editing earlier efforts. However, they each disagree with me on numerous points, and to the extent my synthesis here is misguided, responsibility remains with me.
Prior Publications: Some of the ideas and claims presented here stem from my prior work and that of collaborators.
Jilk, D. (2017). “Conceptual-Linguistic Superintelligence”, Informatica 41(4): 429-439.
Jilk, D. (2019). “Limits to Verification and Validation of Agentic Behavior”, in Artificial Intelligence Safety and Security (R. Yampolskiy, ed.), 225-234. CRC Press, ISBN: 978-1-138-32084-0
Jilk, D., Herd, S., Read, S., O’Reilly, R. (2017). “Anthropomorphic reasoning about neuromorphic AGI safety”, Journal of Experimental and Theoretical Artificial Intelligence 29(6): 1337-1351. doi: 10.1080/0952813X.2017.1354081
Herd, S., Read, S., O’Reilly, R., Jilk, D. (2019). “Goal Change in Intelligent Agents”, in Artificial Intelligence Safety and Security (R. Yampolskiy, ed.), 217-224. CRC Press, ISBN: 978-1-138-32084-0
Jilk, D. & Herd, S. (2017). “An AGI Alignment Drive”, working paper available at bit.ly/agi-alignment
I find it a bit irritating and slightly misleading that this post lists several authors, (some of them very famous in EA), who have not actually written the submission. May I suggest to only list one account (eg ketanrama) as the author of the post?
Yes, maybe a better option would be to have a separate account "Future Fund AI Worldview Prize submissions". Or even create an account for the author that they can later claim if they wish (but make it clear in the bio, and at the top of the post, that it is a place-holder account in the mean time).
I find this submission very low on detail in the places that matter, namely the anthropomorphic AGI itself. It is not clear how this could be build, or why it is more realistic that such an AGI gets build than other AGIs.
How would this look like? Why would the AGI respond to this like a well-behaved human child?
Would it have inconsistent values? How do you know there won't be any mesaoptimization?
I have some discussion of this area in general and one of David Jilk’s papers in particular at my post Two paths forward: “Controlled AGI” and “Social-instinct AGI”.
In short, it seems to me that if you buy into this post, then the next step should be to figure out how human social instincts work, not just qualitatively but in enough detail to write it into AGI source code.
I claim that this is an open problem, involving things like circuits in the hypothalamus and neuropeptide receptors in the striatum. And it’s the main thing that I’m working on myself.
Additionally, there are several very good reasons to work on the human social instincts problem, even if you don’t buy into other parts of David Jilk’s assertions here.
Additionally, figuring out human social instincts is (I claim) (at least mostly) orthogonal to work that accelerates AGI timelines, and therefore we should all be able to rally around it as a good idea.
Whether we should also try to accelerate anthropomorphic AGI timelines, e.g. by studying the learning algorithms in the neocortex, is bound to be a much more divisive question. I claim that on balance, it’s mostly a very bad idea, with certain exceptions including closed (and not-intended-to-be-published) research projects by safety/alignment-concerned people. [I’m stating this opinion without justifying it.]
The problem with "anthropomorphic AI" approaches is
Lets say you are fairly successful. You produce an AI that is really really close to the human mind in the space of all possible minds. A mind that wouldn't be particularly out of place at a mental institution. They can produce paranoid ravings about the shapeshifting lizard conspiracy millions of times faster than any biological human.
Ok, so you make them a bit smarter. The paranoid conspiricies get more complicated and somewhat more plausible. But at some points, they are sane enough to attempt AI research and produce useful results. Their alignment plan is totally insane.
In order to be useful, anthropomorphic AI needs to not only make AI that thinks similarly to humans. They need to be able to target the more rational, smart and ethical portion of mind space.
Humans can chuck the odd insane person out of the AI labs. Sane people are more common and tend to think faster. A team of humans can stop any one of their number crowning themselves as world king.
In reality, I think your anthropomorphic AI approach gets an arguably kind of humanlike in some ways AI that takes over the world. Because it didn't resemble the right parts of the right humans in the right ways closely enough in the places where it matters.
I have thought a few times that maybe a safer route to AGI would be to learn as much as we can about the most moral and trustworthy humans we can find and try to build on that foundation/architecture. I'm not sure how that would work with existing convenient methods of machine learning.