(last slight update on 2024-04-18)
My lab has started devoting some resources to AI safety work. As a transparency measure and to reach out, I here describe our approach.
Overall Approach
I select small theoretical and practical work packages that...
- seem manageable in view of our very limited resources,
- match our mixed background in applied machine learning, game theory, agent-based modeling, complex networks science, dynamical systems theory, social choice theory, mechanism design, environmental economics, behavioral social science, pure mathematics, and applied statistics, and
- appear under-explored or neglected but promising or even necessary, according to our subjective assessment based on our reading of the literature and exchanges with individuals from applied machine learning, computer linguistics, AI ethics researchers, and most importantly, AI alignment researchers (you?).
Initial Reasoning
I believe that the following are likely to hold:
- We don't want the world to develop into a very low-welfare state.
- Powerful AI agents that optimize for an objective not almost perfectly aligned with welfare can produce very low-welfare states.
- AI systems can get powerful if they are given this power explicitly by others, or when they are capable enough to gain this power.
- Highly capable AI agents will emerge soon enough.
- It is impossible to specify and formalize sufficiently well what "welfare" actually means (welfare theorists have tried for centuries and still disagree, common people disagree even more).
My puzzling conclusion from this is:
- We can't make sure that AI agents optimize for an objective that is almost perfectly aligned with welfare.
- It is not yet clear that we can (or even want to) prevent AI systems from getting powerful.
- Hence we must try to prevent that any powerful AI agent optimizes for any objective whatsoever.
- Doing so requires designing non-optimizing agents. This appears to be a necessary but not a sufficient condition for AI safety that is currently under-researched.
Further reasoning
I also believe the following is likely to hold:
- Even non-optimizing agents with limited cognitive capacities (like Elon Musk) can cause a lot of harm if they are powerful and misaligned.
From this I conclude:
- We must also make sure no agent (whether AI or human, optimizing or not, intelligent or not) can acquire too much power.
Those of you who are Asimov fans like me might like the following...
Six Laws of Non-Optimizing
- Never attempt to optimize* your behavior with regards to any metric. (In particular: don't attempt to become as powerful as possible.)
- Constrained by 1, don't cause suffering or do other harm.
- Constrained by 1-2, prevent other agents from violating 1 or 2
- Constrained by 1-3, do what the stakeholders in your behavior would collectively decide you should do.
- Constrained by 1-4, cooperate with other agents.
- Constrained by 1-5, protect and improve yourself.
Rather than trying to formalize this or even define the terms precisely, I just use them to roughly guide my work.
*When saying "optimize" I mean it in the strict mathematical sense: aiming to find an exact or approximate, local or global maximum or minimum of some given function. When I mean mere improvements w.r.t. some metric, I just say "improve" rather than "optimize".
Agenda
We currently slowly pursue two parallel approaches, the first related to laws 1,3,5 from above, the other related to law 4.
Non-Optimizing Agents
- Explore several novel variants of aspiration-based policies and related learning algorithms for POMDPs, produce corresponding non-optimizing versions of classical to state-of-the art tabular, ANN-based, and probabilistic-programming-based RL algorithms, and test and evaluate them in benchmark and safety-relevant environments from the literature, plus in tailormade environments for testing particular hypotheses. This might or might not be seen as a contribution to Agent Foundations research. (Currently underway as part of AI Safety Camp and SPAR, see the project website and Will Petillo's interview with me)
- Test them in near-term relevant application areas such as autonomous vehicles, via state-of-the-art complex simulation environments. (Planned with partner from autonomous vehicles research)
- Using our game-theoretical and agent-based modeling expertise, study them in multi-agent environments both theoretically and numerically.
- Design evolutionarily stable non-optimizing strategies for non-optimizing agents that cooperate with others to punish violations of law 1 in paradigmatic evolutionary games.
- Use our expertise in adaptive complex networks and dynamical systems theory to study dynamical properties of mixed populations of optimizing and non-optimizing agents: attractors, basins of attraction, their stability and resilience, critical states, bifurcations and tipping behavior, etc.
Collective Choice Aspects
- Analyse known existing schemes for Reinforcement Learning from Human Feedback (RLHF) from a Social Choice Theory perspective to study their implicit preference aggregation mechanism and its effects on inclusiveness, fairness, and diversity of agent behavior.
- Reinforcement Learning from Collective Human Feedback (RLCHF): Plug in suitable collective choice mechanisms from Social Choice Theory into existing RLHF schemes to make agents obey law 4. (Currently underway)
- Design collective AI governance mechanisms that focus on inclusion, fairness, and diversity.
- Eventually merge the latter with the hypothetical approach to long-term high-stakes decision making described in this post.
- Co-organize the emerging Social Choice for AI Ethics and Safety (SC4AI) community
Call for collaboration and exchange
Given almost non-existent funding, we currently rely on voluntary work by a few interns and students writing their theses, so I would be extremely grateful for additional collaborators and people who are willing to discuss our approach.
Thanks
I profited a lot from a few conversations with, amongst others, Yonatan Cale, Simon Dima, Anca Dragan, Clément Dumas, Thomas Finn, Simon Fischer, Scott Garrabrant, Jacob Hilton, Vladimir Ivanov, Bob Jacobs, Jan Hendrik Kirchner, Benjamin Kolb, Vanessa Kosoy, Nathan Lambert, Linda Linsefors, Adrian Lison, David Manheim, Marcus Ogren, Joss Oliver, Will Petillo, Stuart Russell, Phine Schikhof, Hailey Schoelkopf (in alphabetical order). This is not meant to claim their endorsement of anything I wrote here, of course.
Hey Jobst!
Regarding non-optimizing agents,
TL;DR: These videos from Robert Miles changed my mind about this, personally
(I think we talked about that but I'm not sure?)
A bit longer:
Robert (+ @edoarad ) convinced my that an agent that isn't optimizing anything isn't a coherent concept. Specifically, an agent that has a few things true about it, like "it won't trade things in a circle so that it will end up losing something and gaining nothing" will have a goal that can be described with a utility function.
If you agree with this, then I think it's less relevant to say that the agent "isn't maximizing anything" and more coherent to talk about "what is the utility function being maximized"
Informally:
If I am a paperclip maximizer, but every 100 seconds I pause for 1 second (and so, I am not "maximizing" paperclips), would this count as a non-optimizer, for you?
Also maybe obvious:
"5. We can't just build a very weak system": Even if you succeed building a non-optimizer, it still needs to be pretty freaking powerful. So using a technique that just makes the AI very weak wouldn't solve the problem as I see it. (though I'm not sure if that's at all what you're aiming at, as I don't know the algorithms you talked about)
Ah,
And I encourage you to apply for funding if you haven't yet. For example here. Or if you can't get funding, I'd encourage you to try talking to a grantmaker who might have higher quality feedback than me. I'm mostly saying things based on 2 youtube videos and a conversation
Something is wrong here, because I fit the description of an "AGI", and yet I do not have a utility function. Within that theorem something is being smuggled in that is not necessary for general intelligence.
Agree. Something that clarified my thinking on this (still feel pretty confused!) is Katja Grace's counterarguments to basic AI x-risk case. In particular the section on "Different calls to ‘goal-directedness’ don’t necessarily mean the same concept" and discussions about "pseduo-agents" clarified how there are other ways for agents to take actions than purely optimizing a utility functions (which humans don't do).
I mainly want to say I agree, this seems fishy to me too.
An answer I heard from an agent foundation's researcher if I remember correctly (I complained about almost the exact same thing) : Humans do have a utility function, but they're not perfectly approximating it.
I'd add: Specifically, humans have a "feature" of (sometimes) being willing to lose all their money (in expectation) in a casino, and other such things. I don't think this is such a good safety feature (and also, if I had access to my own code, I'd edit that stuff away). But still this seems unsolved to me and maybe worth discussing more. (maybe MIRI people would just solve it in 5 seconds but not me)
It is interesting to think about the seeming contradiction here. Looking at the von neuman theorem you linked earlier, the specific theorem is about a rational agent choosing between several different options, and saying that if their preferences follow the axioms (no dutch-booking etc), you can build a utility function to describe those preferences.
First of all, humans are not rational, and can be dutch-booked. But even if they were much more rational in their decision making, I don't think the average person would suddenly switch into "tile the universe to fulfill a mathematical equation" mode (with the possible exception of some people in EA).
Perhaps the problem is that the utility function describing an entities preferences doesn't need to be constant. Perhaps today I choose to buy pepsi over coke because it's cheaper, but next week I see a good ad for coke and decide to pay the extra money for the good associations it brings. I don't think the theorem says anything about that, it seems like the utility just describes my current preferences, and says nothing about how my preferences change over time.
From a neuroscience/psychology perspective, I'd say that you are maximizing your future reward. And while that's not a well-defined thing, it doesn't matter; if you were highly competent, you'd make a lot of changes to the world according to what tickles you, and those might or might not be good for others, depending on your preferences (reward function). The slight difference between turning the world into one well-defined thing and a bunch of things you like isn't that important to anyone who doesn't like what you like.
This is a broader and more intuitive form of the argument Miles is trying to make precise.
If you can be dutch-booked without limit, well, you're just not competent enough to be a threat; but you're not going to let that happen, let alone a superintelligent version of you.
I agree.
Except for one detail: Humans who hold preferences that don't comply to the axioms cannot necessarily be "dutch-booked" for real. That would require them not only to hold certain preferences but also to always act on those preferences like an automaton, see this nice summary discussion: https://plato.stanford.edu/entries/dutch-book/
"Humans do have a utility function"? I would say that depends on what one means by "have".
Does it mean that the value of a humans' life can in principle be measured, only that measure might not be known to the human? Then I would not be convinced – what would the evidence for this claim be?
Or does it mean that humans are imperfect maximizers of some imperfectly encoded state-action-valuation function that is somehow internally stored in their brains and might have been inherited and/or learned? Then I would also not be conviced as long as one cannot point to evidence that such an evaluation function is actually encoded somewhere in the brain.
Or does it simply mean that the observable behavior of a human can be interpreted as (imperfecty) maximizing some utility function? This would be the classical "as if" argument that economists use to defend their modeling humans as rational agents despite all evidence from psychology.
It means humans are highly imperfect maximizers of some imperfectly defined and ever-changing thing: your estimated future rewards according to your current reward function.
It doesn't matter that you're not exactly maximizing one certain thing; you're working toward some set of things, and if you're really good at that, it's really bad for anyone who doesn't like that set of things.
Optimization/maximization is a red herring. Highly compentent agents with goals different from yours is the core problem.
Dear Seth,
if Yonatan meant it the way you interpret it, I would still respond: Where is the evidence that such a reward function exists and guides humans' behavior? I spoke to several high-ranking scientists from psychology and social psychology who very much doubt this. I suspect that the theory of humans aiming to maximize reward functions might be a non-testable one, and in that sense "non-scientific" – you might believe in it or not. It helps explaining some stuff, but it is also misleading in other respects. I choose not to believe it until I see evidence.
I also don't agree that optimization is a red herring. It is a true issue, just not the only one, and maybe not the most severe one (if one believes one can separate out the relative severity of several interlinked issues, which I don't). I do agree that powerful agents are another big issue, whether competent or not. But powerful, competent, and optimizing agents are certainly the most scary kind :-)
Mismatched goals is the problem. The logic of instrumental convergence applies to any goal, not just maximization goals.
Dear Seth, thank you again for your opinion. I agree that many instrumental goals such as power would be helpful also for final goals that are not of the type "maximize this or that". But I have yet to see a formal argument that show that they would actually emerge in a non-maximizing agent just as likely as in a maximizer.
Regarding your other claim, I cannot agree that "mismatched goals is the problem". First of all, why do you think there is just a single problem, "the" problem? And then, is it helpful to consider something a "problem" that is an unchangeable fact of life? As long as there is more than one human who is potentially affected by an AI system's actions, and these humans' goals are not matched with each other (which they usually aren't), no AI system can have goals matched to all humans affected by it. Unless you want to claim that "having matched goals" is not a transitive relation. So I am quite convinced that the fact that AI systems will have mismatched goals is not a problem we can solve but a fact we have to deal with.
I agree with you that humans have mismatched goals among ourselves, so some amount of goal mismatch is just a fact we have to deal with. I think the ideal is that we get an AGI that makes its goal the overlap in human goals; see [Empowerment is (almost) All We Need](https://www.lesswrong.com/posts/JPHeENwRyXn9YFmXc/empowerment-is-almost-all-we-need) and others on preference maximization.
I also agree with your intuition that having a non-maximizer improves the odds of an AGI not seeking power or doing other dangerous things. But I think we need to go far beyond the intuition; we don't want to play odds with the future of humanity. To that end, I have more thoughts on where this will and won't happen.
I'm saying "the problem" with optimization is actually mismatched goals, not optimization/maximization. In more depth, and hopefully more usefully: I think unbounded goals are the problem with optimization (not the only problem, but a very big one).
If an AGI had a bounded goal like "make on billion paperclips", it wouldn't be nearly as dangerous; it might decide to eliminate humanity to make the odds of getting to a billion as good as possible (I can't remember where I saw this important point; I think maybe Nate Soares made it). But it might decide that its best odds would just be making some improvements to the paperclip business, in which case it wouldn't cause problems.
So we're converging...
One final comment on your argument about odds: In our algorithms, specifying an allowable aspiration includes specifying a desired probability of success that is sufficiently below 100%. This is exactly to avoid the problem of fulfilling the aspiration becoming an optimization problem through the backdoor.
Hey Yonatan,
first, excuse my spelling your name incorrectly originally, I fixed it now.
Thank you for your encouragement with funding. As it happens, we did apply for funding from several sources and are waiting for their response.
Regarding Rob Miles' videos on satisficing:
One potential misunderstanding relates to the question of with what probability the agent is required to reach a certain goal. If I understand him correctly, he assumes satisficing needs to imply maximizing the probability that some constraint is met, which would still constitute a form of optimization (namely of the probability). This is why our approach is different: In a Markov Decision Process, the client would for example specify a feasibility interval for the expected value of the return (= long-term discounted sum of rewards according to some reward function that we explicitly do not assume to be a proper measure of utility), and the learning algorithm would seek a policy that makes the expected return fall anywhere into this interval.
The question of whether an agent somehow necessarily must optimize something is a little philosophical in my view. Of course, given an agent's behavior, one can always find some function that is maximal for the given behavior. This is a mathematical triviality. But this is not the problem we need to address here. The problem we need to address is that the behavior of the agent might get chosen by the agent or its learning algorithm by maximizing some objective function.
It is all about a paradigm shift: In my view, AI systems should be made to achieve reasonable goals that are well-specified w.r.t. one or more proxy metrics, not to maximize whatever metric. What would be the reasonable goal for your modified paperclip maximizer?
Regarding "weakness":
Non-maximizing does not imply weak, let alone "very weak". I'm not suggesting to build a very weak system at all. In fact, maximizing an imperfect proxy metric will tend to give low score on the real utility. Or, to turn this around: The maximum of the actual utility function is most achieved by a policy that does not maximize the proxy metric. We will study this in example environments and report results later this year.
Isn't this equivalent to building an agent (agent-2) that DID have that as their utility function?
Ah, you wrote:
I don't understand this and it seems core to what you're saying. Could you maybe say it in other words?
When I said "actual utility" I meant that which we cannot properly formalize (human welfare and other values) and hence not teach (or otherwise "give" to) the agent, so no, the agent does not "have" (or otherwise know) this as their utility function in any relevant way.
In my use of the term "maximization", it refers to an act, process, or activity (as indicated by the ending "-ation") that actively seeks to find the maximum of some given function. First there is the function to be maximized, then comes the maximization, and finally one knows the maximum and where the maximum is (argmax).
On the other hand, one might object the following: if we are given a deterministic program P that takes input x and returns output y=P(x), we can of course always construct a mathematical function f that takes a pair (x,y) and returns some number r=f(x,y) so that it turns out that for each possible y we have P(x)=argmax f(x,y). A trivial choice for such a function is f(x,y)=1 if y=P(x) and f(x,y)=0 otherwise. Notice, however, that here the program P is given first, and then we construct a specific function f for this equivalence to hold.
In other words, any deterministic program P is functionally equivalent to another program P' that takes some input x, maximizes some function f(x,y), and returns the location y of that maximum. But being functionally equivalent to a maximizer is not the same as being a maximizer.
In the learning agent context: If I give you a learned policy pi that takes a state s and returns an action a=pi(s) (or a distribution of actions), then you might well be able to construct a reward function g that takes a state-action pair (s,a) and returns a reward (or expected reward) r=g(s,a) so that when I then calculate the corresponding optimal state-action-quality-function Q* of this reward function, it turns out that for all states s, we have pi(s)=argmax Q*(s,a). This means that the policy pi is the same policy as the one that a learning process would have produced that searches for the policy that maximizes the long-term discounted sum of rewards according to reward function g. But it does not mean that the policy pi was actually determined by such a possible optimization procedure: the learning process that produced pi can very well be of a completely different kind than an optimization procedure.
This is a start, but just a start. Optimization/maximization isn't actually the problem. Any highly competent agent with goals that don't match ours is the problem.
A world that's 10% paperclips and the rest composed of other stuff we don't care about is no better than a true optimizer.
The idea "just don't optimize" has a surprising amount of support in AGI safety, including quantilizers and satisficing. But they seem like only a bare start on taking the points off of the tiger's teeth to me. The tiger will still gnaw you to death if it wants to even a little.
Hi Seth, thank you for your thoughts!
I totally agree that it's just a start, and I hope to have made clear that it is just a start. If it was not sufficiently clear before, I have now added more text making explicit that of course I don't think that dropping the optimization paradigm is sufficient to make AI safe, just that it is necessary. And because if appears necessary and under-explored, I chose to study it for some time.
I don't agree with your 2nd point however: If an agent turns 10% of the world into paperclips, we might still have a chance to survive. If it turns everything into paperclips, we don't.
Regarding the last point:
Hey! Can you elaborate a bit more on what you mean by "never optimise" here? It seems like the definition you have is broad enough to render an AI useless :
It seems like this definition would apply to anything that uses math to make decisions. If I ask the AI to find me the cheapest flight it can from london to new york tomorrow, will it refuse to answer?
Also, I don't understand the distinction with "improvement" here. If I try to "improve" the estimate of the cheapest flight, isn't that the same think as trying to "optimise" to find the approximate local minimum of cost?
This is difficult to say. I have a relatively clear intuition what I mean by optimization and what I mean by optimizing behavior. In your example, merely asking for the cheapest flight might be safe as long as you don't automatically then book that flight without spending a moment to think about whether taking that one-propeller machine without any safety belts that you have to pilot yourself is actually a good idea just because it turned out to be the cheapest. I mostly care about agents that have more agency than just printing text to your screen.
I believe what some people call "AI heaven" can be reached with AI agents that don't book the cheapest flights but that book you a flight that costs no more than you specify, take no longer than you specify, and have at least those safety equipment and other facilities that you specify. In other words: satisficing! Another example: Not find me a job that earns me as much income as possible, but find me a job that earns me at least as much income to satisfy all my basic needs and let's me have as much fun from leisure activities as I can squeeze into my lifetime. And so on...
Regarding "improvement": Replacing a state s by a state s' that scores higher on some metric r, so that r(s') > r(s), is an "improvement w.r.t. r", not an optimization for r. An optimization would require replacing s by that s' for which there is no other s'' with r(s'') > r(s'), or some approximate version of this.
One might think that a sequence of improvements must necessarily constitute an optimization, so that my distinction is unimportant. But this is not correct: While any sequence of improvements r(s1) < r(s2) must make r(sn) converge to some value r° (at least if r is bounded), this limit value r° will in general be considerably lower than the maximal value r* = max r(s). unless the procedure that selects the improvements is especially designed to find that maximum, in other words, is an optimization algorithm. Note that optimization is a hard problem in most real-world cases, much harder than just finding some sequence of improvements.
With regards to your improvements definition, isn't "continuously improving until you reach a limit with is not necessarily the global limit" just a different way of describing local optimization? It sounds like you're just describing a hill climber.
I do agree with building a satisficer, as this describes more accurately what the user actually wants! I want a cheap flight, but I wouldn't be willing to wait 3 days for the program to find the cheapest possible flight that saved me 5 bucks. But on the other hand, if I told it to find me flights under 500 bucks, and it served me up a flight for 499 bucks even though there was another equally good option for 400 bucks, I'd be pretty annoyed.
It seems like some amount of local optimisation is necessary for an AI to be useful.
That depends what you mean by "continuously improving until you reach a limit which is not necessarily the global limit".
I guess by "continuously" you probably do not mean "in continuous time" but rather "repeatedly in discrete time steps"? So you imagine a sequence r(s1) < r(s2) < ... ? Well, that could converge to anything larger than each of the r(sn). E.g., if r(sn) = 1 - 1/n, it will converge to 1. (It will of course never "reach" 1 since it will always below 1.) This is completely independent of what the local or global maxima of r are. They can obviously be way larger. For example, if the function is r(s) = s and the sequence is sn = 1 - 1/n, then r(sn) converges to 1 but the maximum of r is infinity. So, as I said before, unless your sequence of improvements is part of an attempt to find a maximum (that is, part of an optimization process), there is no reason to expect that it will converge to some maximum.
Btw., this also shows that if you have two competing satisficers whose only goal is to outperform the other and who therefore repeatedly improve their reward to be larger than the other agents' current reward, this does not imply that their rewards will converge to some maximum reward. They can easily be programmed to avoid this by just outperforming the other by an amount of 2**(-n) in the n-th step, so that their rewards converge to the initial reward plus one, rather than to whatever maximum reward might be possible.
Ah, well explained, thank you. Yes, I agree now that you can theoretically improve to a limit without having that limit being a local maxima. Although I'm unsure if the procedure could end up being equivalent in practice to a local maximisation with a modified goal function (say one that penalises going above "reward + 1" with exponential cost). Maybe something to think about when going forward.
Thanks for answering the questions, best of luck with the endeavour!
If your goal is to prevent an agent from being incentivized to pursue narrow objectives in an unbounded fashion (e.g. "paperclip maximizer"), you can do this within the existing paradigm of reward functions by ensuring that the set of rewards simultaneously includes:
1) Contradictory goals, and
2) Diminishing returns
Either one of these on their own is insufficient. With contradictory goals alone, the agent can maximize reward by calculating which of its competing goals is more valuable and disregarding everything else. With diminishing returns alone, the agent can always get a little more reward by pursuing the goal further. But when both are in place, diminishing returns provides automatic, self-adjusting calibration to bring contradictory goals into some point of equilibrium. The end result looks like satisficing, but dodges all of the philosophical questions as to whether "satisficing" is a stable (or even meaningful) concept as discussed in the other comments.
Obviously there are deep challenges with the above, namely:
(1) Both properties must be present across all dimensions of the agent's utility function. Further, there must not be any hidden "win-win" solutions that bring competing goals into alignment so as to eliminate the need for equilibrium.
(2) The point of equilibrium must be human-compatible.
(3) 1 & 2 must remain true as the agent moves further from its training environment, as well as if it changes, such as by self-improvement.
(4) Calibrating equilibria requires the ability to reliably instill goals into an AI in the first place, currently lacking since ML only provides the indirect lever of reinforcement.
But most of these reflect general challenges within any approach to alignment.
Dear Will,
thanks for these thoughtful comments. I'm not sure I understand some aspects of what you say correctly, but let me try to make sense of this in the example of Zhuang et al., http://arxiv.org/abs/2102.03896. If the utility function is defined only in terms of a proper subset of the attributes, it will exploit the seemingly irrelevant remaining attributes in the optimization, whether or not some of the attributes it uses represent conflicting goals. Even when conflicting goals are "present across all dimensions of the agent's utility function", that utility function might simple ignore relevant side-effects, e.g. because the designers and teachers have not anticipated them at all.
Their example in Fig. 2 shows this nicely. In contrast, with a satisficing goal of achieving only, say, 6 in Fig. 2, the agent will not exploit the unrepresented features as much and actual utility will be much larger.