Many thanks to Jan for commenting on a draft of this post.
There were a lot of great comments on "Let's see you write that corrigibility tag". This is my attempt at expanding Jan Kulveit's comment[1], because I thought it was useful, and should be read more widely. This post may not accurately represent Jan's thoughts. This post took approximately 3 hours to write.
Introduction
'Corrigible' is a word that is used in a lot of different ways in AI alignment. Here, we're using Eliezer Yudkowsky's framing from mad investor chaos and the woman of asmodeus (a story about a lot of things, but also about alignment). In the story, there is a world with plans to create a limited AI if needed in an emergency, with as many safeguards as they have come up with at the time.
The initial craft doesn't have to be literally perfect to work perfectly in the end, it just has to be good enough that its reflection and self-correction ends up in exactly the right final place.
Eliezer asked if LessWrong could write a list of safeguards for a limited AI like this without his input. Eliezer's answer is available here; this post is an explanation of Jan's answer. Jan notes that his answer was written in Eliezer's frame, not the frame he would normally think in.
We'll look at examples of five types of tool; those for reducing instrumental convergence and reducing unbounded rationality, 'traps', oversight methods, and interpretability methods.
Five Types of Tool
Against Instrumental Convergence
"Instrumental convergence is the theorized tendency for most sufficiently intelligent agents to pursue potentially unbounded instrumental goals, such as self-preservation and resource acquisition".
Instrumental convergence precludes corrigibility. For instance, many theoretical agents will not allow their goal function to be changed, because this will reduce their ability to achieve their current goals. Additionally, pursuit of instrumental goals can break some of the other corrigibility methods. Preventing instrumental convergence is therefore key to achieving corrigibility. Jan outlines three tools that could be used to help prevent instrumental convergence: preventing resource acquisition, uncertainty about goals, and making the AI want to not exist in the future.
Resource acquisition can be disincentivized by using a mutual information measure between the AI and distant parts of the environment (e.g. an asteroid, parts of the opposite side of the earth, parts of the core of the earth). The term 'mutual information' refers to how easy it is to predict the state of one, given the state of the other. If the AI is taking actions that massively impact the state of these parts of the environment (e.g. blowing up the earth for raw metals, asteroid mining, etc.), it will be very easy to guess the state of these parts of the environment from perfect knowledge of the state of the AI. As the AI is penalised for high mutual information, actions that have large consequences are discouraged.
Jan suggests the AI should have uncertainty over its intended task, with a reasonable prior belief. This should incentivise the AI to do things like ask humans questions to clarify its goals. This concept is explored more in Stuart Russell's book Human Compatible, though I'm not aware of any empirical work demonstrating AI with goal uncertainty.
Finally, we can make the AI not want to exist in the future as a terminal goal. Making this sufficiently difficult without fulfilling the task will encourage the AI to complete the task in order to stop existing, by making this the safest/most likely path to non-existence.
Against Unbounded Rationality
Unbounded rationality is related to instrumental convergence; many agents will want to become more rational in order to better achieve their goals. Many agents will also have unbounded reward functions, where they get a theoretically unlimited amount of reward for fulfilling more of their reward function (e.g. making infinitely many paperclips). Jan outlines three methods for limiting unbounded rationality: a penalty on reasoning, making satisficers rather than maximisers, and making myopic agents.
To reduce the desire for increased intelligence, we can make the AI have some utility cost for any bit-flip that occurs in the system. This will cause the AI to have a penalty on how much reasoning it does, and have to balance the expected gains from more intelligence against the built-in utility cost. Jan suggests the cost function could be very small initially, but quickly get steeper as the amount of processing goes up. Jan suggests this works best in combination with other corrigibility mechanisms, but is fundamentally an attempt to keep the 'limited AI' actually limited.
To demonstrate the value of satisficers (which have bounded goals, rather than unbounded goals), we'll use the paperclip example. An unbounded model, which gets increasing reward for more paperclips with no end into infinity, will be incentivised to destroy human civilisation to access resources that can be used to make paperclips. A satisficer — which gets more reward for each paperclip it collects up to 100 paperclips, but no additional reward for 101+ paperclips, will select randomly from actions which gain at least 100 paperclips (or select a policy with the maximal probability of attaining 100 stamps, which can still cause some risks). Here is an example paper in this direction.
Finally, myopic means short-sighted. In the context of AI, it is used to refer to agents which have a strong 'discount factor'; future reward is much less important to them than proximal reward. This is an important alignment tool because it encourages AI to take the shortest possible path to its terminal rewards, reducing instrumental convergence and disincentivising e.g. taking time to increase its intelligence or gather many resources. In particular, this combined with the AI not wanting to exist in the future encourage the AI to complete its task quickly without taking over the whole world first.
Traps
We can use a number of 'traps' left in the environment or in the AI which disincentivise bad behaviour. These include tripwires, watchdogs, and uncertainty about the level of simulation.
Firstly, tripwire artifacts -- related to mutual information measures, if we can create high negative reward for changing the state of distant parts of the environment, and hide from the system which parts of the environment these are, then the AI will be incentivised to be 'low-impact' and avoid massively altering the environment.
We can also use watchdogs -- small, fast systems external to the agent which are trained to react quickly to out-of-distribution behaviour or certain specific bad or unexpected outcomes. This works best if the agent is uncertain how the watchdogs are set up.
Lastly, we can make the agent uncertain if it is in training or deployment. In this case, the agent is forced to behave in ways that it expects humans will find acceptable (or at worst, randomly defect[2]). This works best if the agent is also uncertain how many such simulated test runs occur (so it does not know how often it 'should' randomly defect), but may be practically difficult to achieve (how do you simulate reality well enough to fool a superintelligence?).
Oversight
Oversight can be used to encourage the AI to behave in ways that line up with human values. These include a human-approval model, value learning, a legal system model, and AI debate style red teaming of action plans.
First, we can train a model to imitate humans approving/disapproving of actions that the AI takes, and penalise the AI when this model expects that humans would disapprove of its plan. Similarly, we can train a model which estimates what humans will say about various ethical scenarios, and penalise the agent when this model disapproves of its plan of action. We can also penalise the agent for breaking laws e.g. violating property rights, by training a model to define these.
Jan also suggests using red-teaming as a method of producing automated oversight. This means having multiple sub-parts with different goals, like having one part plan how to produce text that fits the prompt, training another part to judge text on how harmless it is, and training the first one to get a good score as rated by the second. This can be done when training the AI, in addition to having an ethical model during deployment (as above).
Interpretability
Our final class of tools is interpretability tools. An AI is interpretable if humans are able to follow its internal logic, how it came to its decisions, what its current plan of action is, and what it expects the consequences of that plan to be.
Jan suggests two interpretability methods -- imposing strong incentives (in the reward function of the AI) for being highly modular, and having messaging between modules that is human-understandable. Jan also suggests training a separate 'translator' system that produces human-understandable explanations of what the system is doing. One promising direction towards this kind of tool is OpenAIs work to automate interpretability by asking GPT-4 to interpret neurons in GPT-2.
Conclusion
Although there's still a lot of technical work to be done on a number of these proposals (i.e. this isn't a complete solution for alignment), it does cut the problem into more manageable pieces, for many of which the groundwork can be (or is being) done today e.g. OpenAIs work on interpretability, constitutional AI, etc.