This is my second distillation of Risks from Learned of Optimization in Advanced Machine Learning Systems focusing on deceptive alignment.

As the old saying goes, alignment may go wrong in many different ways, but right only in one (and hopefully, we find that one!). To get an idea of how a mesa-optimizer can be playing the game of deceptive alignment that I explained in a previous post, we'll look at three possible scenarios of pseudo-alignment. Essentially, the question I'm trying to answer here is what is it that makes the mesa-optimizer pursue a new objective, i.e., the mesa-objective? Each of the following scenarios gives an answer to this question.

Recall that when a mesa-optimizer is deceptively aligned, it is optimizing for an objective other than the base objective while giving off the impression that it's aligned, i.e., that it's optimizing for the base objective.

Scenario 1: Proxy alignment

The mesa-optimizer starts searching for ways to optimize for the base objective. I call it "ways" but the technically accurate term is "policies" or "models"( although "models" is used for many things and it can be confusing). During this search, it stumbles upon a proxy of the base objective and starts optimizing for the proxy instead. But what does a proxy do? Proxies tend to be instrumentally valuable steps on the way towards achieving a goal, i.e., things an optimizer has to do to complete a task successfully.

To prevent the misalignment from happening, we must be in control of the search over models.

There are two cases of proxy alignment to keep in mind:

Side-effect alignment

Imagine that we are training a robot to clean the kitchen table. The robot optimizes the number of times it has cleaned the table. Wiping down the table causes the table to be clean. By doing this, the robot would score high if judged according to the standards of the base objective. Now we deploy the robot in an environment where it has a chance to spill coffee on the table right after wiping it down. There's no reason why the robot won't take that chance. It'll start spilling coffee and then continue cleaning it.

In this case, the mesa-optimizer optimizes for the mesa-objective, but this directly increases the base objective:

Instrumental alignment

Imagine we have another robot and here we're training it to clean crumbles off t...

Alignment Theory Series