It's quite possible someone has already argued this, but I thought I should share just in case not.
Goal-Optimisers and Planner-Simulators
When people in the past discussed worries about AI development, this was often about AI agents - AIs that had goals they were attempting to achieve, objective functions they were trying to maximise. At the beginning we would make fairly low-intelligence agents, which were not very good at achieving things, and then over time we would make them more and more intelligent. At some point around human-level they would start to take-off, because humans are approximately intelligent enough to self-improve, and this would be much easier in silicon.
This does not seem to be exactly how things have turned out. We have AIs that are much better than humans at many things, such that if a human had these skills we would think they were extremely capable. And in particular LLMs are getting better at planning and forecasting, now beating many but not all people. But they remain worse than humans at other things, and most importantly the leading AIs do not seem to be particularly agentic - they do not have goals they are attempting to maximise, rather they are just trying to simulate what a helpful redditor would say.
What is the significance for existential risk?
Some people seem to think this contradicts AI risk worries. After all, ignoring anthropics, shouldn’t the presence of human-competitive AIs without problems be evidence against the risk of human-competitive AI?
I think this is not really the case, because you can take a lot of the traditional arguments and just substitute ‘agentic goal-maximising AIs, not just simulator-agents’ in wherever people said ‘AI’ and the argument still works. It seems like eventually people are going to make competent goal-directed agents, and at that point we will indeed have the problems of their exerting more optimisation power than humanity.
In fact it seems like these non-agentic AIs might make things worse, because the goal-maximisation agents will be able to use the non-agentic AIs.
Previously we might have hoped to have a period where we had goal-seeking agents that exerted influence on the world similar to a not-very-influential person, who was not very good at planning or understanding the world. But if they can query the forecasting-LLMs and planning-LLMs, as soon as the AI ‘wants’ something in the real world it seems like it will be much more able to get it.
So it seems like these planning/forecasting non-agentic AIs might represent a sort of planning-overhang, analogous to a Hardware Overhang. They don’t directly give us existentially-threatening AIs, but they provide an accelerant for when agentic-AIs do arrive.
How could we react to this?
One response would be to say that since agents are the dangerous thing, we should regulate/restrict/ban agentic AI development. In contrast, tool LLMs seem very useful and largely harmless, so we should promote them a lot and get a lot of value from them.
Unfortunately it seems like people are going to make AI agents anyway, because ML researchers love making things. So an alternative possible conclusion would be that we should actually try to accelerate agentic AI research as much as possible, and decelerate tool LLM planners, because eventually we are going to have influential AI maximisers, and we want them to occur before the forecasting/planning overhang (and the hardware overhang) get too large.
I think this also makes some contemporary safety/alignment work look less useful. If you are making our tools work better, perhaps by understanding their internal working better, you are also making them work better for the future AI maximisers who will be using them. Only if the safety/alignment work applies directly to the future maximiser AIs (for example, by allowing us to understand them) does it seem very advantageous to me.
2024-07-15 edit: added clarification about differential progress.
Suppose we have some LLM interpritability technology that helps us take LLMs from a bit worse than humans at planning to a bit better (say because it reduces the risk of hallucinations), and these LLMs will ultimately be used by both humans and future agentic AIs. The improvement from human-level planning to better-than-human level benefits both humans and optimiser AIs. But the improvement up to human level is a much bigger boost to the agentic AI, who would otherwise not have access to such planning capabilities, than to humans, who already had human-level abilities. So this interpritability technology actually ends up making crunch time worse.
It's different if this interpritability (or other form of safety/alignment work) also applied to future agentic AIs, because we could use it to directly reduce the risk from them.
It seems I get the knack of it now...
So your argument here is that if we are going to go this route, then interpretability technology should be used as a measure in the future towards ensuring the safety of this agentic AI as much as they are using currently to improve their "planning capabilities"