AI Notkilleveryone researcher at Apollo, working on interpretability. Physics PhD.
Crossposted from LessWrong.
MATS has steadily increased in quality over the past two years, and is now more prestigious than AISC. We also have Astra, and people who go directly to residencies at OpenAI, Anthropic, etc. One should expect that AISC doesn't attract the best talent.
- If so, AISC might not make efficient use of mentor / PI time, which is a key goal of MATS and one of the reasons it's been successful.
AISC isn't trying to do what MATS does. Anecdotal, but for me, MATS could not have replaced AISC (spring 2022 iteration). It's also, as I understand it, trying to have a structure that works without established mentors, since that's one of the large bottlenecks constraining the training pipeline.
Also, did most of the past camps ever have lots of established mentors? I thought it was just the one in 2022 that had a lot? So whatever factors made all the past AISCs work and have participants sing their praises could just still be there.
Why does the founder, Remmelt Ellen, keep posting things described as "content-free stream of consciousness", "the entire scientific community would probably consider this writing to be crankery", or so obviously flawed it gets -46 karma? This seems like a concern especially given the philosophical/conceptual focus of AISC projects, and the historical difficulty in choosing useful AI alignment directions without empirical grounding.
He was posting cranky technical stuff during my camp iteration too. The program was still fantastic. So whatever they are doing to make this work seems able to function despite his crankery. With a five year track record, I'm not too worried about this factor.
All but 2 of the papers listed on Manifund as coming from AISC projects are from 2021 or earlier.
In the first link at least, there are only eight papers listed in total though. With the first camp being in 2018, it doesn't really seem like the rate dropped much? So to the extent you believe your colleagues that the camp used to be good, I don't think the publication record is much evidence that it isn't anymore. Paper production apparently just does not track the effectiveness of the program much. Which doesn't surprise me, I don't think the rate of paper producion tracks the quality of AIS research orgs much either.
The impact assessment was commissioned by AISC, not independent. They also use the number of AI alignment researchers created as an important metric. But impact is heavy-tailed, so the better metric is value of total research produced. Because there seems to be little direct research, to estimate the impact we should count the research that AISC alums from the last two years go on to produce. Unfortunately I don't have time to do this.
Agreed on the metric being not great, and that an independently commissioned report would be better evidence (though who would have comissioned it?). But ultimately, most of what this report is apparently doing is just asking a bunch of AIS alumni what they thought of the camp and what they were up to, these days. And then noticing that these alumni often really liked it and have apparently gone on to form a significant fraction of the ecosystem. And I don't think they even caught everyone. IIRC our AISC follow-up LTFF grant wasn't part of the spreadsheets until I wrote Remmelt that it wasn't there.
I am not surprised by this. Like you, my experience is that most of my current colleagues who were part of AISC tell me it was really good. The survey is just asking around and noticing the same.
I was the private donor who gave €5K. My reaction to hearing that AISC was not getting funding was that this seemed insane. The iteration I was in two years ago was fantastic for me, and the research project I got started on there is basically still continuing at Apollo now. Without AISC, I think there's a good chance I would never have become an AI notkilleveryoneism researcher.
It feels like a very large number of people I meet in AIS today got their start in one AISC iteration or another, and many of them seem to sing its praises. I think 4/6 people currently on our interp team were part of one of the camps. I am not aware of any other current training program that seems to me like it would realistically replace AISC's role, though I admittedly haven't looked into all of them. I haven't paid much attention to the iteration that happened in 2023, but I happen to know a bunch of people who are in the current iteration and think trying to run a training program for them is an obviously good idea.
I think MATS and co. are still way too tiny to serve all the ecosystem's needs, and under those circumstances, shutting down a training program with an excellent five year track record seems like an even more terrible idea than usual. On top of that, the research lead structure they've been trying out for this camp and the last one seems to me like it might have some chance of being actually scalable. I haven't spend much time looking at the projects for the current iteration yet, but from very brief surface exposure they didn't seem any worse on average than the ones in my iteration. Which impressed and surprised me, because these projects were not proposed by established mentors like the ones in my iteration were. A far larger AISC wouldn't be able to replace what a program like MATS does, but it might be able to do what AISC6 did for me, and do it for far more people than anything structured like MATS realistically ever could.
On a more meta point, I have honestly not been all that impressed with the average competency of the AIS funding ecosystem. I don't think it not funding a project is particularly strong evidence that the project is a bad idea.
AISC 6 was what got me into the field, the research I worked on there is still an influence on what we're doing at Apollo Research now, and three five other people currently at Apollo are alumni of the camp as well.
I'm also currently not seeing the LTFF grant(s) for the project we started at AISC listed in that table, so I suspect others might be missing as well.
I think the counterfactual impact for me was probably high here. Certainly, no other formal program active then or now I am aware of seems like it could have replaced AISC for onboarding me into AI Safety.
all behaviour can be interpreted as maximising a utility function.
Yes, it indeed can be. However, the less coherent the agent acts, the more cumbersome it will be to describe it as an expected utility maximiser. Once your utility function specifies entire histories of the universe, its description length goes through the roof. If describing a system as a decision theoretic agent is that cumbersome, it's probably better to look for some other model to predict its behaviour. A rock, for example, is not well described as a decision theoretic agent. You can technically specify a utility function that does the job, but it's a ludicrously large one.
The less coherent and smart a system acts, the longer the utility function you need to specify to model its behaviour as a decision theoretic agent will be. In this sense, expected-utility-maximisation does rule things out, though the boundary is not binary. It's telling you what kind of systems you can usefully model as "making decisions" if you want to predict their actions.
If you would prefer math that talks about the actual internal structures agents themselves consist of, decision theory is not the right field to look at. It just does not address questions like this at all. Nowhere in the theorems will you find a requirement that an agent's preferences be somehow explicitly represented in the algorithms it "actually uses" to make decisions, whatever that would mean. It doesn't know what these algorithms are, and doesn't even have the vocabulary to formulate questions about them. It's like saying we can't use theorems for natural numbers to make statements about counting sheep, because sheep are really made of fibre bundles over the complex numbers, rather than natural numbers. The natural numbers are talking about our count of the sheep, not the physics of the sheep themselves, nor the physics of how we move our eyes to find the sheep. And decision theory is talking about our model of systems as agents that make decisions, not the physics of the systems themselves and how some parts of them may or may not correspond to processes that meet some yet unknown embedded-in-physics definition of "making a decision".
I do not find the argument against the applicability of the Complete Class theorem in that post convincing. See Charlie Steiner's reply in the comments.
You just have to separate "how the agent internally represents its preferences" from "what it looks like the agent us doing." You describe an agent that dodges the money-pump by simply acting consistently with past choices. Internally this agent has an incomplete representation of preferences, plus a memory. But externally it looks like this agent is acting like it assigns equal value to whatever indifferent things it thought of choosing between first.
Decision theory is concerned with external behaviour, not internal representations. All of these theorems are talking about whether the agent's actions can be consistently described as maximising a utility function. They are not concerned whatsoever with how the agent actually mechanically represents and thinks about its preferences and actions on the inside. To decision theory, agents are black boxes. Information goes in, decision comes out. Whatever processes may go on in between are beyond the scope of what the theorems are trying to talk about.
So
Money-pump arguments for Completeness (understood as the claim that sufficiently-advanced artificial agents will have complete preferences) assume that such agents will not act in accordance with policies like ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ But that assumption is doubtful. Agents with incomplete preferences have good reasons to act in accordance with this kind of policy: (1) it never requires them to change or act against their preferences, and (2) it makes them immune to all possible money-pumps for Completeness.
As far as decision theory is concerned, this is a complete set of preferences. Whether the agent makes up its mind as it goes along or has everything it wants written up in a database ahead of time matters not a peep to decision theory. The only thing that matters is whether the agent's resulting behaviour can be coherently described as maximising a utility function. If it quacks like a duck, it's a duck.
the (main) training process for LLMs is exactly to predict human text, which seems like it could reasonably be described as being trained to impersonate humans
"Could reasonably be described" is the problem here. You likely need very high precision to get this right. Relatively small divergences from human goals in terms of bits altered suffice to make a thing that is functionally utterly inhuman in its desires. This is a kind of precision that current AI builders absolutely do not have.
Worse than that, if you train an AI to do a thing, in the sense of setting a loss function where doing that thing gets a good score on the function, and not doing that thing gets a bad score, you do not, in general, get out an AI that wants to do that thing. One of the strongest loss signals that trains your human brain is probably "successfully predict the next sensory stimulus". Yet humans don't generally go around thinking "Oh boy, I sure love successfully predicting visual and auditory data, it's so great." Our goals have some connection to that loss signal, e.g. I suspect it might be a big part of what makes us like art. But the connection is weird and indirect and strange.
If you were an alien engineer sitting down to write that loss function for humans, you probably wouldn't predict that they'd end up wanting to make and listen to audio data that sounds like Beethoven's music, or image data that looks like van Gogh's paintings. Unless you knew some math that tells you what kind of AI with what kind of goals you get if you train on a loss function over a dataset .
The problem is that we do not have that math. Our understanding of what sort of thinky-thing with what goals comes out at the end of training is close to zero. We know it can score high on the loss function in training, and that's basically it. We don't know how it scores high. We don't know why it "wants" to score high, if it's the kind of AI that can be usefully said to "want" anything. Which we can't tell if it is either.
With the bluntness of the tools we currently possess, the goals that any AGI we make right now would have would effectively be a random draw from the space of all possible goals. There are some restrictions on where in this gigantic abstract goal space we would sample from, for example the AI can't want trivial things that lead to it just sitting there forever doing nothing. Because then it would be functionally equivalent to a brick and have no reason to try and score high on the loss function in training, so it would be selected against. But it's still an incredibly vast possibility space.
Unfortunately, humans and human values are very specific things, and most goals in goal space make no mention of them. If a reference to human goals does get into the AGIs goals, there's no reason to expect that it will get in there in the very specific configuration of the AGI wanting the humans to get what they want.
So the AGI gets some random goal that involves more than sitting around doing nothing, but probably isn't very directly related to humans, any more than humans' goals are related to correctly predicting the smells that enters their noses. The AGI will then probably gather resources to achieve this goal, and not care what happens to humans as a consequence. Concretely, that may look like earth and the solar system getting converted into AGI infrastructure, with no particular attention paid to keeping things like an oxygen rich atmosphere around. The AGI knows that we would object to this, so it will make sure that we can't stop it. For example, by killing us all.
If you offered it passage off earth in exchange for leaving humanity alone, it would have little reason to take that deal. That's leaving valuable time and a planet worth of resources and on the table. Humanity might also make another AGI some day, and that could be a serious rival. On the other hand, just killing all the humans is really easy, because they are not smart enough to defend themselves. Victory is nigh guaranteed. So it probably just does that.
Flagging that I have also heard about this case.