This is an interesting analysis!
I agree with MaxRa's point. When I skim read "Metaculus pro forecasters were better than the bot team, but not with statistical significance" I immediately internalised that the message was "bots are getting almost as good as pros" (a message I probably already got from the post title!) and it was only when I forced myself to slow down and read it more carefully that I realised this is not what this result means (for example you could have done this study only using a single question, and this stated result could have been true, but likely not tell you much either way about their relative performance). I only then noticed that both main results were null results. I'm then not sure if this actually supports the 'Bots are closing the gap' claim or not..?
The histogram plot is really useful, and the points of reference are helpful too. I'd be interested to know what the histogram would look like if you compared pro human forecasters to average human forecasters on a similar set of questions? How big an effect do we see there? Or maybe to get more directly at what I'm wondering: how do bots compare to average human forecasters? Are they better with statistical significance, or not? Has this study already been done?
Thanks for the link, I've just given your previous post a read. It is great! Extremely well written! Thanks for sharing!
I have a few thoughts on it I thought I'd just share. Would be interested to read a reply but don't worry if it would be too time consuming.
I like the 'replace one neuron at a time' thought-experiment, but accept it has flaws. For me, it's that we could in principle simulate a brain on a digital computer and have it behave identically, that convinces me of functionalism. I can't grok how some system could behave identically but its thoughts not 'exist'.
Thanks for the reply, this definitely helps!
The brain operating according to the known laws of physics doesn't imply we can simulate it on a modern computer (assuming you mean a digital computer). A trivial example is certain quantum phenomena. Digital hardware doesn't cut it.
Could you explain what you mean by this..? I wasn't aware that there were any quantum phenomena that could not be simulated on a digital computer? Where do the non-computable functions appear in quantum theory? (My background: I have a PhD in theoretical physics, which certainly doesn't make me an expert on this question, but I'd be very surprised if this was true and I'd never heard about it! And I'd be a bit embarrassed if it was a fact considered 'trivial' and I was unaware of it!)
There are quantum processes that can't be simulated efficiently on a digital computer, but that is a different question.
I don't think I fully understand exactly what you are arguing for here, but would be interested in asking a few questions to help me understand it better, if you're happy to answer?
Ah, that's a really interesting way of looking at it, that you can trade training-compute for inference-compute to only bring forward capabilities that would have come about anyway via simply training larger models. I hadn't quite got this message from your post.
My understanding of Francois Chollet's position (he's where I first heard the comparison of logarithmic inference-time scaling to brute force search - before I saw Toby's thread) is that RL on chain of thought has unlocked genuinely new capabilities that would have been impossible simply by scaling traditional LLMs (or maybe it has to be chain of thought combined with tree-search - but whatever the magic ingredient is he has acknowledged that o3 has it).
Of course this could just be his way of explaining why the o3 ARC results don't prove his earlier positions wrong. People don't like to admit when they're wrong! But this view still seems plausible to me, it contradicts the 'trading off' narrative, and I'd be extremely interested to know which picture is correct. I'll have to read that paper!
But I guess maybe it doesn't matter a lot in practice, in terms of the impact that reasoning models are capable of having.
This was a thought-provoking and quite scary summary of what reasoning models might mean.
I think this sentence may have a mistake though:
"you can have GPT-o1 think 100-times longer than normal, and get linear increases in accuracy on coding problems."
Doesn't the graph show that the accuracy gains are only logarithmic? The x-axis is a log scale.
This logarithmic relationship between performance and test-time compute is characteristic of brute-force search, and maybe is the one part of this story that means the consequences won't be quite so explosive? Or have I misunderstood?
It might be fair to say that the o3 improvements are something fundamentally different to simple scaling, and that Chollet is still correct in his 'LLMs will not simply scale to AGI' prediction. I didn't mean in my comment to suggest he was wrong about that.
I could imagine someone criticizing him for exaggerating how far away we were from coming up with the necessary new ideas, given the o3 results, but I'm not so interested in the debate about exactly how right or wrong the predictions of this one person were.
The interesting thing for me is: whether he was wrong, or whether he was right but o3 does represent a fundamentally different kind of model, the upshot for how seriously we should take o3 seems the same! It feels like a pretty big deal!
He could have reacted to this news by criticizing the way that o3 achieved its results. He already said in the Dwarkesh Patel interview that someone beating ARC wouldn't necessarily imply progress towards general intelligence if the way they achieved it went against the spirit of the task. When I clicked the link in this post, I thought it likely I was about to read an argument along those lines. But that's not what I got. Instead he was acknowledging that this was important progress.
I'm by no means an expert, but timelines in the 2030s still seems pretty close to me! I'd have thought, based on arguments from people like Chollet, that we might be a bit further off than that (although only with the low confidence of a layperson trying to interpret the competing predictions of experts who seem to radically disagree with each other).
Given all the problems you mention, and the high costs still involved in running this on simple tasks, I agree it still seems many years away. But previously I'd have put a fairly significant probability on AGI not being possible this century (as well as assigning a significant probability to it happening very soon, basically ending up highly uncertain). But it feels like these results make the idea that AGI is still 100 years away seem much less plausible than it was before.
The ARC performance is a huge update for me.
I've previously found Francois Chollet's arguments that LLMs are unlikely to scale to AGI pretty convincing. Mainly because he had created an until-now unbeaten benchmark to back those arguments up.
But reading his linked write-up, he describes this as "not merely an incremental improvement, but a genuine breakthrough". He does not admit he was wrong, but instead paints o3 as something fundamentally different to previous LLM-based AIs, which for the purpose of assessing the significance of o3, amounts to the same thing!
I think the presentation of this argument here misses some important considerations:
The way that you want us to act with respect to OP is already the way that OP is trying to act with respect to the rest of the world
EAs don't fund the most important causes, based purely on scale (otherwise tonnes of things EAs ignore would score highly, e.g. vaccination programs in rich countries). A core part of EA is looking for causes which are neglected. We look for the areas that are receiving the least funding relative to what they would receive in our ideal world, because these are likely to be the areas where our donations will have the highest marginal impact.
This is the reply to people who argue "oh you want local charities to disappear and to send all the money to malaria nets". The reply is: "No! In my ideal world, malaria nets would quickly attract all the funding they need. Then there would still be plenty of money left over for other things. But I think I should look at the world I actually live in, recognize that malaria nets are outrageously underfunded, and give all my resources there."
So in a sense, the argument you are making here isn't anything new. You are just saying we should try to act towards other EAs in a similar way to how EAs as a group act towards the rest of the world. And I don't disagree with this. But I think we should go all the way. I think we should treat other EAs in the same way that we treat the rest of the world. If I understand your argument correctly, you are trying to draw a distinction between the EA community and everyone else.
The same considerations that lead OP to choose not to allocate all their funds to the highest expected value cause should also be relevant for individual donors
OP do not allocate all of their funding to the 'best' cause. Even if OP were a pure EV maximizer, they might have valid reasons not to do this, because they have such a big budget. It may be that diminishing marginal returns mean that the 'best' cause stops being the best once OP have given a certain level of funds to it, at which point they should switch to funding another cause instead.
But my impression is that this is not OP's reason for donating to multiple causes (or at least not their only reason). They are not purely trying to maximize expected value, or at least not in a naive first order way. One reason to diversify might be donor risk aversion, like you mention (e.g. you want to maximize EV while bounding the risk that you have no positive impact at all), and there are plenty of other considerations that might come into it too, e.g. sense of duty to a certain cause, reputation, belief in unquantifiable uncertainty and impossibility of making certain cause comparisons etc
But if these considerations are valid for OP then they should also be relevant for individual donors. For example, if an individual donor wants to bound the risk that they have no impact, then that might well mean not donating everything to the cause they think is most underfunded by OP. It would only make sense to do this if they had a weird type of risk aversion where they want to bound the risk that the EA community as a whole has no positive impact, but are unconcerned about their own donations' risk. This seems very arbitrary! Either they should care about the risk for their own donations, and should diversify, or they should be concerned with all of humanity's donations, in which case OP should not be diversifying either!
Pure EV maximizers don't care about percentages anyway
You could bite the bullet and say that neither OP nor individual donors should be diversifying their donations (except when faced with diminishing marginal utility). For these individual donors, they should be donating everything to one cause (and probably one charity unless they have a lot to give!) But even for these donors, it's not which causes OP underfund that really matters. It's what causes all of humanity underfund. So it is not the percentages of OP's funding allocation that matter, it's the absolute value.
If OP are a relatively small player in a cause area (global health..?) then their donation decisions are unlikely to be especially relevant to the individual donor. If they thought global health was the top cause before OP donations were taken into account, it probably still will be afterwards. But if OP are a relatively big player (animal welfare..?) then their donations are more relevant, due to diminishing marginal utility. But it is the absolute amount of funding they are moving, not the percentages, which will determine this.
This take seems to contradict Francois Chollet's own write-up of the o3 ARC results, where he describes the results as:
(taken from your reference 52 , emphasis mine)
You could write this off as him wanting to talk-up the significance of his own benchmark, but I'm not sure that would be right. He has been very publicly sceptical of the ability of LLMs to scale to general intelligence, so this is a kind of concession from him. And he had already laid the groundwork in his Dwarkesh Patel interview to explain away high ARC performance as cheating if it tackled the problem in the wrong way, cracking it through memorization via an alternative route (e.g. auto-generating millions of ARC-like problems and training on those). He could easily have dismissed the o3 results on those grounds, but chose not to, which made an impression on me (a non-expert trying to decide how to weigh up the opions of different experts). Presumably he is aware that o3 trained on the public dataset, and doesn't view that as cheating. The public dataset is small, and the problems are explicitly designed to resist memorization, requiring general intelligence. Being told the solution to earlier problems is not supposed to help you solve later problems.
What's your take on this? Do you disagree with the write up in [52]? Or do you think I'm mischaracterizing his position (there are plenty of caveats outside the bit I selectively quoted as well - so maybe I am)?
The fact that the human-level ARC performance could only be achieved by extremely high inference-time compute costs seems significant too. Why would we get inference time scaling if chain-of-thought consisted of not much more than post-hoc rationalizations, instead of real reasoning?
For context, I used to be pretty sympathetic to the "LLMs do most of the impressive stuff by memorization and are pretty terrible at novel tasks" position, and still think this is a good model for the non-reasoning LLMs, but my views have changed a lot since the reasoning models, particularly because of the ARC results.