Bio

Pro-pluralist, pro-bednet, anti-Bay EA

Posts
8

Sorted by New
4
JWS
· · 1m read

Sequences
3

Against the overwhelming importance of AI Safety
EA EDA
Criticism of EA Criticism

Comments
299

Final final edit: Congrats on the ARC-AGI-PUB results, really impressive :)

This will be my final response on this thread, because life is very time consuming and I'm rapidly reaching the point where I need to dive back into the technical literature and stress-test my beliefs and intuitions again. I hope Ryan and any readers have found this exchange useful/enlightening for seeing two different perspectives hopefully have productive disagreement?

If you found my presentation of the scaling-skeptical position highly unconvincing, I'd recommend following the work and thoughts of Tan Zhi Xuan (find her on X here). One of biggest updates was finding her work after she pushed back on Jacob Steinhardt here, and recently she gave a talk about her approach to Alignment. I urge readers to consider spending much more of their time listening to her than to me about AI.


I feel like this is a pretty strange way to draw the line about what counts as an "LLM solution".

I don't think so? Again, I wouldn't call CICERO an "LLM solution". Surely there'll be some amount of scaffolding which tips over into the scaffolding being the main thing and the LLM just being a component part? It's probably all blurry lines for sure, but I think it's important to separate 'LLM only systems' from 'systems that include LLMs', because it's very easy to conceptual scale up the former but harder to do the latter.

Human skeptic: That wasn't humans sending someone to the moon that was Humans + Culture + Organizations + Science sending someone to the moon! You see, humans don't exhibit real intelligence!

I mean, you use this as a reductio, but that's basically the theory of Distributed Cognition, and also linked to the ideas of 'collective intelligence', though that's definitely not an area I'm an expert in by any means. Also reminds me a lot Chalmers and Clarks' thesis of the Extended Mind.[1]

Of course, I think actual LLM skeptics often don't answer "No" to the last question. They often do have something that they think is unlikely to occur with a relatively straightforward scaffold on top of an LLM (a model descended from the current LLM paradigm, perhaps trained with semi-supervised learning and RLHF).

So I can't speak for Chollet and other LLM skeptics, and I think again LLMs+extra (or extras+LLMs) are a different beast from LLMs on their own and possibly an important crux. Here are some things I don't think will happen in the near-ish future (on the current paradigm):

  • I believe an adversarial Imitation Game, where the interrogator is aware of both the AI system's LLM-based nature and its failure modes, is unlikely to be consistently beaten in the near future.[2]
  • Primarily-LLM models, in my view, are highly unlikely to exhibit autopoietic behaviour or develop agentic designs independently (i.e. without prompting/direction by a human controller).
  • I don't anticipate these models exponential increase the rate of scientific research or AI development.[3] They'll more likely serve as tools used by scientists and researchers themselves to frame problems, but new and novel problems will still remain difficult and be bottlenecked by the real world + Hofstadter's law.
  • I don't anticipate Primarily-LLM models to become good at controlling and manoeuvring robotic bodies in the 3D world. This is especially true in a novel-test-case scenario (if someone could make a physical equivalent of ARC to test this, that'd be great)
  • This would be even less likely if the scaffolding remained minimal. For instance, if there's no initial sorting code explicitly stating [IF challenge == turing_test GO TO turing_test_game_module].
  • Finally, as an anti-RSI operationalisation, the idea of LLM-based models assisting in designing and constructing a Dyson Sphere within 15 years seems... particularly far-fetched for me.

I'm not sure if this reply was my best, it felt a little all-over-the-place, but we are touching on some deep or complex topics! So I'll respectfully bow out now, and thank again for the disucssion and giving me so much to think about. I really appreciate it Ryan :)

  1. ^

    Then you get into ideas like embodiment/enactivism etc

     

  2. ^

    I can think of a bunch of strategies to win here, but I'm not gonna say so it doesn't end up in GPT-5 or 6's training data!

  3. ^

    Of course, with a new breakthrough, all bets could be off, but it's also definitionally impossible to predict those, and unrobust to draw straight lines and graphs to predict the future if you think breakthroughs will be need. (Not saying you do this, but some other AIXR people definitely seem to be)

(folding in replies to different sub-comments here)

Sure you can have a very smart quadriplegic who is very knowledgable. But they won't do anything until you let them control some actuator. 

I think our misunderstanding here is caused by the word do. Sure, Stephen Hawking couldn't control his limbs, but nevertheless his mind was always working. He kept writing books and papers throughout his life, and his brain was 'always on'. A transformer model is a set of frozen weights that are only 'on' when a prompt is entered. That's what I mean by 'it won't do anything'.

As far as this project, seems extremely implausible to me that the hard part of this project is the scaffolding work I did.

Hmm, maybe we're differing on what hard works means here! Could be a difference between what's expensive, time-consuming, etc. I'm not sure this holds for any reasonable scheme, and I definitely think that you deserve a lot of credit for the work you've done, much more than GPT4o.

I think my results are probably SOTA based on more recent updates.

Congrats! I saw that result and am impressed! It's definitely clearly SOTA on the ARC-AGI-PUB leaderboard, but the original '34%->50% in 6 days ARC-AGI breakthrough' claim is still incorrect.

I'll have to dive into the technical details here I think, but the mystery of in-context learning has certainly shot up my reading list, and I really appreciate that link btw! It seems Blaine has some of the similary a-priori scepticism that I do towards it, but the right way for me to proceed is dive into the empirical side and see if my ideas hold water there.

From the summary page on Open Phil:

In this framework, AGI is developed by improving and scaling up approaches within the current ML paradigm, not by discovering new algorithmic paradigms.

From this presentation about it to GovAI (from April 2023) at 05:10:

So the kinda zoomed out idea behind the Compute-centric framwork is that I'm assuming something like the current paradigm is going to lead to human-level AI and further, and I'm assuming that we get there by scaling up and improving the current algorithmic approaches. So it's going to look like better versions of transformers that are more efficient and that allow for larger context windows..."

Both of these seem to be pretty scaling-maximalist to me, so I don't think the quote seems wrong, at least to me? It'd be pretty hard to make a model which includes the possibility of the paradigm not getting us to AGI and then needing a period of exploration across the field to find the other breakthroughs needed.

The solution would be much worse without careful optimization and wouldn't work at all without gpt4o (or another llm with similar performance).

I can buy that GPT4o would be best, but perhaps other LLMs might reached 'ok' scores on ARC-AGI if directly swapped out? I'm not sure what you refer to be 'careful optimization' here though.

There are different analogies here which might be illuminating:

  • Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.
  • If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.
  • You can build systems around people which remove most of the interesting intelligence from various tasks.

I think what is going on here is analogous to all of these.

On these analogies:

  1. This is an interesting point actually. I suppose credit-assingment for learning is a very difficult problem. In this case though, the child stranded would (hopefully!) survive and make a life for themselves and learn the skills they need to survive. They're active agents using their innate general intelligence to solve novel problems (per chollet). If I put a hard-drive with gpt4o's weights in the forest, it'll just rust. And that'll happen no matter how big we make that model/hard-drive imo.[1]
  2. Agreed here, will be very interesting to see how improved multimodality affects ARC-AGI scores. I think that we have interesting cases of humans being able to perform these takes in their head presumably without sight? e.g. Blind Chess Players with high ratings or Mathematicians who can reason without sight. I think Chollet's point in the interview is that they seem to be able to parse the JSON inputs fine in various cases, but still can't perform generalisation.
  3. Yep I think this is true, and perhaps my greatest fear from delegating power to complex AI systems. This is an empirical question we'll have to find out, can we simply automate away everything humans do/are needed for through a combination of systems even if each individual part/model used in said system is not intelligent?

Separately, this tweet is relevant: https://x.com/MaxNadeau_/status/1802774696192246133

Yep saw Max's comments and think he did a great job on X bringing some clarifications. I still think the hard part is the scaffolding. Money is easy for SanFran VCs to provide, and we know they're all fine to scrape-data-first-ask-legal-forgiveness later.

I think there's a separate point where enough scaffolding + LLM means the resulting AI system is not well described by being an LLM anymore. Take the case of CICERO by Meta. Is that a 'scaffolded LLM'? I'd rather describe it as a system which incorporates an LLM as a particular part. It's harder to naturally scale such a system in the way that you can with the transformer architecuter by stacking more layers or pre-training for longer on more data.

My intuition here is that scaffolding to make a system work well on ARC-AGI would make it less useable on other tasks, so sacrificing generality for specific performance. Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each? (Just thinking out loud here)


Final point, I've really appreciate your original work, comments on substack/X/here. I do apologise if I didn't make clear what parts were my personal reflections/vibes instead of more technical disagreements on interpretation - these are very complex topics (at least for me) and I'm trying my best to form a good explanation of the various evidence and data we have on this. Regardless of our disagreements on this topic, I've learned a lot :)

  1. ^

    Similarly, you can pre-train a model to create weights and get to a humongous size. But it won't do anything until you ask it to generate a token. At least, that's my intuition. I'm quite sceptical of how pre-training a transformer is going to lead to creating a mesa-optimiser

Oh yeah this wasn't against you at all! I think you're a great researcher, and an excellent interlocutor, and I learn a lot (and am learning a lot) from both your work and your reactions to my reaction.[1] Point five was very much a reaction against a 'vibe' I saw in the wake of your results being published. 

Like let's take Buck's tweet for example. We know now that a) your results aren't technically SOTA and b) It's not an LLM solution, it's an LLM + your scaffolding + program search, and I think that's importantly not the same thing. 

  1. ^

    I sincerely hope my post + comments have been somewhat more stimulating than frustrating for you

At the moment I think ARC-AGI does a good job at showing the limitations of transformer models on simple tasks that they don't come across in their training set. I think if the score was claimed, we'd want to see how it came about. It might be through frontier models demonstrating true understanding, but it might through shortcut learning/data leakage/impressive but overly specific and intuitively unsatisfying solution.

If ARC-AGI were to be broken (within the constraints Chollet and Knoop place on it) I'd definitely change my opinions, but what they'd change to depends on the matter of how ARC-AGI was solved. That's all I'm trying to say in that section (perhaps poorly)

As in, your crux is that the probability of AGI within the next 50 years is less than 10%?

I'm essentially deeply uncertain about how to answer this question, in a true 'Knightian Uncertainty' sense and I don't know how much it makes sense to use subjective probability calculus. It is also highly variable to what we mean by AGI though. I find many of the arguments I've seen to be a) deference to the subjective probabilities of others or b) extrapolation of straight lines on graphs - neither of which I find highly convincing. (I think your arguments seem stronger and more grounded fwiw)

I think from an x-risk perspective it is quite hard to beat AI risk even on pretty long timelines.

I think this can hold, but it hold's not just in light of particular facts about AI progress now but in light of various strong philosophical beliefs about value, what future AI would be like, and how the future would be post the invention of said AI. You may have strong arguments for these, but I find many arguments for the overwhelming importance of AI Safety do very poorly to ground these, especially in the light of compelling interventions to good that exist in the world right now.

 You have to reject one of the three. So, if you reject the third (as I do), then you think LLMs do learn at runtime.

Ah sorry I misread the trilemma, my bad! I think I'd still hold the 3rd to be true (Current LLMs never "learn" at runtime) though I'm open to changing my mind on that looking at further research. I guess I could see ways to reject 1 (e.g. if I copied the answers and just used a lookup table I'd get 100% but I don't think there's any learning, so it's certainly feasible for this to be false, but agreed it doesn't feel satisfying), or 2 (Maybe Chollet would say selection-from-memorised-templates doesn't count as a learning, also agreed unsatisfying). It's a good challenge!

In RLHF and training, no aspect of the GPU hardware is being updated at all, its all frozen. So why does that count as learning?

I'm not really referring to hardware here, in pre-training and RLHF the model weights are being changed and updated, and that's where the 'learning' (if we want to call it that) comes in - the model is 'learning' to store information/generate information with some combination of accurately predicting the next token in its training data and satisfying the RL model created from human reward labelling. Which is my issue with calling ICL 'learning' since the model weights are fixed, the model isn't learning anything. Similarly, all the activation functions between the layers do not change either. I also don't make intuitive sense to me to call the outputs of layers as 'learning' - the activations are 'just matmul' which I know is reductionist, but they aren't a thing that acquires a new state in my mind.

But again, this is something I want to do a deep dive into myself, so I accept that my thoughts on ICL might not be very clear

JWS
12
1
3
1

Thanks for sharing this Phil, it's very unforunate it came out just as I went on holiday! To all readers, this will probably be the major substantive response I make in these comments, and to get the most out of it you'll probably need some technical/background understanding of how AI systems work. I'll tag @Ryan Greenblatt directly so he can see my points, but only the first is really directed at him, the rest are responding to the ideas and interpretations.


First, to Ryan directly, this is really great work! Like, awesome job 👏👏 My only sadness here is that I get the impression you think this work is kind of a dead-end? On the contrary, I think this is the kind of research programme that could actually lead to updates (either way) across the different factions on AI progress and AI risk. You get mentioned positively on the xrisk-hostile Machine Learning Street Talk about this! Melanie Mitchell is paying attention (and even appeared in your substack comments)! I feel like the iron is hot here and it's a promising and exciting vein of research![1] 

Second, as others have pointed out, the claimed numbers are not SOTA, but that is because there are different training sets and I think the ARC-AGI team should more clear about that. But to be clear for all readers, this is what's happened:

  • Ryan got a model to achieve 50% accuracy on the public evaluation training set provided by Chollet in the original repo. Ryan has not got a score on the private set, because those answers are kept privately on Kaggle to prevent data leakage. Note that Ryan's original claims were based on the different sets being IID and the same difficulty, which is not true. We should expect performance to be lower on the private set.
  • The current SOTA  on the private test set was Cole, Osman, and Hodel with 34%, though apparently they now have reached 39% on the private set.  Ryan has noted this, so I assume we'll have clarifications/corrections soon to that bit of his piece.
  • Therefore Ryan has not achieved SOTA performance on ARC. That doesn't mean his work isn't impressive, but it is not true that GPT4o improved the ARC SOTA 16% in 6 days.
  • Also note from the comments on Substack, when limited to ~128 sample programmes per case, the results were 26% on the held out test of the training set. It's good, but not state of the art, and one wonders whether the juice is worth the squeeze there, especially if Jianghong Ying's calculations of the tokens-per-case is accurate. We seem to need exponential data to improve results.

Currently, as Ryan notes, his solution is inelligble for the ARC prize as it doesn't meet the various restrictions on runtime/compute/internet connection to enter. While the organisers say that this is meant to encourage efficiency,[2] I suspect it may be more of a security-conscious decision to limit people's access to the private test set. It is worth noting that, as the public training and eval sets are on GitHub (as will be most blog pieces about them, and eventually Ryan's own piece as well as my own) dataset contamination remains an issue to be concerned with.[3]

Third, and most importantly, I think Ryan's solution shows that the intelligence is coming for him, and not from Chat-GPT4o. skybrian makes this point in the comments in the substack comments. For example:

  • Ryan came up with the idea and implementation to use ASCII encoding since the vision capabilities of GPT4o were so unreliable. Ryan did some feature extraction on the ARC problems.
  • Ryan wrote the prompts and did the prompt engineering in lieu of their being fine-tuning available. He also provides the step-by-step reasoning in his prompts. Those long, carefully crafted prompt seems quite domain/problem-specific, and would probably point toward ARC's insufficiency as a test for generality than an example of general ability in LLMs.
  • Ryan notes that the additional approaches and tweaks are critical for performance gain above the 'just draw more samples'. I think that meme was a bit unkind, let alone inaccurate, and I kinda wish it was removed from the piece tbh.

If you check the repo (linked above), it's full of some really cool code to make this solution work, but that's the secret sauce. To my eyes, I think the hard part here was the scaffolding done by Ryan rather than the pre-training[4] of the LLM (this is another cruxy point I highlighted in my article). I think it's much less conceptually hard to scrape the entire internet and shove it through a transformer architecture. A lot of leg work and cost sure, but the hard part is the ideas bit, and that's still basically all Ryan-GPT. 

Fourth, I got massively nerdsniped by what 'in-context learning' actually is. A lot of talk about it from a quick search seemed to be vague, wish-washy, and highly anthropomorphising. I'm quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why it's called learning at all. The model certainly isn't learning anything. After you ask GPT4o a query you can boot up a new instance and it'll be as clueless as when you started the first session, or you could just flood the context window with enough useless tokens so the original task gets cut off. So, if I accept Ryan's framing of the inconsistent triad, I'd reject the 3rd one, and say that "Current LLMs never "learn" at runtime (e.g. the in-context learning they can do isn't real learning)". I'm going to continue following the 'in-context learning' nerdsnipe, but yeah since we know that weights are completely fixed and the model isn't learning, what is doing it? And can we think of a better name for it than 'in-context learning'?

Fifth and finally, I'm slightly disappointed at Buck and Dwarkesh for kinda posing this as a 'mic drop' against ARC.[5] Similarly, Zvi seems to dismiss it, though he praises Chollet for making a stand with a benchmark. I contrasnt, I think that the ability (or not) of models to reason robustly, out-of-distribution, without having the ability to learn from trillions of pre-labelled samples is a pretty big crux for AI Safety's importance. Sure, maybe in a few months we'll see the top score on the ARC Challenge above 85%, but could such a model work in the real world? Is it actually a general intelligence capable of novel or dangerous acts, such as to motivate AI risk? This is what Chollet is talking about in the podcast when he says:

I’m pretty skeptical that we’re going to see an LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved. If you just train the model on millions or billions of puzzles similar to ARC, you’re relying on the ability to have some overlap between the tasks that you train on and the tasks that you’re going to see at test time. You’re still using memorization.


If you're reading, thanks for making it through this comment! I'd recommend reading Ryan's full post first (which Philb linked above), but there's been a bunch of disparate discussion there, on LessWrong, on HackerNews etc. If you want to pursue what the LLM-reasoning-sceptics think, I'd recommend following/reading Melanie Mitchell and Subbarao Kambhampati. Finally, if you think this is topic/problem is worth collaborating on then feel free to reach out to me. I'd love to hear from anyone who thinks it's worth investigating and would want to pool resources.

  1. ^

    (Ofc your time is valuable and you should pursue what you think is valuable, I'd just hope this could be the start of a cross-factional, positive-sum research program which would be such a breath of fresh air compared to other AI discourse atm)

  2. ^

    Ryan estimates he used 1000x runtime compute per problem than Cole et. al, and also spent $40,000 in API costs alone (i wonder how much it costs for just 1 run though?).

  3. ^

    In the original interview, Mike mentions that 'there is an asterisk on any score that's reported on against the public test set' for this very reason

  4. ^

    H/t to @Max Nadeau  for being on top of some of the clarifications on Twitter

  5. ^

    Perhaps I'm misinterpreting, and I am using them as a proxy for the response of AI Safety as a whole, but it's very much the 'vibe' I got from those reactions

Load more