Benchmark Performance is a Poor Measure of Generalisable AI Reasoning Capabilities

James Fodor

Author's Note

This piece originated as a forum post but progressively grew sufficiently that I decided to make it into a preprint. You can find the pdf here. Alternatively I have uploaded the full text below. It can been seen as a partial response to @Benjamin_Todd's article on Teaching AI to reason, though I began working on it weeks prior to reading this article. If anyone would like to get in touch with me or discuss further feel free to send a private message, as I don't read comments much.

Introduction

The recent development of large language models (LLMs) based on the transformer architecture has led to extensive discussion as to how to best measure their capabilities. The most common way to assess LLMs is by their performance on standardised tests called benchmarks. It is often argued that recent substantial improvements in LLM benchmark scores, largely driven by state-of-the-art systems like GPT-4 and OpenAI’s o3, are indicative of large increases in the capability of LLMs. On this basis, it is often inferred that such models are becoming signiﬁcantly more capable, potentially outstripping human-level performance, on a wide range of real-world tasks. In this article I contend that this argument is ﬂawed in two key respects. First, I argue that inherent limitations with the benchmarking paradigm, along with speciﬁc limitations of existing benchmarks, combine to render benchmark performance highly unsuitable as a metric for generalisable competence across a range of tasks. Second, I argue that an alternative set of methods more suited for investigating the robustness and generalisability of LLM capabilities, namely adversarial stimuli and interpretability techniques, has yielded strong evidence that LLMs do not have robust competence in many language and reasoning tasks, and often fail to learn representations which facilitate generalisable inferences.

A note on terminology. In this article I use the term ‘large language models’ (LLMs) to refer to models based on the transformer architecture trained on large corpus of text data to interact with the user via natural language. Examples of such models include the Llama series, the GPT series, and the Gemini series. Although these models are capable of tasks beyond language processing, such as mathematical and coding tasks, this is still performed by next token prediction after training on linear strings of text. I avoid the term ‘artiﬁcial intelligence’, as it has no accepted clear meaning and does not refer to any speciﬁc model architecture.

Problems with benchmarks

A benchmark is a test designed to assess the ability of an LLM to perform a particulartype of task. They are designed to interrogate the ability of LLMs to perform tasks such as grammatical judgements, commonsense reasoning, logical inference, answering science or math questions, and producing computer code. Hundreds of benchmarks have been developed over the past decade, and it is outside my scope to review them here. Interested readers can consult existing literature reviews [1, 2, 3]. Here my purpose is to highlight several major limitations inherent in the practise of using benchmarks to assess the real-world capabilities and competence of LLMs, along with speciﬁc limitations of existing popular benchmarks. Critically, benchmarks aim to assess LLM capabilities in performing certain types of tasks that are similar, but not identical, to those contained in the benchmark itself. A high benchmark score is therefore not meaningful unless we have reason to believe the benchmark provides a good estimate for performance on related real-world tasks. In this section, however, I will argue that this is rarely the case.

Over-ﬁtting to the benchmarks

A major problem with many benchmarks is that shortly after they are made publicly available, they become incorporated into the data used to train the very LLMs they are intended to assess. Modern LLMs are easily large enough to memorise large swaths of data, with numerous studies ﬁnding evidence of benchmark tasks having been learned during model pre-training or subsequent ﬁne-tuning [4, 5, 6]. Further, even if LLMs do not directly memorise the answers to a task, exposing LLMs to example problems from a speciﬁc benchmark allows them to learn statistical regularities associated with that task. This can allow the LLM to achieve a higher score on that benchmark without any actual improvement in the LLM’s general reasoning capabilities [4].

More broadly, the development of LLMs has become guided by benchmark performance to the extent that the benchmarks lose much of their value in assessing LLM capabilities. This problem is an instance of Goodhart’s law, the adage that “when a measure becomes a target, it ceases to be a good measure”. Initially presented in the context of economics, it also applies in the context of evaluating LLMs [7, 8]. Unfortunately, its importance in this context often goes unappreciated. For instance, recently the World Economic Forum erroneously inferred that LLMs have exceeded human performance on various cognitive tasks, on the basis that LLMs have exceeded human performance on benchmarks designed to test those tasks [9]. Such an inference is only valid if the benchmarks in question are unbiased and representative indicators of LLM performance for the corresponding task as a whole. This measurement function of benchmarks, however, has been supplanted by their use as targets. New LLMs are designed speciﬁcally to achieve ‘state-of-the-art performance’ on various benchmarks, with this being rewarded by conference papers, citations, and media attention. Such incentives result in model architectures, parameters, and training regimes being ﬁtted speciﬁcally to these benchmarks, thereby reducing their relevance for assessing LLM performance [10]. Such over-ﬁtting to the test is manifested in the rapid pace with which new benchmarks become saturated, even while fundamental limitations of LLM competence are still evident, thereby requiring still newer benchmarks to assess performance [7, 11].

Similar problems have long been appreciated in the ﬁeld of psychometrics [12]. When people practise or receive coaching on components of an IQ test, performance on that task improves [13], but the test become less predictive of subsequent performance on other tasks [14]. This phenomenon results in recommendations to limit the number of times a given test is assigned to a particular individual. The problem is likely to be even worse in the case of LLMs, as compared to humans they can be much more thoroughly adapted to specific tasks through architectural design choices, hyperparameter selection, choice of training data, and ﬁne-tuning paradigm. This further highlights the difﬁculty of using benchmarks as both research targets and as evaluation metrics.

Supplementing these theoretical concerns, several recent studies have analysed this phenomenon empirically. One study found that ﬁne-tuning various LLMs on the training portion of question answering and text comprehension datasets resulted in a doubling of performance. They also found that ﬁne-tuning on some benchmarks led to a decline in performance on other benchmarks [15]. Another study found that leading LLMs such as GPT-4 had been exposed to many commonly-used public machine learning datasets in its training data. Indeed, in ﬁve out of seven cases GPT-4 had learned the data sufﬁciently well to be able to accurately sample from the dataset and reconstruct conditional distributions. These results highlight the difﬁculty of assessing LLMs using benchmarks which were publicly known at the time of their development [16].

Relevance of benchmark tasks

Another signiﬁcant problem with existing LLM benchmarks is the lack of research concerning their behavioural relevance. Few benchmarks provide substantive theoretical justiﬁcation concerning how or why certain tasks were selected, or empirical evidence that strong performance on the benchmark correlates with performance on tasks of real-world relevance. While psychometrics has grappled with these questions for decades and developed sophisticated (though still imperfect) methods for addressing them, theoretical and empirical evaluation of LLM benchmarks lags far behind [11, 17]. Too often, it is simply assumed that if a task requires intelligence for a human to perform then it will also be useful as a measure of the cognitive capabilities of an LLM. Decades of artiﬁcial intelligence research, however, has found that many tasks which require intelligence and ﬂexible cognitive capabilities for humans to perform can nonetheless be accomplished by artiﬁcial systems that are not intelligent and lack general cognitive capabilities. Examples of this phenomenon, sometimes called Moravec’s paradox [18], include executing search and sorting algorithms, automated theorem proving, playing games like chess and Go, competing in jeopardy, and many natural language tasks. Given the lack of careful research and poor track record of prediction, there is good reason to be skeptical about how informative current LLM benchmarks are about generalisable cognitive capabilities.

Survey evidence from users and developers of LLMs highlights this gap between benchmarks and real-world applications. Recently a series of interviews was conducted with 19 policy analysts, academic researchers, and industry professionals who have used benchmarks to inform decisions regarding adoption or development of LLMs [19]. Most respondents were skeptical of the relevance of benchmark performance for real-world tasks, as they found it difﬁcult to relate questions on the benchmarks to tasks of real-world relevance or customer requirements. One especially blunt respondent reported: ‘Some of the major benchmarks that people use today, MMLU, for example, (if you) just go through the questions and look at them, and you’ll realize that they are awful... a bunch of totally irrelevant stuff, things that are totally wrong.’

An additional factor limiting real-world applicability is that many benchmarks suffer from poor quality control and inadequate documentation of procedures for assessing accuracy of data. For example, one analysis found that 30% of Google’s emotion dataset is mislabelled [20]. An analysis of NLI benchmarks similarly found that many of the questions either had incorrect answers or were too vague to have a single objective solution [21]. Similar problems have been documented for the MMLU benchmark[22]. Furthermore, multiple prompts taken from a single benchmark commonly exhibit highly correlated performance across different LLMs, driven in part by semantic similarity between prompts. This indicates that benchmarks do not provide random sampling of problem types, which increases the risk that over-ﬁtting to the benchmark will result in failure of generalisation [23]. Overall, quality control is clearly on ongoing challenge for LLM benchmarks.

Analysis of recent benchmarks

In an effort to rectify some of these problems, recently several new benchmarks have been developed which aim to provide a more rigorous test of LLM capabilities. Several of these benchmarks have recently attracted attention due to OpenAI’s new reasoning models o1 and o3 achieving substantial performance gains on these tasks. Here I consider several of these newer benchmarks in more detail, discussing the extent to which they successfully resolve the problems highlighted above.

SciEval is a benchmark consisting of 18,000 questions covering physics, chemistry, and biology, which utilises dynamically generated data to reduce the problem of contamination of training data [24]. This dynamic component of the benchmark highlights just how extensive data contamination is, with GPT4’s accuracy on physics questions of falling from 65% on publicly available static data to 26% on dynamic data. While the use of dynamic data is a step forward, it is likely to be insufﬁcient to resolve the problem of data contamination, since LLMs can learn not only the precise numerical values but also the form and structure of the questions [25]. This so-called task contamination has been found to be responsible for about a 20 percentage point increase in performance of the GPT-3 series models when comparing older to newer benchmarks.

The GPQA benchmark consists of multiple-choice questions covering chemistry, physics, and biology which are said to be ‘Google-proof’ [26]. This claim is based on the fact that performance of non-experts on the questions only marginally increased when they were given access to the internet. The questions are also claimed to be difﬁcult, as domain experts achieved 65% accuracy compared to a mere 34% (on a 25% baseline) achieved by non-experts with unrestricted internet access. However, the authors also found that LLMs achieved about the same score on the most difﬁcult subset of questions as on the easier subset of questions, which contrasted sharply with the observation that human experts performed four times as well as non-experts on the difﬁcult questions compared to only twice as well on the easier questions. This discrepancy indicates that LLMs are likely performing the task in a different manner to humans, raising the question as to whether they have truly obtained generalisable competence in the relevant subjects, or instead are relying on superﬁcial heuristics to answer questions.

The FrontierMath benchmark is comprised of novel math problems developed speciﬁcally for the purpose by domain experts [27]. To prevent contamination of training data, it has not been made publicly available. While these are important steps forward, the authors still fall prey to the common mistake of assuming without evidence that their benchmark succeeds in testing the construct it was designed to test. Speciﬁcally, they claim that “the problems in FrontierMath demand deep theoretical understanding, creative insight, and specialized expertise”. While the novel math problems may well require deep theoretical insight for humans to solve, it is an entirely open question whether they require equivalent understanding or insight for LLMs to solve. This is especially relevant given that all tasks on the benchmark have numerical answers, meaning that correct reasoning is not assessed, only the solution. As such, it is unclear whether this benchmark will do much better than ensuring that models which achieve high scores have genuine mathematical competence, instead of merely utilising complex but superﬁcial heuristics for guessing the answers.

RE-Bench consists of seven machine learning assignments where performance is evaluated using metrics such as the accuracy or efﬁciency of the resulting code [28]. Unlike most other benchmarks, RE-Bench requires the LLM to generate workable code, rather than simply output a multiple-choice option or numerical answer. While undoubtedly a useful resource, this benchmark does not live up to its claim of testing the capabilities of LLMs to serve as ‘AI researchers’, as it only assesses their ability to perform modestly-sized, clearly-deﬁned, non-interacting coding tasks. Another major limitation is that while producing their answers, LLMs and humans were both able to submit preliminary solutions and receive scores on a test set as live feedback. While the 8-hour time limit meant that it was difﬁcult for humans to make much use of this information, the LLMs were able to quickly iterate many slightly different solutions to effectively brute-force increased scores against the test set. The authors also note that several times LLMs failed to adhere to the task instructions, which led them to produce solutions which performed well on the automated performance evaluation but were manually marked as incorrect for not adhering to the task. This raises the question of how often LLMs are able to achieve nominally correct answers despite following incorrect reasoning processes on other benchmarks where such manual inspection is not possible.

Overall, while these novel benchmarks represent a step forward for evaluation of LLMs, I conclude that they have failed to mitigate the major problems of data contamination, lack of real-world relevance, and poor evidence for generalisability that have plagued previous LLM benchmarks.

Generalisation abilities of LLMs

The purpose of benchmarks is to provide metrics of LLM performance which generalise beyond the speciﬁc questions contained in the benchmark to a range of tasks of real-world relevance. There are several aspects of successful generalisation, including performance of tasks from outside the training distribution [29], ability to combine previously learned components in novel ways (compositionality) [30], and robustness to task alternations such as noise, changes in format, and irrelevant information [31]. In the previous section I argued that current benchmarks are poorly suited for the purpose of measuring generalisable LLM performance on cognitive tasks. In this section I argue that we have strong evidence from adversarial, interpretability, and capability research that current LLMs are unable to learn to perform many cognitive tasks in a robust and consistent manner.

Adversarial stimuli and tasks

One useful technique for interrogating the capabilities of LLMs is to develop tasks or stimuli which are deliberately designed to be challenging for them. Such adversarial tasks are analogous to tasks developed in the heuristics and biases literature to study the limitations of human cognition [32]. In this section I highlight several recent studies which have bearing on the question of how well LLMs are able to learn to generalise beyond the problems they were trained on.

In one study focused on automated evaluation of LLMs, researchers developed a simple adversarial model which always produced the same output in response to all instructions [33]. However, this adversarial model was cleverly designed to reliably fool automated LLM evaluation software by manipulating the format of its output. This technique was able to fool even advanced evaluation models like GPT-4 about 80% of the time. The inability to detect or appropriately respond to what humans would easily recognise as an obvious attempt to circumvent the test highlights the lack of any human-level understanding of the input. Such fragility also indicates that robust generalisation to novel styles and formatting of responses is likely to be difﬁcult.

Other studies have used reversed text as an adversarial stimulus. This is motivated by the idea that if LLMs learn generalisable language competence analogous to that possessed by humans, they should show much greater competencies when trained on forward compared to reversed text. One prominent example relates to a widely publicised study which found that LLMs performed better than human experts at predicting the outcomes of neuroscience studies based only on their abstract [34]. A follow-up study replicated this effect with models trained on character-reversed neuroscience articles, even showing slightly better performance than models trained on regular text. These results indicate that LLMs do not perform the task by extracting the same sorts of features that humans use in assessing the

content of a scientiﬁc article [35]. Sensitivity to stimulus structure has also been investigated at the syntax level, with one study ﬁnding that ChatGPT performs better than humans at judging grammatical sentences, but when asked about ungrammatical sentences only achieves about 35% accuracy compared to humans 75% [36]. ChatGPT also tends to oscillate between answers even when given the same question much more than humans.

Adversarial methods have also been used to present LLMs with subtle variations of a task designed to elicit mistakes for models which inadequately process semantic content. One study focusing on theory of mind tasks found that GPT-4 performed very inconsistently on false-belief tasks [37]. On two of ﬁve task variants, GPT-4 achieved zero percent accuracy, while achieving nearly 100% accuracy on slightly different forms of the task. The model seemed to be easily distracted by irrelevant details or make false inferences based on which information was included. Similar ﬁndings have been observed when evaluating ChatGPT on arithmetic problems, where performance degraded by 70% when irrelevant information was included in the prompt [38].

Other studies have also found results indicating that LLMs fail to adequately represent the meaning of their prompts. An analysis of ChatGPT found that its answers to various questions about world facts or sentence meaning changed between 20% and 40% of the time when questions were paraphrased or reworded slightly [39]. Similarly, an analysis of BERT models on NLI tasks showed that the models relied mostly on superﬁcial heuristics such as presence of words like ‘all’ or ‘some’ in the premises, with performance breaking down on the subset of problems where such heuristics could not be used [40]. These results indicate that LLMs do not truly understand the meaning of the language they process in the way a human would, raising questions about the ability to reliably generalise to unseen examples or novel domains.

Failure to learn problem structure

While LLMs perform well on simple instances of logical and mathematical problems, they often fail with more com-plex problems even when the basic steps are the same in both cases. While humans can make errors of execution due to working memory constraints or lapses of attention, LLMs fail for very different reasons. Speciﬁcally, evidence indicates that LLMs learn superﬁcial heuristics from extensive training data while failing to learn the underlying structure of the problem or the steps needed to solve it. Memorisation of single steps observed in the training data is common, with relatively poor ability to combine multiple steps together [41]. Various research has found that LLMs systematically fail to learn robust representations of problem structure in ways that allow them to appropriately generalise to alterations of the problem or stimulus.

LLMs are often highly sensitive to the order of answers and speciﬁc symbols used to present them. For example, one study found changes in accuracy of up to 25 percentage points depending on which order multiple choice answers were presented in [10]. Changing the symbols from letters to unusual characters reduced accuracies by around 5 percentage points on average. Large reductions in accuracy of 20-30 percentage points were also observed across a range of models when they were provided with irrelevant incorrect answers along with the question prompt. Another study found that the performance of Llama3 on the MMLU task fell by about 25 percentage points when model was prompted to provide its answers in a slightly different way [42].

Analysis of mathematical tasks has also found that LLMs consistently fail to learn the underlying task structure. One study used GSM-Symbolic, a maths benchmark which consisting of variants of a standard set of arithmetic questions produced by altering variable names and values [43]. The authors found that these simple changes reduced performance by 1-9 percentage points, indicating some degree of over-ﬁtting to the existing static dataset. They also found that including superﬁcially important but ultimately irrelevant information reduced accuracies by up to 65 percentage points, with even o1-preview experiencing a reduction of 17 percentage points. A similar effect was observed in a separate analysis of GPT-4, where performance on a logical inference problem declined from over 95% to about 40% when irrelevant rules were introduced and the order of the premises was altered [44]. Such sensitivity to the alteration of variable names and inclusion of irrelevant information is not expected of a system which has adequately learned how to interpret and solve arithmetic problems.

Chain-of-thought prompting, a technique designed to help LLMs reason more carefully about complex problems, often increases LLM performance without improving generalisable reasoning capabilities. One analysis of such prompts found that for seven of the eight tasks examined, inserting a mistake near the beginning of the chain-of-thought resulted in the same answer being given more than 60% of the time [45]. This indicates that much of the ‘thought’ contains post-hoc rationalisation rather than genuine reasoning. Interestingly, the authors also found this degree of post hoc reasoning was greater for larger models than smaller ones. Another analysis found that when chain-of-thought prompting was used to train the Llama model to perform an arithmetic task, performance increased from 79% to 95% [46]. However, when an informative task-relevant prompt was replaced by a series of meaningless ﬁller tokens, performance only fell to 94%, indicating that the content of the prompt was not relevant to the improved performance. Another analysis focusing on use of chain-of-thought prompting for planning found that LLMs consistently failed to generalise appropriately to new problem instances, and only performed well when given prompts highly speciﬁc and tailored to each type of problem [47].

One especially insightful study used a careful series of analyses to illustrate how LLMs can fail to learn the underlying structure of a task despite achieving high performance on test sets [29]. In a series of experiments the authors showed that the BERT model could achieve near perfect performance on a logical deduction task when tested on a subset of problems sampled in the same way as the problems on which it was trained. However, if different sampling algorithms were instead used to generate the training and testing data, performance dropped signiﬁcantly, falling close to chance levels for more difﬁcult problems. This decline in performance occurred even though both algorithms sampled from the same space of problems, with the only difference being that they did so using slightly different techniques too subtle for humans to discern any overall differences between the resulting two sets of problems. The authors also identiﬁed statistical artefacts in the training distribution, which they showed BERT had utilised to accurately perform the task when tested on data sampled in the same way as the data on which it was trained. These statistical artefacts, however, were not present when the problems were sampled in a different way, resulting in incorrect predictions when BERT asked to generalise beyond its training problems. Since such statistical artefacts are inevitable in any sample of instances of a sufﬁciently complex problem, it is infeasible to identify and remove all of them. This is especially signiﬁcant since contemporary LLMs are so large that they can identify and utilise very complex and abstract statistical associations that may be impossible for humans to even understand. This study therefore powerfully highlights the inherent limits of the entirely statistical methods by which LLMs learn to solve abstract problems, and the difﬁculties in ensuring such performance will generalise to real world problems.

The promise of reasoning models

The past year has seen the emergence of ‘reasoning models’ such as OpenAI’s o1 and o3, DeepSeek-R1, or Google’s Gemini 2.0 Flash Thinking model. While little information about their architecture or training has been made public, these models are evidently LLMs with extensive ﬁne-tuning on step-by-step reasoning problems taken from domains such as mathematics, coding, and science [48]. The models are then trained to generate a large number of candidate solutions and then score each using heuristics learned from their training data. Some have claimed that such models represent a new paradigm that will resolve the problems and limitations with traditional LLMs [49, 50]. Currently, these systems are too recent for any signiﬁcant evaluation of their capabilities to have been published. However, there are powerful reasons to doubt the claim that reasoning models have circumvented the problems with benchmarking and generalisation discussed in previous sections.

From a theoretical perspective, it is unclear why additional ﬁne-tuning on step-by-step reasoning would substantially mitigate LLM limitations with generalisation or learning problem structure. As noted, reasoning models appear to work by generating large numbers of candidate chain-of-thought solutions and then using a heuristic function trained by reinforcement learning to select the most plausible candidates. Insofar as this outline is accurate, ‘reasoning models’ are not performing genuine reasoning, but are learning to mimic desired outputs by utilising heuristics and complex statistical regularities. While it may be more effective to learn to match steps in the reasoning process rather than directly predicting the ﬁnal answer, the system still lacks any evident mechanism for ensuring that it learns the under-lying structure of the problem rather than complex but ultimately superﬁcial or irrelevant statistical regularities. True formal reasoning requires structural modelling of a program and performing speciﬁed operations on the component elements of the problem until the desired endpoint is reached. Based on information at the time of writing, reasoning models appear instead to be matching and mimicking reasoning steps contained in their enormous training corpus, and combining them together in accordance with complex heuristics derived from reinforcement learning. This strategy may yield performance improvements for certain types of problems, but does not resolve the key issue of the inability to learn underlying problem structure so as to facilitate robust generalisable inferences.

It has been argued that reasoning models have the potential to generate massively new training datasets that will allow for rapid improvements in performance. Speciﬁcally, the claim is that ﬁne-tuning on step-by-step programming or mathematical reasoning tasks will enable the automatic generation of large new datasets of reasoning steps, which can then be combined with answers automatically checked for correctness [50]. This new data can then be fed back into the model to provide even more training data, thereby further improving performance in a virtuous feedback cycle. While this approach may work in certain cases, the methodology assumes that the reasoning steps generated by model are correct and logically lead to the generated answer. This is not something that can be easily checked, and it is also inconsistent with research which indicates that chain-of-thought prompts often improve performance even while consisting largely of post-hoc reasoning that does not logically entail the generated answer [45, 47]. The core problem is therefore left unsolved, namely of ensuring that the model learns the underlying logical structure of the problem and can apply this knowledge to relevant novel instances, instead of relying on superﬁcial heuristics which will not robustly generalise to most real-world applications.

Beyond these general theoretical concerns, there are speciﬁc reasons to be cautious about the claims made about the capabilities of reasoning models. Announcements by companies like OpenAI concerning the benchmark performance of newly developed models should be interpreted as marketing designed to appeal to potential customers and investors, rather than as research reports having substantial scientiﬁc value. Too often, commentators accept such announcements entirely at face value as evidence of substantial progress, with no discussion of the extent to which such results will be reproducible, generalisable, or robust. Furthermore, in achieving these benchmarks OpenAI and other companies are likely to be making full use of any publicly available training portions of existing benchmarks. They may even be developing internal datasets with tasks resembling those from known benchmarks to use for training, thereby allowing for improved performance on those speciﬁc benchmarks which may not generalise to other tasks.

Several recent examples highlight these concerns. Recently it emerged that OpenAI was one of the major funders of the FrontierMath benchmark, and had at least some of the solutions in their possession [51]. Exactly how these were used in the development of o3 is unclear, however there is more than enough reason to be suspicious about performance claims made regarding this benchmark. In addition, o3 was also trained on the public data of ARC-AI, a dataset comprised of abstract visual reasoning problems in the style of Raven’s progressive matrices [52]. When combined with the large amount of targeted research this benchmark has attracted in recent years, the high scores achieved by o3 should not be considered a reliable metric of general reasoning capabilities. Most recently, OpenAI has claimed that o3 achieved substantially improved performance on the CodeForces benchmark and International Olympiad in Informatics 2024 competition [53]. While the questions from these evaluations had already been made public by the time they were used for assessing o3, OpenAI claims their training cutoff was after the release of these problems, and that manual search revealed no evidence of contamination of the training data. They also claimed that the performance increase was due entirely to reinforcement learning of step-by-step reasoning, with no speciﬁc tailoring to the speciﬁc benchmarks. Though at present these claims are impossible to verify, even if taken at face value, information from the benchmarks could still have inﬂuenced the development of o3 through selection of hyperparameters or other development decisions. As an illustration of how this can occur, one study using older LLMs illustrated how information from a test dataset can be ‘illicitly’ learned by a series of seemingly benign intermediate evaluation steps [54]. As such, until OpenAI’s claims can be independently veriﬁed using novel problems, it should not be assumed that the claimed improvements in benchmark scores will translate into an equivalently large and robust improvement in generalised capabilities.

In the case of the recently-released DeepSeek-R1, although more information is available than for OpenAI’s o3 model, it is still highly unclear what data was used for training [48]. The authors mention a combination of synthetic and human-annotated data used for both chain-of-thought ﬁne-tuning and reinforcement learning, but no details are provided as to the nature of the data or what types of tasks were trained on. It is rumoured that it was trained on outputs generated by OpenAI’so1 model, which calls into question how generalisable its reasoning capabilities really are when applied to novel types of problems not reﬂected in its training data. DeepSeek-R1 has been made publicly available, which should facilitate community analysis of the robustness of its general reasoning cognitive capabilities.

Overall, there are strong theoretical and empirical reasons to doubt that AI reasoning models have substantially re-solved the limitations of previous generations of LLMs. New models like o3 still rely on the same underlying mechanisms that extract superﬁcial heuristics from huge datasets without learning the structure of a problem. Likewise, state-of-the-art scores on benchmarks which were already known and likely guided o3’s development are not a reliable indicator of its capabilities of genuine reasoning or generalisation to novel tasks.

Conclusion

It is undeniable that recent years have seen substantial progress in the development of large language models capable of performing many tasks relating to language, logic, coding, and mathematics. However, the extent of this progress is frequently exaggerated based on appeals to rapid increases in performance on various benchmarks. I have argued that these benchmarks are of limited value for measuring LLM progress because of problems of models being over-ﬁt to the benchmarks, lack real-world relevance of test items, and inadequate validation for whether the benchmarks predict general cognitive performance. Conversely, evidence from adversarial tasks and interpretability research indicates that LLMs consistently fail to learn the underlying structure of the tasks they are trained on, instead relying on complex statistical associations and heuristics which enable good performance on test benchmarks but generalise poorly to many real-world tasks. The result is that the LLMs lack robust reasoning capabilities that are consistent and invariant to irrelevant changes in problem format, style, numerical data, or context. Despite much recent hype, new reasoning models do not offer a general solution to these problems. Renewed hype highlights the importance of subjecting new LLMs to careful systematic evaluation to determine its reasoning capabilities and their limits across a range of scenarios. Benchmark performance alone is insufﬁcient to establish general reasoning competence.

References

[1] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.

[2] Todor Ivanov and Valeri Penchev. AI Benchmarks and Datasets for LLM Evaluation. arXiv preprint arXiv:2412.01020, 2024.

[3] Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024.

[4] Md Tahmid Rahman Laskar, Sawsan Alqahtani, M Saiful Bari, Mizanur Rahman, Mohammad Abdullah Matin Khan, Haidar Khan, Israt Jahan, Amran Bhuiyan, Chee Wei Tan, Md Rizwan Parvez, et al. A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13785– 13816, 2024.

[5] Ali Satvaty, Suzan Verberne, and Fatih Turkmen. Undesirable Memorization in Large Language Models: A Survey. arXiv preprint arXiv:2410.02650, 2024.

[6] Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, and William Yang Wang. Generalization vs Memorization: Tracing Language Models’ Capabilities Back to Pre-training Data. arXiv preprint arXiv:2407.14985, 2024.

[7] Sourav Banerjee, Ayushi Agarwal, and Eishkaran Singh. The Vulnerability of Language Model Benchmarks: Do They Accurately Reﬂect True LLM Performance? arXiv preprint arXiv:2412.03597, 2024.

[8] Rachel L Thomas and David Uminsky. Reliance on metrics is a fundamental challenge for AI. Patterns, 3(5), 2022.

[9] James Fell. Stanford just released its annual AI Index report. Here’s what it reveals. /urlhttps://www.weforum.org/stories/2024/04/stanford-university-ai-index-report/, 2024.

[10] Norah Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yusef Almushaykeh, Faisal Mirza, Nouf Alotaibi, Nora Altwairesh, Areeb Alowisheq, et al. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. arXiv preprint arXiv:2402.01781, 2024.

[11] Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J Kochenderfer. BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices. arXiv preprint arXiv:2411.12990, 2024.

[12] Holli A DeVon, Michelle E Block, Patricia Moyle-Wright, Diane M Ernst, Susan J Hayden, Deborah J Lazzara, Suzanne M Savoy, and Elizabeth Kostas-Polston. A psychometric toolbox for testing validity and reliability. Journal of Nursing scholarship, 39(2):155–164, 2007.

[13] Claudia Bartels, Martin Wegrzyn, Anne Wiedl, Verena Ackermann, and Hannelore Ehrenreich. Practice effects in healthy adults: a longitudinal study on frequent repetitive cognitive testing. BMC neuroscience, 11:1–12, 2010.

[14] Jan Te Nijenhuis, Olga F Voskuijl, and Natasja B Schijve. Practice and coaching on IQ tests: Quite a lot of g. International Journal of Selection and Assessment, 9(4):302–308, 2001.

[15] Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han. Don’t make your LLM an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964, 2023.

[16] Sebastian Bordt, Harsha Nori, and Rich Caruana. Elephants never forget: Testing language models for memo-rization of tabular data. arXiv preprint arXiv:2403.06644, 2024.

[17] Denis Federiakin. Improving LLM Leaderboards with Psychometrical Methodology. arXiv preprint arXiv:2501.17200, 2025.

[18] Demis Hassabis. Artiﬁcial Intelligence: Chess match of the century. Nature, 544(7651):413–415,2017.

[19] AmeliaHardy, Anka Reuel, Kiana Jafari Meimandi, Lisa Soder, Allie Grifﬁth, Dylan M Asmar, Sanmi Koyejo, Michael S Bernstein, and Mykel J Kochenderfer. More than Marketing? On the Information Value of AI Bench-marks for Practitioners. arXiv preprint arXiv:2412.05520, 2024.

[20] Edwin Chen. 30% of Google’s Emotions Dataset is Mislabeled. https://www.surgehq.ai/blog/30-percent-of-googles-reddit-emotions-dataset-is-mislabeled, 2022.

[21] Aikaterini-Lida Kalouli, Hai Hu, Alexander F. Webb, Lawrence S. Moss, and Valeria de Paiva. Curing the SICKand Other NLI Maladies. Computational Linguistics, 49(1):199–243, 03 2023.

[22] Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are We Done with MMLU? arXiv preprint arXiv:2406.04127, 2024.

[23] Charlotte Siska, Katerina Marazopoulou, Melissa Ailem, and James Bono. Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10406–10421, 2024.

[24] Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, and Kai Yu. Scieval: A multi-level large language model evaluation benchmark for scientiﬁc research. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 38, pages 19053–19061, 2024.

[25] Changmao Li and Jeffrey Flanigan. Task contamination: Language models may not be few-shot anymore. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 38, pages 18471–18480, 2024.

[26] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level google-proof Q&A benchmark. arXiv preprint arXiv:2311.12022, 2023.

[27] Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in AI. arXiv preprint arXiv:2411.04872, 2024.

[28] Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, et al. Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts. arXiv preprint arXiv:2411.15114, 2024.

[29] Honghua Zhang, Liunian Harold Li, Tao Meng, Kai-Wei Chang, and Guy Van Den Broeck. On the paradox of learning to reason from data. In Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence, pages 3365–3373, 2023.

[30] Yuekun Yao and Alexander Koller. Structural generalization is hard for sequence-to-sequence models. arXiv preprint arXiv:2210.13050, 2022.

[31] Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399, 2023.

[32] Richard F West, Maggie E Toplak, andKeith E Stanovich. Heuristics and biases as measures of critical thinking: Associations with cognitive ability and thinking dispositions. Journal of educational psychology, 100(4):930, 2008.

[33] Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, and Min Lin. Cheating automatic LLM bench-marks: Null models achieve high win rates. arXiv preprint arXiv:2410.07137, 2024.

[34] Xiaoliang Luo, Akilles Rechardt, Guangzhi Sun, Kevin K Nejad, Felipe Yáñez, Bati Yilmaz, Kangjoo Lee, Alexandra O Cohen, Valentina Borghesani, Anton Pashkov, et al. Large language models surpass human experts in predicting neuroscience results. Nature human behaviour, pages 1–11, 2024.

[35] Xiaoliang Luo, Michael Ramscar, and Bradley C Love. Beyond Human-Like Processing: Large Language Models Perform Equivalently on Forward and Backward Scientiﬁc Text. arXiv preprint arXiv:2411.11061,2024.

[36] Vittoria Dentella, Fritz Günther, and Evelina Leivada. Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias. Proceedings of the National Academy of Sciences, 120(51):e2309583120, 2023.

[37] Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, and Vered Shwartz. Clever hans or neural theory of mind. Stress testing social reasoning in large language models, 2023.

[38] Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR, 2023.

[39] Xenia Ohmer, Elia Bruni, and Dieuwke Hupke. From form(s) to meaning: Probing the semantic depths of language models using multisense consistency. Computational Linguistics, 50(4):1507–1556,2024.

[40] Reto Gubelmann and Siegfried Handschuh. Uncovering more shallow heuristics: Probing the natural language inference capacities of transformer-based pre-trained language models using syllogistic patterns. arXiv preprint arXiv:2201.07614, 2022.

[41] Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al. Faith and fate: Limits of transformers on compositionality. Advances in Neural Information Processing Systems, 36, 2024.

[42] Kyle Moore, Jesse Roberts, Thao Pham, Oseremhen Ewaleifoh, and Doug Fisher. The Base-Rate Effect onLLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance. arXiv preprint arXiv:2406.11634, 2024.

[43] Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024.

[44] Xinyun Chen, Ryan A Chi, Xuezhi Wang, and Denny Zhou. Premise order matters in reasoning with large language models. arXiv preprint arXiv:2402.08939, 2024.

[45] Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023.

[46] Jacob Pfau, William Merrill, and Samuel R Bowman. Let’s Think Dot by Dot: Hidden Computation in Trans-former Language Models. arXiv preprint arXiv:2404.15758, 2024.

[47] Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. Chain of thoughtlessness: An analysis of COT in planning. arXiv preprint arXiv:2405.04776, 2024.

[48] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.

[49] Julian Horsey. OpenAI’s o3 Model Stuns the World with Gold Medal Win at IOI. Geeky Gadgets, 2024.

[50] Benjamin Todd. Teaching AI to reason: this year’s most important story. https://benjamintodd.substack.com/p/teaching-ai-to-reason-this-years, 2025.

[51] Roger Montii. OpenAI Secretly Funded Benchmarking Dataset Linked To o3 Model. Search Engine Journal, 2025.

[52] François Chollet. OpenAI o3 Breakthrough High Score on ARC-AGI-Pub. https://arcprize.org/blog/oai-o3-pub-breakthrough, 2024.

[53] Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, et al. Competitive Programming with Large Reasoning Mod-els. arXiv preprint arXiv:2502.06807, 2025.

[54] Jonibek Mansurov, Akhmed Sakip, and Alham Fikri Aji. Data Laundering: Artiﬁcially Boosting Benchmark Results through Knowledge Distillation. arXiv preprint arXiv:2412.15255, 2024.

tobycrisford 🔸Feb 214

In addition, o3 was also trained on the public data of ARC-AI, a dataset comprised of abstract visual reasoning problems in the style of Raven’s progressive matrices [52]. When combined with the large amount of targeted research this benchmark has attracted in recent years, the high scores achieved by o3 should not be considered a reliable metric of general reasoning capabilities.

This take seems to contradict Francois Chollet's own write-up of the o3 ARC results, where he describes the results as:

a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs. o3 is a system capable of adapting to tasks it has never encountered before

(taken from your reference 52 , emphasis mine)

You could write this off as him wanting to talk-up the significance of his own benchmark, but I'm not sure that would be right. He has been very publicly sceptical of the ability of LLMs to scale to general intelligence, so this is a kind of concession from him. And he had already laid the groundwork in his Dwarkesh Patel interview to explain away high ARC performance as cheating if it tackled the problem in the wrong way, cracking it through memorization via an alternative route (e.g. auto-generating millions of ARC-like problems and training on those). He could easily have dismissed the o3 results on those grounds, but chose not to, which made an impression on me (a non-expert trying to decide how to weigh up the opions of different experts). Presumably he is aware that o3 trained on the public dataset, and doesn't view that as cheating. The public dataset is small, and the problems are explicitly designed to resist memorization, requiring general intelligence. Being told the solution to earlier problems is not supposed to help you solve later problems.

What's your take on this? Do you disagree with the write up in [52]? Or do you think I'm mischaracterizing his position (there are plenty of caveats outside the bit I selectively quoted as well - so maybe I am)?

The fact that the human-level ARC performance could only be achieved by extremely high inference-time compute costs seems significant too. Why would we get inference time scaling if chain-of-thought consisted of not much more than post-hoc rationalizations, instead of real reasoning?

For context, I used to be pretty sympathetic to the "LLMs do most of the impressive stuff by memorization and are pretty terrible at novel tasks" position, and still think this is a good model for the non-reasoning LLMs, but my views have changed a lot since the reasoning models, particularly because of the ARC results.

James FodorFeb 242

Hi Toby, thanks for the comment.

I have read about some of the work on tackling the ARC dataset, and I am not at all confident that the approaches which perform well have anything to do with generalisable reasoning. The problem remains that there is no validation that the benchmark measures what it claims to. I don't know what methods o3 used to solve it, but until I do I don't believe the marketing hype released by OpenAI that it must be generalisable reasoning.

As to why we'd see inference time scaling if chain-of-thought consisted of not much more than post-hoc rationalizations, this is still an open question but it seems to be partly driven by increased compute time and number of tokens. I don't have the full answer here, but the evidence we do have strongly cautions against just assuming these models are doing what we might describe as 'genuine reasoning'.

SummaryBotFeb 211

Executive summary: Benchmark performance is an unreliable measure of general AI reasoning capabilities due to overfitting, poor real-world relevance, and lack of generalisability, as demonstrated by adversarial testing and interpretability research.

Key points:

Benchmarks encourage overfitting—LLMs often train on benchmark data, leading to inflated scores without true capability improvements (a case of Goodhart’s law).
Limited real-world relevance—Benchmarks rarely justify why their tasks measure intelligence, and many suffer from data contamination and quality control issues.
LLMs struggle with generalisation—Studies show they rely on statistical shortcuts rather than learning underlying problem structures, making them sensitive to minor prompt variations.
Adversarial testing exposes flaws—LLMs fail tasks that require true reasoning, such as handling irrelevant information or understanding problem structure beyond superficial cues.
"Reasoning models" are not a breakthrough—New models like OpenAI's o3 use heuristics and reinforcement learning but still lack genuine generalisation abilities.
Benchmark reliance leads to exaggerated claims—Improved scores do not equate to real cognitive progress, highlighting the need for more rigorous evaluation methods.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Effective Altruism Forum
EA Forum