Change a single number in a math problem, and a human who understands the underlying logic will still get the right answer. Do the same to a large language model, and accuracy can fall off a cliff. That discrepancy sits at the heart of a growing body of research arguing that even the most advanced AI systems do not actually reason. They pattern-match, and when the patterns shift, they break.
Two recent preprints from researchers at Apple’s machine learning group, along with an independent reanalysis of a high-profile cognitive model called Centaur, are sharpening this critique with controlled experiments and reproducible evidence. Their collective finding: what passes for intelligent thought in today’s AI may be closer to sophisticated mimicry, a distinction that matters enormously for anyone relying on these tools in medicine, law, or behavioral science.
What “centaur” means, and why it matters here
The term “centaur” in AI refers to a workflow where a human expert and an AI system collaborate, each supposedly contributing what it does best. The idea gained traction in chess after Garry Kasparov proposed that human-machine teams could outperform either alone. It has since spread into fields like radiology, legal research, and scientific analysis, where AI handles pattern recognition at scale while humans supply judgment and contextual understanding.
But the centaur model rests on an assumption: that the AI half of the team is doing something meaningfully different from the human half, not just faster. If the machine is only recognizing surface patterns rather than reasoning about structure, the partnership has a blind spot. The human may defer to the AI on exactly the cases where the AI is least reliable, novel problems, edge cases, and inputs that do not resemble training data.
The “Illusion of Thinking” experiments
The central paper driving this debate is a preprint titled “The Illusion of Thinking,” produced by Apple’s machine learning research group and hosted on arXiv. The study uses a controlled task suite with explicit complexity scaling to test whether reasoning models genuinely work through problems or simply retrieve learned patterns. As problem complexity increased and tasks deviated from structures resembling training data, model performance degraded sharply.
A separate, earlier paper from Apple-affiliated researchers, published on arXiv in October 2024, reinforces this through a different lens. That study focused on mathematical reasoning benchmarks and found that LLM accuracy dropped significantly when numerical values in problems were swapped out or when irrelevant clauses were inserted into the problem text. If a model genuinely understood the math, a changed number or a distracting sentence should not matter. The fact that accuracy fell under these conditions points to pattern sensitivity rather than logical competence, the researchers argued.
Together, the two papers build a case that current reasoning models are performing a kind of sophisticated lookup: matching new inputs against patterns absorbed during training, then failing when the match is poor.
Centaur under the microscope
The other target of scrutiny is Centaur, a model built on Meta’s Llama 3.1 70B architecture with QLoRA adapters and trained on a dataset called Psych-101. That dataset spans 160 behavioral experiments, roughly 60,000 participants, and approximately 10.68 million individual choices, according to the peer-reviewed paper published in Nature. The Centaur team claimed the model could generalize to out-of-distribution psychological tasks, effectively predicting human behavior in experiments it had never encountered during training.
A reanalysis by independent researchers, however, suggests that Centaur’s apparent generalization may rest on a shortcut. In a preprint hosted on PsyArXiv, the team ran controlled experiments that systematically stripped either task information or participants’ choice history from the model’s inputs. The result: Centaur’s performance advantage appeared tied to exploiting sequential dependencies in participants’ choice patterns, not to understanding the psychological tasks themselves. When choice history was removed, the model’s edge shrank toward baseline.
In plain terms, Centaur seemed to be predicting what a participant would do next based on what they had already done, not based on the structure of the experiment. That is a useful statistical trick, but it is not cognitive modeling.
What remains uncertain
As of June 2025, neither the Apple researchers behind “The Illusion of Thinking” nor the original Centaur development team have issued public responses to these specific critiques, based on available sources. Without direct rebuttals, it is unclear whether the Centaur authors view the shortcut finding as a fundamental flaw or a narrow edge case that does not undermine their broader claims.
The evidentiary standing of the papers in this debate is also uneven. The Centaur paper in Nature is the only study that has cleared formal peer review. Both Apple preprints and the PsyArXiv reanalysis remain unreviewed. Preprints are standard in fast-moving AI research and allow the community to scrutinize methods quickly, but their findings carry less institutional weight. The critiques may ultimately prove correct, but they have not yet been validated through the same process the original Centaur claims underwent.
There is also a transparency gap. The PsyArXiv reanalysis describes controlled ablation experiments, but the underlying datasets and code beyond the DOI have not been independently verified through available reporting. Readers weighing the strength of the shortcut argument should note this limitation.
A broader open question hangs over the debate: how do the most prominent commercial reasoning models, such as OpenAI’s o-series, perform under similar controlled stress tests? The Apple preprint tested several frontier models, but the industry’s response to these findings has been limited. Until major labs engage publicly with the critique, the practical implications for deployed AI systems remain partly speculative.
Why the distinction between reasoning and pattern-matching is not academic
The strongest evidence in this debate comes from the controlled experimental designs across all three papers. The Apple preprints use complexity scaling and input perturbation as direct, quantifiable tests. These are not opinion pieces. They measure specific drops in performance under conditions that should not affect a system with genuine logical ability. The mathematical reasoning paper’s finding that adding irrelevant clauses degrades accuracy is particularly striking: a human who understands a math problem can simply ignore extraneous information. A pattern-matcher cannot, because the extra text changes the surface pattern.
The Centaur reanalysis follows similar logic. If removing task-relevant information does not hurt performance but removing choice history does, the model is tracking statistical regularities in behavior sequences rather than modeling the cognitive processes that produce those behaviors. For anyone hoping to use AI to study or predict human decision-making, this is a critical distinction. A system that mirrors response patterns without understanding why people make those choices has limited value as a scientific tool, even if its predictions look accurate on standard benchmarks.
For practitioners in medicine, law, and scientific research, where AI-assisted decision-making is gaining traction, these findings carry a concrete warning. Consider a diagnostic AI trained on thousands of radiology scans. It may perform impressively on cases that resemble its training data. But a rare presentation, an unusual patient history, or a scan with an artifact could push the input outside familiar patterns, and the system’s confidence may not reflect its actual reliability. A physician working in a centaur arrangement might defer to the AI’s read precisely when the AI is most likely to be wrong.
The gap between benchmark performance and genuine reasoning ability is not a new concern. Researchers have flagged it for years, notably in the 2021 “stochastic parrots” paper by Emily Bender and colleagues, which argued that language models generate text by statistical association rather than understanding. What the current wave of research adds is specific, reproducible experimental evidence showing where and how that gap manifests in models marketed as capable of reasoning.
Anyone evaluating AI tools for high-stakes applications should look beyond headline accuracy scores and ask a harder question: has this system been tested on problems that differ meaningfully from its training data? If it has not, the performance numbers may reflect memorization, not understanding, and the centaur workflow built around it may be less reliable than it appears.
More from Morning Overview
*This article was researched with the help of AI, with human editors creating the final content.