A cutting-edge AI model that appeared to mimic human thinking may actually be memorizing answers — new tests reveal it struggles with true understanding

When a team of researchers unveiled an AI system called Centaur in a Nature paper in July 2025, the promise was bold: a single model, trained on 160 psychology experiments and roughly 10.68 million recorded human choices, that could reliably stand in for real people in cognitive studies. If it worked as advertised, Centaur could let scientists pilot experiments without recruiting a single participant, saving months of time and thousands of dollars. But as of mid-2026, independent researchers are raising pointed questions about whether Centaur truly understands human cognition or has simply memorized the right answers.

What Centaur was built to do

Centaur was trained on a massive behavioral dataset called Psych-101, drawn from approximately 60,092 participants across 160 experiments spanning risky decision-making, memory recall, learning, and other core cognitive tasks. Its creators positioned it as a potential “foundation model” for human cognition, one flexible enough to predict how people would behave in a wide variety of experimental setups. An earlier technical version had circulated as an arXiv preprint, giving the research community months to examine the methods before the journal publication appeared.

The ambition was significant. If a model could genuinely simulate human cognitive patterns, researchers could use it to pre-screen hypotheses, refine experimental designs, and even explore questions that would be impractical to test with large human samples. Centaur’s reported accuracy across its training tasks was high enough to make that vision seem plausible.

Where the cracks appeared

The problems surfaced when outside teams stopped testing Centaur on items that looked like its training data and started testing whether it could handle the same tasks presented differently.

A follow-up evaluation titled “Not Yet AlphaFold for the Mind,” posted as an arXiv critique by Ilia Sucholutsky, Yun-Fei Liu, and colleagues, examined whether Centaur could function as a reliable “synthetic participant” for new experiments. The researchers altered test items in ways that preserved their underlying logic but changed surface-level features like wording and presentation. Centaur’s performance dropped. That distinction is critical: a system that genuinely understands a task should handle a rephrased version just fine, while one that has memorized specific phrasings will stumble.

A separate critique from a team led by Liang Luo at Zhejiang University went further. The work was described as published in National Science Open, though the primary source available as of June 2026 is a summary reported by Science China Press via ScienceDaily. Luo and colleagues argued that Centaur’s behavior across those 160 cognitive tasks is consistent with overfitting. In machine learning, overfitting means a model has effectively memorized its training examples rather than learning transferable rules. The hallmark is exactly what the independent tests revealed: high scores on familiar material, poor scores on anything new.

What we still do not know

Several pieces of evidence that could settle the debate remain unavailable. The full set of training prompts used to build Centaur and the token-level overlap between training items and evaluation items have not been released beyond summary counts in the Nature paper. Without that data, it is hard to measure precisely how much the model’s accuracy depends on lexical similarity to material it has already seen. The raw per-participant response files and model-generated probability distributions from the generalization tests in the arXiv critique have not been deposited in an open repository either, limiting independent reproduction of the reported failure modes.

There is also no direct head-to-head comparison between Centaur and a well-known brittleness benchmark called GSM-Symbolic. That benchmark, introduced in a separate study of symbolic math reasoning, showed that large language models lose accuracy on math problems when only the numbers or clause structures change, even though the logical steps remain identical. The parallel is striking: both Centaur and the models tested in GSM-Symbolic appear competent until the surface statistics of the input shift. But without identical test items run through both systems, the comparison remains suggestive rather than proven.

The original Centaur team has not issued a public statement addressing the specific failure modes described in the Zhejiang critique. Whether they view the reported accuracy drops as a fundamental flaw or as an expected limitation of the current version is unclear from the available record.

A pattern bigger than Centaur

The questions surrounding Centaur fit into a broader and increasingly well-documented pattern in AI research: systems that appear to reason often turn out to be relying on sophisticated pattern-matching that collapses under controlled variation.

Work on large language models in medical reasoning, published in Nature Communications, found that multiple models lacked the metacognitive reliability needed for dependable clinical use, even when their surface-level answers looked correct. A separate study on distinguishing memorization from reasoning in multiple-choice benchmarks, available as an arXiv preprint, reported large accuracy drops on public datasets compared to private ones, interpreting the gap as evidence of data contamination. Neither study tested Centaur directly, but together they reinforce the same warning: impressive benchmark scores do not automatically mean a model has learned to think.

It is also worth noting that “overfitting” is not a binary label. A model can be somewhat overfit to particular phrasings or numerical ranges while still capturing useful structure about human decision-making. The Zhejiang critique argues that Centaur sits too far toward the memorization end of that spectrum, but without full transparency about training data and evaluation overlaps, outside observers are left inferring the degree of overfitting from indirect signals like paraphrase sensitivity and held-out task performance.

What this means for researchers using AI as a stand-in for humans

For scientists who have already started using AI models as stand-ins for human participants in pilot studies, the practical implications are immediate. Any experiment that relies on a model like Centaur to pre-screen hypotheses should include robustness checks: paraphrase the stimuli, swap numerical values, and reorder answer options. If accuracy drops significantly under those changes, the model is likely reflecting its training distribution rather than simulating genuine cognition. Until the field develops standardized generalization benchmarks for cognitive models, treating any single system as a reliable synthetic participant carries real risk of producing misleading pilot data that wastes time and funding when the hypotheses are eventually tested on actual people.

More broadly, the Centaur controversy is sharpening a growing consensus that claims of “human-like” reasoning from AI systems must be backed by stringent tests of transfer. That means evaluating models on tasks that differ in wording, presentation, and context from anything seen during training, while holding underlying structure constant. It also means sharing enough information about datasets and overlaps for independent groups to audit whether strong results reflect genuine abstraction or subtle forms of data leakage.

Why the generalization question will define Centaur’s legacy

Whether Centaur ultimately proves to be a stepping stone toward more robust cognitive models or a cautionary example of premature generalization claims will depend on future replications, ablation studies, and transparency efforts. For now, the safest reading is that Centaur captures real regularities in how people respond within a specific experimental corpus, but its ability to generalize beyond that corpus remains genuinely contested. For researchers and policymakers weighing whether to trust model-driven findings about the human mind, that uncertainty is not a footnote. It is the central question.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X