Morning Overview

Researchers show ChatGPT can help produce original mathematical proofs

A series of recent research papers have shown that ChatGPT and related large language models can produce original, verifiable mathematical proofs, including solutions to problems that had not been publicly solved before. The findings span competition-level mathematics, formal theorem proving, and open conjectures in combinatorial optimization, collectively suggesting that AI systems are crossing a threshold from pattern-matching into something closer to genuine mathematical reasoning. The results also raise hard questions about verification, overconfidence, and what it means for a machine to “prove” anything at all.

Five of Six IMO Problems Solved

The strongest single result comes from a paper describing what its authors call a theorem-proving feat achieved by ChatGPT. Using a multi-instance prover and verifier protocol, researchers demonstrated that ChatGPT solved five out of six problems from the 2025 International Mathematical Olympiad, as detailed in the IMO-focused study. The protocol ended with formal verification in Lean, a proof assistant language widely used in academic mathematics, specifically to guard against the hallucinations that plague large language models when they attempt complex reasoning.

That design choice matters. Without a formal verification step, a model can produce text that looks like a proof but contains subtle logical gaps. The Lean verification layer acts as an independent check, accepting only arguments whose every step follows from established axioms and previously proven lemmas. The five-of-six result is striking because IMO problems are designed to challenge the best human competitors, typically elite high school and undergraduate students who train for years. For an AI system to clear that bar, even with human-designed scaffolding and multiple attempts per problem, represents a sharp jump in capability from just a few years ago.

The study’s protocol also hints at how human mathematicians might collaborate with AI tools. Rather than relying on a single monolithic answer, the researchers orchestrated many parallel proof attempts, filtered them through automated checkers, and only then submitted candidates to Lean for full formal verification. This layered approach (generation, self-critique, and mechanized checking) resembles the way research teams iterate on difficult problems, suggesting that future mathematical work may increasingly involve orchestrating AI “collaborators” rather than simply querying a single model.

Benchmarks Built to Test Originality

Solving known competition problems, however, is not the same as producing original mathematics. Competition solutions exist in training data, and skeptics have long argued that models might be retrieving patterns rather than reasoning from scratch. Two new benchmarks directly address that concern by focusing on questions that were deliberately kept out of the public domain until after evaluation.

The First Proof benchmark, developed by researchers affiliated with Cornell University, consists of ten research-level math questions that arose during the authors’ own work. The questions were withheld from public view until the benchmark’s release, and answers were initially encrypted to prevent contamination. This design creates a controlled test: if an AI system produces a correct, checkable proof for a problem it could not have seen during training, the result is harder to dismiss as memorization. The First Proof framework lays out scoring rules, human review procedures, and encryption protocols intended to establish a new standard for measuring whether AI can generate original proofs on previously non-public problems.

Early results on First Proof are mixed but notable. Systems based on GPT-5-level architectures reportedly managed to fully solve a minority of the ten problems and produced substantial partial progress on several others. Human experts evaluated the submissions blind, assessing both correctness and novelty. In some cases, AI-generated arguments paralleled techniques the original researchers had used; in others, the models proposed different but valid approaches. Even when the proofs were incomplete, they sometimes suggested lemmas or constructions that human mathematicians found worth exploring.

A separate evaluation called the Godel Test takes a different angle. Rather than competition or textbook problems, it targets simple but previously unsolved conjectures in combinatorial optimization, chosen to be understandable yet genuinely open at the time of testing. The Godel Test study reports structured outcomes across five such conjectures and includes a case where GPT-5-derived reasoning, after verification, actually refuted one of the conjectures. Disproving a conjecture is, in some ways, a more convincing demonstration of reasoning than confirming one, because it requires the model to identify a specific counterexample or logical flaw rather than simply following a known proof strategy.

According to the authors, the refutation emerged through an iterative loop: the model proposed candidate counterexamples, an automated checker filtered out invalid ones, and promising cases were then explored more deeply with the model’s help. Once a genuine counterexample was identified, the team worked with proof assistants to formalize the argument that the conjecture could not hold in general. The result illustrates how AI systems can serve both as generators of mathematical ideas and as search engines over enormous combinatorial spaces that would be impractical for humans to explore unaided.

Where Formal Verification Still Stumbles

Not every attempt at automated theorem proving succeeds cleanly. PutnamBench, a benchmark accepted to the NeurIPS 2024 Datasets and Benchmarks Track, formalizes 640 problems from the Putnam competition across multiple proof assistant languages, including Lean 4, Isabelle, and a subset in Coq. The benchmark tested GPT-4-family models on producing formal proofs and found that common failure modes include syntax mismatches between Lean 3 and Lean 4, the two major versions of the proof assistant, along with misunderstandings of library functions and tactics.

Those errors are revealing. They suggest that current models sometimes struggle not with the underlying mathematics but with the precise formatting and ecosystem knowledge that verification tools demand. A proof that is logically sound in natural language can fail formal verification because the model calls the wrong lemma name, uses outdated syntax, or targets the wrong version of Lean. From the standpoint of mathematical truth, these are superficial issues; from the standpoint of building a fully automated pipeline, they are critical bottlenecks.

A separate paper on simplifying formal proof generation with ChatGPT acknowledges that the challenge has a rich history, stretching back through decades of work in automated theorem proving and interactive proof assistants. The authors explore prompt-engineering strategies, tool integration, and decomposition techniques that can translate informal reasoning into proof assistant code with fewer errors. Their results indicate that modest changes in how problems are presented to the model, such as explicitly listing relevant lemmas or constraining the allowed tactics, can significantly increase the rate of successful formalization.

Taken together, these benchmarks and tool-building efforts highlight a crucial point: formal verification is not yet a solved problem, even for proofs that human mathematicians would consider routine. Bridging the gap between natural mathematical language and machine-checkable formal code remains an active research area, and the reliability of AI-generated proofs depends as much on this translation layer as on the models’ raw reasoning capabilities.

Thinking on the Fly, or Faking It?

Beyond formal benchmarks, anecdotal experiments have added texture to the debate about what, exactly, large language models are doing when they solve math problems. Researchers at the University of Cambridge reported that ChatGPT appeared to reason spontaneously when given an ancient Greek mathematics puzzle involving the area of a square. The system broke the problem down into sub-questions, considered alternative approaches, and corrected its own missteps, behaviors that, in a human student, would be interpreted as signs of genuine understanding.

The Cambridge team was careful in its language, saying the model “seemed to think on the fly” rather than claiming it truly understood geometry. That distinction captures the central tension in this field. AI systems can now produce outputs that look like original reasoning, but determining whether the underlying process constitutes genuine mathematical thought or sophisticated pattern completion remains an open philosophical and scientific question. From a practical standpoint, however, the line may matter less than reliability: if a model can consistently generate correct, verifiable proofs, researchers will be tempted to use it regardless of whether it is “really” thinking.

UCLA mathematician Terence Tao, one of the most respected figures in contemporary mathematics, has weighed in on this tension in public discussions about AI and proof. In commentary reported by UCLA’s news office, Tao has emphasized both the promise and the limitations of current systems. On the one hand, he notes that AI tools can already assist with routine calculations, suggest plausible lemmas, and check large combinatorial cases that would be tedious for humans. On the other, he cautions that models are prone to subtle errors and that human mathematicians remain essential for setting research directions, interpreting results, and deciding which conjectures are worth pursuing.

Looking ahead, the emerging picture is one of hybrid workflows rather than fully autonomous discovery. Benchmarks like First Proof and the Godel Test show that large language models can sometimes generate or refute genuinely new statements, especially when paired with careful experimental design and strong verification. Datasets such as PutnamBench, along with ongoing work on simplifying the formalization pipeline, reveal the many ways those efforts can fail. Anecdotal case studies, from ancient Greek puzzles to cutting-edge research projects, suggest that models can behave in ways that feel strikingly like human mathematical reasoning, even if their internal mechanisms remain opaque.

Whether these systems are “doing mathematics” in the human sense may ultimately be less important than how the mathematical community chooses to integrate them. If AI-generated proofs are always accompanied by machine-checkable certificates, and if the culture of verification keeps pace with the technology, the next decade could see a rapid expansion in the scale and ambition of solvable problems. If, instead, proofs are accepted on the basis of persuasive natural-language arguments from opaque models, the risk of undetected errors will grow. For now, the frontier of AI mathematics is defined as much by new norms and tools for checking proofs as by the startling problems that models are beginning to solve.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.