DeepMind’s AlphaProof system solved four out of six problems at the 2024 International Mathematical Olympiad, generating formally verified proofs through reinforcement learning trained on millions of auto-formalized problems. That result, reported in a recent Nature study, signals a broader shift. Artificial intelligence is not just assisting mathematicians but actively reshaping the mechanics of how proofs get written, checked, and discovered.
AlphaProof and the Reinforcement Learning Breakthrough
AlphaProof draws inspiration from AlphaZero, the game-playing agent that mastered chess and Go. But instead of board positions, it searches through formal proof states in the Lean proof assistant. The system learns to find formal proofs through reinforcement learning, training on millions of problems that were automatically translated into formal mathematical language. Each candidate proof is checked by a verification engine, creating a feedback loop where the model improves by confirming which reasoning steps actually hold up.
The IMO 2024 performance is striking because competition problems sit at the boundary of what trained human competitors can handle under time pressure. AlphaProof operated in a formal setting, meaning every step it produced was machine-checkable rather than relying on the kind of intuitive leaps that characterize handwritten proofs. That distinction matters. A formally verified proof leaves no gaps for hidden errors, which is exactly the standard that skeptical mathematicians demand before trusting machine-generated reasoning.
AlphaProof’s training pipeline also hints at how future systems might be built. By auto-formalizing vast numbers of problems and then using reinforcement learning to navigate the resulting search space, the developers sidestepped the scarcity of hand-written formal proofs. The system effectively learned a policy for proof construction, guided not by human demonstrations but by the rigid constraints of a proof checker. In that sense, the IMO showing is less a party trick, and more a demonstration that formal proof search can scale to genuinely challenging mathematics.
From Olympiad Geometry to Combinatorics
AlphaProof is not the only system pushing these boundaries. AlphaGeometry, published as a peer-reviewed study in Nature, introduced a neuro-symbolic framework that pairs a neural language model trained on synthetic data with a symbolic deduction engine. The neural component proposes geometric constructions and intermediate steps, while the symbolic engine enforces logical consistency by verifying each deductive move. This division of labor allows the model to explore creative geometric ideas without sacrificing the rigor of formal reasoning.
Its successor, AlphaGeometry2, extended this approach and was evaluated on historical Olympiad geometry problem sets. According to an arXiv preprint, the system matched the performance of top human gold medalists, solving a comparable fraction of problems under similar conditions, and it did so without relying on human solution traces during training. That result suggests that, at least in structured domains like Euclidean geometry, AI can internalize problem-solving heuristics that were once thought to require years of specialized human practice.
On the combinatorics side, FunSearch took a different path. Rather than proving theorems directly, it used a large language model inside an evolutionary search loop to generate candidate programs, which were then evaluated for mathematical quality. This approach, described in a Nature paper on new combinatorial constructions, produced novel cap set examples and improved heuristics for the bin packing problem. The key point is that the system discovered mathematical objects and algorithms that had not previously been catalogued by humans, underscoring that generative models can contribute genuinely new ideas rather than just remixing existing ones.
The Formalization Bottleneck
Despite these advances, a persistent challenge is the gap between how mathematicians think and how proof assistants work. Human-written solutions, especially at the Olympiad level, often rely on high-level insights, clever substitutions, and informal diagrams. Turning such reasoning into a machine-checkable proof usually requires decomposing the argument into many small lemmas, each expressed in the rigid syntax of a proof assistant. One recent arXiv analysis of Olympiad problems found that a single competition question can expand into dozens of formal steps when translated into Lean, illustrating how quickly complexity explodes.
This decomposition process is labor-intensive, and the shortage of large, high-quality formal corpora has constrained how fast neural provers can improve. DeepSeek-Prover, a system focused on autoformalization for formal mathematics, directly targets this bottleneck by translating informal mathematical statements into formal ones at scale. By generating aligned pairs of informal and formal text, such systems aim to create the training data that modern language models need to learn robust proof strategies.
Meanwhile, Lean-STaR takes yet another approach: instead of insisting on purely formal reasoning from the outset, its model generates informal commentary interleaved with Lean proof steps. This mixed style has shown improved performance on the miniF2F benchmark, which aggregates problems from competitions like the AMC, AIME, and IMO across multiple proof assistants. The idea is that allowing the model to “think out loud” in natural language can guide it toward better formal moves, much as human mathematicians sketch ideas in prose before committing them to full rigor.
Mathematicians Are Cautious for Good Reason
The growing capability of these systems has not silenced skepticism. UCLA mathematician Terence Tao, speaking in coverage summarized by the UCLA newsroom, has emphasized that AI models are becoming increasingly proficient at generating arguments that look like real proofs. The word “convincing” carries weight here: a proof that appears sound to a human reader but contains subtle errors can be more dangerous than an obviously flawed attempt, because it risks seeding entire research programs on a false foundation.
This is precisely where formal verification offers a safety net. When AlphaProof generates a solution in Lean, every logical step is checked by the proof assistant before the result is accepted. The system cannot bluff its way past the verifier; if a step does not follow from previous ones, the proof simply fails. However, most AI-generated mathematical reasoning today is not produced inside proof assistants. Standard large language models remain prone to hallucinations, occasionally fabricating lemmas, misapplying theorems, or skipping crucial cases even when they sound confident.
Researchers working under the AI4Math umbrella have argued that addressing this gap is both intellectually significant and practically essential. If AI is to assist with safety-critical domains, such as cryptographic protocol design or formal verification of hardware, its mathematical reasoning must be not only creative but also reliably correct. That requires better integration between generative models and the formal systems that can certify their outputs.
Early Wins in Formal Libraries
The idea of machines contributing accepted proofs to human-maintained libraries is no longer theoretical. GPT-f, an early language-model-driven proof search system, generated new short derivations that were incorporated into the Metamath library, a long-running repository of fully formalized mathematics. In that project, the model did not operate as an autonomous theorem prover; instead, it proposed candidate proof steps that were then validated by Metamath’s strict verifier. Only sequences that passed this check were admitted, ensuring that every machine-suggested contribution met the community’s standards of rigor.
These early successes hint at a future in which formal libraries grow through a combination of human and machine effort. AI systems could scan existing repositories for theorems with long or fragile proofs and search for shorter, more robust alternatives. They might also identify gaps, natural conjectures that humans have not yet explored, and either propose tentative statements or attempt full formal proofs. As with AlphaProof’s Olympiad performance, the key is that each contribution would be accompanied by a certificate of correctness that other mathematicians can trust.
For now, the field remains in a transitional phase. Systems like AlphaProof, AlphaGeometry2, and FunSearch demonstrate that AI can tackle problems at or beyond the level of elite human competitors in specific domains. Autoformalization tools and hybrid approaches such as Lean-STaR are beginning to ease the bottleneck that has long separated informal insight from formal verification. At the same time, leading mathematicians continue to voice concerns about overreliance on opaque models whose internal reasoning cannot be inspected or guaranteed.
The next few years are likely to be defined less by headline-grabbing competition results and more by infrastructure: building richer formal libraries, refining autoformalization pipelines, and integrating proof assistants into everyday mathematical practice. If that work succeeds, AI will not replace mathematicians so much as change what they do, shifting effort from checking routine arguments to exploring new conjectures, structures, and theories. In that world, the most important proofs may be those that no single human or machine could have found alone, but that emerge from a tightly coupled collaboration between both.
More from Morning Overview
*This article was researched with the help of AI, with human editors creating the final content.