Christina Morillo/Pexels

Artificial intelligence has moved from checking homework to attacking problems that professional mathematicians once treated as out of reach. Systems tuned for symbolic reasoning are now cracking long‑standing conjectures, acing elite contests and, in some cases, making human competitions feel almost obsolete. The question is no longer whether AI can handle hard math, but whether it can match the creativity and judgment of the best human minds.

I see the current moment as a stress test for what “mathematical ability” really means. Machines are already outperforming top students under exam conditions, yet they still lean on humans to translate problems, interpret results and decide which ideas matter. That tension, between raw problem‑solving power and deeper understanding, is where the race between AI and human mathematicians is actually being run.

From “impossible” puzzles to machine‑readable proofs

The most striking shift is that AI is now solving problems that experts once labeled “impossible” for computers, at least in practice. Recent work in automated theorem proving shows systems navigating dense symbolic arguments and discovering proofs that surprise specialists, a far cry from the days when software could only check steps that humans had already laid out. In one widely discussed project, Dec researchers built an AI that attacked advanced competition problems, but the AI first required humans to translate the problems into a special computer language before it could begin work, a reminder that even the most capable models still depend on careful human encoding of mathematical structure.

That translation bottleneck is not just a technicality, it is a clue to what current systems are actually doing. They excel once a problem is expressed in the formal grammar they understand, yet they struggle with the messy front end of mathematics, where ideas are vague, diagrams are informal and notation is inconsistent. Reporting on how AI is solving “impossible” math emphasizes that human experts still curate the problems, design the formal languages and interpret the resulting proofs, even when the machine’s solution rivals the world’s best math students, a pattern that shows up clearly in the work described by Dec researchers on “impossible” problems.

Gold‑medal AI at the International Mathematical Olympiad

The clearest public benchmark for mathematical talent is still the International Mathematical Olympiad, or IMO, where high school prodigies tackle six brutally difficult problems over two days. In 2025, AI systems finally crossed a symbolic line by achieving gold‑medal performance on this stage, matching and in some cases surpassing the scores of the very best human contestants. Analysts who had been tracking progress noted that the baseline was already high, with Google’s AlphaProof1 described as a specialized system that set expectations for what might be possible at the IMO, a context laid out in detail in a Jul analysis that asked what the IMO would really tell us about AI math capabilities and how far these systems had come compared with earlier generations of solvers, as summarized in Here on the evolving baseline.

By the time the 2025 contest wrapped up, multiple AI teams were claiming gold‑level results, and the conversation shifted from “if” to “how” they had done it. A Jul deep dive into the 2025 IMO winner’s circle described how Google DeepMind’s team moved From Formal to Natural Language, with Google researchers training systems that could not only manipulate formal proofs but also translate AI’s solutions back to English so humans could follow the reasoning. That same account highlighted how three AI teams shared the top tier, underscoring that the IMO had become a proving ground not just for students but for competing AI research programs, a dynamic captured in the profile of the 2025 winners that foregrounded Google, From Formal and Natural Language in its discussion of the new gold‑medal landscape, as detailed in Jul coverage of the 2025 IMO winner’s circle.

From benchmarks to real contest performance

For years, AI math progress was measured on static benchmarks, curated sets of problems that models could train on and gradually overfit. The 2025 IMO results marked a break from that pattern, because the problems were fresh, unseen and designed to foil rote pattern matching. One Jul analysis framed the shift explicitly as a move From Benchmarks to Gold, arguing that the real story was not just that systems solved Olympiad problems, but that they did so under the same 4.5‑hour constraints and partial‑credit scoring rules that human contestants face. At the 2025 International Mathematical Olympiad, large language models were reported to have Cracked IMO 2025 by solving five of the six problems, a performance that would comfortably earn a gold medal for a human competitor and that suggested these systems could generalize beyond the training distributions that had defined earlier math benchmarks, as described in the piece titled From Benchmarks to Gold, How LLMs Cracked IMO and What Comes Next for Math AI, summarized in From Benchmarks to Gold.

That same analysis stressed research methods that went beyond raw scaling, including careful prompt design, self‑verification loops and hybrid systems that combined language models with symbolic proof engines. By treating the IMO as a live, adversarial test rather than a static dataset, the teams behind these systems could probe where models still failed, such as in combinatorics problems that required a single deep insight rather than many routine steps. The emphasis on Benchmark design in that Jul report underscored a broader lesson: as AI closes the gap with top humans on headline metrics like contest scores, the real differentiator becomes how we construct tests that reveal whether the machine is reasoning or just exploiting statistical regularities, a concern that runs through the discussion of Cracked IMO and What Comes Next for Math AI in research on How LLMs Cracked IMO.

Gemini Deep Think and OpenAI’s experimental Olympians

Among the systems that reached gold‑medal standard, Gemini Deep Think has become a touchstone for what integrated reasoning models can do. In a Jul announcement, Google described Breakthrough Performance at IMO 2025 with Gemini Deep Think, stating that an advanced version of Gemini Deep Think solved five out of the six Olympiad problems within the official 4.5‑hour competition time limit, with solutions that human judges found clear and, in many cases, easy to follow. That performance was framed as a significant milestone, because it showed that a general‑purpose model, augmented with a “Deep Think” mode for extended reasoning, could match the sustained concentration and multi‑step planning that human contestants cultivate over years of training, a claim laid out in the detailed account of Breakthrough Performance at IMO with Gemini Deep Think in the Gemini Deep Think report.

OpenAI, for its part, highlighted a different philosophy in its own Jul write‑up of an IMO‑level system. The author noted, Besides the result itself, I am excited about our approach, emphasizing that the team reached gold‑medal capability not via narrow, task‑specific engineering but with a more general, very much an experimental model that was adapted to Olympiad problems. That framing matters because it suggests a future where the same core model that chats, writes code and analyzes data can also tackle high‑end mathematics without being rebuilt from scratch, a contrast to earlier generations of theorem provers that were essentially bespoke tools. The LessWrong discussion of OpenAI’s claim makes clear that the company sees contest performance as a side effect of broader reasoning progress rather than an isolated stunt, a perspective captured in the passage that begins Besides the and continues through the description of a very much an experimental model in OpenAI’s IMO gold claim.

Humans versus machines in live competition

Once AI systems started posting gold‑level scores, organizers began staging explicit showdowns between human teens and algorithms. At an International Math Competition highlighted in Jul coverage, Human Teens Compete Against AI at an International Math Competition where AI models achieved gold‑level scores at the International Mathematical Olympiad and similar events, putting them head‑to‑head with the world’s best high school problem solvers. The report described how the AI models, running on servers far from the contest hall, quietly submitted solutions that matched or exceeded those of the students, raising questions about how to keep such competitions meaningful in an era when a laptop could, in principle, outscore the entire field, a tension explored in detail in the account of Human Teens Compete Against AI at an International Math Competition and the performance of Inte level systems in Jul’s International Math Competition story.

Broader reporting has reinforced that picture. One Jul summary noted that Google and OpenAI models outscore top teens in the world’s toughest math showdown, describing how Google and OpenAI models put up a gold‑medal standard performance and, in some scoring schemes, edged out the best human contestants. The same piece emphasized that artificial intelligence models developed by these companies are not just matching human performance, it is redefining its limits daily, suggesting that the gap may widen as models improve and training data expands. For the teenagers who have spent years preparing for these contests, the arrival of machine rivals is both a challenge and a reality check, a dynamic captured in the description of how Google and OpenAI beat the world’s best mathematical minds in coverage of Google and OpenAI models.

Beyond contests: long‑standing problems and research‑level math

Contest performance is flashy, but the deeper test of AI’s mathematical value is whether it can contribute to research that professionals care about. One striking example came in Nov, when reports described how a 30‑year‑old problem had been cracked and AI had achieved it, drawing on the “problem list” of mathematician E, with the solution emerging in just six hours of machine time. The account noted that the problem had resisted human effort for decades and that the AI‑assisted solution was valuable enough to be associated with tens of thousands of dollars in prize money, a sign that this was not a toy example but a genuinely important result, as detailed in the Nov report that opened with A 30‑year‑old problem has been cracked, AI has achieved it and went on to describe the tens of thousands of dollars at stake in the 30‑year problem story.

At the same time, leading mathematicians caution that these breakthroughs do not yet amount to a wholesale replacement of human creativity. A Jun analysis quoted one expert saying, But we are not there yet, adding that AI can contribute in a meaningful way to research‑level problems, but we are certainly not at the point where machines can autonomously set agendas, define new fields or decide which conjectures are worth decades of human attention. The same piece suggested that the next few years will reveal whether AI becomes a routine collaborator in high‑end math or remains a specialized tool for certain classes of problems, a nuanced view captured in the discussion of what is next for AI and math and the observation that But we are not there yet, we will see, in Jun’s look at what is next for AI and math.

Government bets and the race for mathematical acceleration

Governments are not waiting for that verdict, they are already investing in programs that treat AI as a force multiplier for mathematical discovery. In the United States, the Defense Advanced Research Projects Agency has launched an initiative known as expMath, framed explicitly as a way to accelerate mathematical progress. According to the program description, By accelerating mathematical progress, expMath has the potential to unlock breakthroughs in a wide range of critical areas, from cryptography to materials science, by building systems capable of proposing and proving useful abstractions rather than just checking existing proofs, a vision laid out in the agency’s call for ideas and media guidance in Math + AI = Tomorrow’s breakthroughs.

These efforts reflect a strategic belief that faster math can translate into real‑world advantages, whether in secure communications, optimization of logistics or the design of new algorithms for sensing and control. They also implicitly assume that AI will not just replicate what human mathematicians already do, but will help uncover structures and patterns that would be hard for any individual to spot. That is a more ambitious goal than winning contests, and it raises its own questions about verification, credit and the culture of proof, especially if machines begin to propose abstractions that are correct but difficult for humans to interpret, a possibility that expMath’s focus on proposing and proving useful abstractions brings to the foreground in the program description linked above.

Are competitions already obsolete?

As AI systems rack up wins, some mathematicians are starting to wonder whether traditional competitions still make sense. One Dec report quoted a leading figure saying that Now AI systems are so good at such mathematical games that there is no point to these competitions because AI can beat the best human contestants, likening the situation to chess, where computers became so strong that human tournaments no longer served as a meaningful test of the frontier. The same piece cited a mathematician at Imperial College London who argued that the real action would shift to problems that are too open‑ended or too conceptually deep to be packaged into contest format, echoing his remark that “The chess computers got good, but people still play chess, they just do not play to prove,” a sentiment captured in the discussion of how Now AI systems are so good at these games in Dec’s account of “impossible” math.

I think that analogy cuts both ways. On one hand, once machines dominate a formal game, the game stops being a frontier of intelligence research, and the same may soon be true for Olympiad‑style contests. On the other hand, chess did not die when engines surpassed grandmasters, it evolved, with humans using engines as training partners and analytical tools while still valuing human‑only tournaments as cultural events. Math competitions may follow a similar path, becoming less about defining the limits of human reasoning and more about education, community and the joy of problem solving, even as AI quietly sets a higher bar in the background.

Why humans still matter in the age of math AI

Even if AI can outscore top teens and crack decades‑old problems, it does not automatically render human mathematicians obsolete. One reason is that the most impactful work in science and technology often comes from human‑machine teams rather than either side alone. A Feb study on collaborative decision making found that Humans alone achieved 81% accuracy, and AI alone achieved 73% accuracy, but the combination hit 90% accuracy, with the best results coming when the AI deferred to the human on ambiguous cases and the human learned to trust the model on routine ones. The same analysis stressed that the key was designing workflows where the AI’s strengths in pattern recognition and exhaustive search complemented the human’s strengths in context, ethics and long‑term planning, a lesson that applies directly to mathematical research, as described in the discussion of how Humans and AI work best together in the Feb collaboration study.

In mathematics, I expect a similar division of labor to emerge. AI systems will increasingly handle the exploration of vast search spaces, the checking of intricate proofs and the generation of candidate conjectures, while humans will focus on setting the agenda, interpreting surprising patterns and connecting abstract results to the rest of science. That is not a consolation prize, it is a recognition that “beating” humans on a contest scoreboard is not the same as replacing the uniquely human capacity to decide which questions are worth asking in the first place.

More from MorningOverview