Google DeepMind’s Gemini 3.1 Pro has posted the highest scores yet recorded on two of the toughest AI reasoning benchmarks in existence, pulling ahead of OpenAI’s GPT 5.2 on both. The results sharpen a competitive dynamic that has real consequences for researchers, businesses, and anyone relying on AI tools to handle expert-level problems. What makes these gains significant is not just the margin of victory but the nature of the tests themselves, which were built specifically to resist shortcuts and expose whether a model truly reasons or merely retrieves.
Two Benchmarks Built to Break AI Models
The two tests at the center of this story, Humanity’s Last Exam (HLE) and GPQA Diamond, were designed with a shared philosophy, to make it nearly impossible for AI systems to score well through pattern-matching or web lookups alone. HLE is a multi-modal, closed-ended benchmark composed of high-difficulty academic questions spanning subjects like mathematics, biology, and other expert domains. Its creators structured the exam so that answering correctly demands genuine synthesis across text and images, not just recall of memorized facts. The benchmark also includes both public and private held-out components, a design choice intended to detect whether a model has been overtrained on test data rather than developing authentic reasoning ability.
GPQA, short for Graduate-Level Google-Proof Q&A, takes a complementary approach. Its Diamond subset, the most frequently cited tier among AI vendors, consists of questions that trained human experts find difficult even with unrestricted internet access. The benchmark’s evaluation protocol was specifically engineered to measure deep scientific inference and domain knowledge, filtering out problems that could be solved through simple retrieval. Together, these two benchmarks represent the current ceiling for measuring how well an AI system can think through hard problems rather than look up answers, and they have quickly become reference points in technical reports and product announcements from leading labs.
Where Gemini 3.1 Pro Pulled Ahead
On HLE, Gemini 3.1 Pro scored 62%, compared to GPT 5.2’s 58%. On the GPQA Diamond subset, the gap was similar, 85% for Gemini versus 81% for GPT 5.2. Both DeepMind and OpenAI have cited HLE percentages in their own communications, lending weight to the benchmark as a shared yardstick for frontier model performance. The fact that both companies reference the same test suggests the AI industry has converged on HLE as a meaningful, if imperfect, measure of where the state of the art actually stands, especially for tasks that blend advanced reasoning with domain-specific knowledge.
Those four-point margins may look modest in isolation, but context matters. HLE was designed so that even small score increases represent significant jumps in capability, because the questions sit at the boundary of what current systems can handle. A model that answers 62% of those questions correctly is not just slightly better than one at 58%; it is clearing problems that the lower-scoring model could not solve at all. The same logic applies to GPQA Diamond, where the difficulty floor is set at graduate-level science and the questions resist the kind of surface-level pattern matching that inflates scores on easier benchmarks, turning each additional correct answer into evidence of a qualitatively stronger reasoning process.
Why Held-Out Test Data Matters More Than Headlines
One persistent concern in AI benchmarking is contamination: the risk that a model has seen test questions, or close variants, during training. If that happens, high scores reflect memorization rather than reasoning. The HLE benchmark addresses this directly through its private, held-out question sets, which are not publicly available and exist specifically to catch overfitting. This design, described in the benchmark’s arXiv preprint, gives outside observers a way to check whether a model’s performance holds up on questions it could not have encountered before. In an environment where training data often includes massive web scrapes, such protections are increasingly central to claims about genuine progress.
No independent third-party audit of Gemini 3.1 Pro’s performance on the held-out HLE components has been published, and OpenAI has not released raw evaluation logs for GPT 5.2’s benchmark runs. That gap means the reported scores, while credible given the benchmark’s design safeguards, still carry some uncertainty. The AI field has seen previous cases where vendor-reported numbers looked different under independent replication, especially when evaluation prompts, decoding settings, or sampling strategies were not fully documented. Until outside researchers run their own evaluations on the private HLE partition and disclose their methodologies in detail, the strongest claim available is that Gemini 3.1 Pro outperformed GPT 5.2 under the conditions each company chose to test.
From Knowledge Recall to Multi-Modal Synthesis
The pattern emerging from these results points to a shift in what separates top-tier AI models. Earlier generations competed primarily on how much factual knowledge they could store and retrieve, racing up leaderboards on benchmarks that rewarded encyclopedic recall. The current frontier, as reflected by HLE and GPQA Diamond, rewards something different, the ability to integrate information across modalities, apply domain expertise, and reason through problems that do not have obvious lookup solutions. Gemini 3.1 Pro’s edge on both benchmarks suggests that Google DeepMind has made specific architectural or training advances in this direction, though the company has not disclosed the technical details behind the improvement or how much of the gain comes from model scale versus better data curation.
For researchers and organizations that depend on AI for complex analysis, this distinction has practical consequences. A model that scores well on GPQA Diamond is demonstrating the kind of scientific reasoning that could accelerate work in drug discovery, materials science, or climate modeling, where problems require chaining together multiple steps of expert-level inference. A model that scores well on HLE is showing it can handle questions that span text and images, a capability that matters for fields like medical imaging, engineering design, and any domain where evidence comes in more than one format. In both cases, the benchmarks are beginning to approximate the messy, multi-step reasoning pipelines that real-world workflows demand, rather than the one-shot trivia queries of earlier tests.
What the Benchmark Race Means for AI Users
The competitive pressure between Google DeepMind and OpenAI is producing real, measurable gains in model quality on the hardest available tests. That benefits anyone choosing between AI platforms for serious analytical work, because benchmark scores on tests like HLE and GPQA Diamond are among the few objective signals available. Unlike marketing claims or cherry-picked demos, these benchmarks were built by independent researchers with explicit anti-gaming measures, making them harder to manipulate. For enterprises deciding which model to standardize on, differences of a few percentage points at the top of these leaderboards can translate into fewer failure cases on edge problems that matter most.
Still, benchmarks are proxies, not guarantees. A model that excels on closed-ended academic questions may struggle with open-ended business problems, ambiguous instructions, or tasks that require judgment calls outside the scope of any test. The gap between benchmark performance and real-world reliability remains significant, and no single score can capture how well a model will perform in a specific deployment context. What the Gemini 3.1 Pro results do reliably indicate is that the frontier is moving toward systems that can tackle harder, more structured reasoning challenges than before, and that users who care about those capabilities should pay close attention not just to headline numbers, but to which benchmarks are being used, how the tests are administered, and whether independent evaluations confirm the claims now shaping the next generation of AI tools.
More from Morning Overview
*This article was researched with the help of AI, with human editors creating the final content.