
AI systems are racing through standardized tests, acing bar exams and coding challenges that once seemed safely human. Yet in daily life, the same tools still hallucinate facts, misread context and quietly reinforce bias in ways that are hard to detect. The disconnect points to a deeper problem: the industry is fixated on narrow scores that say little about whether an AI system is actually intelligent in the ways that matter to people.
Instead of asking how many benchmarks a model can dominate, the more urgent question is whether it can be trusted, understood and held accountable when it fails. From consumer chatbots to diagnostic tools in hospitals, the stakes are no longer theoretical, and the metrics we use to judge “smart” machines are starting to look dangerously out of date.
Benchmarks were built for labs, not for life
For decades, progress in machine learning has been defined by performance on a fixed set of tests, from image recognition leaderboards to language exams. Computer scientist Melanie Mitchell has described how the typical research method is to design a benchmark, train a system to excel on that benchmark and then treat the resulting score as a proxy for intelligence. In controlled settings this approach is efficient, but it rewards pattern matching over genuine understanding and often produces models that cannot generalize very well once they leave the test environment.
Even within the research community, there is growing recognition that the way benchmarks are built and used is deeply flawed. Reporting on AI evaluation has highlighted how weak implementation criteria and inconsistent documentation can turn supposedly rigorous tests into moving targets, where small tweaks in setup or data curation produce big swings in performance. When the rules are fuzzy and the incentives are to win at all costs, benchmark scores start to look less like scientific measurements and more like marketing assets.
The benchmark obsession is distorting the market
As AI systems have moved from labs into products, benchmark scores have become a kind of currency for investors, customers and regulators who are trying to compare opaque technologies. That has created what one evaluation specialist calls an AI industry “benchmark obsession,” where models that dominate standardized tests still struggle in real-world deployments. In practice, this means a chatbot that tops a reasoning leaderboard can fail to follow basic business rules in a bank’s customer service flow, or a vision model that shines on a curated dataset can misclassify images from a different camera or lighting condition.
The incentives around these scores are so strong that some companies have started gaming the tests themselves. A recent investigation into AI marketing described how firms manipulate results so they can claim their models are the best, making it harder for consumers and governments to make informed decisions about which systems to use. In one widely shared explainer, reporters detailed how AI companies massage benchmark conditions, cherry-pick favorable metrics and quietly discard runs that do not support the desired narrative. When the scoreboard itself is being bent, the notion that a single number can capture “how smart” a system is starts to collapse.
Accuracy is not the same as truth
Even when benchmarks are not being gamed, the metrics they rely on can be deeply misleading. Accuracy, precision and recall are technical measures that quantify how often a model’s outputs match a labeled dataset, and they are often presented as evidence of reliability. Yet as one analysis from the United Nations University bluntly put it, people should never assume that the accuracy of artificial intelligence information equals truth, especially when those metrics are calculated on historical data that may be biased or incomplete.
The gap between technical performance and lived reality is already visible to users. In one set of Student Perspectives on AI tools in education, respondents warned that systems can be inaccurate in subtle ways that are hard to detect, making them potentially dangerous in academic and professional settings. A model that is “right” 95 percent of the time on a benchmark may still fabricate citations, misinterpret assignment prompts or reproduce stereotypes, and those failures are rarely captured in the headline score that gets marketed to the public.
Experts say many popular tests are close to meaningless
As large language models have surged into public view, a new wave of evaluations has emerged to test everything from legal reasoning to medical knowledge. Yet several AI researchers and auditors now argue that the most widely cited exams tell us very little about how these systems will behave in practice. One investigation into model comparisons reported that everyone is judging AI by a small set of standardized tests, but experts say they are Close to Meaningless as indicators of real-world capability.
The problem is not just that these tests are narrow, it is that they are static while the systems they measure are constantly updated and fine-tuned. Once a benchmark becomes popular, it quickly leaks into training data, and models start to memorize patterns rather than demonstrate flexible reasoning. Mitchell has warned that this kind of contamination makes it impossible to know whether a high score reflects genuine understanding or simple overfitting, a concern that echoes through her broader critique of benchmark-driven evaluation. When the exam questions are effectively in the textbook, the resulting grades tell us little about how a system will handle unfamiliar situations.
Trust, rapport and human outcomes are missing from the scorecard
Some AI leaders are now arguing that the most important qualities of an AI system cannot be captured by traditional metrics at all. Product director Fernanda Dobal has described how Today’s AI landscape feels remarkably similar to the early days of search engines, when companies bragged about the number of web pages indexed instead of whether users actually found what they needed. In her view, the industry is repeating that mistake by focusing on token counts, parameter sizes and benchmark scores instead of measuring whether systems build trust and rapport with the people who rely on them.
That perspective is not just philosophical, it is grounded in product experience. At one consumer finance app, executives have argued that the most meaningful metric for their AI assistant is not how many questions it can answer, but how well it helps humans thrive. Reporting on that work describes how the team at Cleo, where the AI is embedded, tracks whether users feel supported and empowered rather than simply counting interactions. In a recent essay, the company’s leaders warned that Today the industry is still measuring AI all wrong, because it rarely asks whether systems are helping people make better financial decisions or avoid harm.
Practical intelligence beats abstract “smartness”
Behind these critiques is a more fundamental challenge to how we talk about machine intelligence at all. A growing group of researchers argues that asking whether AI is “intelligent” in the abstract is the wrong question, and that we should instead focus on what systems can do in concrete settings. One influential essay on this topic urges readers to embrace what it calls Conclusion and Embracing Practical Intelligence, arguing that a diagnostic AI that helps doctors save lives or a climate modeling system that improves disaster planning can truly be called intelligent, regardless of how it performs on abstract puzzles.
In this view, intelligence is not a single dimension that can be captured by a score, but a relationship between a system, its environment and the humans it serves. A navigation app that reliably guides drivers through a chaotic city, or a translation tool that helps refugees access legal aid, may never top academic benchmarks, yet they embody the kind of practical capability that matters most. By contrast, a model that dazzles on synthetic reasoning tests but fails to respect user consent or cultural nuance looks less like a breakthrough and more like a liability. Shifting the conversation toward practical intelligence would force companies to define success in terms of real-world outcomes instead of leaderboard positions.
Understanding, AGI and the limits of current tests
The debate over metrics is also tangled up with a larger argument about whether today’s systems actually understand anything at all. Sam Altman, the CEO of OpenAI, has said that we will reach AGI, or artificial general intelligence that can do any cognitive task a human can, and that some current models already show glimmers of that future. In a recent discussion on whether AI understands, philosophers and computer scientists pointed out that even if a system can produce useful answers, that does not mean it has the kind of grounded comprehension humans associate with understanding.
Benchmarks are poorly equipped to resolve this dispute, because they mostly test outputs rather than internal representations or causal reasoning. A model can ace a reading comprehension exam by exploiting statistical patterns in text without forming any mental model of the world, and current tests rarely distinguish between those paths to success. Mitchell has argued that this is why systems that look impressive on paper can still fail spectacularly when confronted with slightly altered problems or adversarial prompts. Until evaluations probe how models generalize, adapt and explain their own behavior, claims about AGI will rest on shaky empirical ground.
Why agent benchmarks matter, and where they fall short
One promising response to these concerns is the rise of benchmarks that evaluate AI agents in interactive environments rather than static question sets. Researchers at IBM have noted that much of Much of today’s revolution in AI and natural language processing can be traced back to rigorous, standardized benchmarks, and they are now working on new suites that test how agents plan, act and recover from mistakes. These efforts aim to capture skills like tool use, long-term memory and collaboration, which are closer to how people experience AI in products like autonomous vehicles or workflow copilots.
Yet even these richer evaluations face familiar challenges. Designing realistic tasks that are still reproducible across labs is difficult, and there is still no consensus on what progress should look like for agents that operate in messy human environments. If the field simply replaces one set of narrow scores with another, the underlying problem will remain. The real opportunity is to pair agent benchmarks with field experiments, user studies and domain-specific audits that track whether systems actually improve safety, efficiency or well-being in the settings where they are deployed.
From leaderboard culture to outcome-based evaluation
What would it look like to measure AI in a way that reflects its real impact on people’s lives rather than its performance on synthetic tests? One answer is to treat benchmarks as starting points, not finish lines, and to prioritize A/B testing in live environments where possible. Evaluation specialists have argued that the current benchmark culture has created a dangerous illusion of progress, and that teams should instead run controlled experiments to see how Models affect key outcomes like error rates, user satisfaction and operational risk.
Some organizations are already moving in this direction by building evaluation frameworks that combine technical metrics with human-centered indicators. In finance, that might mean tracking whether an AI reduces the number of customers who fall into overdraft, not just how accurately it predicts spending patterns. In healthcare, it could involve measuring how a diagnostic tool changes treatment decisions and patient outcomes, rather than only its performance on historical scans. The shift is subtle but profound: instead of asking “How smart is this system?” the question becomes “What happens when people rely on it?”
Rethinking what we count as progress
Underneath the technical debates about benchmarks and metrics lies a more basic choice about what kind of AI future we want. If progress continues to be defined by leaderboard rankings and abstract scores, companies will keep optimizing for systems that look impressive in demos but are brittle, opaque and hard to govern. The reporting on Today’s AI landscape, from Fernanda Dobal’s critique of shallow metrics to the warnings about manipulated tests and contaminated datasets, suggests that this path is already producing systems that are misaligned with human needs. The fact that students, doctors and policymakers are all voicing concerns about hidden inaccuracies and unmeasured harms is a sign that the current scorecard is not capturing what matters.
There is another path, one that treats intelligence as the capacity to help people solve real problems safely and fairly. That approach would elevate practical intelligence, trust, rapport and accountability as primary metrics, and it would demand evaluation methods that are as dynamic and context-aware as the systems they measure. It would also require humility from AI developers, who would need to accept that a single benchmark can never settle the question of how “smart” their creations are. If we can make that shift, the next generation of AI systems might be judged less by how they perform on a test and more by how they change the world for the people who use them.
Supporting sources: We’re measuring AI all wrong—and missing what matters most.
More from MorningOverview