Why LLMs are stalling out and what that means for software security?

Large language models have been pitched as the next great leap in software development, yet mounting evidence suggests their capabilities are flattening rather than accelerating. That plateau carries direct consequences for software security: if the tools writing and reviewing code cannot reliably improve, the vulnerabilities they introduce or miss will compound. The tension between rising adoption and stalling performance creates a widening gap that security teams are only beginning to address.

Benchmark Gains Are Shrinking, Not Surging

The most telling sign of an LLM slowdown comes from the benchmarks designed to measure real coding ability. SWE-bench, developed by Princeton NLP and collaborators, introduced a dataset of 2,294 real GitHub issues and found that state-of-the-art models at the time achieved very low baseline solve rates. That result established a hard floor. Autonomous code-fixing on real-world repositories is far more difficult than synthetic test suites suggest. When researchers later audited those results with a stricter version of the benchmark, the picture got worse. SWE-bench+ flagged many “successful” patches as contaminated, and resolution rates dropped after filtering out leaked solutions and weak test cases. The apparent progress was partly an artifact of data leakage and permissive evaluation, not genuine capability growth.

Independent evaluations of coding and reasoning tasks show a similar pattern of compression at the top. Multiple highly capable models now cluster within a narrow band of scores and trade wins across scenarios rather than any single system pulling decisively ahead. The marginal distance between frontier models is shrinking, which means organizations betting on the next release to magically solve their code-quality problems are likely to be disappointed. For engineering leaders, this translates into a practical constraint: LLM-assisted development will not self-correct its own security shortcomings through sheer model improvement alone, and process changes must fill the gap.

Scaling Laws Hit Diminishing Returns

Part of the explanation lies in the physics of training itself. Research from DeepMind on compute-optimal training showed that performance gains require scaling training data tokens alongside model parameters, not just inflating model size. Their work on data-efficient language models demonstrated that Chinchilla, trained with more tokens at similar compute budgets, outperformed larger but under-trained predecessors. The industry absorbed this lesson, pivoting toward better data utilization and longer training runs, but it also exposed a ceiling: high-quality training data is finite, and much of the accessible web-scale text has already been consumed by current-generation models. Building bigger systems no longer guarantees proportional improvement.

Compounding this resource constraint, researchers have identified structural flaws in how LLMs generalize. These systems can learn spurious associations between inputs and outputs, effectively overfitting to subtle artifacts in their training prompts, leading to brittle behavior on new tasks. In a security context, that brittleness means a model might appear robust on familiar vulnerability patterns but fail catastrophically on slightly different code structures or novel exploit chains. This is not a problem that more GPUs alone can fix; it reflects architectural and data-distribution limits that make reliable performance on long-tail security edge cases inherently difficult.

AI-Written Code Is Already Dominant

The security implications of stalling LLM performance become urgent when set against the speed of adoption. Research tracking coding behavior since generative tools became widely available found that software development has shifted rapidly: within a few years, most new code appears to be produced by AI agents rather than human authors. That pace of change means the majority of fresh code entering production environments is now generated by systems whose reliability is, as the benchmarks show, overstated. Every line of AI-generated code that ships without adequate human review carries the same risk profile as any untested contribution from an unfamiliar developer, except it arrives at far greater volume and often bypasses traditional peer-review norms because it is perceived as “pre-vetted” by the model.

The conventional assumption has been that LLMs will get better fast enough to outrun their own mistakes, gradually reducing the defect rate in generated code. The evidence above suggests otherwise. If model capabilities are plateauing while AI-written code volumes keep climbing, the ratio of unreviewed, potentially vulnerable code to verified secure code is shifting in the wrong direction. Security teams that treat LLM output as a trusted peer rather than a junior contributor requiring oversight are accepting a growing, unmeasured risk. Over time, that risk compounds as AI-generated code becomes the training substrate for future models, baking historical vulnerabilities and insecure idioms into the next generation of tools.

Alignment Is Brittle, and Attackers Know It

Even where LLMs do perform well, their safety guardrails are weaker than they appear. Research published on arXiv demonstrated automated adversarial prompts that function as jailbreak-like suffixes and generalize across models, including black-box systems the attackers never directly trained against. This means that a single attack technique can compromise safety alignment on multiple commercial LLMs simultaneously. For software security, the implication is stark: if an LLM is integrated into a CI/CD pipeline, code review workflow, or chat-based developer assistant, an attacker who can inject a crafted prompt into the input chain could manipulate the model’s output in ways that bypass its built-in restrictions, nudging it to suggest insecure configurations, suppress warnings, or subtly weaken cryptographic routines.

The gap between attack sophistication and defense maturity is the core vulnerability. Adversarial prompt methods are scaling faster than the alignment techniques meant to stop them, and the stalling of general LLM capabilities means defensive improvements are not arriving fast enough to close that gap organically. Organizations using LLMs in security-sensitive contexts need to treat prompt injection and adversarial manipulation as first-class threat vectors, not theoretical concerns reserved for academic papers. That requires traditional security practices (threat modeling, red teaming, and layered defenses) to be applied directly to model inputs, outputs, and integration points, rather than assuming the vendor’s safety layer is sufficient.

Government Frameworks Are Catching Up

Regulators and standards bodies have begun responding to this mismatch between rapid deployment and uncertain capability growth. In the United States, a coalition of cybersecurity agencies led by CISA has issued joint guidance on deploying AI systems that treats generative models as high-risk components requiring secure design, development, and operation. The document emphasizes classic security disciplines (asset inventory, access control, monitoring, and incident response) applied specifically to AI pipelines, from data collection through model integration and runtime use. It also highlights threats unique to LLMs, including data poisoning, model theft, and prompt injection, and urges organizations to build compensating controls rather than relying on opaque vendor assurances.

In parallel, NIST has extended its risk-based approach to cover generative systems with an AI risk management framework tailored to these models. The framework encourages organizations to identify where generative tools sit in critical workflows, assess the impact of model errors or manipulations, and implement governance structures that keep humans in the loop for high-stakes decisions. For software development, that means explicitly defining when AI-generated code can be used, what review steps are mandatory, how provenance is tracked, and how incidents involving model behavior are investigated. These documents do not assume that LLM capabilities will rapidly self-improve; instead, they treat current limitations as enduring constraints that must be engineered around.

Designing for a Plateaued Future

Taken together, the benchmarks, scaling research, adoption data, adversarial studies, and emerging regulatory guidance point to a common conclusion: software security cannot be premised on optimistic forecasts of LLM progress. Instead, organizations need to design their development and security processes for a world in which model quality improves slowly and unevenly, while usage continues to expand. That starts with reframing LLMs as powerful but fallible tools whose outputs demand the same skepticism as any other untrusted input. Secure defaults include mandatory human review for security-sensitive changes, automated static and dynamic analysis on AI-authored code, and clear separation between suggestion tools and systems that can directly commit or deploy changes.

Longer term, the industry will need better metrics for the real security impact of LLM-assisted development: not just benchmark solve rates, but empirical data on vulnerability density, exploitability, and time-to-detection in AI-heavy codebases. Until those measurements mature, the prudent stance is defensive. Assume that plateaued capabilities, brittle alignment, and rapidly growing AI authorship collectively increase the attack surface, and then build controls (technical, organizational, and regulatory) that acknowledge that reality. The promise of LLMs in software development is still substantial, but realizing it safely will depend less on the next breakthrough model and more on the discipline with which current tools are constrained, audited, and integrated.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X