Morning Overview

AI-written code is fueling a surge in serious security flaws

Developers are adopting AI coding assistants at a rapid clip, but a growing body of peer-reviewed research shows that machine-generated code frequently ships with serious security flaws. Across multiple independent studies, AI tools such as GitHub Copilot have produced vulnerable outputs at rates that would alarm any security team. The risk is not theoretical: researchers have now traced AI-generated code to real-world vulnerability records, raising hard questions about whether the speed gains are worth the trade-off.

What is verified so far

The strongest evidence comes from a study titled “Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions,” which was accepted to the IEEE Symposium on Security and Privacy 2022. The researchers designed security-focused scenarios and collected 1,689 generated programs from Copilot. Their finding was blunt: approximately 40% of those outputs contained vulnerabilities. The flaws ranged from common weaknesses like SQL injection and path traversal to subtler issues in memory handling and cryptographic operations. That failure rate means nearly half the code a developer might accept from the tool could introduce an exploitable bug if left unreviewed.

A separate large-scale comparison of human-written and AI-generated code reinforces the pattern. That paper, titled “Human-Written vs. AI-Generated Code,” provides a structured comparison across defects, security vulnerabilities, and complexity. By quantifying how vulnerability incidence changes when code is generated by large language models rather than written by hand, the study moves the conversation past anecdote and into measurable territory. The data suggest that AI generation does not simply replicate the same bug rates humans produce; it introduces its own distribution of weaknesses, particularly in languages or ecosystems where training data may be less security-conscious.

A third study goes further by examining code already living in production repositories. Titled “AI Code in the Wild,” the research describes a detection pipeline that labels AI-generated code in popular repositories and cross-references those contributions against CVE-linked code changes. CVEs, or Common Vulnerabilities and Exposures, are the standard identifiers the security industry uses to track publicly disclosed flaws. By connecting AI-authored patches and commits to entries in the CVE system, the researchers established a direct link between machine-generated code and cataloged real-world vulnerabilities, even if the exact proportion of all CVEs that originate from AI remains unknown.

One concrete example of how such flaws surface in practice is CVE-2024-9143, an OpenSSL vulnerability cataloged in the National Vulnerability Database maintained by NIST. That entry, which provides canonical identifiers and standardized severity metadata used across the industry, was highlighted in connection with Google’s AI-powered fuzzing efforts. The case illustrates a dual reality: AI can help find bugs, but AI-generated code can also introduce them, and the same tracking infrastructure ends up cataloging both sides of that equation.

What remains uncertain

Several gaps prevent a complete picture. No official developer survey with primary data has established exactly how many production environments now rely on AI coding tools. Industry estimates circulate widely, but they tend to come from vendor marketing or informal polls rather than controlled measurement. Without that baseline, it is difficult to calculate the aggregate risk that a 40% vulnerability rate poses across the global software supply chain.

Direct statements from GitHub or OpenAI about specific mitigation plans for insecure code generation remain sparse in the primary research record. Secondary reporting from technology outlets has referenced improvements to Copilot’s filtering and prompt engineering, but the academic studies reviewed here do not validate those claims with independent testing. Whether newer model versions have materially reduced the vulnerability rate documented in the IEEE-accepted paper is, based on available primary sources, still an open question.

There is also no institutional dataset from NIST or a comparable body that aggregates AI-linked CVEs over time. The “AI Code in the Wild” study builds its own detection pipeline, which is a significant methodological contribution, but it has not yet been replicated or adopted as a standard measurement tool. Until government or standards bodies begin tagging CVE entries by code origin, the field will rely on academic pipelines whose coverage and accuracy are still being refined. According to the arXiv member institutions that support preprint distribution, the platform itself serves as a rapid-dissemination channel rather than a peer-review gatekeeper, meaning some of these findings await full journal validation.

The hosting and institutional backing of arXiv also involves overlapping entities. The platform is associated with Cornell University and its technology campus, creating minor attribution ambiguity in sourcing. This does not affect the substance of the research, but readers should note that arXiv papers vary in review status, and the strongest claim in this body of work, the 40% vulnerability rate, carries the additional weight of IEEE symposium acceptance. At the same time, arXiv is sustained by a mix of institutional support and community contributions, as reflected in its donation programs, underscoring that it is an infrastructure provider rather than a standards body.

How to read the evidence

Not all sources in this discussion carry equal weight. The IEEE-accepted Copilot study is the most battle-tested: it passed formal peer review at a top security venue, and its experimental design, generating code across 89 distinct scenarios, allows for reproducible testing. That makes its 40% figure the single most reliable data point available. The two other arXiv papers offer valuable extensions of the argument, one by comparing human and AI code at scale and the other by tracing AI contributions through live repositories, but both were published as preprints. Their findings are credible and methodologically transparent, yet they have not undergone the same level of external scrutiny.

The CVE-2024-9143 entry from NIST’s National Vulnerability Database is a different kind of evidence entirely. It does not measure AI code quality; it documents a specific flaw in OpenSSL that intersected with AI-driven security testing. Its value here is as a concrete example of how AI and vulnerability tracking systems interact, not as proof that AI wrote the vulnerable code in that particular case. Readers should resist the temptation to treat every CVE mentioned alongside AI research as an AI-caused bug. The causal chain matters, and a vulnerability discovered with AI is not automatically a vulnerability introduced by AI.

Interpreting these findings also requires understanding the publication pipeline itself. As the arXiv help materials emphasize, the platform hosts preprints that may later appear in journals or conferences after peer review. That means claims from preprints should be treated as provisional, especially when they propose sweeping conclusions about the safety or reliability of AI coding tools. By contrast, results that have passed peer review at established venues, like the IEEE symposium paper, can be regarded as more stable, though even they are subject to refinement as replication studies appear.

For practitioners, the practical takeaway is less about banning AI assistants and more about recalibrating trust. The existing evidence base shows that AI-generated code can be highly productive but is not, on its own, a safe default. Teams that adopt these tools should assume that suggested snippets carry at least the same risk profile as code from an inexperienced human contributor, and in some domains, a higher one. That implies mandatory review, integration with static analysis and fuzzing, and clear policies about where AI assistance is acceptable, such as boilerplate generation, versus where it should be restricted, such as cryptographic routines or authentication logic.

For policymakers and standards bodies, the research points to a growing measurement gap. Without standardized tagging of AI-authored changes in vulnerability databases, it will remain difficult to quantify how much risk AI coding tools add or subtract from the software ecosystem. Encouraging more rigorous disclosure practices, funding replication of detection pipelines like those used in “AI Code in the Wild,” and clarifying expectations around secure use of AI in development pipelines are all steps that could turn scattered academic results into actionable guidance.

In the meantime, organizations should treat AI coding assistants as powerful but fallible instruments. The verified data show that these systems can and do generate insecure code at nontrivial rates, and the unresolved questions about deployment scale and long-term impact argue for caution rather than complacency. Until the evidence base is broader and more standardized, the safest stance is to enjoy the productivity gains while assuming that every AI-suggested line still needs the same skeptical, security-aware review that any other untrusted code would receive.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.