Researchers at UC San Francisco and Wayne State University prompted generative-AI chatbots to write analysis code for pregnancy datasets, and the resulting models matched or exceeded benchmarks set by trained human teams in a 2022 crowdsourced competition. The finding sharpens a growing question across biomedicine: whether large language models can reliably replace human programmers in high-stakes clinical research, or whether their uneven performance introduces new risks that current validation methods are not built to catch.
What is verified so far
The central claim rests on a UCSF-led study in which multiple generative-AI chatbots were given natural-language prompts and asked to produce executable code for building prediction models from pregnancy data. The datasets included microbiome, blood, and placental samples. Of the eight LLM tools tested, only four produced usable code, a 50 percent failure rate that is easy to overlook when the headline focuses on the winners. The models that did work generated predictions for preterm birth, defined as delivery before 37 weeks of gestation.
The benchmark those AI-generated models were measured against comes from the preterm birth challenge, a crowdsourced machine-learning competition that ran from July through September 2022. That challenge asked dozens of human teams to build classifiers for preterm birth using standardized evaluation metrics including AUROC and AUPR, along with accuracy, sensitivity, specificity, and Matthews correlation coefficient. Organizers also submitted their own baseline models, giving the field a clear floor against which any new approach can be compared.
The underlying data are substantial. The PREMO dataset at UCSF’s Benioff Center for Microbiome Medicine contains more than 3,500 samples from roughly 1,300 individuals, and the National Science Foundation selected it for the NAIRR Pilot program, a signal of its perceived value for national AI research infrastructure. Separately, the NIH-funded ImmPort repository, which houses data from more than 1,000 immunology projects and issues monthly data releases, provides the broader data ecosystem that supports this kind of work. ImmPort’s coverage explicitly includes preterm birth research, connecting the repository to both the DREAM Challenge and the newer AI-code study.
A parallel line of evidence comes from outside biomedicine. A peer-reviewed study published in Scientific Reports found that large language models outperformed outsourced human coders on every metric across multiple information-extraction and inference tasks when both were scored against expert-generated labels. That study is not biomedical in scope, but it offers a rigorous methodological template for comparing AI and human performance on structured analytical work. Its design, which uses expert labels as the gold standard rather than treating human coders as automatically correct, is directly relevant to how the UCSF team framed its own comparison.
The broader infrastructure supporting these analyses is also documented. The NIAID data portal describes how federal repositories curate and govern immunological datasets, including those that intersect with pregnancy and neonatal outcomes. Within political science, meanwhile, researchers have already explored how automated algorithms compare to human coders when extracting information from complex texts; one influential paper in that field, available via a methodological journal, underscores that machine learning systems can rival or surpass manual coding when carefully trained and validated. Taken together, these strands of evidence establish that AI systems can, under certain conditions, match or exceed human performance on structured analytic tasks.
What remains uncertain
The most significant gap is the absence of a full peer-reviewed paper for the UCSF-Wayne State study itself. The primary public record so far is a university news release, which does not include raw code outputs, complete statistical tables, or a detailed comparison of how each of the eight LLMs performed relative to specific DREAM Challenge teams. Without that granularity, it is difficult to assess whether the AI-generated code genuinely surpassed the best human entries or merely beat the organizer baselines, which were designed as a floor rather than a ceiling.
The half of the LLMs that failed to produce usable code also raises questions that the available reporting does not resolve. There is no public breakdown of which tools succeeded, what types of errors the failing tools produced, or whether the failures were consistent across prompts or sporadic. In clinical contexts, a tool that works brilliantly half the time and crashes the other half presents a different risk profile than one that performs reliably at a slightly lower level. The reporting does not address how researchers plan to screen for such inconsistency in practice or what safeguards would be in place if AI-written code were deployed in real-world clinical pipelines.
There is also no primary evaluation data from NSF’s NAIRR Pilot program regarding how the PREMO dataset has been integrated with AI-driven analysis workflows. The institutional description confirms the dataset’s selection for the pilot but stops short of documenting outcomes or performance benchmarks within that program. Similarly, while the NIAID portal outlines governance and access policies for immunological databases, it does not speak to how well AI-generated code handles the specific data formats, missingness patterns, and batch effects that are common in these repositories. Translating a proof-of-concept into routine, regulated practice will require far more detail about error rates and failure modes.
The Scientific Reports study on LLM versus human coder performance, while methodologically strong, tested text-analysis tasks rather than biomedical prediction. Extrapolating its findings to clinical data pipelines requires caution. Text extraction and genomic or microbiome modeling involve different data structures, different failure modes, and different consequences when errors occur. A misclassified text passage is a research inconvenience; a misclassified preterm birth risk could affect clinical decisions. The current evidence does not yet show how often AI-written code might introduce subtle statistical mistakes—such as data leakage or inappropriate cross-validation—that pass basic checks but bias downstream risk estimates.
Another open question is reproducibility. The UCSF news release does not specify whether the prompts, model versions, and parameter settings used for each chatbot will be made public. Without those details, other groups cannot easily replicate the results or test whether small changes in wording lead to materially different code. Experience from other domains suggests that LLM outputs can be sensitive to prompt phrasing, which complicates claims that a given performance level is stable rather than a one-off success.
How to read the evidence
The strongest evidence in this story comes from two peer-reviewed papers and one well-documented dataset. The DREAM Challenge paper establishes the benchmark: it describes exactly how human teams performed, what metrics were used, and what the organizer baselines looked like. Any claim that AI-generated code “outperformed humans” must be read against that specific benchmark, not against a vague notion of human ability. The Scientific Reports study on automated coding adds methodological weight to the general claim that AI can beat human coders, but it does so in a controlled, text-focused setting that does not fully mirror the stakes of clinical research.
Readers should also distinguish between code generation and scientific judgment. LLMs can assemble syntactically correct scripts and even select reasonable modeling approaches, but choices about which covariates to include, how to handle confounders, and when to flag a result as clinically meaningful still rest on domain expertise. The UCSF–Wayne State work, as described publicly, kept human researchers in the loop to review and run the AI-written code. That hybrid model is very different from an automated pipeline in which code is generated and executed with minimal oversight.
From a policy and governance perspective, the key takeaway is not that AI is ready to replace human analysts, but that oversight mechanisms need to anticipate AI-written code as a routine part of biomedical research. Data access committees, institutional review boards, and journal peer reviewers may soon face submissions where critical analytic steps were drafted by chatbots. Evaluating those steps will require transparency about prompts, model versions, and post hoc edits, as well as clear documentation of how results were validated against independent benchmarks like the DREAM Challenge.
For clinicians and patients, the current evidence justifies cautious optimism rather than alarm or uncritical enthusiasm. The PREMO dataset and related immunology repositories demonstrate that large, well-curated resources can support advanced AI methods, and early studies suggest that generative models can help unlock their value. At the same time, the 50 percent failure rate among tested chatbots and the lack of peer-reviewed detail on the successful runs underscore how far the field is from treating AI-generated analysis code as turnkey or self-validating.
In the near term, the most responsible path is to treat LLMs as powerful assistants whose outputs must be checked against established standards, not as autonomous analysts. As more detailed results from the UCSF-Wayne State project and the NAIRR Pilot emerge, they will help clarify where AI-written code genuinely advances prediction of preterm birth, and where human judgment, careful validation, and transparent reporting remain indispensable.
More from Morning Overview
*This article was researched with the help of AI, with human editors creating the final content.