Somewhere on Kaggle, the open data platform where anyone can upload a spreadsheet and call it a dataset, two files labeled as stroke and diabetes patient records became quietly popular. Researchers around the world downloaded them, trained machine learning models on them, and published the results in peer-reviewed journals. At least 124 clinical prediction models trace their training data back to those two files. The problem: a University of Birmingham preprint found that neither dataset has a verifiable connection to any hospital, patient registry, or clinical trial. Internal patterns in the records are consistent with simulated or outright fabricated data, not real patients.
As of spring 2026, academic journals are investigating dozens of studies built on the flagged datasets, according to reporting by Nature. The episode has forced an uncomfortable question into the open: how many AI tools designed to predict who will have a stroke or develop diabetes were trained on numbers that never described a real human being?
What the Birmingham preprint actually found
The Birmingham team did not simply note that the two Kaggle datasets lacked documentation. They ran statistical checks on the internal structure of the records and identified inconsistencies that authentic clinical data would be unlikely to produce. Variable distributions, correlations between fields, and missing-data patterns all pointed away from genuine patient records and toward synthetic generation. Combined with the total absence of provenance metadata, the researchers concluded that the datasets cannot be treated as a reliable foundation for clinical tools.
The downstream count is what makes the finding alarming. By tracing citations and code repositories, the preprint identified 124 published models that used one or both datasets for training or validation. These are not obscure student projects. Many appeared in indexed journals, complete with performance metrics suggesting the models worked well. But a model trained on fabricated inputs can produce confident risk scores that mean nothing. If a diabetes prediction tool learned its patterns from invented records, its output for a real 55-year-old patient with prediabetes is essentially a guess dressed in statistical confidence.
Journals are responding, but slowly
Nature’s reporting confirms that editorial teams at multiple journals have opened investigations into studies that cited the two Kaggle datasets as though they represented genuine health records. The reviews are examining methods sections, data availability statements, and peer-review records to determine whether authors failed to disclose the uncertain origins of their training data or actively misrepresented it.
No retractions tied specifically to the Birmingham findings had been announced as of May 2026. The outcomes, whether retractions, corrections, or formal expressions of concern, will set a precedent for how seriously the field treats data provenance failures. Until those decisions are public, the accountability picture remains incomplete.
The problem is bigger than two datasets
Fabricated training data is the most dramatic failure mode, but it is not the only one. A BMJ systematic review of supervised machine learning prediction model studies found widespread high risk of bias and incomplete reporting across the field. Missing performance metrics, opaque design choices, and validation strategies that do not test whether a model works on new patient populations were common. Even when the underlying data is authentic, poor documentation can make a model look reliable on paper while it fails in practice.
The U.S. Centers for Disease Control and Prevention has flagged a related concern from a health equity perspective. In published guidance on AI in public health, the CDC warned that training data problems, including sampling bias, incomplete records for demographic subgroups, and documentation errors, can systematically degrade AI performance for the populations that most need accurate screening. A stroke risk model trained on data that underrepresents Black patients, for example, may perform well on average benchmarks while producing less accurate predictions for the group with the highest stroke incidence in the United States.
New standards aim to close the gap
Two frameworks published in The BMJ now provide concrete benchmarks for what responsible reporting should look like. The TRIPOD+AI statement updates guidance for reporting clinical prediction models that use regression or machine learning, with specific requirements around describing data sources, documenting provenance, and stating which populations a model is intended to serve. A companion tool, PROBAST+AI, provides a structured method for assessing bias and applicability, including bias introduced through data collection, labeling, and missing subgroups.
Together, the two frameworks establish a clear threshold: if a study cannot show where its data came from, how it was collected, and who it represents, the resulting model should not be trusted for clinical decisions. Had these standards been widely enforced before the Kaggle datasets gained traction, reviewers would have had a formal reason to reject papers that could not trace their training data to a real clinical source.
What no one can answer yet
The most pressing unanswered question is whether any of the 124 models reached real patients. No public health agency, including the U.S. Food and Drug Administration, has released a formal assessment of whether tools trained on the flagged datasets were deployed in active clinical care. The FDA’s existing framework for AI- and ML-based software as a medical device requires manufacturers to describe their training data, but enforcement depends on the tool reaching the regulatory pipeline in the first place. Research prototypes, clinical decision-support tools marketed under regulatory exemptions, or models embedded in electronic health record systems through institutional agreements may never undergo that scrutiny.
The identity and intent behind the original uploads also remain unknown. The Birmingham researchers documented the absence of provenance but did not identify a specific creator. Whether the data was fabricated to deceive or generated carelessly as a teaching exercise that later migrated into serious research is an open question. The distinction matters: intentional fabrication suggests a deeper integrity problem, while accidental misuse points to a quality-control failure by the researchers who adopted the files without verification.
The scope of contamination beyond these two datasets is similarly unclear. Kaggle hosts thousands of health-related datasets. The Birmingham preprint did not audit the full catalog, and no systematic review of open health data repositories has been published. Other widely used files for heart disease, cancer screening, or mental health prediction may carry similar provenance deficiencies.
Why data provenance is now a patient safety issue
No patient outcome data tied to the flagged models has been published, and the gap between “models were trained on bad data” and “patients received wrong diagnoses” is real. But the absence of documented harm is not the same as reassurance. Post-deployment monitoring of AI tools in health care remains patchy. Many models are evaluated on retrospective datasets or narrow pilot studies, with limited follow-up once they enter clinical workflows. If a risk score built on questionable data subtly shifts how clinicians allocate attention or order tests, the resulting missed diagnoses may never be traced back to the algorithm.
For anyone trying to evaluate a clinical AI tool, a few practical checks help. First, treat any prediction model with extra skepticism if the authors cannot clearly identify the origin of their data, including the institutions involved, the time period covered, and the inclusion and exclusion criteria for patients. Second, look for alignment with established reporting standards: studies that explicitly reference TRIPOD+AI and undergo structured bias assessment with PROBAST+AI are not guaranteed to be flawless, but they have at least confronted the basic questions. Third, consider whether the populations in the training data match the patients who will actually encounter the tool. Mismatches here are a known and well-documented source of inequitable performance.
The Kaggle episode did not create the problem of unreliable AI in medicine. But it gave the problem a name and a number: two datasets, 124 models, and an unknown count of patients whose care those models may have touched. In clinical AI, provenance is not a bureaucratic detail. It is a safety feature. Without it, even the most sophisticated algorithm is guesswork with a confidence interval.
More from Morning Overview
*This article was researched with the help of AI, with human editors creating the final content.