AI foundation models trained on DNA could speed up plant research

A wave of AI foundation models built on plant DNA sequences is giving researchers new tools to annotate genomes, predict gene function, and measure crop traits faster than traditional methods allow. Several teams have published models trained on dozens of plant species, and at least one U.S. national laboratory is pairing DNA-trained AI with high-throughput imaging to cut the time needed for plant phenotyping. The collective effort targets a stubborn problem: plant genomes are large, repetitive, and riddled with duplications that make standard annotation pipelines slow and error-prone.

Why Plant Genomes Are Harder to Decode

Most early DNA foundation models were designed for human or microbial genomes. The Nucleotide Transformer, for example, helped establish the now-familiar strategy of pretraining transformer architectures on large DNA corpora before adapting them to specific prediction tasks. That approach works well for many vertebrate and microbial genomes, where gene structures are relatively compact and reference annotations are mature.

Plant genomes, however, present a tougher target. Many crop species are polyploid, carrying multiple full copies of their genome that have diverged over evolutionary time. Their chromosomes are also enriched for transposable elements and other repeats, which can span most of the genome in some cereals. This combination of polyploidy and repetitive content confounds homology-based annotation tools that perform adequately in simpler organisms. Alignments become ambiguous, exon–intron boundaries are harder to resolve, and duplicated genes can be misassigned or collapsed during assembly.

The authors of a plant-focused DNA language model called PlantCAD2 frame this challenge as a kind of double-edged sword: the very features that make plant genomes difficult for conventional pipelines also create a rich signal for models that can learn long-range sequence patterns. That tension between difficulty and opportunity is driving a burst of model development aimed squarely at flowering plants.

PlantCAD2 and Cross-Species Prediction

PlantCAD2 is pre-trained on dozens of angiosperm genomes, from well-studied crops to lesser-known relatives, and is explicitly designed to transfer what it learns across species boundaries. Instead of training a separate network for each plant, the model absorbs statistical regularities about coding regions, regulatory motifs, and evolutionary conservation from many genomes at once.

This cross-species transfer matters because only a small fraction of plant species have deep experimental annotations. For a newly sequenced legume or wild grass, collecting enough RNA-seq data and functional assays to support classical annotation can take years. In contrast, a pretrained DNA model can be applied immediately to highlight likely promoters, splice sites, and conserved noncoding elements. In practice, researchers can triage large genomic regions, focusing scarce experimental resources on the most promising loci.

PlantCAD2 is part of a broader trend toward multi-species plant models. A related system, PlantCaduceus, is described as learning evolutionary conservation patterns across a panel of 16 flowering plants to support comparative genomics and crop improvement. Together, these models signal a shift from single-species tools toward shared representations that treat the diversity of angiosperms as an asset rather than an obstacle.

GeneCAD Turns Representations Into Gene Models

Locating functional elements is only half the annotation problem. Researchers also need coherent gene models: structured predictions that specify transcription start sites, exon–intron structures, untranslated regions, and coding sequences along each chromosome. GeneCAD is an end-to-end pipeline that builds on PlantCAD2’s internal representations and uses structure-aware decoding to generate full gene models directly from DNA.

Traditional plant annotation workflows lean heavily on RNA-seq alignments and protein homology searches. Both requirements introduce friction. Generating transcriptomic data demands carefully designed experiments across tissues and conditions, while homology-based methods depend on the existence of well-annotated relatives. For many orphan crops and wild species, neither prerequisite is easily met.

GeneCAD’s appeal is its promise of scalability. By relying primarily on sequence context and patterns learned during pretraining, the system can operate even when RNA data are sparse or absent. According to its developers, this strategy could alleviate a longstanding bottleneck in plant genomics, where polyploidy and repeat content have made it difficult to automate high-quality annotations. If successful, such models would let genome projects move more quickly from raw assemblies to usable gene catalogs, especially in under-resourced species.

AgroNT and Benchmarking Progress

As more DNA foundation models appear, the field faces a basic question: how should their performance be compared? AgroNT, a transformer trained on DNA from a wide range of plant species, was introduced alongside the Plant Genomic Benchmark, an evaluation suite spanning multiple genomic tasks. This benchmark, often abbreviated PGB, assembles standardized datasets for problems such as promoter identification, chromatin feature prediction, and variant effect estimation across several crops.

PGB’s authors argue that shared benchmarks are essential to avoid a proliferation of incomparable claims. Without common test sets and metrics, each group can tune its model to a bespoke task and report gains that may not generalize. Publicly available benchmarks also make it easier for outside teams to reproduce results, probe failure modes, and test new architectures on equal footing.

Based on currently available descriptions, there is no published, head-to-head comparison of PlantCAD2 and AgroNT on identical cross-species prediction tasks. That gap leaves open questions about how these models differ in practice, beyond architectural details and training data choices. Wider adoption of PGB or similar suites, and the inclusion of more cross-species benchmarks, will determine whether the field converges on a small set of reliable, well-characterized models or fragments into many niche systems.

From DNA to Physical Traits at Oak Ridge

Genome-level AI is only one layer in the stack needed for crop improvement. Turning sequence predictions into real-world impact requires phenotyping: systematic measurement of traits such as plant height, leaf area, root architecture, and photosynthetic performance. Oak Ridge National Laboratory is pursuing this side of the problem by training a foundation model on large-scale hyperspectral and imaging data, using exascale computing to accelerate trait measurement in controlled and field environments.

The data fueling this work come from the lab’s Advanced Plant Phenotyping Laboratory, which is equipped with high-throughput imaging systems including hyperspectral cameras, thermal sensors, and automated conveyors. Plants can be imaged repeatedly over their life cycle, generating dense time series that capture subtle changes in physiology and morphology. By pretraining models on these multimodal datasets, researchers hope to infer difficult-to-measure traits (such as photosynthetic efficiency or stress responses) directly from image streams.

One envisioned application is to estimate photosynthetic activity from spectral signatures, replacing labor-intensive gas exchange measurements that currently limit experimental throughput. If an imaging model can reliably map pixel-level data to physiological traits, and a DNA model can predict which alleles modulate those traits, the combination could dramatically shorten the feedback loop between genotype and phenotype.

Although no public timeline has been announced for integrating DNA-centric and image-centric pipelines, the conceptual workflow is clear. A plant breeder or geneticist could start from a set of candidate variants flagged by a foundation model trained on plant genomes. Those variants would be introduced into experimental lines or identified in existing diversity panels. High-throughput imaging would then track how each genotype performs under different environmental conditions, with AI models extracting trait values in near real time.

Over time, such an integrated system could enable more efficient selection cycles, particularly for complex traits influenced by many loci and environmental interactions. Instead of waiting multiple seasons to evaluate field performance, researchers could use AI-derived trait predictions as early indicators, discarding weak candidates sooner and focusing resources on the most promising lines. For climate resilience, where rapid adaptation is critical, compressing this loop could be especially valuable.

Plant DNA foundation models and phenotyping foundation models are still in their early stages, and many technical questions remain about robustness, bias, and transferability across environments and species. Yet the trajectory is clear: as models like PlantCAD2, GeneCAD, AgroNT, and Oak Ridge’s imaging systems mature, they are beginning to turn the complexity of plant genomes and traits from a barrier into a computational resource. The next phase will test whether these tools can move beyond proof-of-concept demonstrations and deliver reliable gains in breeding programs and basic plant biology.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X

AI foundation models trained on DNA could speed up plant research

Why Plant Genomes Are Harder to Decode

PlantCAD2 and Cross-Species Prediction

GeneCAD Turns Representations Into Gene Models

AgroNT and Benchmarking Progress

From DNA to Physical Traits at Oak Ridge

Author

Get weekly updates with the latest news and tips!

More in AI

IG

FB

PIN

LI

X