New protein method generates 10M data points in 3 days, boosting AI models

A team at Rice University has built a lab platform that can map the activity of more than 10 million protein variants in a single experiment, then feed that data into AI models that learn to predict which mutations improve function. The whole cycle, from wet-lab run to trained model, takes about three days.

The platform, called Sequence Display, was described in a paper published in Nature Biotechnology in spring 2026. It was developed by a group led by Jonathan Silberg, a professor of bioengineering at Rice, alongside first author Yue Qin and collaborators across the university’s departments of biosciences and chemical engineering.

Why protein data has been the bottleneck

Protein language models work on the same principle as the large language models behind chatbots: they learn patterns from massive datasets. But while text data is freely available across the internet, experimental protein data is scarce and expensive to produce. A single enzyme activity assay can take hours per variant. Deep mutational scanning, the current gold standard for high-throughput protein characterization, typically yields hundreds of thousands to a few million measurements per campaign and requires multiple rounds of library construction and selection.

That data gap has limited how well AI models can predict the effects of mutations on protein function. Researchers often resort to training on evolutionary sequence data alone, which captures what nature has already tried but misses the functional measurements needed to guide engineering toward novel properties.

How Sequence Display works

The core innovation is an activity-linked barcoding system. Each protein variant in a large library is paired with a short DNA barcode that encodes its measured activity. When the entire pool is sequenced in bulk, researchers recover both the amino acid sequence and the functional readout for every variant simultaneously. That single-round design eliminates the iterative screening cycles that slow conventional directed-evolution workflows, according to the Rice University press release.

The resulting datasets are then used to fine-tune pretrained protein language models. Rather than relying solely on evolutionary patterns, the models learn from millions of direct activity measurements, which sharpens their ability to predict how untested mutations will perform. The published study reports that this fine-tuning step produces variant-level predictions accurate enough to guide the next round of engineering without additional wet-lab screening.

What this could change for drug and enzyme design

Directed evolution, the Nobel Prize-winning strategy of mutating and selecting proteins through repeated rounds, typically requires weeks or months of cycling between mutagenesis, screening, and selection. If a single Sequence Display experiment can replace much of that cycling with a data-driven prediction step, labs could identify high-performing variants faster and with fewer physical experiments.

Enzyme optimization for industrial manufacturing, antibody engineering for therapeutics, and biosensor development are all areas where that acceleration would matter. A pharmaceutical company trying to improve the stability of a biologic drug, for instance, could use Sequence Display to survey millions of variants at once rather than testing a few thousand candidates over several months.

Open questions and limitations

The Nature Biotechnology paper establishes the platform and reports its throughput, but several practical questions remain. The specific protein targets tested, the exact accuracy metrics of the resulting AI models, and head-to-head comparisons against deep mutational scanning are details that require a close reading of the full manuscript and supplementary data. The institutional releases describe the method’s output in broad terms but do not disclose the error rate of the barcoding step or the fraction of data points that survive quality filtering before model training.

Generalizability is another concern. A method optimized for a well-characterized enzyme scaffold might not transfer cleanly to membrane proteins or large multi-domain complexes. The publicly available summaries do not specify which protein classes were tested, so the breadth of applicability remains unclear.

Cost is worth watching, too. Generating 10 million data points in one experiment implies significant sequencing throughput, likely requiring a high-output Illumina or similar next-generation sequencing run. Whether a mid-sized academic lab or a startup could afford to run Sequence Display without dedicated sequencing infrastructure is not addressed in the available materials.

And reproducibility, the standard stress test for any new experimental platform, has not yet been reported by independent groups. The paper is recent, and outside validation will take time. Until other labs replicate the throughput and accuracy claims with their own protein targets, the 10-million-data-point figure should be understood as a demonstrated capability from the originating group, not a field-wide benchmark.

What independent replication will need to show

For researchers considering whether to adopt Sequence Display, the practical first step is reading the full paper and its supplementary methods, assessing whether the barcoding chemistry is compatible with their protein system of interest, and estimating sequencing costs at the required depth. The platform’s long-term value will be judged not by the volume of data it produces but by how much that data improves the predictive power of downstream AI models on real engineering tasks.

The initial results are strong, and publication in Nature Biotechnology lends credibility to the core claims. But the distance between a single high-profile demonstration and routine lab adoption is real. The next milestone to watch for: independent groups reporting whether they can match the throughput and, more importantly, whether the AI models trained on Sequence Display data outperform those built on smaller, conventionally generated datasets. That comparison will determine whether this platform becomes a standard tool or remains an impressive proof of concept.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X

New protein method generates 10M data points in 3 days, boosting AI models

Why protein data has been the bottleneck

How Sequence Display works

What this could change for drug and enzyme design

Open questions and limitations

What independent replication will need to show

Author

Get weekly updates with the latest news and tips!

More in AI

IG

FB

PIN

LI

X