Startup bets genomic data from 100M species can help build new medicines

Biotech startup Basecamp Research announced its Trillion Gene Atlas in March 2026, a project designed to collect genomic data from more than 100 million species and feed it into AI models that design new medicines. The initiative represents a dramatic private-sector bet that the biological diversity found in underexplored ecosystems, not just human genomes, holds the keys to the next generation of therapeutics. If the data proves as rich as the company claims, it could reshape how drug candidates are identified and engineered.

What the Trillion Gene Atlas Actually Promises

The core claim is scale. Basecamp Research says the Atlas will expand known evolutionary genetic diversity by 100x, drawing novel genomic material from species collected across thousands of sites worldwide. The company had previously announced the discovery of over one million new species, building what it described as a database purpose-built for generative foundation models in biology. The Trillion Gene Atlas takes that earlier work and extends it by orders of magnitude, targeting genomic sequences from 100 million species rather than cataloging individual organisms.

That number deserves scrutiny. Most large-scale biodiversity genomics programs operate at a fraction of this ambition. A prominent proposal published in Nature called for sequencing roughly 105,000 species across Africa to safeguard biodiversity, and even that plan carried cost estimates in the billions and focused on producing reference genomes rather than feeding AI drug-design pipelines. Basecamp’s target is roughly 1,000 times larger. The gap between these two scales raises a practical question: can a private company collect, sequence, and curate genetic material from that many species with the rigor that drug development demands?

Basecamp’s own framing emphasizes that the Atlas is intended as infrastructure for a new class of AI-native therapeutics companies. In a separate announcement aimed at investors and partners, the company positioned the Atlas as a way to turn previously inaccessible biodiversity into a structured dataset that can be licensed to pharmaceutical firms, biotechs, and industrial biology players. That commercial focus distinguishes the project from conservation-led genomics efforts and underscores the stakes: if the Atlas works, it could become a proprietary substrate for drug discovery rather than a purely public scientific resource.

Sequencing Hardware and the Cost Problem

One day after Basecamp’s announcement, Ultima Genomics confirmed that its UG200 Series was selected as the sequencing platform for the Trillion Gene Atlas. The partnership signals that Basecamp is banking on next-generation sequencing economics to make the project viable. Ultima’s technology is designed to drive per-genome costs far below what older platforms can achieve, which matters enormously when the target is not thousands of genomes but hundreds of millions.

The Ultima announcement also carried a telling admission: “Even with the rapid advancements of large-scale sequencing projects, traditional efforts have fallen short.” That framing positions the Atlas as a response to a specific bottleneck: existing biodiversity sequencing initiatives have generated valuable but incomplete datasets, and the diversity of genetic information available for AI training remains limited relative to the scope of life on Earth. Whether Ultima’s hardware can close that gap at the scale Basecamp envisions is an engineering question that has not yet been answered publicly with peer-reviewed benchmarks or independent cost analyses.

Cost is not the only constraint. Sampling and sequencing at this scale will require robust logistics in the field, standardized protocols for preserving DNA integrity, and careful tracking of metadata such as location, environmental conditions, and host species. These contextual details are essential if downstream AI models are to learn meaningful structure–function relationships from the data. Without them, even trillions of raw sequences risk becoming a noisy archive rather than a reliable foundation for therapeutics design.

How Existing Biodiversity Databases Compare

Basecamp is not building in a vacuum. The Barcode of Life Data System, known as BOLD, already serves as a centralized bioinformatics platform for DNA-based biodiversity data, housing millions of specimen records and barcodes covering hundreds of thousands of species. BOLD’s strength lies in its curation and validation pipeline, which ensures that sequence data is tied to verified specimens with clear taxonomic assignments.

The difference between BOLD and the Trillion Gene Atlas is not just size but intent. BOLD was designed for species identification and biodiversity monitoring. Basecamp’s project is designed to generate training data for AI models that can design proteins, enzymes, and gene therapies. That distinction matters because the quality standards for each use case diverge. A barcode sequence that reliably identifies a beetle species may not contain enough functional genomic information to train a model that designs a novel enzyme. Basecamp’s challenge is to produce data that satisfies both breadth and depth, and the company has not yet released independent validation of its data quality at the scale it claims.

There is also a governance contrast. BOLD is built around open scientific collaboration, whereas Basecamp’s Atlas is being assembled by a private company that plans to commercialize access. How much of the underlying sequence data, annotations, and model outputs will ultimately be public remains unclear. That uncertainty will shape how academic researchers, conservationists, and regulators view the project, particularly in countries that provide the biological samples.

From Genomes to Generative AI Tools

The Atlas does not exist in isolation from Basecamp’s existing product line. The company has already released ZymCTRL, an open-source generative AI tool that designs enzymes for industrial processes. That tool is tied to a preprint on bioRxiv, giving outside researchers at least some ability to evaluate the underlying science. ZymCTRL represents a proof of concept: if you train a generative model on sufficiently diverse enzyme sequences, it can produce novel designs with low similarity to its training data.

Earlier this year, Basecamp also announced AI models for programmable gene insertion, a breakthrough the company says tackles a longstanding challenge in genetic medicine. The logic connecting these products to the Atlas is straightforward: more diverse training data should, in theory, produce more capable generative models. If the Atlas delivers on its promise, future versions of these tools could be trained on vastly larger and more varied sequence sets, potentially improving their ability to design enzymes that function in extreme conditions or gene-editing systems that work across a wider range of cell types.

Basecamp has also highlighted a network of field expeditions, using a combination of in-house teams and local collaborators, to gather environmental samples from underexplored ecosystems. In earlier communications tracked through investor updates, the company emphasized that these sampling campaigns are designed not just to find new species but to capture functional diversity relevant to enzymes, metabolic pathways, and natural products. That focus on function aligns with the needs of generative models, which benefit from examples spanning a wide range of biochemical behaviors.

Ethics, Access, and Regulatory Questions

Turning global biodiversity into a proprietary training set raises ethical and legal questions that go beyond sequencing throughput. Many countries now operate under access and benefit-sharing frameworks inspired by the Nagoya Protocol, which seek to ensure that communities providing genetic resources share in any resulting commercial gains. Basecamp’s statements, including those directed at partners through corporate briefings, stress long-term relationships with host countries, but the specific terms of benefit sharing, data ownership, and downstream IP rights are not public.

Regulators and policymakers will also need to grapple with how AI-designed biological products derived from such datasets are evaluated and approved. If a therapeutic enzyme or gene therapy vector is generated by a model trained on sequences from hundreds of millions of species, tracing functional ancestry or assigning provenance to any single organism becomes nearly impossible. That opacity could complicate safety assessments, environmental impact reviews, and equitable licensing arrangements.

Can the Atlas Deliver on Its Ambition?

For now, the Trillion Gene Atlas remains more a statement of intent than a completed resource. Basecamp’s claims about expanding known genetic diversity, scaling field collection, and integrating low-cost sequencing all point toward a plausible path, but there is limited independent evidence on execution. The company has not yet published large, peer-reviewed datasets from the Atlas itself, nor has it opened its full training corpus for external benchmarking.

Still, the direction is clear. By pairing ambitious biodiversity sampling with AI-native product development, Basecamp is testing a thesis that many in biotech quietly share: the next wave of therapeutics may come less from rational design based on a handful of model organisms and more from mining the vast, still-unmapped sequence space of life on Earth. Whether the Atlas ultimately fulfills that vision will depend on factors that go beyond marketing claims, from sequencing reliability and data governance to ethical sourcing and regulatory trust.

What is certain is that the project has already shifted the conversation. Biodiversity genomics is no longer framed solely as a conservation or academic endeavor; it is being reimagined as a strategic asset for AI-driven drug discovery. As Basecamp and its partners move from announcements to results, the rest of the field will be watching closely, not just to see whether the Trillion Gene Atlas can be built, but to decide what kind of biological future such an atlas should enable.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X