By Jorge Bastos, Nick Schaum — 26 Jun 2025

Covalent: The Biosciences Discovery Engine

Making the World’s Data Usable.

At Covalent.Bio, we’re building a Discovery Engine for biology: an ambitious platform that continuously ingests, standardizes, and structures the full breadth of human biological knowledge and data worldwide, both existing and newly generated.

Today, we sit on exabytes of public biological data, but most of it is wasted. Current AI and machine learning models in the biosciences rely on clean, standardized datasets like those in the Protein Data Bank. But beyond these well-organized sources lies a vast ocean of omics, clinical, and imaging data that remains largely untapped, buried under inconsistent metadata and deposited across fragmented repositories.

We’re here to change that.

Our goal is to tap into this vast potential by making all biological data machine-readable and interoperable, transforming the way we do science. Imagine a future where scientists no longer rely solely on what they’ve personally read or encountered in their careers, but can design experiments based on everything that’s ever been done! The amount of data available, even within individual subfields, is staggering—far beyond what any one individual or lab could possibly track or interpret. Right now, research is still largely driven by manually reading papers over months and years, and even then, most researchers only manage to stay current in a narrow slice of their own subfield.

What if every scientist had access to a discovery engine that could answer questions like: Has this experiment already been done? Has someone already tested this drug? What were the outcomes? What if they could perform meta-analyses across global datasets in seconds? We’re wasting enormous resources—time, money, human effort—simply because scientists lack the tools to access and connect this buried knowledge. Experiments are repeated needlessly, and critical insights are lost in obscure corners of the internet.

Now imagine this power in the hands of every scientist, biotech, and pharmaceutical company. Imagine the ability to instantly retrieve every clinical trial, every study involving a specific compound or drug class, and have those results interpreted in context. This would make every step of the R&D process faster, more efficient, and more likely to succeed. It could reduce the cost of developing a drug, not by a small percentage, but by orders of magnitude. A $1–2 billion drug could one day cost a fraction of that.

Dirty Metadata Prevent Data Reuse

So why hasn’t this happened already?

Many groups are applying AI and ML to biology, but they tend to focus on the easy cases, using data that are already standardized, like protein structure prediction from amino acid sequences. These are important, but they represent just a sliver of biological research.

The real bottleneck isn’t the raw data, it’s the metadata. These are the essential descriptors that explain the context of an experiment: the organism, sex, age, cell type, antibody used, preparation conditions, and more. Metadata are what make data interpretable, reproducible, and useful. Without them, even the most carefully conducted experiment becomes meaningless.

Every scientist understands the importance of metadata, and every scientist knows how bad the current state of metadata is. Even centralized data repositories often suffer from inconsistent naming, typos, vague descriptors, and a lack of standardization. Metadata are often recorded in free-form text, with little to no enforcement of structure or naming conventions. As a result, connecting one experiment to another—or even finding relevant studies in the first place—is nearly impossible.

This is the challenge Covalent is tackling.

AI Meets Metadata

We’re using large language models (LLMs) to standardize and enrich metadata at scale. By doing so, we’re transforming inaccessible, messy data into clean, machine-ready datasets. For the first time, we will feed the full spectrum of biological and biomedical data into machine learning systems.

ML is only as good as the data it learns from. Right now, the majority of the world’s bioscience data are invisible to these models. We aim to change that: to unlock discoveries faster, to enable decades of research to happen in years, and accelerate breakthroughs across drug discovery, basic science, and clinical medicine.

Covalent will be the foundation for a new generation of AI-driven biology. From this platform, we aim not only to aid millions of scientists, but to use Covalent as a platform to launch dozens of companies and initiatives in every research area, disease space, and molecular pathway. And because our engine continuously ingests new data, it will keep getting smarter, forever.

This is the future of research. This is the future of medicine.

Help us build it.

Making the World’s Data Usable.

Dirty Metadata Prevent Data Reuse

AI Meets Metadata

Subscribe to Covalent.Bio