The Sequence Read Archive, Renewed
Messy Metadata Prevents Data Reuse.
While constructing The Lifespan Database helped fill a community need, it is rather niche, and there is a sea of data waiting for the same standardization. One of the most widely used data types in the biological sciences derives from single-cell RNA-sequencing (scRNA-seq), a technology that enables one to profile thousands of individual cells simultaneously. This profiling gives insight into the molecular state of each cell across almost the entire spectrum of cellular functions, enabling one to determine for each cell what type of cell it is, if it is healthy or diseased, young or old, and much, much more.
Fortunately for us, there are several large, centralized data repositories of scRNA-seq data, the largest of which is the Sequence Read Archive (SRA), run by the National Institute of Health. This repository enables scientists who generate datasets to store them free of charge, and to openly share them with other scientists, which is critical to maintaining transparency and research reproducibility.
Unfortunately for us, and for the entire community, the SRA has extremely loose guidelines for how users document their contributed datasets. In brief, the metadata users upload is not controlled beyond requiring a few select fields to be completed. However, this is not comprehensive enough to fully describe the datasets, and worse, most of the fields allow free-form entry with no verification or enforcement of standards of any kind. And, users can create any additional metadata fields that they wish! While you will find the occasional impeccably detailed dataset, overall, this has led to a database full of errors, typos, inconsistencies, and generally a state of chaos.
And the consequences of this are devastating for science, making it impossible for users to search for studies within SRA, and in many cases, impossible to analyze and interpret datasets. This also makes it impossible to extract deeper insights from scRNA-seq experiments that might never be discovered by humans in the first place, the way machine learning can. Without clean, standardized metadata, ML is at best useless, and at worst, misleading.
Typically, if a scientist finds a dataset with incomplete documentation, one of two things will happen: 1) they won’t use the dataset, or 2) they will spend potentially hours searching for that information through papers or emailing authors, which can take weeks. Not using the dataset obviously impedes progress - they might have made a new insight or combined that dataset with their own results, accelerating the development of a new cancer treatment. Spending hours locating metadata is potentially worse - all that manual labor could end up for naught if they can’t find what they were looking for, or discover the experiment actually isn’t relevant to their research. Compiled across millions of biologists worldwide and the waste becomes staggering. Imagine if scientists could instantly find this information, if they could find dozens of highly relevant studies in seconds instead of weeks, if they could scan the entire history of their field with the click of a button. This is what we are aiming for—but it is impossible without clean, standardized metadata.
Standardizing scRNA-seq Metadata
So, we decided to tackle this ourselves, taking some lessons from previous attempts at improvements that, unfortunately, fell short of the overhaul that is truly needed.
In brief, we:
- Filtered the 9 million SRA records for scRNA-seq experiments, returning 30,518 datasets (i.e. sequencing runs with a unique ID).
- Converted the complete text from each dataset page (see example) and any matched Gene Expression Omnibus (GEO) page to markdown (sometimes the GEO page has additional information not in SRA).
- Using a custom prompt to a large language model (LLM; think chatGPT, Gemini, etc.), we extracted key metadata categories from the text into JSON files. Simple categories were standardized in the process (e.g., sex, age).
- Complex categories were further processed to be standardized using a combination of LLMs and purpose-built tools (e.g., tissue, disease).
These last two steps are really key - it’s what makes this approach unique and, critically, scalable. Not only can this method tackle all existing datasets, but it can be used on all future datasets submitted to SRA (and with tuning, probably every scRNA-seq study in existence).
To enable LLM-based standardization, we used retrieval augmented generation (RAG), which essentially means we supplied the LLM with additional, specific information to customize its task. In this case, we supplied a set of ontologies—databases of standardized terminologies for different subjects which align with the types of metadata we are extracting. For example, the Mondo ontology contains precise disease names for nearly every disease you can think of—e.g. Alzheimer’s Disease is actually “Alzheimer disease”—and a hierarchy that defines the closeness of each disease to one another, AND a list of synonyms commonly used for each term:
Wherever possible, we used ontologies to standardize our metadata:
In brief, the process begins with the LLM extracting contextual tokens from the source XML. It then performs a RAG search to identify the top three semantically similar terms from the relevant ontology. Each of these terms is expanded by retrieving their ten closest hierarchical neighbors. Finally, the LLM selects the best match based on the complete set of retrieved terms.
For now, we have chosen a semi-automated standardization process where we (or the user) can approve or modify the choice of standardization. For example, for mouse strains, we can look through the list of extracted values and simply check those that we want to standardize to the same term. For the C57BL/6J mouse strain, for example, we see all sorts of strings that are clearly referring to the official name “C57BL/6J”:
Having selected these, the suggested standardized term “C57BL/6J” is returned to us. We click approve, and voila! Every instance of those 39 strings in all 30,518 datasets is instantly standardized to C57BL/6J.
There are also many types of metadata that don’t require mapping to an ontology, but simply coercion to a standardized format of our choosing. Sex is a simple example—typically users supply a limited range of entries, commonly “Female”, “female”, “F”, “f”. The entries may also have typos like “fmale” or “fe male”. But LLMs can handle these with ease.
For some types of metadata, this is quite complex. In fact, mouse strains are highly complicated, with multiple synonyms and common names, and even multiple “official” names referring to the same strain/genotype. With hundreds of thousands of mouse strains, this presents a challenge even using an LLM. Now imagine trying to do this without an LLM - it’s simply impossible.
What's Next?
While we believe this represents a good first step, there is much more metadata in these studies to extract and standardize, and LOTS of edge cases to consider (i.e. unique things that don't fit nicely into the most common metadata categories or standardization pipelines). And, scRNA-seq is just the beginning. SRA contains hundreds of thousands of "bulk" RNA-seq datasets, not to mention other types of sequencing data. There are also other repositories, other data types, and the huge amount of data not even in a centralized repository. There is enormous value in these resources, but the only way researchers—and machine learning—can extract that value is if they are standardized. That is our mission.