Automating Ageing Research

Automating Ageing Research
Photo by Nick Fewings / Unsplash

The Lifespan Database: Building an Open, Community-Driven Resource for Mouse Aging Data.

We set out with a clear initiative: to automate the standardization of key resources, and then expand across data types and scientific disciplines until we encompass the entirety of biology. This is a huge task, and to start, we needed a well-defined use case—one that was both impactful and feasible—to act as a proof of concept for our tools.

We chose to focus on mouse lifespan because the data are straightforward, we know the field well, and no project has yet systematically compiled, standardized, and openly shared lifespan data while enabling community contributions that can continuously expand the database with each new experiment. By focusing on this dataset first, we aim to showcase what's possible when the community is equipped with a centralized resource and intuitive tools that automate tedious steps like data extraction and metadata standardization. Ultimately, our goal is a lasting, dynamic database that grows with each new experiment for decades to come.

How We Built It

Our aim was not to build a comprehensive resource, but rather to provide the infrastructure and tools that enable the community to build with us. This entailed several steps:

  1. Identifying dozens of mouse lifespan experiments
  2. Building a Content Management System
    1. To automate metadata extraction, standardization, and verification
  3. Extracting and standardizing lifespan data
  4. Building a portal for users to find and interact with experiments
  5. Enabling community contributions

Identifying Studies

We performed this first step manually to demonstrate the futility of traditional approaches to research. In brief, we scanned the top 100 hits from 4 different search engines using the simple term “mouse lifespan”, and recorded the total time to verify if a paper had actual lifespan results, and data. Searching for the raw data through supplementary files was the most time consuming part.

Search Tool

Total Time

# hits

# with raw data

Notes

PubMed

~3 hrs

30 / 100

7 / 100

6 more from ITP, which supplies raw data elsewhere.

Google Scholar

~2.5 hrs

48 / 100

5 / 100

Shorter time b/c several overlapped with PubMed, and fewer contained supplementary files to look through.

Google

~2 hrs

17 / 100

7 / 100

Diverse hits, not all primary papers.

Elicit

~4.5 hrs

53 / 100

4 / 100

Nearly every hit was relevant, but many were reviews or 50+ year old papers w/out raw data.

In total, this step took about 12 hours of manual labor, returning 131 papers containing lifespan experiments, of which 32 included raw data. However, excluding ITP studies dropped this to 12 / 113, or 10.6%. After emailing all authors (when an address was supplied), 2 sent back raw data. Several more hours were spent extracting email addresses and creating a table to track everything. Note that we did not filter for e.g. interventions applied only to wild-type mice, or even to interventions per se. Since one can filter as they like later, it made sense to us to be inclusive of any study that reports lifespan in any scenario.

If this process sounds wasteful, you are absolutely right. It is a demonstration of the extreme inefficiency of traditional research. In the time it took to perform this frustrating manual search, one could have created a set of tools to automate this process, creating a scalable and permanent solution. We didn’t do this (yet) because we were building the automation for the second, even harder step: metadata extraction and standardization.

Automated Metadata Extraction

In brief, we built a Content Management System (CMS) that enables 1-step extraction of metadata. Simply supply the PMID or a PDF of the paper you wish to process, and click Go. Our pipeline then performs the following steps:

  1. Downloads the PDF (if PMID supplied)
  2. Converts the PDF to markdown (MD)
  3. Generates JSON from MD
  4. Extracts metadata with an LLM following our custom prompts
  5. Allows users to verify, modify if necessary, and approve extracted metadata

To perform these steps manually—to search for a dozen pieces of metadata manually across dozens or hundreds of papers, would take weeks, or months, depending how detailed you wanted to be. But LLMs, appropriately prompted, can do this in seconds. It just takes a little upfront effort to build the infrastructure, but the return on investment is huge.

So far, we have standardized every experiment for which we could find raw lifespan data, totaling 75 experiments from 34 papers.

Extracting and Standardizing Lifespan Data

As for the lifespan data, those we processed manually, and it was excruciating. Although the ITP data are highly standardized, data from other papers (in the rare cases when it was actually supplied) come in all sorts of formats: csvs, pdf tables, images pasted into word docs, you name it. Different papers also use different units (days, weeks, months), and worst of all is the use of dates, which, if not exactly correct according to how excel expects them, are an absolute nightmare to work with.

Our first manual standardization of 13 papers took 20 hours, with another 24 hours to add the additional papers and make refinements. In all, this took an entire work week, full time, to standardize the data from only a tiny fraction of existing lifespan studies. This, like searching for papers, and extracting metadata, is almost entirely automatable.

Building a Public Web Portal

The next step is easy to see for yourselves. This open-access database contains the standardized data and metadata from all experiments we have ingested so far, and it is searchable and filterable. Kaplan Meier curves, with basic statistics, are viewable, shareable, and downloadable for every experiment. In the future, we would love to add features, like the ability to combine experiments on a single plot, to perform more advanced statistics with the click of a button, and to privately upload your own internal results to compare with those published.

Enabling Community Contributions

Finally, the most important part—allowing any user to add a paper we are missing. This is what makes our database unique—the ability for the community to continuously expand the database. All other lifespan databases in existence (see Appendix) are static, usually made by one or a few individuals at a single point in time, left unmaintained after publishing (some are still updated every few years!). But this means that no database is ever up to date—it never contains the most recent studies. With community contributions, anyone can add any study at any time, allowing the database to grow organically and stay current.

To add a study, simply make an account (we don’t want bots uploading spam), and get started. Your contributions will be tracked and displayed (if you wish) on each study you contribute, and authors can claim and verify their studies should they already be in the database.

A Collaborative Effort

Critically, our mission is not to compete with existing efforts—it’s to amplify them, and ultimately benefit all scientists. By creating an open platform that simplifies data aggregation, standardization, and sharing, we hope to accelerate progress for everyone, allowing researchers to quickly showcase their own experiments, combine their data with others, and find gaps and new insights. Research moves faster when data are accessible, well-organized, and continuously updated. We hope you join us.

Appendix: Mapping the Existing Landscape

Although we didn’t use these resources (except for the ITP) here, if you are interested in mouse lifespan, you should visit the following:

  1. DrugAge—perhaps the most extensive, sortable, filterable, with core metadata and summary statistics, and several species.
  2. Rodent Aging Interventions Database (RAID)—one of the most comprehensive, with >160 studies, featuring summary statistics, sorted by % lifespan extension.
  3. Interventions Testing Program (ITP)—featuring ITP data only, but very extensive, with ~84 robust experiments from 2004-2020, easily downloadable data, summary statistics, and Kaplan-Meier curves.
  4. Longevity Intervention Database (LID)—predominantly ITP data, but also data from the Caenorhabditis ITP.
  5. Animal Lifespan Expectancy Comparisons (ALEC)—predominantly ITP data, and enables data upload for users to quickly visualize their own experiments.
  6. SurvCurv—decent in scope and depth with data from multiple species, but no data post-2015.
  7. 95 Things That Make Mice Live Longer—a nice list, but no studies after 2017.
  8. Collective Collection of Lifespan Data (COLLIDA)—but no database is yet public.

There are likely many more private databases, but one may be hard pressed to discover anything larger than Longevica’s, featuring 20,000 mice. And while these make for a good starting point, these efforts are isolated, with no portal enabling community contributions. This is key for our database.

Subscribe to Covalent.Bio

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe