Article Science

COPO: providing context through metadata

What is the difference between one data point and another?

19 December 2023

Suppose a researcher is measuring quantities of a protective chemical secreted by a tree. She finds that samples from two leaves from the same tree have very different quantities of this protein.

She has two different data points with no explanation - until she returns to her notes and finds she observed the leaf with higher amounts had been lunch for a caterpillar.

If data is a list of measurements, a description, an amount of a chemical - then metadata is the data about the data. Where was a sample collected? When? How? And did anything eat the leaf?

Metadata is essential information. It allows researchers to make deductions, replicate experiments, and draw more accurate conclusions.

But this contextual information is often inaccessible. It may be presented in an ambiguous way, buried in lab notebooks, or omitted entirely. 

Even for researchers who want to record and share their metadata, there is the challenge of deciding how much information to log. Might someone want to know the time of day the data was collected? The prevailing wind direction? The species of caterpillar?

The Earlham Institute platform Collaborative OPen Omics (COPO) is designed to address some of these issues. It helps researchers with uploading, labelling, and tagging their work in a consistent way and is designed to make it easy to share both results and the metadata around them.

COPO makes it much easier to prepare metadata for uploading alongside research data. It makes it findable and describable according to agreed terms.

Dr Felix Shaw

Dr Felix Shaw is a research software engineer at the Institute, working on COPO alongside fellow software engineers Debby Ku and Aaliyah Providence

He describes the platform as a broker between the user and repository systems. 

“COPO makes it much easier to prepare metadata for uploading alongside research data,” he says. “It makes it findable and describable according to agreed terms.

“For example, if you need to find the genome of all the amphibians living in brackish water, in one small geographical area, caught between April and May this year, it is possible to make that search easily.”

Dr Felix Shaw is a Research Software Engineer at the Earlham Institute and regularly leads training sessions on software and data skills for the bioscience community.
Participants on our Data Carpentry workshop in the training suite at the Earlham Institute

Support for large datasets

Increasingly high throughput technologies mean biologists of all descriptions are working with larger datasets than ever before. However, the infrastructure to support data sharing has not kept up with the demand.

Many scientists still do not use public repositories and methods for effectively recording and managing metadata are lacking.

“A keystone of science is the concept of FAIR data,” says Dr Shaw. “Data should be findable, accessible, interoperable, and reusable.

“It’s a fundamental principle, essential for reproducibility of experiments. Metadata is what makes FAIR possible.”

He explains part of the process involves limiting the range and variability of search terms. Standardising data description requires both the researcher uploading the metadata and the person searching for it to agree on the descriptive words used.

“With COPO, we are using specific, controlled vocabularies and agreed terms, which standardises data through consistency.” he says. 

It’s [FAIR data] a fundamental principle, essential for reproducibility of experiments. Metadata is what makes FAIR possible.

Dr Felix Shaw

Quick and easy access

When submitting data to COPO, researchers fill in the sample manifest – approximately 80 columns of data. It is not required to fill all of them in, but the more information that is made available, the easier it will be for others to find their work.

COPO checks that provided metadata are of the necessary quality for public submission. Each submission is given an accession number. This is a unique identifier which allows submissions to be retrieved easily. Work is submitted to the European Nucleotide Archive.

The platform is already used widely across the UK and all over Europe, with 53,307 samples and 33,647 files uploaded to date. And plans include an expansion into the United States, allowing American researchers the same access as those in Europe.

Dr Shaw thinks COPO’s future is bright. “In an ideal future, COPO would be a resource for researchers and people with an interest in science everywhere,” he states, “to easily and quickly access everything connected with the areas of research they want to know more about.”

Contact the team to find out more about COPO and the training the Institute offers in using the platform.

Amy Lyall
Article author

Amy Lyall

Scientific Communications and Outreach Officer