Pangenome annotation opens up a multiverse of genes
Researchers at the Earlham Institute are gaining deeper insights into genetic diversity with pangenome annotation.
Unlike traditional genome studies which use a single reference, a pangenome includes the full range of variation within a species.
Annotating a pangenome allows researchers to capture details of a species’ genetic diversity, analyse evolution and disease resistance, and look at old data with fresh insights.
Different strategies and tools can focus on different levels of detail or organise the pangenome to see particular kinds of information. Or the pangenome can be annotated - a pan-annotation - to identify functional elements encoded in DNA. These could include genes and sequences involved in expression, regulation, or structural roles.
Dr Rowena Hill, a Postdoctoral Research Scientist in the Neil Hall Group at the Institute, and Dr Gillian Reynolds, a Senior Bioinformatician in the Core Bioinformatics (Swarbreck) Group, worked together to complete a Fusarium pan-annotation, intended to improve comparative genomics analysis for this group.
Fusarium is a varied group of filamentous fungi, commonly found in soil and associated with plants. Many species are harmless, but some produce toxins which can affect human and animal health and others are significant pathogens of plants and animals.
Notable species include Fusarium oxysporum f.sp. cubense, the pathogen responsible for Panama disease in bananas, a disease that threatens the global banana industry.
Dr Hill says the small size of the Fusarium genomes makes them a good test case for developing a pangenome annotation workflow.
“There's all sorts of Fusarium species - opportunistic human pathogens, plant pathogens, decomposers, the gamut. Even really closely related species can have completely different lifestyles,” she says.
“And their small genome size - relative to, for instance, most animals or plants - meant it was easier to reannotate them.”
She says the Fusarium dataset contained 83 genomes.
“It was a mix of data. The genomes were annotated, but the annotations were produced by different people, using different methods. We weren’t able to compare one with another in a suitably fair way. So the question was - how can we make this consistent for all 83 and improve comparability?”
(Left) Dr Rowena Hill, Postdoctoral Scientist, and (right) Dr Gillian Reynolds, Senior Bioinformatician
Dr Reynolds is a Senior Bioinformatician, working on applying and improving strategies for large-scale genome annotation and genome comparison.
She worked with Dr David Swarbreck, from the Institute’s Core Bioinformatics (Swarbreck) Group, to combine existing annotation tools - leading to better-quality and more comparable data.
Dr Reynolds says comparing genomic sequences against a pangenome, rather than a single reference genome gives annotation richer diversity and lowers the chance of technical bias.
“Different methods, different strategies, even things like different parameters can all affect the results of annotation,” she says.
Annotation involves employing bioinformatics tools and algorithms to analyse DNA data. These quickly process large datasets, predicting gene locations and functions. Researchers manually review the results to ensure clarity and accuracy.
This process transforms raw genome sequences into valuable information, allowing researchers to understand gene interactions and their contributions to biological processes. It provides a comprehensive and precise understanding of the genome.
But there are many different methods for annotating the data, depending on what is being searched for. These can produce different results.
“When you compare different systems of annotation, you can end up accidentally inferring results that are technical, not biological.” says Dr Reynolds.
“So - for example - the absence of gene families you are finding in a particular genome could be due to failure to annotate, not actual differences. If you don't know how the annotation methods all interplay, you might think that there's certain genes missing, or there are certain gene copies that are different from each other.
“When actually – if you use the same method on many genomes – you've got a more equal comparison. There’s a whole host of missing diversity that we're hoping to get back.
“We are taking genome annotations from disparate methods and harnessing the information contained in them to produce a more unified set of annotations.”
Researchers are moving away from a single point of reference to compare multiple genomes against each other. The result is unparalleled resolution on genetic diversity - promising new discoveries in evolution, agriculture, and human health.
In a recent article, we discussed what a pangenome is, and what it can tell us about life around us.
The pan-annotation process aims to make biologically meaningful comparisons by standardising annotation methods. The ultimate goal is to develop computationally efficient methods for comparing multiple genomes.
Dr Reynolds has also developed her own pangenome analysis programme, in collaboration with Montana State University and the James Hutton Institute in Scotland. She is funded by the Institute’s Decoding Biodiversity strategic programme, and is finalising the algorithm, with hopes to begin testing soon.
She plans to encourage users to interpret variation information in the context of a standardised set of annotations. This means information can be interpreted with minimal technical bias.
“I see pangenome annotation being used in two ways,” she says. “It can be used by biologists, to make sure they've got comparable annotations to look at the organism they’re investigating in whatever way they want.
“And then it can also be used to demonstrate that bioinformatic method standardisation is important. We need to know we’re actually comparing the same things.”
![]()
We are taking genome annotations from disparate methods and harnessing the information contained in them to produce a more unified set of annotations.
![]()
Dr Gillian Reynolds
Dr Reynolds added that the only limitation about this approach was that it requires a comparison between every single pair combination. As the dataset grows, the number of pairs becomes computationally intractable.
She intends to address this in her own pangenome work and says the next steps will include working on improving the method to avoid relying on pairwise comparisons.
“For the Fusarium annotation we used a combination of different strategies. We lift over what we can between already existing annotations, and then we can fill in any gaps with AI-based gene prediction software,” she says.
“We are trying to make sure that all the genomes have a comparable gene content. It’s really about getting rid of that technical bias in our data and making sure that the comparisons we make are biologically meaningful.”
Dr Hill adds this approach potentially allows new information to be sourced from existing genomes.
“The beautiful thing about it is that we can re-examine existing data with these computational approaches. We can make the most of what we've already generated.”
This project is also part of the Institute’s Decoding Biodiversity strategic research programme - our work to unlock the full potential of genomic data sets by developing new tools to harness large-scale genomic and metagenomic collections.
Ultimately, the pangenome annotation technique of comparing genomes could lead to more accurate, more scalable, and more biologically meaningful data, capturing subtle variations which might otherwise be missed.
This approach could allow scientists to re-examine existing data with these computational approaches, giving us the chance to see the true scale of diversity within a species.