New approaches for metagenome assembly with short reads

04 March 2019

Metagenomic assembly is challenging - especially as dedicated algorithms for metagenomic data are a relatively recent innovation. <a href="/martin-ayling">Martin Ayling</a>, <a href="http://www.nhm.ac.uk/our-science/departments-and-staff/staff-directory/matt-clark.html">Matthew Clark</a> and <a href="/richard-leggett">Richard Leggett</a> talk us through the challenges and different approaches in a <a href="https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbz020/5363831">recently published review</a>, in <i>Briefings in Bioinformatics.</i>

As the review states in the first line of the conclusion: “Assembling genomes out of heterogeneous samples is an extremely challenging problem and one that remains unsolved.” Indeed, with the first specialised tools for dedicated assembly of metagenomic data only released in 2012, the field is still in its infancy.

Yet, metagenomics is a rapidly expanding field of research, with some important implications in healthcare, especially. Gleaning accurate information on the contents of the human gut, for example, can help us move forward the area of personal healthcare and more swift diagnosis of life-threatening conditions.

Likewise, analysing environmental samples - envirogenomics - can help us explore the composition of the complex communities of microorganisms living all around us, even on and inside us: the human microbiome, the New York Subway, the virome of bats, the oceans, the crop rhizosphere and the extremophiles living in hot springs, to name but a few already-applied examples.

Additionally, metagenomics can provide a means to investigate species which are otherwise impossible to interrogate using classic genomics methods.

The challenges are abundant in such analyses, however. Firstly, there is unknown abundance and diversity in any sample. How to know how much is in there, and how much of what? It’s not possible to make many of the assumptions that you could with regard to samples gleaned from a single species, when it comes to assembly of the sequence.

Another problem is related species: “in a genomic study, it may be assumed that all sequence reads derive from the same original genome. In metagenomic studies, this is emphatically not the case.” In a metagenomic sample, there are so many related species and subspecies that these can really confuse an assembler - with extensive overlaps in a kmer set, for example.

Other significant challenges include memory and processing, initial classification of reads, graph partitioning and read pair information.

So, what are the approaches? Find out in the review - in which the team looks at a variety of short-read de novo tools, as well as looking at reference-based approaches. Currently there are no published tools dedicated solely to metagenomes gleaned from third generation, long-read sequencing platforms, but good results have yet been garnered from the use of existing tools.

Essentially, assembling metagenomes requires different algorithmic approaches, which don’t abide by the assumptions made by most genome assembly tools. New tools are continuing to emerge, but there is no single tool that is best for all samples or questions.

Notes to editors.

Notes to editors

For more information, please contact:

Dr Peter Bickerton

Scientific Communications & Outreach Manager, Earlham Institute (EI)

+44 (0)1603 450 994

peter.bickerton@earlham.ac.uk

About Earlham Institute

The Earlham Institute (EI) is a world-leading research Institute focusing on the development of genomics and computational biology. EI is based within the Norwich Research Park and is one of eight institutes that receive strategic funding from Biotechnology and Biological Science Research Council (BBSRC) - £5.43m in 2017/18 - as well as support from other research funders. EI operates a National Capability to promote the application of genomics and bioinformatics to advance bioscience research and innovation.

EI offers a state of the art DNA sequencing facility, unique by its operation of multiple complementary technologies for data generation. The Institute is a UK hub for innovative bioinformatics through research, analysis and interpretation of multiple, complex data sets. It hosts one of the largest computing hardware facilities dedicated to life science research in Europe. It is also actively involved in developing novel platforms to provide access to computational tools and processing capacity for multiple academic and industrial users and promoting applications of computational Bioscience. Additionally, the Institute offers a training programme through courses and workshops, and an outreach programme targeting key stakeholders, and wider public audiences through dialogue and science communication activities.

www.earlham.ac.uk

Tags: Metagenomics, de novo, assembly, genome, Genomics, algorithm, Bioinformatics