When one become two: separating DNA for more accurate nanopore analysis

17 May 2021
Image

A new software tool developed by Earlham Institute researchers will help bioinformaticians improve the quality and accuracy of their biological data, and avoid mis-assemblies. The fast, lightweight, user-friendly tool visualises genome assemblies and gene alignments from the latest next generation sequencing technologies.

Called Alvis, the new visualisation tool examines mappings between DNA sequence data and reference genome databases. This allows bioinformaticians to more easily analyse their data generated from common genomics tasks and formats by producing efficient, ready-made vector images. 

First author and post-doctoral scientist at the Earlham Institute Dr Samuel Martin in the Leggett Group, said: “Typically, alignment tools output plain text files containing lists of alignment data. This is great for computer parsing and for being incorporated into a pipeline, but it can be difficult to interpret by humans. 

“Visualisation of alignment data can help us to understand the problem at hand. As a new technology, several new alignment formats have been implemented by new tools that are specific to nanopore sequencing technology. 

“We found that existing visualisation tools were not able to interpret these formats; Alvis can be used with all common alignment formats, and is easily extensible for future ones.”

A key feature of the new command line tool is its unique ability to automatically highlight chimeric sequences - weak links in the DNA chain. This is where two sequences - from different parts of a genome or different species - are linked together by mistake to make one, affecting the data’s accuracy.

Chimera sequences can be problematic for bioinformaticians when identifying specific DNA. The chimera formation can physically happen to the DNA molecules during either sequencing library preparation, during the sequencing process on some platforms, and by assembly tools when trying to piece together a genome. 

 

Example contigalignment 1

During the development of the tool, the team compared genome assemblies with and without using Alvis chimera detection. The vector image produced (above) shows an example output, where the intuitive tool tracks all reads it recognises as chimeras. 

“Although chimeric sequences don’t make up a large proportion of samples, they can have a significant effect, so we have to be careful that we have identified them during analysis,” said Dr Martin.

“In the Alvis diagram example of chimera data, each rectangle across the page represents a read, and the coloured blocks inside them represent alignments. Most chimeras are easy to see because their alignments are different colours, meaning they map to different genomes. Others are more subtle because both alignments are to the same genome, but different regions.”

The Alvis tool can pinpoint visualisation of only chimeric sequences for further inspection, and output numerical data describing the chimeras. This demonstrates that by applying the tool and then bioinformatically splitting the chimeras, the quality of the assemblies is significantly improved.

Accessed nearly 600 times since being made available at the beginning of March this year, Dr Martin, adds: “We hope that Alvis continues to be useful to other researchers working with, for example, nanopore sequencing; improving their understanding of their data by visualising alignments,''.

“Alignments are so fundamental to bioinformatics that it could be of use to anyone working with long read sequencing data, as well as alignments generated by sequencing data from short-read  platforms. The diagrams that Alvis generates can be easily exported to directly use in publications, demonstrated in our study already.”

The paper Alvis: a tool for contig and read ALignment VISualisation and chimera detection is published in BMC Bioinformatics. 

Notes to editors.

Notes to editors.

 

For more information, please contact:

Hayley London

Marketing & Communications Officer, Earlham Institute (EI)

  • +44 (0)1603 450 107

hayley.london@earlham.ac.uk

About Earlham Institute

The Earlham Institute (EI) is a world-leading research institute focusing on the development of genomics and computational biology. EI is based within the Norwich Research Park and is one of eight institutes that receive strategic funding from Biotechnology and Biological Science Research Council (BBSRC) - £6.45M in 2015/2016 - as well as support from other research funders. EI operates a National Capability to promote the application of genomics and bioinformatics to advance bioscience research and innovation.

EI offers a state of the art DNA sequencing facility, unique by its operation of multiple complementary technologies for data generation. The Institute is a UK hub for innovative bioinformatics through research, analysis and interpretation of multiple, complex data sets. It hosts one of the largest computing hardware facilities dedicated to life science research in Europe. It is also actively involved in developing novel platforms to provide access to computational tools and processing capacity for multiple academic and industrial users and promoting applications of computational Bioscience. Additionally, the Institute offers a training programme through courses and workshops, and an outreach programme targeting key stakeholders, and wider public audiences through dialogue and science communication activities.

www.earlham.ac.uk