Rapid validation for genome assemblies? Introducing KAT: K-mer Analysis Toolkit

05 December 2016

A new bioinformatics tool has been released by the Earlham Institute that provides rapid validation for whole genome sequencing data as well as genome assemblies produced from Next Generation Sequencing (NGS) data.

Genome assembly projects are costly in both time and money; where identifying problems with your data post-assembly can be a real setback. With the K-mer Analysis Toolkit (KAT), researchers can access and confirm their results at every stage.

Genome assembly with NGS technologies is like trying to do the hardest jigsaw puzzle you can imagine. The final jigsaw represents the full genome, and the individual pieces represent small fragments of the genome read out by the sequencer. Counterintuitively, to make the data more manageable, it is actually easier to first break these pieces into even smaller pieces called K-mers.

K-mers represent small fragments of the original genome with a fixed number (K) of DNA base pairs. A computer can efficiently work with large quantities of K-mers, then identify connections between these fragments to build-up a representation of the original genome.

K-mer-based techniques are commonly used to efficiently generate genome assemblies, KAT, however, is built to examine and compare K-mer datasets, using each distinct K-mer’s underlying properties, such as frequency and nucleotide composition.

Initially, KAT can analyse sequencing data to identify error levels, biases and contamination. Information from this analysis can help researchers decide whether to proceed with downstream tasks such as genome assembly. KAT can then internally back-check your assembly to determine completeness and accuracy without any external reference data - a really useful feature when studying new organisms.

Lead Software Developer, Daniel Mapleson, said on the new tool: “Imagine genome assembly like lego. Instead of trying to piece together long, 8x2-stud pieces with 6x2-stud pieces and 5x2-stud pieces, it’s more like making a staircase pattern out of the smaller 2x2-bit pieces, overlapping one stud at a time.

“However, K-mers are not only useful for assembling a genome, by counting the number of K-mers in a sequencing dataset you can learn a lot about it. By looking at the K-mer frequency profiles (K-mer spectra) we can assess the quality of the sequencing data in the first instance, such as working out if the dataset is clean, contains contaminants or is biased in some way. KAT can give answers to these questions quickly, even for non-model organisms where a reference is not available.”

Project Leader and corresponding author Bernardo Clavijo commented: “The first thing many researchers do after sequencing a genome is to use-check the K-mer spectra of their data. This tells you if the information you will need to assemble the genome is there before you spend a lot of time, effort and money on doing the rest of the analysis. Now with KAT, researchers can do all kinds of validation and information comparison at this initial stage; but to also carry this forward to validation, we have included the relevant information at the end of the assembly.

“In terms of assembly validation, the tool is particularly useful with diploid genomes that can carry more than one copy of a gene, certain regions can be falsely duplicated or deleted during assembly, leading the researcher to believe there’s more or less copies of a gene than there really are. KAT can help to detect these artefacts by tracking both the data generated from the sequencer and data from the assembler, ultimately leading to faster, more accurate conclusions.”

The paper titled: KAT: A K-mer Analysis Toolkit to quality control NGS datasets and genome assemblies is published in Bioinformatics.

For more information, read our article: KAT got your tongue? An analysis tool to quickly detect problems in sequencing data and genome assemblies.

Notes to editors.

Notes to editors

For more information, please contact:

Hayley London

Marketing & Communications Officer, Earlham Institute (EI)

+44 (0)1603 450 107

hayley.london@earlham.ac.uk

About Earlham Institute

The Earlham Institute (EI) is a world-leading research institute focusing on the development of genomics and computational biology. EI is based within the Norwich Research Park and is one of eight institutes that receive strategic funding from Biotechnology and Biological Science Research Council (BBSRC) - £6.45M in 2015/2016 - as well as support from other research funders. EI operates a National Capability to promote the application of genomics and bioinformatics to advance bioscience research and innovation.

EI offers a state of the art DNA sequencing facility, unique by its operation of multiple complementary technologies for data generation. The Institute is a UK hub for innovative bioinformatics through research, analysis and interpretation of multiple, complex data sets. It hosts one of the largest computing hardware facilities dedicated to life science research in Europe. It is also actively involved in developing novel platforms to provide access to computational tools and processing capacity for multiple academic and industrial users and promoting applications of computational Bioscience. Additionally, the Institute offers a training programme through courses and workshops, and an outreach programme targeting key stakeholders, and wider public audiences through dialogue and science communication activities.

www.earlham.ac.uk

About BBSRC

The Biotechnology and Biological Sciences Research Council (BBSRC) invests in world-class bioscience research and training on behalf of the UK public. Our aim is to further scientific knowledge, to promote economic growth, wealth and job creation and to improve quality of life in the UK and beyond.

Funded by Government, BBSRC invested over £509M in world-class bioscience in 2014-15 and is the leading funder of wheat research in the UK (over £100M investment on UK wheat research in the last 10 years). We support research and training in universities and strategically funded institutes. BBSRC research and the people we fund are helping society to meet major challenges, including food security, green energy and healthier, longer lives. Our investments underpin important UK economic sectors, such as farming, food, industrial biotechnology and pharmaceuticals.

For more information about BBSRC, our science and our impact see: http://www.bbsrc.ac.uk For more information about BBSRC strategically funded institutes see: http://www.bbsrc.ac.uk/institutes

Tags: Bioinformatics

Rapid validation for genome assemblies? Introducing KAT: K-mer Analysis Toolkit

Notes to editors.

Notes to editors

About Earlham Institute

About BBSRC

Related reading.

KAT got your tongue? An analysis tool to quickly detect problems in sequencing data and genome assemblies

The lab is my happy place: What I've learnt from a decade of long-read sequencing

How pangenomics is shaping the future of sustainable fishing

Using MARTi for taxonomic classification and visualisation of metagenomic data

From idea to innovation: inspiring entrepreneurship at Earlham Institute

Pangenome annotation opens up a multiverse of genes

Every cell tells a story: single-cell analysis in forensic science

AI and life sciences: why FAIR data is essential