Precision genome annotation: Portcullis and Mikado
Take a look at our two latest bioinformatics tools that are helping to better identify information within diverse genomes.
It’s one thing getting accurate RNA sequencing data for genome analysis - but it’s another thing entirely to be able to assemble this correctly, and then to dissect this information to pick out the useful parts - i.e. the genes and their products.
Two such tools developed at Earlham Institute in the Swarbreck Group have paved the way to drastically enhance how we can do this: Portcullis, developed by Daniel Mapleson, and Mikado, developed by Luca Venturini, have recently been released to the public, and are available for any scientist who wants to accurately identify important regions within genomes.
Splice junctions are important. In plants and animals, one gene sequence does not necessarily only make one product. In fact, within the gene, there are various gaps.
Think of it like a stripy football shirt that is black and white. The black stripes might make a protein, but the white stripes are simply non-coding sections of material in between. These gaps are not ‘junk’, however, because they allow one gene to make many products, depending on the order in which the black stripes are stuck together.
Let’s say there are five black stripes and four white stripes. We might get a combination of 1+2+3+4+5, or we might get 1+3+4, or 1+4+5.
The problem with the existing software was that it generated many false positives - meaning there were plenty of examples of splice junctions that simply weren’t there. There was a need to filter these out, which is precisely what Portcullis can do.
Compared to other software, the percentage of false positives using Portcullis is extremely low, with precision upwards of 95-99%.
This is incredibly useful, as the software can aid the identification and selection of correct transcripts - i.e. the bits of DNA that tell cells to make proteins - and to investigate the effects of alternative splicing (the alternation of black and white stripes, or coding and non-coding sections within gene sequences).
Portcullis has been battle-tested on the wheat genome, which is large and complex (five times larger than the human genome, and way more complicated), along with the huge number of alignments that come with it. Portcullis processed the data in under an hour using less than 20GB of RAM.
Wheat is a great example, as Luca says, “Portcullis was fundamental to the analysis of the wheat genome.” In fact, Dan points out that we have a saying, “If the software works on wheat, it'll work on anything!”
Furthermore, Portcullis is five times faster than the next best tool, which is always of benefit to a bioinformatician.
Luca based the name of Mikado on a game of the same name; the aim of which is to extract sticks of differing colours that carry the biggest score without disturbing the rest. This, he says, is “exactly what we are doing. Imagine genes as sticks and we want to capture the ones with the highest value without getting the others.”
The problem with transcript assembly and genome annotation has been that, in addition to getting too many false positives, there have been too many false negatives. Individual tools often miss genes completely, but Mikado can combine different transcript assembly methods to retrieve more real genes than any individual tool.
At the same time, by analysing the incoming transcripts in detail Mikado can filter out false positives too. In addition, adjacent genes can often be annoyingly and incorrectly fused together by the transcript reconstruction software, yet Mikado can identify these situations and split the falsely-fused genes apart.
You can see how this might happen with very similar gene sequences. Many genes are duplicates and can carry almost identical sections of DNA in places, therefore, it’s easy for an algorithm to assume these join up, rather than form their own stretch of information.
So, when it came to solving complex gene regions, instead of correctly identifying different genes, we were getting chimeras. What Luca wanted to do was to split these into different transcripts.
This is where Portcullis comes into its own again. Using Portcullis along with Mikado helped Luca to eliminate the false junctions that had led previous software to fuse gene sequences together incorrectly.
Essentially, Luca has managed to go from there being a big mess to having a nicely annotated genome, with the correct genes identified.
Think of it like having a big pile of socks all jumbled up on the floor. Rather than having stripy socks paired with chequered ones, and blue socks with red socks, instead we have blue socks, red socks, stripy and chequered socks all correctly paired together, and nicely organised into piles - while the rubbish - the socks with holes in them, are thrown away.
Luca also points out that this software can be “used on wheat without a glitch (granted, it took a week of hammering about to get to this stage).” Wheat being a very complex genome to analyse, this is no mean feat. Furthermore, this tool can be used throughout different species with much more increased precision.
Thus, with these two pieces of software, we are able to much more accurately, efficiently and precisely identify important coding regions of various genomes, which is incredibly important when dissecting the essence of life.
The accurate identification of genes and splice variants is an important step towards determining the function of these genes and helping us meet the challenges of feeding ten billion people by 2050, responding to climate change and identifying factors that aid the development and health of a range of organisms.
On Portcullis, Dan Mapleson said, "Portcullis has been a real team effort, demonstrating how combining computational with biological expertise can produce software that addresses real problems in the field.
“We've shown that using portcullis, we can generate an accurate set of splice junctions from RNAseq data, which can have a really positive impact on a number of downstream tasks, including genome annotation (via Mikado), and alternative splicing analysis.
“As a computer scientist, it's been a great project to work on, incorporating a number of powerful computational methods, such as machine learning, to deliver precise results, and multi-threading, to get those results rapidly."
On Mikado, Luca Venturini added, “Mikado has been the product of over one year of myself as a developer working with experienced annotators in our group – an absolutely delightful collaboration. Mikado demonstrates that it is possible to integrate RNA-Seq assemblies coming from different tools, and to combine them seamlessly with data coming from novel technologies such as PacBio.
“Our test cases included exotic and difficult genomes to work with such as wheat and the ash tree, so we can say that even as a novel tool, Mikado has gone through a lot already! As a computational biologist, Mikado was the first large scale project I was involved with and it has been a pleasure to explore all the necessary techniques to make it as robust and efficient as it is today.”