• Article
  • Science

How our tools can help you: Mikado

To produce our magnitude of scientific data, we also need a host of software tools to highlight and decipher analyses that pinpoint integral information to answer ambitious biological questions.

May 23, 2018

Have you often wanted to know the best approach before you start analysing your RNA data? Mikado is a novel algorithm that lets you integrate multiple RNA-Seq assemblies into a coherent transcript annotation. The performance of genomics RNA-Seq aligners and assemblers varies greatly across different organisms and experiments and the accuracy of the transcript reconstruction is boosted by combining multiple approaches.

The open-source bioinformatics tool can remove redundancies and select the best transcript models according to user-specific metrics while solving common genetic artefacts. Lead developer Luca Venturini explains how Mikado leverages multiple transcriptome assembly methods for improved gene structure annotation.

What are Mikado's top three best features?

Lightweight – developed to be as modest in resources as possible. Mikado adapts to the scale of the dataset; an essential component, given that it was created for heavy-duty analysis on one of the most complex genomes, wheat.

Accurate and customisable – with its default parameters, Mikado is capable of improving the accuracy of most transcript assemblies by removing artifacts and selecting genetic models which are biologically relevant. The tool is nonetheless very customisable, allowing the user to tailor the types of transcripts they want to see from their annotation.

Flexible – can integrate data from multiple sources. As long as the source is formatted in one of the three most common file formats used in genomics (GTF, GFF3 and BED12) it’s not limited by technology and can be applied to varied sources such as the genome sequencing platform PacBio or ab initio predictions for biological features; using only a computational model without an existing data comparison.

Why the name, 'Mikado'?

The name references the eponymous game, 'Mikado', where the goal is to find and pick up the highest-scoring sticks without touching or moving the others in the bunch. This is conceptually similar to the job done by my programme: given a bunch of transcripts (the 'sticks'), it assesses their quality (the coloured band on the sticks), selects the highest-scoring, and picks them up while discarding the rest (similar to a Mikado player who picks up the highest-scoring stick while leaving the others on the table).

Also, similar to the Mikado game, it is possible to use sticks you have already retrieved as levers to obtain others still on the table; in the programme the main transcript can be 'leveraged' to find and pick up valid alternative splicing events.

Open quote marks

... the final Mikado dataset reduced the amount of transcript data by 30 times - from 12 million to approximately 350,000 transcripts.

Closing quote marks

How is Mikado invaluable to scientific research?

Mikado was employed during the wheat genome project to provide the transcriptomic backbone of the wheat genome annotation. It was essential in cutting through the noise of twelve million potential transcripts, before starting the ab initio genome annotation process.

This helped in two ways: firstly, the accuracy of the dataset was increased by removing many fragments, chimeras, and assorted artefacts that would have been problematic to solve later on. Mikado also calculated and reported the ORFs (open-reading frame) for each transcript, therefore, the output of this stage was already a good quality evidence-based gene annotation. This annotation could then be validated to find the best models to feed into the Augustus ab initio gene predictor as training data. The accuracy of this essential gene predictor tool is dependent on the quality of the initial dataset; Mikado’s precise analysis helps guide the final gene annotation.

Secondly, the final Mikado dataset reduced the amount of transcript data by 30 times - from 12 million to approximately 350,000 transcripts. This massive reduction in the dataset size was instrumental in making subsequent steps computationally feasible.

Mikado was also used in the parallel sequencing and assembling effort coordinated by the International Wheat Genome Sequencing Consortium (IWGSC). In this project, we exploited the flexibility of Mikado to perform a different task - rather than for integrating multiple transcriptome assemblies, we used it for comparing and selecting proper gene models from different genome annotation pipelines.

For each gene locus, Mikado was able to compare the two different annotations and assess how well either fit with external validating data - such as the degree of support from Illumina or PacBio data, or the homology from known genes. The integrated annotation is the backbone of the current gene annotation for the IWGSC genome assembly.

What impact has it made to the bioinformatics community?

Apart from the above-mentioned wheat projects, Mikado has been used in the Fraxinus project, which was published in Nature, and we are currently supporting other researchers through our GitHub page.

For example, we helped Dr Torresen, based at the University of Oslo, in using Mikado on a fish species of interest, cod; as well as annotating the wild relative of potato Solanum verrucosum. Also, in a recent project led by EI and JIC, utilities from the Mikado were instrumental in the gene annotation analysis in a diatom species and its survival in highly variable environments.

Mikado increases the accuracy of genome datasets by removing multiple fragments, chimeras, and assorted artefacts

The chimera statue in Notre Dame

How does Mikado differ from other similar bioinformatics software?

There are not many tools which perform like Mikado. The operations performed by Mikado are usually done later, in conjunction with ab initio predictors - whereas we act earlier, directly on the transcript assemblies.

However, a key feature of Mikado is that we don’t rely on transcriptomic data alone - which is the favoured approach by current tools - but rather try to integrate data from multiple sources: homology, junction analysis, ORF calling.

This additional information allows us to correctly define loci at little additional computational cost. It also allows to identify and remove blatant artefacts that are present in the results of many tools, such as chimeric transcripts that exhibit more than one ORF.

These kinds of artefacts tend to be common with many RNA-Seq centric tools which typically ignore the coding information of transcripts they assemble and can wreak havoc on downstream analyses.

How does Mikado work closely with the RNA-Seq bioinformatics tool Portcullis?

Portcullis is and has always been an essential ingredient for Mikado. The junctions it defines as reliable are used internally by Mikado in many ways; from helping in assessing a single transcript to completely exclude particular isoforms as probable artefacts.

In all the projects cited above, Portcullis was used as a companion of Mikado. Our transcript assembly and evaluation pipeline, contained in Mikado - ‘Daijin’ - explicitly includes Portcullis.

How would you sum Mikado up in one sentence?

“Hunt for the transcripts you need and want, in a transparent and reproducible way.”

Article author

Luca Venturini

Computational Biologist