Aligning and assembling RNA-Seq reads into transcript assemblies is a common task in genome annotation, and the plethora of available tools makes it difficult to know the best overall protocol beforehand. Moreover, the high dynamic range of expression across gene loci in an organism implies that a single tool or set of parameters will hardly ever be optimal across the whole genome. Mikado provides a way to turn this variability from liability into an advantage, by combining the output of multiple transcript assemblers and looking for the best assembly at each locus. Contrary to many other tools, Mikado does not rely on expression depth for its analysis, but it rather looks for those transcript models that appear to have valid splicing junctions and that seem to produce complete proteins. By relying on the ORF content rather than just on expression depth, Mikado is also capable of identifying and solving artifactual chimeric transcripts - a common problem in RNA-Seq assemblies. Although Mikado preferentially looks for protein-coding genes, it is also capable of concurrently defining non-coding RNA loci.
In Mikado, the definition of what kind of gene models the experimenter looks for in the species can be easily tweaked to adapt it for the species under study. The pipeline has been tested on multiple species of varying transcript complexity, and Mikado has been used in multiple wheat genome annotation projects to help define the genome annotations now released to the community.
Mikado is freely available under the Lesser GNU Public License v3.0; source code is available at:
Github code repository: https://github.com/lucventurini/mikado
PyPI stable releases: https://pypi.python.org/pypi/Mikado/1.0
A full documentation, inclusive of details for the extensive library API, can be found on “Read the Docs”:
Article preprint: https://www.biorxiv.org/content/early/2017/11/09/216994