How our tools can help you: BioJulia

Software Developer Dr Ben Ward shares with us the latest new tool in modern programming for computer science and biology; helping bioinformaticians navigate their way out of the infamous tool and file-format black hole.

To those who haven’t heard of your tool, how would you sum it up in one sentence?

BioJulia provides an ecosystem of compatible software packages that contain a great deal of functionality required by biologists to quickly and simply program their scripts and workflows. It is a collaborative project, hosted on Github, that has had a variety of different contributors, working on different aspects over the years.

Where did the name come from?

Julia is the name of the programming language that the BioJulia ecosystem is built with.

It is a top-level, high-performance dynamic programming language for numerical computing. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library.

It has been dubbed by some as ‘C for scientists’. The idea was to create a programming language that lets scientists rapidly develop and test their ideas, but that results in very fast code. Something as easy and friendly as Python, but as fast and efficient as C.

Other bioinformatics projects for other languages exist including ‘BioPython’, ‘BioRuby’ and so on, so we named BioJulia according to that convention.

What would you say are the top three best features about your tool?

Consistency. I really enjoy the fact the Julia language allows us to build a much more consistent way of processing many different bioinformatics file formats. Bioinformatics is littered with different formats for files, with different tool-sets surrounding them, with their own way of doing things.
In BioJulia, the aim is that if you want to filter a GFF or a BED file, BAM file, or VCF file for certain genomic locations, the way you’d do that is the same i.e. the function or subroutine you would write to achieve that task should be written once and ‘just work’, no matter the file format.

For me this kind of API design we’re aiming for in BioJulia means less cognitive load learning the correct commands and flags for various tools, and less time writing horrible unstable glue scripts. Instead, I can think on what I’m actually doing and why.
Speed. Not only is the Julia language and it’s compiler good at producing very fast code, but we pick data structures and algorithms in our packages to be quick and efficient. With a correctly written script using BioJulia, there’s little reason to assume the performance at runtime will not approach that of a compiled piece of software written in C or C++.
Interaction. The potential for interoperability and interactive exploratory bioinformatics. The Julia language can easily interact with code written in other languages including C, C++, Python, and R.

How is your tool invaluable to scientific research? How is it making an impact at EI?

The ‘tool and file-format hell’ in Bioinformatics means significant amounts of time are wasted on scripts getting one output from one tool into another.

If you’re doing an experiment that is very standard, you’re fine - there’s a stepping stone series of tools and files you have to deal with. But if your analysis is non-standard, or even if you’re experimenting with new methods and going off the beaten path; things get difficult, hacky, and hard to reproduce, very quickly.

One of the main impacts we want BioJulia to have is for the scripting of both common bioinformatics workflows, as well as interactive exploration and experimentation with data to be straightforward to write and read.

What impact has it made in the bioinformatics community?

The Julia language BioJulia is based on is not even released at version 1.0 yet. New languages take some time to build a following (and for some, programming language choice is a sacred and immutable choice, like religion or … choice of text editor).

So impact takes time; as both adoption of the language in general improves, and BioJulia itself improves, as user-bases form and feedback further develops the project. I do get emails and messages from people about BioJulia - just yesterday someone found a flaw in some of the BCF file reading utilities, which I fixed.

I’ve been invited to Cambridge, Manchester, Oxford, and now London to present talks on the topic of BioJulia. When I was writing my PhD thesis I had someone from global biotech company Pacbio asking me about Julia and BioJulia. So it’s great to know people are paying attention to what we’re doing.

Last year I wrote several utilities into BioJulia libraries that would detect selection pressures in sequences, compute population genetic statistics (such as genetic diversity and Tajima’s D), the MacDonald Kreitman Test, and estimations of divergence times between sequences.

Whenever anyone asked for a series of these computations for their project, it took very little time. Simply write script for their project that read in their data files and produce a table of stats they wanted as output, along with any particulars they asked for, such as bootstrapping for non-parametric significance tests or particular custom plotting of results.

I ended up doing this for several collaborators. Such as one group, studying the Cryptosporidium populations, and another studying Albugo candida populations. Both of these organisms are pathogens.

When your analysis goes off the beaten path; things get hacky, and hard to reproduce, very quickly.

How does your tool differ from other similar bioinformatics software?

The differences lie in the marriage of the Julia language, which is itself quite different from the other commonly used languages in scientific computing. Plus, the choices we’ve made in the data-structures and algorithms, we pick from the literature, and the design choices we’re making with the APIs.

Scripting and writing code with BioJulia should be high-level, semantic (meaning it’s as plain as can be possible considering what the intent of the programmer is from the code itself), and generic (meaning the code you write works for any similar data-type or file, regardless of the internal representation of the data-type or file).

How does the tool link to other/EI tools?

So far, I’m not aware of any specifically linking tools at EI currently. But I’m working on integrating a WIP C++ based toolset we’re working on for analysing genome graphs and assemblies into the BioJulia package ecosystem.

What advice would you give to those who are looking to develop their own bioinformatics tool?

Make use of quality libraries of software that already exist!
Obviously in BioJulia we didn’t follow this advice: we’re writing such libraries for other people to use.

Your tool will require several different parts. For example, your tool will probably need to read-in data from files at some point. It will probably need to write results out to files too. So look for libraries that read and write the kinds of file formats your tool works on (If your coding in Julia, I would obviously suggest BioJulia!). Don’t create your own from fresh.

This is for several reasons, including that the authors of the existing language have already spent the time and made (and fixed!) the mistakes (bugs!) you may end up repeating!
Another reason is also a second piece of advice…
It’s good to collaborate!
Even if a library doesn't quite do exactly what you need for your tool, most of these things are distributed under permissive licences that let you copy, modify, and share code. So copy it, modify it and stick it in your tool. Then give the changes or improvements back!
Contact the original authors, or if you’re on GitHub, you can give changes back with something called a ‘pull request’. You may find that you’ve not only fixed your problem but fixed the future problems of others.

BioJulia is a much bigger effort than could be done by one person alone, I’m just privileged to have ended up in a position of significant influence over the project. And while I’ve written significant amounts of code for the project since, there are plenty of other code writers and others who have contributed in other ways.

The software is on Github, and people anytime anywhere can use it under a permissive MIT open-source license.
Ben will be talking at JuliaCon 2018, at the University College London in August. His talk is titled “BioJulia and Bioinformatics in Julia: Past, Present, Future“

Ben Ward is in in the Bioinformatic Algorithms Group at Earlham Institute.

How our tools can help you: BioJulia

To those who haven’t heard of your tool, how would you sum it up in one sentence?

Where did the name come from?

What would you say are the top three best features about your tool?

How is your tool invaluable to scientific research? How is it making an impact at EI?

What impact has it made in the bioinformatics community?

How does your tool differ from other similar bioinformatics software?

How does the tool link to other/EI tools?

What advice would you give to those who are looking to develop their own bioinformatics tool?

Related reading.

Precision genome annotation: Portcullis and Mikado

How our tools can help you: Mikado

BioJulia: helping biologists surmount the "two language problem"

Earlham Institute and Natural History Museum launch deep tech startup Agnos Biosciences™

New software tool MARTi fast-tracks identification and response to microbial threats.

New BBSRC funding supports expansion of transformative spatial science

Director appointed to lead transformative digital research infrastructure initiative

Devastating crop pathogens can be found by sequencing the air

UKRI given green light for game-changing BioFAIR investment

Earlham Institute begins testing air across Norfolk for a year

Earlham Institute spinout TraitSeq to transform agricultural sector

UK plant breeders to benefit from online research tools