BioJulia: helping biologists surmount the "two language problem"
Modern genomics research has brought together two worlds; that of computer science and the world of biological sciences.
BioJulia is a community of biologists using a modern programming language to make programming more accessible to biologists, while providing a much more efficient platform for performing computer-driven research. We chatted to Dr. Ben Ward, visiting scientist at Earlham Institute, about his involvement in BioJulia.
Modern genomics research has brought together two worlds; that of computer science and the world of biological sciences. In fine step with each other, we can finally answer broad and complex research questions that were previously inaccessible to science.
The two language problem.
Typically, if you want to write a computer program you have a choice. You can either use a user-friendly language fit for humans, such as Python, which as Ben says, “all the cool kids are using now.”
Alternatively, you can use a more involved and verbose language such as C or C++, which are frequently used to build operating systems and other performance-critical applications. The difference, essentially, is how much you are willing to take care of.
With C, you are explicitly asking the computer to perform tasks - specifically requesting memory and then cleaning up afterwards. You have to take care of all the little details, which is good as you have finer control over what happens. You can interact with the hardware (the computer), which is essential if you’re doing systems programming.
Ben Ward, visiting scientist at Earlham Institute
As a biologist you are more likely to be using a higher language, such as Python, in order to accomplish day-to-day data and file processing tasks, and to prototype new algorithms and software.
A biologist might use Python, or R, to get something working quickly - let’s say a new sequence alignment method or a SNP analysis - but then realise the need for more speed, greater efficiency and better performance, especially with competition for the computer cluster to run tasks.
This means that, either we identify the slowest parts of the high level language, take that, and then rewrite it in C (or C++, C# etc.), and then link that back into the Python workflow - or, in the worst case scenario, we have to rewrite the entire thing in C.
Both of these are easier said than done. It’s like learning two languages and then getting them to talk to each other. It’s the computing equivalent of Spanglish with some Mandarin thrown in for good measure.
So there is a trade-off, between high-level and easy to use dynamic languages, and the more verbose languages like C that are closer to the hardware, but allow greater efficiency and speed.
This is the two-language problem. It makes computing and programming harder than it need be for biologists (and other scientists too!).
As a biologist, it can be annoying having to learn and work with multiple programming languages, as Ben says, “this is how bugs and mistakes happen. The more complicated something is, the more likely it is that something won’t work.”
It also makes peer review of work involving custom scripts and new programs difficult, as someone might know about the field in question, and a little bit of Python - but nothing about C. Or, someone might know both C and Python but be unsure as to the field in question.
I discovered julia during my PhD, at the time I was using R very heavily, and to try julia out I wanted to make a small phylogenetics package that I called Phylogenetics.jl, to learn by doing.
The Julia programming language.
Julia, developed at MIT, is a high level, dynamic programming language, which is as easy to read and write as Python or R. Julia has a combination of features that make it fast, including a flexible and expressive type system, data type inference, aggressive code specialization, and Just In Time (JIT) compilation.
Together these features allow high level and expressive julia code to become compiled down to efficient specialised machine code that runs well on the hardware. The finer details of how this is achieved have been written up by the language creators in a paper.
Julia is also useful because it is interactive, meaning you can try things out while you are typing to make sure it works; much quicker than using C, which requires you to write code compile it, and then run it, find issues, make modifications and compile again, run again, and so on.
The practical upshot of all this for us as users of the language is that we can prototype our biological software and pipelines once using the language and find out where the speed bottlenecks are, just as in the other high-level languages.
Crucially however, when it comes to speeding up these bottlenecks, all we have to do is write better julia code to allow the compiler to generate more efficient compiled code, instead of wasting time with integrating two languages or rewriting something completely.
To get a more efficient program, you just need to spend time writing better Julia. A great time saver.
Applications of BioJulia: the community.
Dr. Ben Ward, of UEA and Earlham Institute, teamed up with other biologists at Cambridge and beyond who all decided: “I want to do biology with this language!”
“I discovered julia during my PhD, at the time I was using R very heavily, and to try julia out I wanted to make a small phylogenetics package that I called Phylogenetics.jl, to learn by doing. It was nothing special, it had phylogenetic trees and some really really basic functions. Through GitHub and the internet I met a few other biologists, one person had created a small package called BioSeq.jl which had code for working with DNA sequences. We all decided to found the BioJulia community and work together on code.”
These collective biologists gathered together into a community and started BioJulia, which is now growing, and actively developing efficient julia code packages for biologists.
The flagship code package of the community is a package called Bio.jl. The package ties together modules containing code that can handle the most common biology and bioinformatics related tasks. Bio.jl then is a package providing a core code infrastructure in the julia language for biologists.
Ben added, “for example, the Bio.Seq module can be used by people to write julia programs that process DNA, RNA, and protein sequences. Bio.Align, allows people to do DNA, RNA, and protein sequence alignments in their julia programs. Other modules such as Bio.Phylo for phylogenetics and Bio.Var for SNP/variant data and population genetics are being actively developed”
The community aims to get more members, who, even if they don’t want to be a developer, can get involved; it might be in web design, or making tutorials, or simply using the code and providing feedback and notification of bugs.
Since the code is both efficient and easier to read and understand, they hope to get more biologists using it. As more people use it to create more and more high performance applications and programs, the annoyances of the two-language problem should start to become a thing of the past, and bioinformatics should become a little less scary.
As Ben says, “it’s not always easy to understand how it works, unless you create the tool yourself.”
BioJulia for faster computing.
The basis of Ben’s postdoctoral studies, between the Earlham Institute and the UEA, is to build tools and infrastructure in order to study population genomics, so that people can start writing useful population genetics pipelines with Julia.
The most important aspect of this, according to Ben, is design. “It’s not enough to use a ‘fast language’ like C or C++ or Julia, the great thing about julia is that combining ease of comprehension and performance means you can spend more time thinking about the best algorithms to use or implement”.
Ben explains, “Let me give one example where paying attention to data structure and algorithm design pays off: A lot of quickly written (rushed!) scripts just treat DNA as a string of characters, which is not very efficient as one character is represented with one byte. However, you can actually represent every nucleotide with 2 bits [¼ of a byte].
“Imagine A = 00, T = 01, G = 10 and C = 11 (or something like that). So you save space in memory, as 32 nucleotides can be squeezed into a single 64 bit Integer. And then you can do certain operations using bit level operations on that one 64 bit Integer, effectively doing the same operation on 32 nucleotides at the same time in one operation - saving time.
“Compare this with a sloppy implementation, where each nucleotide is stored as one character, than 32 nucleotides requires 32 operations, not 1!”
“It’s surprising sometimes how much you can improve the performance of software and pipelines before you even start to think about multiple cores and parallel programing, although julia makes parallel programming easier too.”
Occasionally, Ben adds, there is a trade-off to make: “Sometimes you have ambiguous characters, such as an N in biological sequences. That means you have a bigger alphabet and so that two bit encoding is not possible.
“However, you can still fit each nucleotide into a “nibble” (4 bits, ½ a byte). Even then, you can fit 16 nucleotides in a single 64 bit Integer and can do operations on 16 nucleotides at once, and this is still a significant speed gain and saving of space.
“Additional speedups can come from the implicit parallelism possible thanks to the Single Instruction Multiple Data (SIMD) capabilities of modern processors.”
With a little julia, and some intelligent thinking about how algorithms are written, the BioJulia team believe it is possible to make bioinformatics software faster, more efficient, and easier to comprehend - giving biologists more time to think about important research questions, rather than writing code.