As computational power increases and becomes more accessible, the age of bioinformatics will accelerate our ability to understand and tackle global challenges like never before. From discovering new antibiotics to fighting pandemics or making agriculture more sustainable, the promise is great and the applications are already rolling in.
Bioinformatics is a broad scientific research field that combines biology, computer science, data science, mathematics and statistics to drive the analysis of the vast amount of data associated with modern bioscience.
In this post-genomics revolution era, a huge amount of that information relates to the study of DNA, RNA, and proteins, and the complex networks and ecosystems in which living organisms interact, as well as the crucially important metadata - data about data - which puts “omics” data in context.
It’s quite complicated, and sometimes divisive, to unravel precisely what bioinformatics is, and it depends on who you ask, but that definition will make do for the purposes of this article.
So, why is bioinformatics important?
Data doesn’t end at publication
Bioinformatics is important because experiments do not exist in a vacuum. The 2020 coronavirus pandemic shows that rapid data analysis and interpretation is much more powerful to help control the spread when that data is shared quickly and openly.
But it’s not all about producing new data, when so much already exists. Analysing data is hugely important. Sharing the results of this requires “showing your working”: the data you used, the methods you employed, the software you used (with versions and parameters). This all takes time and effort, and bioinformaticians can help.
Dr Matt Bawn, a researcher in the Hall Group at EI and the Kingsley Group at QIB, feels that this is very important because “it allows reproducibility, which is one of the strongest reasons for publishing scripts and pipelines openly.” The lack of reproducibility of scientific findings has become something of a crisis in recent years and poses a stark threat to the trustworthiness of research outcomes.
With proper storage of data, along with the metadata that gives context - “arguably just as valuable as the actual data” says Dr Bawn - bioinformatics allows existing datasets to be reused and amplified. Given the right tools and appropriate practices, a bioinformatician can develop novel hypotheses and add significant long-term value.
Of course, all of that requires the raw data to be made properly available, readily accessible, and easily findable. That’s where tools such as EI’s COPO come in, which acts as a data broker for life scientists who can properly label their datasets so that they’re reusable for many years to come.
When that data is not properly labelled, or not even made available, we have a big problem.
An alarming paper, ‘No raw data, no science: another possible source of the reproducibility crisis’ presented a stark argument to the life science community. Raw data and open source software are crucial for the reproducibility of science, and the fact that these research outputs are not always made available in journals poses quite some concern.
The generation and widespread adoption of bioinformatics infrastructures, and particularly an open access attitude to sharing biological datasets, is one route by which we might tackle this problem.
It allows reproducibility, which is one of the strongest reasons for publishing scripts and pipelines openly.
Open access bioinformatics resources for life scientists at EI.
You can see our range of tools and algorithms for bioscience, which include genome assembly, quality control and annotation tools, metagenomics tools, and systems biology tools that enable powerful multi-omics research, here.
EI can also support life scientists who require access to additional computational power through the National Capability in e-Infrastructure. You can find out more by checking out CyVerse UK, or contacting the team, who can help set you up with a virtual machine.
It’s part of the full package
Some would argue that bioinformatics is not just important; it’s integral to how we design today’s experiments.
Dr Wilfried Haerty, Group Leader at the Earlham Institute, says: “Bioinformatics is part of the full package. You need to consider how your data will be analysed while developing the experimental design.”
“The full package here is data generation and data analysis, where one can’t be planned without the other,” explains Dr Karim Gharbi, Head of Genomics Pipelines at EI. “I’ve seen so many projects fail because data was generated without consideration for analysis, and vice versa.”
Dr Rob Davey, Head of Research e-Infrastructure, echoes those sentiments, adding that robust interpretation of any dataset is always based on context. “So much of this important context is kept in our brains, or described in our own way in our lab notebooks and papers.
“To make use of all this information, a full package needs to be able to find, query, and interpret more and more data based on its metadata. For this, we need FAIR data - standardised, well-described, and openly available data that is fit for purpose.”
Rob Davey and Felix Shaw teaching the importance of bioinformatics at the EI Open Day in 2019
A full package needs to be able to find, query, and interpret more and more data based on its metadata.
One area in which that full package can come together to provide fresh insights, while paying enormous dividends for humanity, is driving new drug discovery. The availability of open data combined with the use of various bioinformatics tools and algorithms has enabled scientists to fast-track discovery relevant to treating or vaccinating against coronavirus.
In another example of huge importance for global health, earlier this year it was revealed that machine learning had discovered a new candidate drug that could potentially kill multi-antibiotic resistant bacteria gripped the medical and scientific community.
The world has been aching to discover such a compound for decades, and bioinformatics has helped achieve what a generation of lab-based scientists and pharmaceutical companies have thus far not been able to (despite their best efforts).
Of course, as Dr Ben Ward of the Clavijo Group at EI points out, “without the lab to confirm the predictions of a binding site and design of a potential drug candidate through development and testing - to show that it can be made and that it works in a real world scenario - the bioinformatics is useless. It’s all just a prediction until we can experimentally validate it.”
It all comes down to scale.
We live in an era in which we can sequence the entire DNA sequence of a human genome in real time in just a few hours for a few hundred pounds. There are DNA sequencing laboratories all over the world which are churning out so much data that we’re having to build supercomputers the size of small villages to store and process it all.
With the right algorithms, that data is a treasure trove of information. With bioinformatics, the potential insights we can glean are practically infinite. Where the bioinformatician shines is in developing new tools or improving existing ones, asking the right questions and then interpreting the data to help us answer even more questions.
This all depends on understanding the data, as Dr Ben Ward stresses, “with the right data an algorithm can deliver some valuable insight. Many of the algorithms we have developed here at EI are so powerful because we have spent a lot of time understanding the properties of sequencing data and what we can do with it.”
There are DNA sequencing laboratories all over the world which are churning out so much data that we’re having to build supercomputers the size of small villages to store and process it all.
A new way of approaching biological discovery
Historically, we’ve had to look at one small part of a complex system and painstakingly work out the processes behind the intricate biochemistry of life. Now, we can compare the genome sequences of thousands of individuals or species, as well as all associated data; mining them for associations, leading to the discovery of novel targets and novel compounds based on the fantastic efforts of biology gone by.
“The new paradigm is data-driven research,” says Dr Bawn, which allows us to “move away from the limitations of traditional research."
“Rather than working from one hypothesis, which may come from historical bias, we can instead explore the data which might indicate something that we can then test using the traditional method.”
Of course, that does not mean that we’re saying goodbye to hypothesis-driven research. Quite the opposite, as Dr Ben Ward assures us. “It’s not just about flinging a load of data at a problem and seeing what sticks. It’s having an attitude of data-driven development and testing of scientific questions.”
Humanity is always on the hunt for new drugs, new antibiotics, new sustainable compounds, healthy alternatives to sugar, and so on. With the combination of bioinformatics with traditional research - and an open attitude to sharing data, metadata, and methods - the results are beginning to show that we can accelerate both our understanding of biological data, and speed up how we apply it to real world problems for the benefit of us all.
Now, we can compare the genome sequences of thousands of individuals or species, as well as all associated data; mining them for associations, leading to the discovery of novel targets and novel compounds.
Bioinformatics training opportunities at Earlham Institute.
At Earlham Institute, through our National Capability in Advanced Genomics and Computational Training, we are committed to training the next generation of bioinformaticians and increasing bioinformatics research capacity worldwide.
You can check out our training courses and events here.