Sherlock: elementary genomics
Soon to be open source, how is Sherlock helping to store and analyse bioinformatics big data?
On the one hand, Sherlock is the name of an imaginary detective but more importantly it is also the name of a dog. In the given context I can’t say much about the detective, but the dog belongs to my colleague David, and together with him we started to develop a small software platform in the Earlham Institute, called Sherlock.
The aim of this platform is to store and analyse bioinformatics data using modern big data technologies. We borrow concepts and open source tools from the top players of the software industry, like Facebook and Amazon.
Most bioinformatics projects start with gathering a lot of data. Often you work on your own data (like expression levels or mutation data you get from real samples), but almost all the time you also need to have some external reference data.
You need genome annotations, gene ontologies, tissue specific expression datasets, drug related databases, and many many other data sets. One of the reason why we use the term ’bioinformatics’ in the first place is because we can’t correlate and process all these datasets manually, we need the help of the computers and databases.
Sherlock helps in these first steps of a bioinformatics project. We have combined a lot of different databases into a standard data lake, so it becomes relatively easy to clean-up, integrate and filter the data you need in the given project.
Imagine your colleague asks you to give her or him the sequence of all the genes which are highly expressed in the human brain, or asks you to produce a list of all the intracellular molecules which participate in the regulation of these genes.
Without Sherlock, you have to download six different public databases from the internet, write a dozen scripts, and in general spend a whole day on producing the answers.
But using Sherlock, you just need 15 minutes to formulate a correct question for the big data query engine, and when you arrive back from the coffee machine with your tasty caffè macchiato, you already have all the data to send to your colleague.
Currently the tool is not open yet, so we are the only users, and by ’we’ I mean the Korcsmáros group in the Earlham Institute. In this lab we spent a lot of energy in the development of complex databases. The Korcsmáros group maintains web resources including SignaLink, NRF2-ome, and the Autophagy Regulatory Network database.
SignaLink is among the top 10 signalling network resources according to independent evaluations, and received more than 35,000 visitors in the past year. I think we are pretty good in the understanding and processing of various bioinformatics databases.
Currently we are about to finalize the next major upgrade of the SignaLink database and in order to do so, we needed to evaluate, process and integrate dozens of different open databases, such as protein interaction databases, regulatory databases, etc. for humans and three other species. All these databases are combined in SignaLink in a very specific way, focusing on signalling pathways.
Using an integrated database (like SignaLink) is easier and more efficient, assuming the user really needed the data in the same representation and she/he is focusing on the same priorities which are behind the given database. Sherlock on the other hand is way more flexible.
Using Sherlock, you can build your own complex database, filtered and integrated according to your needs. Into Sherlock we added all the dozens of open databases we used to build SignaLink, and also many more.
However, nothing comes without a cost: using Sherlock requires some effort from you. You need to have a deeper understanding about the original databases and also you need to be able to formulate your questions using SQL language.
The SQL language is very popular among database experts and data scientists, being the de facto standard language to fetch information from relational databases. Up until recent years, using the SQL language on top of large (let’s say multi-terabyte size) datasets was a big challenge.
However, with the continuous evolution of big data technologies, now we have a good, stable and scalable way to directly execute SQL queries on top of database files we store in standardized cloud-compatible storages.
At the Earlham Institute we are fortunate to have both the necessary technologies and also the computational power we need to develop and use Sherlock.
However, depending on the size of the data and the projects, it should be relatively easy for any research group to install their own Sherlock version into their own servers or into a public cloud.
In the next few months we are planning to completely open-source Sherlock, providing documentation and examples about how can anyone deploy, use and even extend it.
Using Sherlock is not necessarily expensive, compared to other big data analytical frameworks. A reasonable installation (1 TB object store and a query engine with 3 moderate 16 GM RAM nodes) would not cost more than £200-£250 monthly, depending on which cloud provider you prefer.
Such an installation is already very useful for data exploration and medium complexity analytics.
Having some hands-on industrial experience with big data and cloud technologies, I personally would be happy to see them being used much more heavily in the field of bioinformatics.
Sherlock is only using a few of these new technologies and there are many more interesting use cases to cover using interactive notebooks, machine learning and data visualization tools.
I am looking forward to seeing how all these technologies will impact our research.
Working at the Earlham Institute definitely gives you the opportunity to spend your time on valuable and interesting projects.
Recently, as a translational project, the group has launched the development of EI’s NavigOmiX, an integrated software solution and workflow management tool to allow other research groups and SMEs to easily perform systems biology and ‘omics meta-analysis projects without advanced expertise in computational biology and programming. I am responsible for the technical design and the prototype development of this software solution.
Besides these, I am also helping other researchers in the group to develop a novel precision medicine pipeline based on the classification of the single point mutation profile of patients.
Here I am focusing on the high software quality and the stabilization of the analytical pipeline, enabling us to execute the algorithm in a robust way on a larger patient scale.