• Article
  • Technology

Sherlock: elementary genomics

Soon to be open source, how is Sherlock helping to store and analyse bioinformatics big data?

March 18, 2019

How is Sherlock helping bioinformaticians to store and analyse big data? Máté Szalay-Bekő of the Korcsmáros Group fills us in on this soon-to-be open source software made at EI, as well as the group’s other exciting projects, from systems biology to personalised healthcare.

Hi Máté, what is Sherlock?

On the one hand, Sherlock is the name of an imaginary detective but more importantly it is also the name of a dog. In the given context I can’t say much about the detective, but the dog belongs to my colleague David, and together with him we started to develop a small software platform in the Earlham Institute, called Sherlock.

The aim of this platform is to store and analyse bioinformatics data using modern big data technologies. We borrow concepts and open source tools from the top players of the software industry, like Facebook and Amazon.

Sherlock combines multiple databases in to a standard data lake for bioinformatics projects

Sherlock combines multiple databases in to a standard data lake for bioinformatics projects

How can Sherlock help bioinformaticians?

Most bioinformatics projects start with gathering a lot of data. Often you work on your own data (like expression levels or mutation data you get from real samples), but almost all the time you also need to have some external reference data.

You need genome annotations, gene ontologies, tissue specific expression datasets, drug related databases, and many many other data sets. One of the reason why we use the term ’bioinformatics’ in the first place is because we can’t correlate and process all these datasets manually, we need the help of the computers and databases.

Sherlock helps in these first steps of a bioinformatics project. We have combined a lot of different databases into a standard data lake, so it becomes relatively easy to clean-up, integrate and filter the data you need in the given project.

What are some good examples of how Sherlock can be applied?

Imagine your colleague asks you to give her or him the sequence of all the genes which are highly expressed in the human brain, or asks you to produce a list of all the intracellular molecules which participate in the regulation of these genes.

Without Sherlock, you have to download six different public databases from the internet, write a dozen scripts, and in general spend a whole day on producing the answers.

But using Sherlock, you just need 15 minutes to formulate a correct question for the big data query engine, and when you arrive back from the coffee machine with your tasty caffè macchiato, you already have all the data to send to your colleague.

Sherlock can help to formulate the correct questions for big data queries with rapid results for analysis

Sherlock can help to rapidly formulate the correct questions for big data analysis
Open quote marks

Without Sherlock, you have to download six different public databases from the internet, write a dozen scripts, and in general spend a whole day on producing the answers.

Closing quote marks

Who is using Sherlock currently?

Currently the tool is not open yet, so we are the only users, and by ’we’ I mean the Korcsmáros group in the Earlham Institute. In this lab we spent a lot of energy in the development of complex databases. The Korcsmáros group maintains web resources including SignaLink, NRF2-ome, and the Autophagy Regulatory Network database.

SignaLink is among the top 10 signalling network resources according to independent evaluations, and received more than 35,000 visitors in the past year. I think we are pretty good in the understanding and processing of various bioinformatics databases.

Currently we are about to finalize the next major upgrade of the SignaLink database and in order to do so, we needed to evaluate, process and integrate dozens of different open databases, such as protein interaction databases, regulatory databases, etc. for humans and three other species. All these databases are combined in SignaLink in a very specific way, focusing on signalling pathways.

How is Sherlock different from the databases developed before by the Korcsmáros group?

Using an integrated database (like SignaLink) is easier and more efficient, assuming the user really needed the data in the same representation and she/he is focusing on the same priorities which are behind the given database. Sherlock on the other hand is way more flexible.

Using Sherlock, you can build your own complex database, filtered and integrated according to your needs. Into Sherlock we added all the dozens of open databases we used to build SignaLink, and also many more.

However, nothing comes without a cost: using Sherlock requires some effort from you. You need to have a deeper understanding about the original databases and also you need to be able to formulate your questions using SQL language.

The SQL language is very popular among database experts and data scientists, being the de facto standard language to fetch information from relational databases. Up until recent years, using the SQL language on top of large (let’s say multi-terabyte size) datasets was a big challenge.

However, with the continuous evolution of big data technologies, now we have a good, stable and scalable way to directly execute SQL queries on top of database files we store in standardized cloud-compatible storages.

How can more people use Sherlock?

At the Earlham Institute we are fortunate to have both the necessary technologies and also the computational power we need to develop and use Sherlock.

However, depending on the size of the data and the projects, it should be relatively easy for any research group to install their own Sherlock version into their own servers or into a public cloud.

In the next few months we are planning to completely open-source Sherlock, providing documentation and examples about how can anyone deploy, use and even extend it.

Using Sherlock is not necessarily expensive, compared to other big data analytical frameworks. A reasonable installation (1 TB object store and a query engine with 3 moderate 16 GM RAM nodes) would not cost more than £200-£250 monthly, depending on which cloud provider you prefer.

Such an installation is already very useful for data exploration and medium complexity analytics.

Sherlock can easily be installed on cloud or private servers and is cost-effective compared to other big data analytical frameworks

Sherlock can easily be installed on cloud or private servers
Open quote marks

In the next few months we are planning to completely open-source Sherlock, providing documentation and examples about how can anyone deploy, use and even extend it.

Closing quote marks

What other tools and resources would you like to see?

Having some hands-on industrial experience with big data and cloud technologies, I personally would be happy to see them being used much more heavily in the field of bioinformatics.

Sherlock is only using a few of these new technologies and there are many more interesting use cases to cover using interactive notebooks, machine learning and data visualization tools.

I am looking forward to seeing how all these technologies will impact our research.

What other exciting projects are you working on?

Working at the Earlham Institute definitely gives you the opportunity to spend your time on valuable and interesting projects.

Recently, as a translational project, the group has launched the development of EI’s NavigOmiX, an integrated software solution and workflow management tool to allow other research groups and SMEs to easily perform systems biology and ‘omics meta-analysis projects without advanced expertise in computational biology and programming. I am responsible for the technical design and the prototype development of this software solution.

Besides these, I am also helping other researchers in the group to develop a novel precision medicine pipeline based on the classification of the single point mutation profile of patients.

Here I am focusing on the high software quality and the stabilization of the analytical pipeline, enabling us to execute the algorithm in a robust way on a larger patient scale.

Máté is also working on the technical design of EI's NavigOmix software for development of a novel precision medicine pipeline

Máté is also working on the development of EI's NavigOmix software for developing a novel precision medicine pipeline
Open quote marks

Besides these, I am also helping other researchers in the group to develop a novel precision medicine pipeline based on the classification of the single point mutation profile of patients.

Closing quote marks

Article author

Mate Szalay-Beko

Software Developer