Data lake approaches for storing and processing Big Data in biology
Cutting edge big data technologies enabling high data quality and fast bioinformatics analytics.
Imagine your colleague asks you to give her or him the sequence of all the genes which are highly expressed in the human brain, or asks you to produce a list of all the intracellular molecules which participate in the regulation of these genes. Before this project, you had to download six different public databases from the internet, write a dozen scripts, and in general spend a whole day on producing the answers. But using our resource, you just need 15 minutes to formulate a correct question for the big data query engine, and when you arrive back from the coffee machine with your tasty caffe macchiato, you already have all the data to send to your colleague.
In our work we need to do routine data related tasks on a daily basis, like fetching different biological data sets, bringing them to some common format, filtering them out and joining them with other data sets, and then we must share these with our colleagues and collaborators. This project helps us in these type of tasks. It enables us to sort all our data in a structured way, using an IT industry standard Data Lake solution. On top of all this data management we are using the Presto big data query engine, which can access the data lake and execute all kinds of highly optimized and distributed analytics.
The data lake is the de facto storage solution in big data projects. It provides the flexibility to upload files in large volumes using different standardized and optimized data formats. It gives more structure than simple shared folders, while it requires less modeling than traditional Sql (or some NoSql) databases. We use S3 (a standardized Object Store solution, originally developed by Amazon) and have developed a simple data load protocol enabling us to store different biological databases in a versioned way.
Query engines are used in big data to execute ad hoc queries on top of data stored in the data lakes. There are many query engine solutions widely used in big data projects (e.g. Hive, Impala or Spark). We chose to use the Presto Query engine, as it performs very well in our use cases while its operation is quite easy. We developed our own docker swarm stack enabling us to scale up or down our presto cluster easily, depending on our needs for analytical capacity. Our dockerized solution is portable to any cloud environment, but can be also deployed to any on-premises hardware.
We are adding automated tests to improve the data quality, while an automated backup solution makes sure we can minimize the data loss during operation. Still, the operation of our data platform is very cheap. We are using internal resources at the Earlham Institute, but if someone were to use the stack with standard cloud resources, a data lake (up to 500 GB of data) and a smaller query engine (docker swarm cluster with 32 GB RAM) we are using in our lab it wouldn’t cost more than £150 monthly.
Using this data lake approach enables us to increase both the quality and efficiency of our bioinformatics work.