Data lake approaches for storing and processing Big Data in biology
Cutting edge big data technologies enabling high data quality and fast bioinformatics analytics.
Imagine your colleague asks you to give her or him the sequence of all the genes which are highly expressed in the human brain, or asks you to produce a list of all the intracellular molecules which participate in the regulation of these genes. Before this project, you had to download six different public databases from the internet, write a dozen scripts, and in general spend a whole day on producing the answers. But using our resource, you just need 15 minutes to formulate a correct question for the big data query engine, and when you arrive back from the coffee machine with your tasty caffe macchiato, you already have all the data to send to your colleague.
In our work we need to do routine data related tasks on a daily basis, like fetching different biological data sets, bringing them to some common format, filtering them out and joining them with other data sets, and then we must share these with our colleagues and collaborators. This project helps us in these type of tasks. It enables us to sort all our data in a structured way, using an IT industry standard Data Lake solution. On top of all this data management we are using the Presto big data query engine, which can access the data lake and execute all kinds of highly optimized and distributed analytics.
Using this data lake approach enables us to increase both the quality and efficiency of our bioinformatics work.