MultiPEN: A network-driven approach to combine multi-omics datasets
Combined analysis of multi-omics data to study living systems.
Start date: April 2015
End date: Ongoing
In recent years, the use of omics technologies has become more accessible and many studies are now generating genome-wide datasets that provide measurements of the different levels of components of a biological system. These datasets could be an invaluable source of knowledge into the inner functioning of an organism. However, handling the large amount of data and being able to extract biological knowledge from the multi-omics datasets is still challenging. Classical approaches to dealing with more than one omics data perform separate analysis on each type of data. However, there is evidence that a combined approach (i.e., an approach where the omics data are analysed together) can provide more information which would otherwise be overlooked.
In this project, we are developing computational strategies for the combined analysis of omics data. We start by developing a network-driven integrative approach for gene expression data and metabolomics, which exploits a priori biological knowledge in the form of interaction networks.
In this work we look at the problem of omics data integration that makes use of biological knowledge as priors in a multivariate statistical model. We start with a penalised logistic regression approach for gene selection. This approach is used to analyse transcriptomics data to find the subset of genes that are potentially influential to separate two conditions (e.g. healthy vs disease). This model uses a protein-protein interaction network (represented as an undirected graph) as prior knowledge to identify groups of connected elements that collectively change conditions. We are studying the network’s effect for gene selection by testing with networks compiled from different sources and with different topological properties.
Subsequently, we explore the challenges of multi-omics integration, for which we modify our network to include interactions with metabolites. This extended network is then used in our regression model for the combined analysis of transcriptomics and metabolomics data.
We are using MultiPEN to analyse transcriptomics and metabolomics data in a diet intervention study. We compare our results with traditional approaches such as differential expression analysis. We have found that MultiPEN is able to rank additional relevant genes whose expression is too small to be considered significant when using classic differential analysis, which is very promising.
MultiPEN is a computational tool to analyse transcriptomics and/or metabolomics using a network-driven approach and a logistic regression approach. MultiPEN is developed using MATLAB but it can be downloaded as an standalone application from our repository.
We also provide example data to test MultiPEN. Finally, we include an interaction network between genes and metabolites which was compiled from Pathway Commons , in August 2016.
MultiPEN in github: https://github.com/TGAC/MultiPEN
 Pathway Commons, a web resource for biological pathway data, Cerami E et al., Nucleic Acids Research (2011).
MultiPEN will be available as a galaxy workflow which will be running on EI’s cluster, which will allow users to analyse large data sets and perform complex computations.
The data generated by omics technology is rapidly increasing. These datasets are being generated at an unprecedented scale to better understand the components and inner functions of a biological system. However, extracting biological knowledge from these omics datasets is still challenging. This project aims at developing MultiPEN, a tool to analyse and understand omics datasets. Our first case studies are within the Norwich Research Park, NRP. Currently, we are using MultiPEN for a diet intervention study to extract the key differences (in terms of genes and metabolites) between two conditions, for example, healthy vs disease.