Data science for integrative biology.

Programme Leadership.

This programme is coordinated by Dr Rob Davey, who has expertise in the development of data infrastructure, scientific data analysis algorithms, and high-performance computing environments. He also leads his own research group that are carrying out exciting work across these fields.

Programme mission.

Our science needs new and innovative ways of processing the huge and varied datasets that are produced every day by our researchers. We’re building better data handling systems, more efficient and greener algorithms to make the best use of our large supercomputers, and exciting technology to put data together in new ways to answer complex biological questions.


State-of-the-art technologies are generating unprecedented amounts of complex data, from genomes, to proteomes and transcriptomes, spanning mechanistic and functional diversity. Handling, interpreting and integrating these large scale data into descriptive models that interpret the molecular functions at a system level requires continued development of algorithms, robust computational models, and interoperable analytical frameworks. Supported by our core capability we will contribute to the newest developments in the data sciences and facilitate the extrapolation of meaningful signals from often noisy data.

This effort is fundamental to the success of our associated research programmes into ‘Understanding genome evolution to drive trait improvement’, ‘Understanding complexity in living systems’ and ‘Designing Future Wheat’, as the analyses developed in these programmes are intrinsically data-rich and therefore exposed to multiple challenges of computational complexity. This programme specifically answers such challenges arising from managing large scale datasets and their associated metadata, improving existing algorithms to maximise their efficiency on state-of-the-art computing architectures, and the integration of heterogeneous datasets generated in the other programmes, thereby enabling and enhancing their interpretation.

We will carry out fundamental research into software engineering methods to manage, share, visualise and integrate the large and complex datasets. We will also develop research data management and dissemination layers, underpinned by community standards, that provide the granularity and searchability of EI’s large-scale and diverse data outputs. A key part of this programme will be the integration of the statistical, machine-learning, and network-based models described in the other programmes​. We will exploit our algorithm optimisation expertise to drive computational advances in accuracy and efficiency across our research into assembly and variant calling, and annotation and network analysis. These efforts will put the platforms in place to consistently collect and rapidly feed datasets into downstream integrative analyses, enabling the extensive and complex data interrogation processes required for bringing together multiple heterogeneous datasets.

Standards to power data interoperability.

​It is now an absolute requirement of data-intensive integrative biology that access to relevant multi-scale diverse datasets is fast and intuitive, and analyses can handle these complex datasets to maintain statistical power. Key objectives of this topic include:

  • Store the data generated throughout EI’s strategic and capability programmes, processed by microservice rules, in a metadata-aware data grid deployed on EI’s high-performance storage architecture.
  • Promote and use community-agreed ontologies to harmonise data description across biological datasets and networks, enabling integrative analyses throughout the wider Strategic Programme.
  • Develop and reuse standardised application programming interfaces (APIs) to feed data into Strategic Programme and Core Capability pipelines for interrogation, and National Capability platforms for data sharing.
  • Develop automated pipelines for the collection, management and integration of heterogeneous datasets arising from wide-scale, real time data generation technologies such as in-field phenotyping and pathogen screening.

Computational advances in algorithm optimisation.

​The data-intensive challenges of sequence assembly, annotation, gene expression, real-time image analysis, and knowledge mining require investigation and implementation of state-of-the-art computing technology. Key objectives in this area include:

  • Establish effective co-design strategies, using novel hardware and software, for the optimisation of algorithms and models developed in this work package.
  • Develop low-power, low-cost modular platforms suitable for in-field analysis, using bleeding-edge sample acquisition, remote sensing, and tracking technologies.

Data integration.

​A key goal of genomic data analyses utilising extensive whole-genome sequencing, transcriptomic, and methylomic datasets is the ability to extrapolate effective models that predict phenotypic traits and outcomes. Key objectives for this topic include:

  • Develop methods to integrate ‘omics datasets to investigate functional and regulatory complexity.
  • Develop heuristic approaches to propose new pathways and molecular interactions and to investigate the impact of nucleotide variation within functional sequence and genomic regions.
  • Develop intuitive, easy-to-use integrated data sharing, access and analysis mechanisms where users are able to consolidate different data types and sources, and integrate them according to their research goals.



The development of new bioinformatics tools, resources and algorithms will help researchers across the biotechnology and biomedical sectors work with data more effectively, as well as making it easily shareable and usable by others. Our continued optimisation of computational architecture and software will increase resource efficiency for researchers. Our work on in-field technology will have many benefits in the agritech sector, as well as provide better data quality and access to agronomists and farmers. Multi-omics researchers will benefit from our holistic and open approach to data management, allowing for more robust scientific interrogation.