Research group

Data Science Group

Data infrastructure and algorithms.

Group activities.

The group focuses on research into understanding how best to manage, represent and analyse data for open science.

We explore new hardware, algorithms and methodologies to develop tools for life science informatics, such as:

  • large-scale data visualisation
  • bioinformatics training
  • assembly algorithms for microbial metagenomics
  • novel infrastructure platforms for disseminating/publishing data and software

The group develops novel open software frameworks to enable scientists to describe, deposit and use research data, currently around exemplar use cases in plants. We lead the Collaborative Open Plant Omics (COPO) project, a web platform and API (Application Programming Interface) for grouping, describing and publishing raw and processed ‘omics data, and research objects such as workflows, software, manuscripts, posters, presentations and images. We are active members of the international W​heat Information System (WheatIS) initiative, where we are developing the Grassroots infrastructure to expose EI’s wheat data, notably our recent release of the new wheat assembly (plus several others) as a single BLAST portal powered by our National Capability HPC architecture. We use the Grassroots architecture to contribute to the yellow rust ​Field Pathogenomics project, the CerealsDB project, and collaborate closely with URGI, INRA on the WheatIS.

We deploy community-focused analytical web platforms by leading the Galaxy Development and Training project and the computational infrastructure and informatics node for CyVerse UK. We aim to offer these services to the UK community, and provide training and best practice around research data management.

Our infrastructure resources are built on open principles and are designed to be queried programmatically with concise APIs, broadening their availability and reuse.

We also hold current grants relating to metagenomics and technology development. The MetaCortex project seeks to develop assembly tools for metagenomics datasets, focussed particularly on viral communities. A local collaborative seed grant aims to carry out preliminary work studying the composition of bee pollen baskets using Illumina and Nanopore sequencing. Finally, Project Genesys is a collaboration with Optalysys Ltd. to adapt their optical computing technology to sequence alignment tasks.

We are developing the Aequatus tool for representing and exploring multi-species orthology for non-model organisms, and the TGAC Browser software to visualise complex non-gold-standard data, both tied to large scale analysis approaches and HPC infrastructure. Where existing solutions concentrate on very-highly curated, static genomes with relatively infrequent release cycles, TGAC Browser and Aequatus aim to allow researchers to visualise and explore orthology for fragmented incomplete genomes.

Some of our projects include:

  • COpenPlantOmics (COPO): a Collaborative Bioinformatics Plant Science Platform
  • CerealsDB: A community resource for wheat genomics
  • TGAC Galaxy Asset Development Project and Training
  • CyVerse UK
  • Building national hardware and software infrastructure for UK DNA Foundries
  • Facilitating synthetic biology literature mining and searching for the plant community
  • Metagenomics assembly