National Capability in e-Infrastructure

Programme leadership.

This programme is coordinated by Dr Rob Davey, who has expertise in the development of data infrastructure, scientific data analysis algorithms, and high-performance computing environments.

Summary

E-Infrastructure encompasses the integration and interconnection of computational hardware and software technology, data resources and services, communications protocols and networks, as well as the human resources and organisational structures required to support modern, internationally leading collaborative research. The Earlham Institute values significantly the impact of empowering collaborative research via world-class high-performance computing technology. We deploy and maintain some of the largest HPC systems for Life Science research in Europe, which powers the development and deployment of versatile digital platforms for 'omics-based data sharing, discovery, and analysis.

Overview

Unquestionably, e-Infrastructure (and in particular, its high-performance computing component) has become one of the primary instruments that underpins modern collaborative science. Following the advent of high-throughput ‘omics technologies, the life sciences community is now challenging traditional high-performance computing (HPC) domains in terms of demand for resources, services and capability. However, access to HPC resources has typically been in the hands of expert bioinformaticians and data scientists, and exhibits a steep learning curve for anyone on the exterior of these domains. It is therefore critical that we continue to enable HPC-proficient researchers to have the computational capabilities to satisfy their scientific objectives, whilst also providing access to these complex, yet powerful, capabilities to the wider collaborative research community through open and intuitive e-Infrastructure. We also need to make sure that our users are trained so that they are able to use our systems effectively. Reflecting EI’s commitment and efforts to e-Infrastructure, we were recently recognised with an HPCwire Readers’ and Editors’ Choice Awards award for our use of high-performance computing in the bread wheat genome project, presented at the leading supercomputing event SC16 in Salt Lake City, Utah.

Capability Objectives

Two recent projects demonstrating the expertise of e-infrastructure researchers at the Earlham Institute include the largest deployment of SGI UV 300 technology worldwide for Life Sciences (including the largest deployment of Intel SSD NVMe technology) as well as being infrastructure leads for the CyVerse UK project.

Our commitment to delivering powerful open systems for data access and analysis is undisputed, and we contribute our scientific outputs to forward the open science movement. In this way, we are contributing our expertise to the development of international initiatives such as DivSeek and the Wheat Initiative. The Earlham Institute has hosted numerous workshops and training events in genomics and bioinformatics that have delivered effective knowledge exchange with researchers and domain experts alike, and the training requirements around the infrastructure and HPC efforts in this NCG will be developed through the Training NCG (NC4). Our infrastructure projects will ultimately connect the UK to international data resource efforts such as the Wheat Information System (developed through the cross-Institute Designing Future Wheat project), and those of SeeD, IWYP, IWGSC, and the recently completed WISP BBSRC wheat research programme.

The infrastructure associated with NC3 is managed by the Earlham Institute’s Scientific Computing team, with the hardware and low-level software infrastructure is managed by the NBIP Computing for Infrastructure (CiS) team.

Large scale genome assembly.

For researchers with advanced HPC skills, the Earlham Institute provides direct access to some of the largest supercomputing resources for Life Sciences in Europe. In particular, access to a state-of-the-art SGI UV 300 platform with 12 terabytes of shared RAM, 256 Intel Xeon processors and 32 terabytes of Intel SSD NVMe storage technology allows UK researchers to undertake large genome assembly computations that are currently intractable on any other UK computing infrastructure (including Archer, the UK’s national HPC Service). For example, the recent publication on our assembly, annotation and dissemination of the 17Gb hexaploid wheat genome requires access to the UV300, a computational infrastructure that is unique to EI, enabling our researchers to remain at the forefront of solving global challenges for crop improvement and food security.

CyVerse UK.

The Earlham Institute is the infrastructure lead for the CyVerse UK project, tasked with installing, configuring and deploying the hardware and management software required to run this extensive UK commitment to international cyberinfrastructure. CyVerse is the successor to the NSF-funded iPlant Collaborative project, and the Earlham Institute is the first federated node outside the US, delivering a sophisticated data management and analysis platform to house, process and distribute data through web-based interfaces and command-line tools and APIs. CyVerse is currently being used to: share complex pan-genomic GWAS pipelines (GWASer, Earlham Institute); RENseq analysis (Earlham Institute); design SNP-based primers (Polymarker, Earlham Institute); reconstruct of genetic regulatory networks (Earlham Institute, University of Warwick); analyse wheat genomes/transcriptomes (University of Liverpool); analyse root-based image phenotypes (University of Nottingham).

The cyberinfrastructure hosts COPO and Designing Future Wheat Grassroots infrastructure virtual machines, and will power data processing aspects of the Brassica Information Portal, providing a robust and well-tested platform for delivering services resulting from strategic and competitively funded research. CyVerse UK is also proposed to be the hosting platform in the upcoming ELIXIR UK hub coordination effort for the European Multi-environment Plant pHenotyping And Simulation InfraStructure (EMPHASIS) project, aiming to develop and provide pan-European access to infrastructures addressing multi-scale phenotyping.

Galaxy.

The Galaxy project represents a web-based workflow management platform to enable researchers to carry out potentially complex analytical pipelines without the need for formal bioinformatics training in HPC usage. Galaxy has been deployed and hosted at the Earlham Institute through the 2014 Bioinformatics and Biomathematics Training Fund, and this platform supports many users across the Norwich Research Park. We develop tools and workflows that are released openly to the community to make the most of our expertise, such as our recent Gigascience paper on GeneSeqToFamily.

Galaxy.

The Galaxy project represents a web-based workflow management platform to enable researchers to carry out potentially complex analytical pipelines without the need for formal bioinformatics training in HPC usage. Galaxy has been deployed and hosted at the Earlham Institute through the 2014 Bioinformatics and Biomathematics Training Fund, and this platform supports many users across the Norwich Research Park. We develop tools and workflows that are released openly to the community to make the most of our expertise, such as our recent Gigascience paper on GeneSeqToFamily.

Related reading