Case Study: Achieving sustainable wheat through data infrastructure

Genomics data generation and provision of computational systems for the BBSRC-funded cross-institute Designing Future Wheat (DFW) project.

The challenge

Secure data storage, sharing and collaborative access pose a great challenge when dealing with large and complex research datasets and analysis. This is especially true for Designing Future Wheat (DFW), the first flagship cross-institute programme of its kind funded by BBSRC, bringing together biologists, breeders, and informaticians spanning eight research institutes and universities.

This 5-year programme aims to develop new wheat varieties (germplasm) containing the next generation of key traits and deliver them into the hands of academia and industry. Large-scale field trials are taking place at partner sites and by a precompetitive consortium of wheat breeding companies, the resulting data being assessed are contributing to the research on new wheat germplasms.

Within DFW, the Earlham Institute’s main responsibilities are in genomics data generation and analysis, and the provision of computational infrastructure and tools. This includes the Grassroots data management platform, developed at the Institute as an interoperable data sharing platform and repository to standardise access to wheat data.

Grassroots has allowed the collection and standardisation of quantitative and qualitative data such as treatments, trait measurements and phenotypes in field trials. As DFW partners are working together to produce large and heterogeneous datasets, the existing Grassroots infrastructure needed access to this ambitious scale of data for the programme, while also allowing fast, openly-licenced unrestricted access to data and experimental information to the wider international wheat community. An integrated flexible solution that had access to underlying data storage, compute nodes for analysis, and servers for hosting web-based services was required.

**Alex Coulton, PhD Researcher, University of Bristol**

Our role

To overcome this challenge, EI’s dedicated e-infrastructure team developed, configured, and deployed a versatile cloud infrastructure, as well as providing first level support and advice on research needs for the DFW programme.

CyVerse UK, led by EI and funded as an expert provision of resources for the UK community, provides in-kind contribution to host virtual servers for Grassroots and other DFW partners to increase its capacity by presenting backends and frontends for data storage and user-interface representation. For Grassroots, CyVerse UK enables the storage of over 30TB of large and complex wheat research data to be collaboratively shared between programme partners, typically released under the Toronto Agreement.

Infrastructure support also includes mechanisms for standardised data sharing and metadata management of field trials, testing servers, the CKAN digital repository to make DFW publications openly accessible, analytical services such as BLAST, and a search system to index and aggregate wheat data. CyVerse UK also hosts, at no direct researcher cost, third-party services for DFW-associated researchers for experimental design and discovery on genetic traits and diseases, such as AutoCloner (University of Bristol) and Knetminer data sources (Rothamsted Research).

Fulfilling the needs of this ambitious programme, EI plays a leading and integral role in building and providing robust data and compute resources to deliver value-for-money infrastructure services, for not only the Norwich Bioscience Institutes (Earlham Institute, Quadram Institute Bioscience, John Innes Centre and The Sainsbury Lab) but also beyond into the UK Higher Education Institution research sector.

The NBIP Computing Infrastructure for Science (CiS) team in turn supports CyVerse UK at the data-centre level to address hardware, networking, and core infrastructure provision. The synergy between the EI e-infrastructure and the CiS research computing teams closely aligns with the UKRI Infrastructure Roadmap to provide long-term national expertise and service on data infrastructure, particularly for life science research groups, research programmes, and partnerships across the UK.

The impact

Research impact

EI provides important contributions to DFW and other BBSRC programmes and projects to make data freely accessible. This is often not the case when publicly-funded projects are stored on commercial cloud services as it typically incurs data egress fees, making the expense prohibitive to perform research.

The cost becomes significant with the modern scales of data sharing in multi-disciplinary collaborations, which can be upwards of £10k for large dataset downloads. The benefits-in-kind that EI have brought to the DFW programme provide free network use and data storage, therefore improved value for money for research investment.

Socioeconomic impact

The NBIP CiS research computing infrastructure team creates a huge amount of unseen value in computing resources and data management. This delivers socioeconomic impact to our funders as a key value-for-money investment. When coupled with data-focused researchers - who understand the data needs of industry, breeders, and growers working to improve yields and reduce reliance on inorganic fertilisers or introduce other economically beneficial traits to future proof wheat as a crop - this provides a huge impact for the demands of future food security in the UK.

Alex Coulton, PhD Researcher, University of Bristol

Working with us.

If you would like to discuss a project with us and find out how you can work with the Earlham Institute contact us.

Get in touch