Achieving sustainable wheat through data infrastructure
Genomics data generation and provision of computational systems for the BBSRC-funded cross-institute Designing Future Wheat (DFW) project.
Secure data storage, sharing and collaborative access pose a great challenge when dealing with large and complex research datasets and analysis. This is especially true for Designing Future Wheat (DFW), the first flagship cross-institute programme of its kind funded by BBSRC, bringing together biologists, breeders, and informaticians spanning eight research institutes and universities.
This 5-year programme aims to develop new wheat varieties (germplasm) containing the next generation of key traits and deliver them into the hands of academia and industry. Large-scale field trials are taking place at partner sites and by a precompetitive consortium of wheat breeding companies, the resulting data being assessed are contributing to the research on new wheat germplasms.
Within DFW, the Earlham Institute’s main responsibilities are in genomics data generation and analysis, and the provision of computational infrastructure and tools. This includes the Grassroots data management platform, developed at the Institute as an interoperable data sharing platform and repository to standardise access to wheat data.
Grassroots has allowed the collection and standardisation of quantitative and qualitative data such as treatments, trait measurements and phenotypes in field trials. As DFW partners are working together to produce large and heterogeneous datasets, the existing Grassroots infrastructure needed access to this ambitious scale of data for the programme, while also allowing fast, openly-licenced unrestricted access to data and experimental information to the wider international wheat community. An integrated flexible solution that had access to underlying data storage, compute nodes for analysis, and servers for hosting web-based services was required.
Working with colleagues at the EI as part of the DFW data management work package is proving extremely useful; their database expertise, and willingness to take on ideas and suggestions is resulting in a unique and widely applicable data management solution. Being able to build in a capability to upload diverse phenotypic data from the latest automated ground and UAV based automated phenotyping platforms deployed in DFW is resulting in an as up to date as possible data management resource, capable of maximising the availability of the trait data collected in DFW.
Andrew Riche, Agronomist, Plant Science, Rothamsted Research
To overcome this challenge, EI’s dedicated e-infrastructure team developed, configured, and deployed a versatile cloud infrastructure, as well as providing first level support and advice on research needs for the DFW programme.
CyVerse UK, led by EI and funded as an expert provision of resources for the UK community, provides in-kind contribution to host virtual servers for Grassroots and other DFW partners to increase its capacity by presenting backends and frontends for data storage and user-interface representation. For Grassroots, CyVerse UK enables the storage of over 30TB of large and complex wheat research data to be collaboratively shared between programme partners, typically released under the Toronto Agreement.
Infrastructure support also includes mechanisms for standardised data sharing and metadata management of field trials, testing servers, the CKAN digital repository to make DFW publications openly accessible, analytical services such as BLAST, and a search system to index and aggregate wheat data. CyVerse UK also hosts, at no direct researcher cost, third-party services for DFW-associated researchers for experimental design and discovery on genetic traits and diseases, such as AutoCloner (University of Bristol) and Knetminer data sources (Rothamsted Research).
Fulfilling the needs of this ambitious programme, EI plays a leading and integral role in building and providing robust data and compute resources to deliver value-for-money infrastructure services, for not only the Norwich Bioscience Institutes (Earlham Institute, Quadram Institute Bioscience, John Innes Centre and The Sainsbury Lab) but also beyond into the UK Higher Education Institution research sector.
The NBIP Computing Infrastructure for Science (CiS) team in turn supports CyVerse UK at the data-centre level to address hardware, networking, and core infrastructure provision. The synergy between the EI e-infrastructure and the CiS research computing teams closely aligns with the UKRI Infrastructure Roadmap to provide long-term national expertise and service on data infrastructure, particularly for life science research groups, research programmes, and partnerships across the UK.
EI provides important contributions to DFW and other BBSRC programmes and projects to make data freely accessible. This is often not the case when publicly-funded projects are stored on commercial cloud services as it typically incurs data egress fees, making the expense prohibitive to perform research.
The cost becomes significant with the modern scales of data sharing in multi-disciplinary collaborations, which can be upwards of £10k for large dataset downloads. The benefits-in-kind that EI have brought to the DFW programme provide free network use and data storage, therefore improved value for money for research investment.
The NBIP CiS research computing infrastructure team creates a huge amount of unseen value in computing resources and data management. This delivers socioeconomic impact to our funders as a key value-for-money investment. When coupled with data-focused researchers - who understand the data needs of industry, breeders, and growers working to improve yields and reduce reliance on inorganic fertilisers or introduce other economically beneficial traits to future proof wheat as a crop - this provides a huge impact for the demands of future food security in the UK.
During my PhD, I developed an automated bioinformatics pipeline, AutoCloner, to help wheat and other polyploid crop researchers with gene cloning. This was previously a laborious and manual process, requiring the 3` ends of PCR primers be placed on the locations of SNPs unique to the allele of interest. The Earlham Institute has been a key part of making this work, providing hosting capacity for the AutoCloner web interface, which allows researchers to access the tool without installing additional software. I’d like to thank Rob Davey and his team for their continued support.
Alex Coulton, PhD Researcher, University of Bristol