• Research

Grassroots Genomics

The Grassroots Genomics project at EI is a user-driven platform for integrating and reconciling wheat genomic information.

Project summary.

Support: grasshelpdesk@earlham.ac.uk

Value: £660,000

Grants:

BB/N023420/1: Federating access to wheat data services for efficient genome-specific marker design

BB/L024144/1: CerealsDB: A community resource for wheat genomics

BB/M025519/1: Using field pathogenomics to study wheat yellow rust dispersal and population dynamics at a national and international scale

Integrative research requires extensive multi-level approaches to enrich and expose data and workflows so that informatics infrastructures can process them effectively. The Grassroots Genomics project represents EI’s contribution to the international Wheat Information System (WheatIS) to consolidate data and analyses, facilitating consistent approaches to generating, processing and disseminating public wheat datasets. The Grassroots Genomic platform is powered by a powerful yet lightweight set of middleware services, called the Grassroots API, which comprises: a data management layer to provide structure to unstructured filesystems; interfaces to interact with local or cloud-based analysis platforms; a search layer to provide multi-faceted metadata and literature querying; a web server layer to deliver content and provide access to public programmatic interfaces.

The Grassroots infrastructure framework can be run locally or packaged in virtual containers and deployed on a variety of hardware thus representing a decentralised system, allowing information generators to retain control over their resources but allowing interconnected resources to access each other consistently. EI has an extensive National Capability to provide scientific computing hardware to the UK research community and is therefore perfectly positioned to build a point-of-access to previously disparate resources to serve wheat breeders, biologists and bioinformaticians. Coupling the Grassroots Genomics project with BBSRC-funded efforts to bring Galaxy and CyVerse to EI provides community standardised methodologies for data integration, interpretation and discovery.

Details.

Data Management

Grassroots Genomics utilises iRODS to track files on a filesystem as objects rather than files and folders. We use iRODS APIs to abstract data search and data access functionality in order to consistently expose and share data for use in downstream analyses. Other iRODS instances are designed to be brought together, such as the DSpace instance under development at INRA which also uses iRODS. This means federation on the data level across geographical and political boundaries is facilitated out-of-the-box.

Interfaces

Grassroots Genomics uses a standard Apache httpd webserver to serve web content to users as well as an Apache Tomcat Java servlet container to host Java enterprise web applications for the platform. We have developed a single simple API as an Apache module so that the webserver can consistently interact with the plethora of Grassroots services, such as: BLAST capability, iRODS data management, and ElasticSearch integration. We have also developed Grassroots services can also search 3rd party resources, such as the Ensembl, Agris, F1000, and BASE repositories.

Federation

The collaborative approach to building the international WheatIS infrastructure means that WheatIS is actually a federated network of nodes. Each node might house different data, metadata or analytical processes, but can be federated into the global network through consistent shared APIs on various middleware levels. The Grassroots Genomics platform can therefore interact with EI’s National Capability HPC infrastructure through iRODS, as well as the WheatIS node in INRA (France), the CerealsDB at Bristol University (UK), and Ensembl/EnsemblGenomes at the EBI (UK).

Indexing

In order to gain access to information quickly, we use metadata indexing to provide faster and federated querying. We index user-relevant fields of iRODS iCAT metadata catalogues in order to quickly find and access Grassroots Genomics data objects, as well as content-mined literature text via Solr. We can also index content from 3rd party databases, such as CerealsDB, via our Grassroots APIs, thereby database owners maintain access and control over their data.

Searching

We use ElasticSearch to filter and aggregate metadata, which allows us to search the already indexed data across WheatIS nodes. Its distributed “shard” model supports federation of search functionality, allowing WheatIS nodes to host and seamlessly federate their own ElasticSearch shards. ElasticSearch integration enables Grassroots Genomics users to submit a search term to multiple indexed repositories, resulting in a detailed and rich search platform.

Analysis

A vital requirement for Grassroots Genomics is to facilitate data-to-analysis infrastructure so that researchers do not have to funnel data through their own networks and storage media. This “Platform as a Service” (PaaS) architecture forms the basis of many cloud solutions for bioinformatics analysis, and the WheatIS project follows these conventions to deliver software to users.

Grassroots currently supports running analysis jobs on local hardware and managed HPC clusters via the DRMAA library, to which the commonly-used schedulers (SLURM, LSF, PBS, SGE) conform. This makes deployment much easier, facilitating the setup of new Grassroots nodes that are compliant with the WheatIS network, and providing a solid basis for Grassroots Genomics HPC requirements. We currently have two active projects at EI to deploy and maintain Galaxy and CyVerse instances on EI’s HPC hardware. As such, the WheatIS network will be able to exploit the APIs of both platforms to enable data transfer and workflow initiation, enabling Grassroots Genomics users to access and analyse a huge array of datasets and pipelines in one place.

External Sites

International Wheat Genome Sequencing Consortium

Wheat @ URGI (INRA)

CerealsDB

Ensembl plants - wheat CSS sequence

Ensembl plants - barley genome

Wheat @ MIPS

Wheat @ UC Davis

Publications.

CerealsDB 3.0: expansion of resources and data integration.

Wilkinson PA, Winfield MO, Barker GLA, et al. B MC Bioinformatics. 2016;17:256. doi:10.1186/s12859-016-1139-x.

Collaborators.

URGI, INRA
WheatIS node partner

Wheat Initiative Wheat Information System Expert Working Group
WheatIS EWG partner consortium

WheatIS EWG Collaborators
List of WheatIS EWG collaborators

Bristol Wheat Genomics
University of Bristol group working on CerealsDB and other wheat resources

Impact statement.

In recent years, there has been a revolution in the generation of genomic data for cereal crops, especially wheat. Our goal is to engage the community of wheat researchers, from breeders to bioinformaticians, in generating, evaluating and integrating wheat data.