How are we related? A Compara-bly easy workflow to find gene families

26 February 2018

We release ‘GeneSeqToFamily’, an open-source Galaxy workflow that helps scientists to find gene families.

Published in Gigascience, the open source Galaxy workflow allows researchers to make easier work of finding gene families; an important tool when it comes to analysing the evolution, structure and function of genes across species.

Co-author Wilfried Haerty, Group Leader of Evolutionary Genomics at EI, explained why this tool is so useful to biologists: “The software developed at the Earlham Institute enables scientists to investigate species of interest using a flexible and reproducible pipeline. The performance of our workflow was assessed on vertebrate genome assemblies of various qualities (platypus, pig, horse, dog, mouse and human). The species were selected to assess the impact of genome quality on gene families identification. The mouse, dog and human genomes are of high quality whereas the three others are at different stages of analysis completion.”

Based on and expanding Ensembl’s existing EnsemblCompara Gene Trees pipeline, the GeneSeqToFamily workflow removes many complex prerequisites of the process, such as having to use the command line to install a large number of separate tools, by converting the whole process into Galaxy; a much simpler platform to use.

Importantly, the workflow is highly customisable, allowing users to choose parameters, change tools and run the software on their own genes, without having to use the Ensembl database.

Not just a workflow, GeneSeqToFamily contains a number of new, standalone Galaxy tools, including TreeBeSThcluster_sgT-Coffee and ETE. Developed at EI by Anil Thanki and Nicola Soranzo of the Data Infrastructure Group, the software makes the process of finding and generating phylogenetic trees easier, using a range of open platforms and databases. Anil ThankiScientific Programmer at EI, said: “We are excited to put our work in the open domain, where it allows biologists and bioinformaticians to use the Ensembl Compara GeneTrees Pipeline in a simple, graphical user interface and modify it if needed.”

The team hopes that the new workflow will help users unfamiliar with the complexities associated with using Compara to be able to more easily analyse phylogenetic datasets, while collating a number of useful gene family tools in one Galaxy workflow. Users can either select existing Ensembl databases to use as the reference sets for their analysis, or provide their own data in the same format, and tools are provided that can help.

Earlham Institute is committed to providing tools and algorithms to support, enable and develop computational biology and life sciences research, with projects such as Galaxy helping to open access to a range of scientific tools and databases.

The Data Infrastructure Group, led by Dr Rob Davey, also supports resources such as CyVerse UK and COPO which, alongside Galaxy, expand the availability and usability of computational resources to the wider scientific community in the UK and internationally through EI’s National Capability in e-Infrastructure.

Notes to editors.

Notes to editors

For more information, please contact:

Hayley London

Marketing & Communications Officer, Earlham Institute (EI)

  • +44 (0)1603 450 107

About Earlham Institute

The Earlham Institute (EI) is a world-leading research institute focusing on the development of genomics and computational biology. EI is based within the Norwich Research Park and is one of eight institutes that receive strategic funding from Biotechnology and Biological Science Research Council (BBSRC) - £6.45M in 2015/2016 - as well as support from other research funders. EI operates a National Capability to promote the application of genomics and bioinformatics to advance bioscience research and innovation.

EI offers a state of the art DNA sequencing facility, unique by its operation of multiple complementary technologies for data generation. The Institute is a UK hub for innovative bioinformatics through research, analysis and interpretation of multiple, complex data sets. It hosts one of the largest computing hardware facilities dedicated to life science research in Europe. It is also actively involved in developing novel platforms to provide access to computational tools and processing capacity for multiple academic and industrial users and promoting applications of computational Bioscience. Additionally, the Institute offers a training programme through courses and workshops, and an outreach programme targeting key stakeholders, and wider public audiences through dialogue and science communication activities.


The Biotechnology and Biological Sciences Research Council (BBSRC) invests in world-class bioscience research and training on behalf of the UK public. Our aim is to further scientific knowledge, to promote economic growth, wealth and job creation and to improve quality of life in the UK and beyond.

Funded by Government, BBSRC invested over £509M in world-class bioscience in 2014-15 and is the leading funder of wheat research in the UK (over £100M investment on UK wheat research in the last 10 years). We support research and training in universities and strategically funded institutes. BBSRC research and the people we fund are helping society to meet major challenges, including food security, green energy and healthier, longer lives. Our investments underpin important UK economic sectors, such as farming, food, industrial biotechnology and pharmaceuticals.

For more information about BBSRC, our science and our impact see: For more information about BBSRC strategically funded institutes see: