Machine Learning for Bioinformatics: A User's Guide

Josh Colmer and Matthew Madgwick are PhD students in the Anthony Hall Group and Korcsmaros Group at EI. They are both experts in machine learning, with first-rate publications and applications already to their name.

Josh has recently first-authored a paper in New Phytologist, which introduces SeedGerm - a machine learning platofrm that enables scientists and seed companies to automatically quantify seed germination traits through computer vision techniques.

Matt, among his various projects, has been applying his machine learning expertise to study how the microbiome affects gut health and breast cancer, while contributing to a global effort to understand the systemic effects of COVID-19 on the human body.

The pair previously introduced us to the dos and don’ts of machine learning in “10 things you need to know about getting into machine learning”, which is where we’d direct a novice user.

In this second installment, they introduce five key concepts that any bioinformatician should know when applying machine learning.

Supervised vs Unsupervised

Machine learning (ML) algorithms can be broadly split into two main categories: supervised and unsupervised learning. The choice of these approaches depends on the task at hand.

The vast majority of ML algorithm applications are supervised learning, where ground truth (empirical evidence) is known, and the algorithm’s task is to learn and predict this ground truth from corresponding data. This can be further split into two main subcategories; regression and classification.

In a regression task the predicted variable is a continuous number, whereas the predicted variable is a category for classification tasks. There are many different algorithms that can be used for these tasks, from traditional methods such as Logistic Regression, Random Forest and Support Vector Machine to deep approaches such as Deep Neural Networks. An example application of supervised learning would be to predict which patients are healthy or have cancer from genomic data.

This differs from unsupervised learning, where the algorithm only sees the input data with no regard to the ground truth, which makes it ideal for tasks such as clusterings and association analysis. Again, there are many different algorithms that are used but some of the most common are PCA, K-means and hierarchical clustering. Deep approaches can also be applied using the autoencoder architecture.

Above: The two figures above demonstrate the difference between a supervised approach and an unsupervised approach for the same dataset. In the supervised approach, here a support vector machine creates a dividing line known as the hyperplane which best splits the data based on their known grown truth. In the unsupervised approach, however, we are asking the algorithm to find 4 clusters from the data which are most alike, with no regard to the ground truth. In this, a K-means clustering algorithm was used.

Feature Engineering

Feature engineering is one of the most important steps in the application of machine learning. The goal of feature engineering is to wrangle your data into a form that is more interpretable for the machine learning algorithm. This makes it easier for the algorithm to learn the structure of the data and achieve good model performance.

There are many different techniques you can use to transform your data. These include imputation, scaling, transformations, handling outliers, one-hot encoding, binning, and the list goes on. Here we will discuss scaling and binning in particular.

Scaling aims to bring two continuous features into the same range such that they are comparable. In some cases, this also increases the computation speed of the algorithm, due to faster convergence. Binning, on the other hand, is the process of assigning a subcategory to a feature based on a range or condition, and can be applied to both categorical and continuous features.

Feature engineering also plays a role in the shift to deep learning architectures; one advantage of the deep approach is that it incorporates feature engineering within the model.

Ultimately, the aim of feature engineering is to increase the robustness and performance of a model. That said, choosing the right features is important. If the features you transform are not representative of the problem then your results may be misleading.

Above: The above distribution represents a feature within a dataset. We can simplify this data by reducing the number of components into a predefined set of bins. Therefore, creating discrete values from a continuous distribution.

Feature Selection

High dimensional datasets are commonly analysed in life sciences (e.g. gene expression) but we are typically concerned with only a subset of those features.

To reduce a high dimensional dataset into a more interpretable subset of features that contain the relationship between input and output we are interested in learning, we must undergo feature selection. Thie three forms of feature selection are filter, wrapper, and embedded methods. As embedded methods link closely to regularisation, which is discussed in the next section, filter and wrapper methods will be the focus here.

The filter method of feature selection involves using a metric to measure the strength of the relationship between a feature and the targetin order to distinguish between relevant and irrelevant features. Examples of these metrics would be Pearson’s correlation and mutual information. Every feature is scored using the chosen metric, and the best scoring features are taken forward for training purposes. An example of applying a filter method of feature selection would be using a metric to select statistically relevant genes in an expression dataset so that the model is not influenced by irrelevant features.

Wrapper methods focus on searching the feature space for the best performing combination of features based on the model’s validation error. Examples of wrapper methods would be recursive feature elimination (RFE) and sequential feature selection (SFS). Both examples are computationally expensive search algorithms that iteratively either add or remove features, calculating the error at each iteration until all features have been added/removed, allowing an estimate of the optimal subset to be selected.

Above: Filter methods used to select; a relevant feature selected by measuring its correlation with the target; two features whose distributions are effective at discriminating between two classes using mutual information

Overfitting and Regularisation

Sometimes in a machine learning project we encounter situations where our model performs significantly better on the training set compared to the test set. This difference is referred to as overfitting and is usually caused by fitting a model that is too complex for the training data.

The excessive complexity of the model enables it to learn an unrealistically accurate mapping from input to output. But, when tasked with making predictions based on unseen data, the mapping it has learned utilises noise that is specific to the training data and does not generalise, resulting in poor predictions. Techniques such as regularisation and cross-validation can be used to combat overfitting.

Regularisation, an embedded method of feature selection, comes in many forms but a shared goal of this technique is to reduce overfitting by penalising model complexity. When optimising a model, a regularisation term that is typically calculated as a function of either the weights or hyperparameters of the model can be added to the error, meaning the model that minimises the error is more likely to be one that has balanced predictive power and complexity.

Cross-validation is a method of estimating a model’s error on unseen data before making predictions on the test data. The method first involves partitioning the training dataset into training and validation subsets. The model is trained on the training subsets and predictions are made on the validation subsets, with each error on the validation subsets being one estimate of the error on the test data.

Different combinations of training and validation subsets are attempted and the average error across these combinations is an estimate of the error on the test data, allowing overfitting to be identified and rectified before exposure to the test data.

Above: Two trained models; the left displaying an overfitted classifier that has learned a complicated decision boundary that is unlikely to generalise, the right displaying a classifier that performs worse on the training data, but is likely to generalise better to unseen data.

Evaluation Metrics

Choosing the appropriate evaluation metric is a key part of the machine learning procedure, allowing you to interpret your model’s performance and communicate your results. When comparing similar models, it’s common practice to choose the model that minimises or maximises the evaluation metric that best quantifies its effectiveness at the learning task.

Selecting an inappropriate metric can lead to a dangerous misunderstanding of a model’s performance. An example of this would be selecting accuracy as the evaluation metric when working with an imbalanced target distribution. If the task is to classify samples as infected or uninfected, but 95% of samples belong to the uninfected class, a model that predicts every sample as being uninfected would score an accuracy of 95% despite being unable to discriminate between the two classes.

More appropriate classification metrics for this task are sensitivity and specificity, which consider the number of false negatives and false positives respectively. Using multiple evaluation metrics is likely to give a more detailed interpretation of the model but using a confusion matrix that displays true positives, true negatives, false positives, and false negatives is likely to be best.

Common evaluation metrics for regression tasks would be the mean squared error (MSE), mean absolute error (MAE), and coefficient of determination (R2). Choosing an appropriate metric for classification tasks is more difficult as it usually depends on the relative importance of false positives and false negatives.

Accuracy, F1, precision, recall, and AUROC are all effective evaluation metrics for a classification task but selecting the optimal metric, whether in a regression or classification setting, is dependent on both the dataset and how the success of predictions should be quantified.

Above: Two confusion matrices that include the number of true positives, true negatives, false positives, and false negatives. Without viewing the confusion matrices or using multiple evaluation metrics, one could mistake the models as performing similarly despite the large differences in predictions.

Glossary

Clustering: An unsupervised method of grouping observations based on similarities in the values of their features.

Data wrangling: The process of restructuring, cleaning or organising a dataset in order to facilitate analysis or visualisation.

Deep architecture: A neural network model that has multiple intermediate layers. The more layers that the model has, the more complex the patterns it is able to utilise within the dataset.

Feature: An explanatory variable that has been recorded as part of a dataset.

Feature engineering: Extracting and generating features from raw data to quantify the observed patterns in a more model friendly way, leading to better results.

Fitting: The procedure of adjusting a model’s weights to more accurately map input to output. Gradient descent methods are most commonly used to calculate how a model’s weights should be changed throughout the fitting (training) procedure.

High dimensional data: A dataset where each observation possesses a high number of features.

Hyperparameters: Parameters of the model that affect the fitting of the model. They can be highly influential on the model’s performance and are typically related to the complexity of the model or the speed and duration of the fitting procedure.

Target: The property/characteristic/variable that we are trying to predict in a supervised machine learning task, sometimes referred to as the label or ground truth.

Weights: Trainable and non-trainable parameters in the model that quantify how the features are mapped from input to output, or input to the next layer in a neural network.

Case Study: SeedGerm

SeedGerm offers an easy-to-use, low-cost and scalable solution to the problem of scoring seed germination - and a handy application of machine learning. Josh explains how he used some of the techniques described in this article:

“When we were developing SeedGerm, our challenge was to create algorithms that could reliably interpret the stage of seed germination from photographs. To do this, we utilised an unsupervised machine learning algorithm to predict how likely a seed was to have germinated using engineered features relating to its size, shape, and colour. The unsupervised algorithm is retrained for every new dataset meaning it is highly generalisable and unlikely to overfit. When it came to making predictions, we realised that the germination rate and timing are both important to the user. To address this, we calculated a mix of classification and regression evaluation metrics to quantify and aim to maximise our model’s performance in relation to these concepts.”

Josh Colmer and Matthew Madgwick are BBSRC funded students on the Norwich Research Park Doctoral Training Partnership (NRPDTP)

Machine Learning for Bioinformatics: A User's Guide

Supervised vs Unsupervised

Feature Engineering

Feature Selection

Overfitting and Regularisation

Evaluation Metrics

Glossary

Case Study: SeedGerm

Related reading.

10 things you need to know about getting into machine learning

The machines are learning, and they’re coming for bioscience

Students need to up their bioinformatics game: why I am learning to code python

SeedGerm: seeding success with machine learning and computer vision

Lettuce have it! Machine learning for cr-optimisation

New software tool MARTi fast-tracks identification and response to microbial threats.

New BBSRC funding supports expansion of transformative spatial science

Director appointed to lead transformative digital research infrastructure initiative

Devastating crop pathogens can be found by sequencing the air

UKRI given green light for game-changing BioFAIR investment

Earlham Institute begins testing air across Norfolk for a year

Earlham Institute spinout TraitSeq to transform agricultural sector