TABLE 1.
Overview of the common clustering, classification, and regression metrics including their advantages, disadvantages, and example uses in genetics and genomics.
| Metric name | Description | Advantages | Disadvantages | References |
|---|---|---|---|---|
| Adjusted Rand Index (ARI) | Compare similarity between calculated clusters and a ground truth (or different clustering) (Hubert and Arabie, 1985). For example, predicted clusters in a disease group and known disease subtypes | - Compared to Rand Index, corrects for when the number or size of clusters could be impacted by chance (Hubert and Arabie, 1985). Important for genetics where there is high dimensionality - No bias toward certain cluster shapes (Steinley, 2004) |
- Requires a known ground truth clustering set so cannot be used if you want to identify new variant or disease subtypes - Biased to cluster size, influenced by large clusters (Warrens and van der Hoef, 2022) - Not applicable to overlapping clusters (e.g., genes in multiple pathways in pathway analysis) |
- Clustering of microbiome data (Shi et al., 2022) - Clustering of single-cell Hi-C data (Zhen et al., 2022) - Clustering differentially expressed cancer genes (Wang et al., 2022) |
| Adjusted Mutual Information (AMI) | Compare similarity between calculated clusters and a ground truth (or different clustering) (Vinh et al., 2010). Similar to ARI, but more suitable for rare disease subtypes (i.e., imbalanced clusters) | - Biased towards pure clusters, not dependent on cluster size. More suitable for imbalanced clusters (e.g., rare diseases) (Romano et al., 2016) | - Requires a known ground truth clustering set so cannot be used if you want to identify new variant or disease subtypes | - Creating gene regulatory networks (Shachaf et al., 2023) - Identifying genetic variant interactions (Cao et al., 2018) - Analyse biomarker similarities (Keup et al., 2021) |
| Fowlkes-Mallows Index | Compare similarity between calculated clusters and a ground truth (or different clustering). The geometric mean of precision and recall for the clustering (Fowlkes and Mallows, 1983) | - No bias toward certain cluster shapes so can compare different clustering algorithms (Fowlkes and Mallows, 1983) | - The index is biased toward a small number of clusters (Wagner and Wagner, 2007) | - Estimating the sequence similarity of two genomes (Ryšavý and Železný, 2017) - Creating genetic similarity matrices for population substructures (Lee et al., 2023) |
| Silhouette Index (SI) | Compares the similarity within clusters to the similarity between clusters (Rousseeuw, 1987). For example, finding the ‘best’ clustering to identify new disease subtypes | - Usually handles outliers better than DBI (Dixon et al., 2009) - Useful for identifying the optimal number of clusters (Shahapure and Nicholas, 2020) |
- Cannot detect if the clustering is due to a bias in the data that is unrelated to the trait (Chhabra et al., 2021) - Assumptions rely on Gaussian clusters so unsuitable for rare disease clusters or sparse data (Thrun, 2018) |
- Clustering Multiple Sclerosis (MS) patients based on GWAS data (Lopez et al., 2018) - Clustering schizophrenia patients based on clinical and genetic data (Yin et al., 2018) |
| Davies-Bouldin Index (DBI) | Compares the similarity between each cluster and the cluster most similar to it (Davies and Bouldin, 1979). For example, finding the ‘best’ clustering to identify new disease subtypes | - Simpler and more efficient computation than SI (Petrovi´c, 2006) - Handles different shapes and cluster count better than SI and CHI (Davies and Bouldin, 1979) |
- Requires Euclidean distances which are not always suitable, e.g., in sparse datasets (Davies and Bouldin, 1979) - Cannot compare between datasets (Dixon et al., 2009) |
- Gene expression clustering for systematic autoinflammatory diseases (Papagiannopoulos et al., 2024) - Clustering single-cell transcriptomes for identification of cell types and states (Zhao et al., 2023) |
| Calsinki-Harabasz Index (CHI) | Compares the similarity within clusters to the distance from the cluster to the global centre (Caliñski and Harabasz, 1974). For example, finding the ‘best’ clustering to identify new disease subtypes | - Simple and efficient computation, an important consideration for large genomics datasets (Caliñski and Harabasz, 1974) | - Assumes that clusters have equal size and density (Caliñski and Harabasz, 1974). Spherical assumptions are unsuitable for imbalanced clusters (e.g., rare disease clusters) | - Risk stratification from electronic health record data (Huang et al., 2021) - Gene clustering from single-cell data with reduced uncertainty (Li et al., 2023) |
| Gap Statistics | Compares within cluster variation to the expected value from a reference distribution (Tibshirani et al., 2001). A method for selecting the optimal number of clusters but can also be used as a metric with higher values indicating it is significantly better than random | - Useful for identifying optimal cluster numbers (Tibshirani et al., 2001) - Useful for evaluating the clusters with respect to random noise (Tibshirani et al., 2001). This is helpful in genomics where there is uncertainty over whether the disease or variants being clustered have subtypes or not |
- Not as direct as the previously listed metrics - Relies on comparison with random distribution, not comparing clustering properties (Tibshirani et al., 2001) |
- Clustering type 2 diabetes based on clinical biomarkers (Lugner et al., 2021) - Choosing the number of clusters for population clustering based on short tandem repeats (STRs) (Syukriani and Hidayat, 2023) |
| Accuracy | Percentage of samples correctly predicted. For example, the percentage of individuals correctly labelled diseased or control | - Very simple to understand | - Heavily impacted by imbalanced datasets which are common in genomics (Bone et al., 2015; Poldrack et al., 2020) | - Prediction of schizophrenia from genetic and clinical data on comorbid conditions (Chen et al., 2018) - Prediction of ADHD from genetic variants (Liu et al., 2021) |
| Precision | Percentage of samples predicted to be “positive” that are actually “positive”. For example, the percentage of identified variants that are predicted correctly | - Useful when false positives are more detrimental than false negatives | - Only considers the positive predictions (e.g., predicted cases) | - Identifying drug sensitive cancer cell lines (Naulaerts et al., 2017) - Analysing gene expression profiles from microarray data while maintaining high precision (Salem et al., 2017) |
| Recall | Percentage of “positive” samples that were correctly predicted. For example, the percentage of breast cancer cases correctly predicted | - Useful when false negatives are more detrimental than false positives | - Only considers the positive class (e.g., cases). You could get 100% recall by predicting everyone to be a case | - Improving recall of taxonomic metagenomic sequence classification (Girotto et al., 2017) - Early detection of cervical cancer with high recall (Gupta et al., 2021) |
| F1 | The harmonic mean of precision and recall. For example, minimising both missed diagnoses (false negatives) and incorrect diagnoses (false positives) in a genetic testing algorithm | - Focusses on the trade-off between precision and recall in one metric - More suitable for imbalanced data than accuracy, however, less so than AUROC (Jeni et al., 2013) |
- Does not consider true negatives which can be important (e.g., identifying individuals who do not carry a specific mutation in carrier screening) | - Training geneformer, a model using single-cell transcriptomes for context aware predictions of, e.g., gene network dynamic (Theodoris et al., 2023) - Survival prediction of heptocelluar cancer based on clinical data and biomarkers (Książek et al., 2021) |
| Area Under Receiver-Operator Curve (AUROC) | The area under the curve (AUC) of the true positive rate (TPR) plotted against the false positive rate (FPR). Often used to compare different ML models for predicting a certain disease or variant types | - Useful in an objective model comparison, particularly when the optimal decision boundary is unknown - Visualises the trade-off between TPR and FPR. |
- Alone it provides little clinical significance as it is not at a specific decision boundary - Susceptible to biases from imbalanced and small datasets which are common in genomics (however, less so than accuracy) (Faviez et al., 2020) - Gives false positives and false negatives the same weighting; often not the case in genomics (Ioannidis et al., 2011) |
- Prediction of Parkinson’s disease from genetic variants (Ho et al., 2022) - Prediction of Alzheimer’s disease from gene expression data (Lee and Lee, 2020) |
| Matthews Correlation Coefficient (MCC) | A balanced metric to evaluate classification predictions considering true negatives (TN), true positives (TP), false negatives (FN), and false positives (FP) (Matthews, 1975; Baldi et al., 2000) | - Considers all confusion matrix components (TN, TP, FN, FP) - Handles imbalanced data better than accuracy, F1 and AUROC (Chicco and Jurman, 2020; 2023) |
- Currently less known so less familiar to readers without a ML background | - Predicting melanoma from mRNA and methylation data (Bhalla et al., 2019) - Predicting cancer progression from RNAseq data (Singh et al., 2018) |
| Cohen’s kappa | Evaluates the level of agreement between two groups (originally between two raters, now often between predictions and ground truth) taking into account chance agreement (Ben-David, 2008) | - Accounts for agreement expected by chance (Ben-David, 2008) | - Less intuitive to set a threshold in clinical settings as it is a relative measure - Not robust to asymmetric confusion matrices or imbalanced data and can therefore give conflicting values to MCC (Jeni et al., 2013; Delgado and Tibau, 2019) |
- Microbial risk assessment using next-generation sequencing (NGS) (Njage et al., 2019) - Predicting individuals’ lithium response from genetic variants (Stone et al., 2021) |
| Mean Absolute Error (MAE) | The average absolute difference between the predicted values and known values. For example, the average distance (in kg) that a model is from predicting birth weight | - Easy to interpret as shares units with measurements - Low sensitivity to outliers (Hodson, 2022) |
- Cannot be used to compare the predictions of datasets with different variances | - Predicting bone mineral density form genetic variants (Wu et al., 2021) - Predicting gene expression from ‘landmark genes’ using cluster-based regression (Seok, 2021) |
| Root Mean Squared Error (RMSE) | Similar to MAE, it is the average absolute difference between the predicted values and known values. However, it is the square root of the mean squared error | - Easy to interpret as shares units with measurements | - Higher outlier sensitivity than MAE (Hodson, 2022) | - Predicting BMI from clinical and genetic data (Harrison et al., 2017) - Analysing association between body fat and cardiovascular risk (Saito et al., 2017) |
| R-squared Error (R2) | Proportion of variation in the target variable that the regression model explains. For example, the percentage of variation in height explained by a regression model with known biomarkers | - Unitless so easy to compare different models | - Relying on a high R2 during model tuning can lead to overfitting (Bohrnstedt and Carter, 1971) - Tends to increase as parameters added (fixed with adjusted R2) (Bohrnstedt and Carter, 1971) |
- Analysing association between genetic scores and birth weight. Used R2 and adjusted R2 (Haulder et al., 2022) - Comparing predictability of genetic risk scores for different traits across different ancestral groups (Ekoru et al., 2021) |