Table 1.
Model namea | Appropriate use-cases / assumptions of data | Advantages/strengths | Disadvantages/limitations | Examples in pediatric allergy / further reading |
---|---|---|---|---|
Unsupervised learning | ||||
k-means / k-means + + [26], (kernel) fuzzy c-means [27] etc |
▪ Cross-sectional continuous data ▪ Assumes clusters to be homogeneous with similar within-cluster variation along all variables |
▪ Implementations widely available, with variants for various types of data and clustering objectives, e.g., fuzzy c-means (probabilistic labeling) [28] ▪ Computationally efficient |
▪ Unsuitable for categorical/mixed data ▪ Unsuitable if clusters have strongly different within-cluster variation ▪ Prioritizes within-cluster homogeneity over between-cluster separation |
▪ Endotypes of seasonal AR based on cytokine patterns [29] ▪ Immunologic endotypes of AD in infants [30] ▪ Phenotypes of asthma based on demographics, comorbidities, and medication [31] ▪ Biomarker-based phenotypes of AD [32] ▪ Transcriptomic clusters of asthma [33] |
k-medoids / partitioning around medoids (PAM) / FastPAM / CLARA / CLARANS [34–37] / fuzzy k-medoids [38] etc |
▪ Cross-sectional continuous or mixed data ▪ More flexible than k-means but still focuses on within-cluster homogeneity |
▪ Relatively non-sensitive to outliers ▪ Can use any distance metric [39] ▪ Somewhat computationally costly, but high-performance variants are available, e.g., [34, 35, 40]) ▪ Implementations widely available, with various iterations for specific clustering objectives, e.g., with soft labeling [38] |
▪ Spherical/convex clusters are more likely to be identified [41] ▪ DBSCAN and other methods may be more appropriate in case of very different cluster sizes, and if between-cluster separation is more important than within-cluster homogeneity [42] |
▪ Longitudinal phenotypes of AR based on mHealth symptoms/medication data [43] ▪ Phenotypes of AR based on demographic, heredity, and clinical data [44] ▪ Phenotypes of eczema based on longitudinal disease patterns [45] ▪ Longitudinal phenotypes of wheezing [46, 47] ▪ Endotypes of asthma based on exhaled breath condensate [48] |
Hierarchical clustering (HCA) / HCA on principal components (HCPC) [49] etc | ▪ Cross-sectional continuous, categorical, or mixed data |
▪ Accommodates any distance metric and a variety of linkage functions [50] ▪ Collinear/high-dimensional data can be managed automatically in HCPC [49], which combines feature extraction with HCA on the principal components ▪ Implementation widely available, as are various hierarchical methods with different focus, e.g., on homogeneity or separation, outliers etc. [51] |
▪ Computationally costly with large data [50, 52] ▪ Meaningful clusters may only occur at low levels of the hierarchy, requiring potentially many clusters (some of which may just be outlier clusters) ▪ Clusters merged once cannot be separated again, which can result in suboptimal solutions |
▪ Endotypes of rhinitis [53] ▪ Endotypes of seasonal AR [29] ▪ Phenotypes of AD based on allergic sensitization patterns [54] ▪ Phenotypes of asthma based on comorbidity, demographics, and asthma symptoms/lung function [55] ▪ Clusters of family adaptation to child’s FA [56] ▪ Clusters of asthma treatment outcome [57] |
Latent class analysis (LCA) / longitudinal latent class analysis (LLCA) | ▪ Longitudinal or cross-sectional categorical data |
▪ Computationally efficient ▪ Informative by providing probability of assignment (soft clustering) ▪ Statistically principled approach to estimate number of clusters available ▪ Interpretability of original variables |
▪ Inter-class heterogeneity is possible ▪ Compared to other methods used for trajectory analysis, performance may be lower [58] |
▪ Trajectories of wheezing, rhinoconjunctivitis, and eczema symptoms [59] ▪ Grass/mite sensitization trajectoriess[60] ▪ Longitudinal phenotypes of FA and AD [61] ▪ Subtypes of AR based on comorbidity, heredity, and sensitization [62] ▪ Sensitization patterns [63] |
Growth mixture modeling (GMM) | ▪ Longitudinal data (continuous, but some implementations can handle categorical variables) |
▪ Implementations widely available ▪ Allows for within-class variation ▪ Statistically principled approach to estimate number of clusters available |
▪ More computationally demanding than e.g., LCGA [64] |
▪ Trajectories of wheezing and allergic sensitization [65] ▪ Trajectories of allergic sensitization [66] |
Latent class growth analysis (LCGA) / group-based trajectory modelling (GBTM) |
▪ Longitudinal data (continuous, but some implementations can handle categorical data) ▪ Small sample or complex model with convergence issue [67] |
▪ Implementations widely available ▪ High homogeneity due to no within-class variation allowed |
▪ May necessitate larger number of classes due to no within-class variation [68], which may prove problematic if sample size is small |
▪ Trajectories of asthma/wheezing based on dispensing data and hospital admissions [69] ▪ Trajectories of wheezing [70] ▪ Trajectories of early-onset rhinitis [71] |
Supervised learning | ||||
k-nearest neighbors (k-NN) |
▪ Primarily for classification but may also be used for regression ▪ Varying degrees of noise, data size, and label numbers [74] |
▪ One of the most widely used methods, with easy implementation, available in most statistical software ▪ Robust to outliers/noise [75] |
▪ High computational demand on large datasets [76] |
▪ Prediction of persistent asthma [77] ▪ Prediction of asthma diagnosis [78] ▪ Multi-omics based prediction of asthma [79] ▪ Prediction of asthma exacerbations based on blood markers, FeNO, and clinical characteristics [80] |
Support vector machine (SVM) | ▪ High-dimensional data (continuous, but categorical variables can be supported [81]) | ▪ Performs well on high-dimensional and complex predictor data |
▪ Prone to overfitting, more so than many other supervised learning methods [77] ▪ Low explainability |
▪ Prediction of asthma diagnosis [78, 82] ▪ Prediction of symptomatic peanut allergy based on microarray immunoassay [83] ▪ Prediction of AD based on transcriptome/microbiota data [84] |
Decision trees (DT) |
▪ Classification or regression ▪ Complex large data [85] of both continuous, categorical, or mixed nature |
▪ Easily interpretable output |
▪ Prone to overfitting ▪ Low accuracy compared to ensembles of DTs [86] |
▪ Prediction of asthma diagnosis [78, 82] ▪ Prediction of hospitalization need for asthma exacerbation [87] ▪ Prediction of symptomatic peanut allergy based on microarray immunoassay [83] |
Random forests (RF) |
▪ Classification or regression ▪ Continuous, categorical or mixed data |
▪ Relatively low risk of overfitting [88] | ▪ Limited interpretability [88] |
▪ Prediction of persistent asthma [77] ▪ Prediction of asthma diagnosis [78, 82] ▪ Prediction of hospitalization need for asthma exacerbation [87] ▪ Prediction of AD based on transcriptome/microbiota data [84] ▪ Prediction of asthma exacerbations based on AI stethoscope, parental reporting [89] |
Bayesian network | ▪ Continuous (although often discretized), categorical, or mixed data based on a probabilistic graphical model (directed acyclic graph) |
▪ Adaptive by possibility to refine the network with new information [90] ▪ Capability of displaying and analyzing complex relationships [91] ▪ Accommodating missingness by utilization of all variables in model [90] ▪ Intuitive interpretation due to probabilistic labeling [92] |
▪ Loss of information in discretization [90] ▪ High computational demand on large datasets [93] |
▪ Prediction of asthma exacerbation [91] ▪ Prediction of response to short-acting bronchodilator medication [92] ▪ Metabolomic prediction of asthma [94] ▪ Prediction of eczema and asthma, respectively, based on SNP signatures [95] |
Naïve Bayes | ▪ Primarily for classification |
▪ Needs relatively little training data [75] ▪ Computationally efficient [96] |
▪ Strong assumptions of independence between the features [97] ▪ May need discretization of continuous variables [98] |
▪ Prediction of OFC outcome [99] ▪ Prediction of asthma [78] ▪ Prediction of persistent asthma [77] |
Multilayer perceptron (MLP) | ▪ Complex high-dimensional continuous data | ▪ Ability to learn complex patterns from high-dimensional big data | ▪ May require large data for training, particularly if deep/complex network architecture [100] |
▪ Prediction of asthma diagnosis [78] ▪ Multi-omics-based prediction of asthma [79] |
Extreme gradient boosting (XGBoost) |
▪ Classification or regression ▪ Continuous and categorical data [75] ▪ High-dimensional and/or sparse data [101] |
▪ Improving accuracy through ensemble of weak prediction models ▪ Computationally efficacious [10] |
▪ Performance issues may arise in imbalanced data [101] ▪ High memory usage |
▪ Prediction of persistent asthma [77] ▪ Prediction of AD based on transcriptome/microbiota data [84] |
Adaptive boosting (Adaboost)/ |
▪ Classification, but can be adapted for regression ▪ If crucial to boost performance of simple models through ensemble methods |
▪ Combines multiple weak classifiers to create a strong classifier, improving accuracy ▪ Often achieves better performance with less tweaking of parameters compared to other complex models |
▪ Sensitive to noisy data and outliers, which can lead to decreased performance if not handled properly ▪ Can overfit if the number and size of weak classifiers is not controlled |
▪ Prediction of allergy [102] ▪ Allergy diagnosis framework [103] ▪ Asthma treatment outcome prediction [104] |
Logistic Regression (LR) |
▪ Primarily for binary (but can be extended to multiclass, e.g., with one-vs-rest strategy) classification ▪ Suitable for models where the outcome is a probability 0–1 |
▪ Simple implementation ▪ Efficient to train ▪ Well-understood and widely used ▪ Comparable performance to more advanced models in specific contexts of binary classification problems [105] |
▪ Prone to underfitting when relationship between features and target is non-linear, unless feature engineering is applied ▪ Susceptible to overfitting with high-dimensional data if not regularized appropriately ▪ Assumes linear connection between dependent and independent variables, which can be limiting in complex scenarios [106] |
▪ Prediction of house dust mite-induced allergy [107] ▪ Prediction of asthma persistence [77] ▪ Prediction of FA [108] ▪ Prediction of AD based on transcriptome and microbiota data [84] |
The list is not intended to be comprehensive or cover all relevant/possible use-cases, but rather to provide an overview of common and promising algorithms. Abbreviations. AD, atopic dermatitis; AI, artificial intelligence; AR, allergic rhinitis; FA, food allergy; FeNO, fraction of exhaled nitric oxide; N/A, not available; OFC, oral food challenge; SNP, single nucleotide polymorphism