Skip to main content
. 2024 Dec 21;184(1):98. doi: 10.1007/s00431-024-05925-5

Table 1.

Commonly used machine learning models

Model namea Appropriate use-cases / assumptions of data Advantages/strengths Disadvantages/limitations Examples in pediatric allergy / further reading
Unsupervised learning
k-means / k-means + + [26], (kernel) fuzzy c-means [27] etc

▪ Cross-sectional continuous data

▪ Assumes clusters to be homogeneous with similar within-cluster variation along all variables

▪ Implementations widely available, with variants for various types of data and clustering objectives, e.g., fuzzy c-means (probabilistic labeling) [28]

▪ Computationally efficient

▪ Unsuitable for categorical/mixed data

▪ Unsuitable if clusters have strongly different within-cluster variation

▪ Prioritizes within-cluster homogeneity over between-cluster separation

▪ Endotypes of seasonal AR based on cytokine patterns [29]

▪ Immunologic endotypes of AD in infants [30]

▪ Phenotypes of asthma based on demographics, comorbidities, and medication [31]

▪ Biomarker-based phenotypes of AD [32]

▪ Transcriptomic clusters of asthma [33]

k-medoids / partitioning around medoids (PAM) / FastPAM / CLARA / CLARANS [3437] / fuzzy k-medoids [38] etc

▪ Cross-sectional continuous or mixed data

▪ More flexible than k-means but still focuses on within-cluster homogeneity

▪ Relatively non-sensitive to outliers

▪ Can use any distance metric [39]

▪ Somewhat computationally costly, but high-performance variants are available, e.g., [34, 35, 40])

▪ Implementations widely available, with various iterations for specific clustering objectives, e.g., with soft labeling [38]

▪ Spherical/convex clusters are more likely to be identified [41]

▪ DBSCAN and other methods may be more appropriate in case of very different cluster sizes, and if between-cluster separation is more important than within-cluster homogeneity [42]

▪ Longitudinal phenotypes of AR based on mHealth symptoms/medication data [43]

▪ Phenotypes of AR based on demographic, heredity, and clinical data [44]

▪ Phenotypes of eczema based on longitudinal disease patterns [45]

▪ Longitudinal phenotypes of wheezing [46, 47]

▪ Endotypes of asthma based on exhaled breath condensate [48]

Hierarchical clustering (HCA) / HCA on principal components (HCPC) [49] etc ▪ Cross-sectional continuous, categorical, or mixed data

▪ Accommodates any distance metric and a variety of linkage functions [50]

▪ Collinear/high-dimensional data can be managed automatically in HCPC [49], which combines feature extraction with HCA on the principal components

▪ Implementation widely available, as are various hierarchical methods with different focus, e.g., on homogeneity or separation, outliers etc. [51]

▪ Computationally costly with large data [50, 52]

▪ Meaningful clusters may only occur at low levels of the hierarchy, requiring potentially many clusters (some of which may just be outlier clusters)

▪ Clusters merged once cannot be separated again, which can result in suboptimal solutions

▪ Endotypes of rhinitis [53]

▪ Endotypes of seasonal AR [29]

▪ Phenotypes of AD based on allergic sensitization patterns [54]

▪ Phenotypes of asthma based on comorbidity, demographics, and asthma symptoms/lung function [55]

▪ Clusters of family adaptation to child’s FA [56]

▪ Clusters of asthma treatment outcome [57]

Latent class analysis (LCA) / longitudinal latent class analysis (LLCA) ▪ Longitudinal or cross-sectional categorical data

▪ Computationally efficient

▪ Informative by providing probability of assignment (soft clustering)

▪ Statistically principled approach to estimate number of clusters available

▪ Interpretability of original variables

▪ Inter-class heterogeneity is possible

▪ Compared to other methods used for trajectory analysis, performance may be lower [58]

▪ Trajectories of wheezing, rhinoconjunctivitis, and eczema symptoms [59]

▪ Grass/mite sensitization trajectoriess[60]

▪ Longitudinal phenotypes of FA and AD [61]

▪ Subtypes of AR based on comorbidity, heredity, and sensitization [62]

▪ Sensitization patterns [63]

Growth mixture modeling (GMM) ▪ Longitudinal data (continuous, but some implementations can handle categorical variables)

▪ Implementations widely available

▪ Allows for within-class variation

▪ Statistically principled approach to estimate number of clusters available

▪ More computationally demanding than e.g., LCGA [64]

▪ Trajectories of wheezing and allergic sensitization [65]

▪ Trajectories of allergic sensitization [66]

Latent class growth analysis (LCGA) / group-based trajectory modelling (GBTM)

▪ Longitudinal data (continuous, but some implementations can handle categorical data)

▪ Small sample or complex model with convergence issue [67]

▪ Implementations widely available

▪ High homogeneity due to no within-class variation allowed

▪ May necessitate larger number of classes due to no within-class variation [68], which may prove problematic if sample size is small

▪ Trajectories of asthma/wheezing based on dispensing data and hospital admissions [69]

▪ Trajectories of wheezing [70]

▪ Trajectories of early-onset rhinitis [71]

▪ Trajectories of eczema [72, 73]

Supervised learning
k-nearest neighbors (k-NN)

▪ Primarily for classification but may also be used for regression

▪ Varying degrees of noise, data size, and label numbers [74]

▪ One of the most widely used methods, with easy implementation, available in most statistical software

▪ Robust to outliers/noise [75]

▪ High computational demand on large datasets [76]

▪ Prediction of persistent asthma [77]

▪ Prediction of asthma diagnosis [78]

▪ Multi-omics based prediction of asthma [79]

▪ Prediction of asthma exacerbations based on blood markers, FeNO, and clinical characteristics [80]

Support vector machine (SVM) ▪ High-dimensional data (continuous, but categorical variables can be supported [81]) ▪ Performs well on high-dimensional and complex predictor data

▪ Prone to overfitting, more so than many other supervised learning methods [77]

▪ Low explainability

▪ Prediction of asthma diagnosis [78, 82]

▪ Prediction of symptomatic peanut allergy based on microarray immunoassay [83]

▪ Prediction of AD based on transcriptome/microbiota data [84]

Decision trees (DT)

▪ Classification or regression

▪ Complex large data [85] of both continuous, categorical, or mixed nature

▪ Easily interpretable output

▪ Prone to overfitting

▪ Low accuracy compared to ensembles of DTs [86]

▪ Prediction of asthma diagnosis [78, 82]

▪ Prediction of hospitalization need for asthma exacerbation [87]

▪ Prediction of symptomatic peanut allergy based on microarray immunoassay [83]

Random forests (RF)

▪ Classification or regression

▪ Continuous, categorical or mixed data

▪ Relatively low risk of overfitting [88] ▪ Limited interpretability [88]

▪ Prediction of persistent asthma [77]

▪ Prediction of asthma diagnosis [78, 82]

▪ Prediction of hospitalization need for asthma exacerbation [87]

▪ Prediction of AD based on transcriptome/microbiota data [84]

▪ Prediction of asthma exacerbations based on AI stethoscope, parental reporting [89]

Bayesian network ▪ Continuous (although often discretized), categorical, or mixed data based on a probabilistic graphical model (directed acyclic graph)

▪ Adaptive by possibility to refine the network with new information [90]

▪ Capability of displaying and analyzing complex relationships [91]

▪ Accommodating missingness by utilization of all variables in model [90]

▪ Intuitive interpretation due to probabilistic labeling [92]

▪ Loss of information in discretization [90]

▪ High computational demand on large datasets [93]

▪ Prediction of asthma exacerbation [91]

▪ Prediction of response to short-acting bronchodilator medication [92]

▪ Metabolomic prediction of asthma [94]

▪ Prediction of eczema and asthma, respectively, based on SNP signatures [95]

Naïve Bayes ▪ Primarily for classification

▪ Needs relatively little training data [75]

▪ Computationally efficient [96]

▪ Strong assumptions of independence between the features [97]

▪ May need discretization of continuous variables [98]

▪ Prediction of OFC outcome [99]

▪ Prediction of asthma [78]

▪ Prediction of persistent asthma [77]

Multilayer perceptron (MLP) ▪ Complex high-dimensional continuous data ▪ Ability to learn complex patterns from high-dimensional big data ▪ May require large data for training, particularly if deep/complex network architecture [100]

▪ Prediction of asthma diagnosis [78]

▪ Multi-omics-based prediction of asthma [79]

Extreme gradient boosting (XGBoost)

▪ Classification or regression

▪ Continuous and categorical data [75]

▪ High-dimensional and/or sparse data [101]

▪ Improving accuracy through ensemble of weak prediction models

▪ Computationally efficacious [10]

▪ Performance issues may arise in imbalanced data [101]

▪ High memory usage

▪ Prediction of persistent asthma [77]

▪ Prediction of AD based on transcriptome/microbiota data [84]

Adaptive boosting (Adaboost)/

▪ Classification, but can be adapted for regression

▪ If crucial to boost performance of simple models through ensemble methods

▪ Combines multiple weak classifiers to create a strong classifier, improving accuracy

▪ Often achieves better performance with less tweaking of parameters compared to other complex models

▪ Sensitive to noisy data and outliers, which can lead to decreased performance if not handled properly

▪ Can overfit if the number and size of weak classifiers is not controlled

▪ Prediction of allergy [102]

▪ Allergy diagnosis framework [103]

▪ Asthma treatment outcome prediction [104]

Logistic Regression (LR)

▪ Primarily for binary (but can be extended to multiclass, e.g., with one-vs-rest strategy) classification

▪ Suitable for models where the outcome is a probability 0–1

▪ Simple implementation

▪ Efficient to train

▪ Well-understood and widely used

▪ Comparable performance to more advanced models in specific contexts of binary classification problems [105]

▪ Prone to underfitting when relationship between features and target is non-linear, unless feature engineering is applied

▪ Susceptible to overfitting with high-dimensional data if not regularized appropriately

▪ Assumes linear connection between dependent and independent variables, which can be limiting in complex scenarios [106]

▪ Prediction of house dust mite-induced allergy [107]

▪ Prediction of asthma persistence [77]

▪ Prediction of FA [108]

▪ Prediction of AD based on transcriptome and microbiota data [84]

The list is not intended to be comprehensive or cover all relevant/possible use-cases, but rather to provide an overview of common and promising algorithms. Abbreviations. AD, atopic dermatitis; AI, artificial intelligence; AR, allergic rhinitis; FA, food allergy; FeNO, fraction of exhaled nitric oxide; N/A, not available; OFC, oral food challenge; SNP, single nucleotide polymorphism