. 2024 Dec 21;184(1):98. doi: 10.1007/s00431-024-05925-5

Table 1.

Commonly used machine learning models

Model name^a	Appropriate use-cases / assumptions of data	Advantages/strengths	Disadvantages/limitations	Examples in pediatric allergy / further reading
*Unsupervised learning*
k-means / k-means + + [26], (kernel) fuzzy c-means [27] etc	▪ Cross-sectional continuous data ▪ Assumes clusters to be homogeneous with similar within-cluster variation along all variables	▪ Implementations widely available, with variants for various types of data and clustering objectives, e.g., fuzzy c-means (probabilistic labeling) [28] ▪ Computationally efficient	▪ Unsuitable for categorical/mixed data ▪ Unsuitable if clusters have strongly different within-cluster variation ▪ Prioritizes within-cluster homogeneity over between-cluster separation	▪ Endotypes of seasonal AR based on cytokine patterns [29] ▪ Immunologic endotypes of AD in infants [30] ▪ Phenotypes of asthma based on demographics, comorbidities, and medication [31] ▪ Biomarker-based phenotypes of AD [32] ▪ Transcriptomic clusters of asthma [33]
k-medoids / partitioning around medoids (PAM) / FastPAM / CLARA / CLARANS [34–37] / fuzzy k-medoids [38] etc	▪ Cross-sectional continuous or mixed data ▪ More flexible than k-means but still focuses on within-cluster homogeneity	▪ Relatively non-sensitive to outliers ▪ Can use any distance metric [39] ▪ Somewhat computationally costly, but high-performance variants are available, e.g., [34, 35, 40]) ▪ Implementations widely available, with various iterations for specific clustering objectives, e.g., with soft labeling [38]	▪ Spherical/convex clusters are more likely to be identified [41] ▪ DBSCAN and other methods may be more appropriate in case of very different cluster sizes, and if between-cluster separation is more important than within-cluster homogeneity [42]	▪ Longitudinal phenotypes of AR based on mHealth symptoms/medication data [43] ▪ Phenotypes of AR based on demographic, heredity, and clinical data [44] ▪ Phenotypes of eczema based on longitudinal disease patterns [45] ▪ Longitudinal phenotypes of wheezing [46, 47] ▪ Endotypes of asthma based on exhaled breath condensate [48]
Hierarchical clustering (HCA) / HCA on principal components (HCPC) [49] etc	▪ Cross-sectional continuous, categorical, or mixed data	▪ Accommodates any distance metric and a variety of linkage functions [50] ▪ Collinear/high-dimensional data can be managed automatically in HCPC [49], which combines feature extraction with HCA on the principal components ▪ Implementation widely available, as are various hierarchical methods with different focus, e.g., on homogeneity or separation, outliers etc. [51]	▪ Computationally costly with large data [50, 52] ▪ Meaningful clusters may only occur at low levels of the hierarchy, requiring potentially many clusters (some of which may just be outlier clusters) ▪ Clusters merged once cannot be separated again, which can result in suboptimal solutions	▪ Endotypes of rhinitis [53] ▪ Endotypes of seasonal AR [29] ▪ Phenotypes of AD based on allergic sensitization patterns [54] ▪ Phenotypes of asthma based on comorbidity, demographics, and asthma symptoms/lung function [55] ▪ Clusters of family adaptation to child’s FA [56] ▪ Clusters of asthma treatment outcome [57]
Latent class analysis (LCA) / longitudinal latent class analysis (LLCA)	▪ Longitudinal or cross-sectional categorical data	▪ Computationally efficient ▪ Informative by providing probability of assignment (soft clustering) ▪ Statistically principled approach to estimate number of clusters available ▪ Interpretability of original variables	▪ Inter-class heterogeneity is possible ▪ Compared to other methods used for trajectory analysis, performance may be lower [58]	▪ Trajectories of wheezing, rhinoconjunctivitis, and eczema symptoms [59] ▪ Grass/mite sensitization trajectoriess[60] ▪ Longitudinal phenotypes of FA and AD [61] ▪ Subtypes of AR based on comorbidity, heredity, and sensitization [62] ▪ Sensitization patterns [63]
Growth mixture modeling (GMM)	▪ Longitudinal data (continuous, but some implementations can handle categorical variables)	▪ Implementations widely available ▪ Allows for within-class variation ▪ Statistically principled approach to estimate number of clusters available	▪ More computationally demanding than e.g., LCGA [64]	▪ Trajectories of wheezing and allergic sensitization [65] ▪ Trajectories of allergic sensitization [66]
Latent class growth analysis (LCGA) / group-based trajectory modelling (GBTM)	▪ Longitudinal data (continuous, but some implementations can handle categorical data) ▪ Small sample or complex model with convergence issue [67]	▪ Implementations widely available ▪ High homogeneity due to no within-class variation allowed	▪ May necessitate larger number of classes due to no within-class variation [68], which may prove problematic if sample size is small	▪ Trajectories of asthma/wheezing based on dispensing data and hospital admissions [69] ▪ Trajectories of wheezing [70] ▪ Trajectories of early-onset rhinitis [71] ▪ Trajectories of eczema [72, 73]
*Supervised learning*
k-nearest neighbors (k-NN)	▪ Primarily for classification but may also be used for regression ▪ Varying degrees of noise, data size, and label numbers [74]	▪ One of the most widely used methods, with easy implementation, available in most statistical software ▪ Robust to outliers/noise [75]	▪ High computational demand on large datasets [76]	▪ Prediction of persistent asthma [77] ▪ Prediction of asthma diagnosis [78] ▪ Multi-omics based prediction of asthma [79] ▪ Prediction of asthma exacerbations based on blood markers, FeNO, and clinical characteristics [80]
Support vector machine (SVM)	▪ High-dimensional data (continuous, but categorical variables can be supported [81])	▪ Performs well on high-dimensional and complex predictor data	▪ Prone to overfitting, more so than many other supervised learning methods [77] ▪ Low explainability	▪ Prediction of asthma diagnosis [78, 82] ▪ Prediction of symptomatic peanut allergy based on microarray immunoassay [83] ▪ Prediction of AD based on transcriptome/microbiota data [84]
Decision trees (DT)	▪ Classification or regression ▪ Complex large data [85] of both continuous, categorical, or mixed nature	▪ Easily interpretable output	▪ Prone to overfitting ▪ Low accuracy compared to ensembles of DTs [86]	▪ Prediction of asthma diagnosis [78, 82] ▪ Prediction of hospitalization need for asthma exacerbation [87] ▪ Prediction of symptomatic peanut allergy based on microarray immunoassay [83]
Random forests (RF)	▪ Classification or regression ▪ Continuous, categorical or mixed data	▪ Relatively low risk of overfitting [88]	▪ Limited interpretability [88]	▪ Prediction of persistent asthma [77] ▪ Prediction of asthma diagnosis [78, 82] ▪ Prediction of hospitalization need for asthma exacerbation [87] ▪ Prediction of AD based on transcriptome/microbiota data [84] ▪ Prediction of asthma exacerbations based on AI stethoscope, parental reporting [89]
Bayesian network	▪ Continuous (although often discretized), categorical, or mixed data based on a probabilistic graphical model (directed acyclic graph)	▪ Adaptive by possibility to refine the network with new information [90] ▪ Capability of displaying and analyzing complex relationships [91] ▪ Accommodating missingness by utilization of all variables in model [90] ▪ Intuitive interpretation due to probabilistic labeling [92]	▪ Loss of information in discretization [90] ▪ High computational demand on large datasets [93]	▪ Prediction of asthma exacerbation [91] ▪ Prediction of response to short-acting bronchodilator medication [92] ▪ Metabolomic prediction of asthma [94] ▪ Prediction of eczema and asthma, respectively, based on SNP signatures [95]
Naïve Bayes	▪ Primarily for classification	▪ Needs relatively little training data [75] ▪ Computationally efficient [96]	▪ Strong assumptions of independence between the features [97] ▪ May need discretization of continuous variables [98]	▪ Prediction of OFC outcome [99] ▪ Prediction of asthma [78] ▪ Prediction of persistent asthma [77]
Multilayer perceptron (MLP)	▪ Complex high-dimensional continuous data	▪ Ability to learn complex patterns from high-dimensional big data	▪ May require large data for training, particularly if deep/complex network architecture [100]	▪ Prediction of asthma diagnosis [78] ▪ Multi-omics-based prediction of asthma [79]
Extreme gradient boosting (XGBoost)	▪ Classification or regression ▪ Continuous and categorical data [75] ▪ High-dimensional and/or sparse data [101]	▪ Improving accuracy through ensemble of weak prediction models ▪ Computationally efficacious [10]	▪ Performance issues may arise in imbalanced data [101] ▪ High memory usage	▪ Prediction of persistent asthma [77] ▪ Prediction of AD based on transcriptome/microbiota data [84]
Adaptive boosting (Adaboost)/	▪ Classification, but can be adapted for regression ▪ If crucial to boost performance of simple models through ensemble methods	▪ Combines multiple weak classifiers to create a strong classifier, improving accuracy ▪ Often achieves better performance with less tweaking of parameters compared to other complex models	▪ Sensitive to noisy data and outliers, which can lead to decreased performance if not handled properly ▪ Can overfit if the number and size of weak classifiers is not controlled	▪ Prediction of allergy [102] ▪ Allergy diagnosis framework [103] ▪ Asthma treatment outcome prediction [104]
Logistic Regression (LR)	▪ Primarily for binary (but can be extended to multiclass, e.g., with one-vs-rest strategy) classification ▪ Suitable for models where the outcome is a probability 0–1	▪ Simple implementation ▪ Efficient to train ▪ Well-understood and widely used ▪ Comparable performance to more advanced models in specific contexts of binary classification problems [105]	▪ Prone to underfitting when relationship between features and target is non-linear, unless feature engineering is applied ▪ Susceptible to overfitting with high-dimensional data if not regularized appropriately ▪ Assumes linear connection between dependent and independent variables, which can be limiting in complex scenarios [106]	▪ Prediction of house dust mite-induced allergy [107] ▪ Prediction of asthma persistence [77] ▪ Prediction of FA [108] ▪ Prediction of AD based on transcriptome and microbiota data [84]

Model name^a

Appropriate use-cases / assumptions of data

Advantages/strengths

Disadvantages/limitations

Examples in pediatric allergy / further reading

Unsupervised learning

k-means / k-means + + [26], (kernel) fuzzy c-means [27] etc

▪ Cross-sectional continuous data

▪ Assumes clusters to be homogeneous with similar within-cluster variation along all variables

▪ Implementations widely available, with variants for various types of data and clustering objectives, e.g., fuzzy c-means (probabilistic labeling) [28]

▪ Computationally efficient

▪ Unsuitable for categorical/mixed data

▪ Unsuitable if clusters have strongly different within-cluster variation

▪ Prioritizes within-cluster homogeneity over between-cluster separation

▪ Endotypes of seasonal AR based on cytokine patterns [29]

▪ Immunologic endotypes of AD in infants [30]

▪ Phenotypes of asthma based on demographics, comorbidities, and medication [31]

▪ Biomarker-based phenotypes of AD [32]

▪ Transcriptomic clusters of asthma [33]

k-medoids / partitioning around medoids (PAM) / FastPAM / CLARA / CLARANS [34–37] / fuzzy k-medoids [38] etc

▪ Cross-sectional continuous or mixed data

▪ More flexible than k-means but still focuses on within-cluster homogeneity

▪ Relatively non-sensitive to outliers

▪ Can use any distance metric [39]

▪ Somewhat computationally costly, but high-performance variants are available, e.g., [34, 35, 40])

▪ Implementations widely available, with various iterations for specific clustering objectives, e.g., with soft labeling [38]

▪ Spherical/convex clusters are more likely to be identified [41]

▪ DBSCAN and other methods may be more appropriate in case of very different cluster sizes, and if between-cluster separation is more important than within-cluster homogeneity [42]

▪ Longitudinal phenotypes of AR based on mHealth symptoms/medication data [43]

▪ Phenotypes of AR based on demographic, heredity, and clinical data [44]

▪ Phenotypes of eczema based on longitudinal disease patterns [45]

▪ Longitudinal phenotypes of wheezing [46, 47]

▪ Endotypes of asthma based on exhaled breath condensate [48]

Hierarchical clustering (HCA) / HCA on principal components (HCPC) [49] etc

▪ Cross-sectional continuous, categorical, or mixed data

▪ Accommodates any distance metric and a variety of linkage functions [50]

▪ Collinear/high-dimensional data can be managed automatically in HCPC [49], which combines feature extraction with HCA on the principal components

▪ Implementation widely available, as are various hierarchical methods with different focus, e.g., on homogeneity or separation, outliers etc. [51]

▪ Computationally costly with large data [50, 52]

▪ Meaningful clusters may only occur at low levels of the hierarchy, requiring potentially many clusters (some of which may just be outlier clusters)

▪ Clusters merged once cannot be separated again, which can result in suboptimal solutions

▪ Endotypes of rhinitis [53]

▪ Endotypes of seasonal AR [29]

▪ Phenotypes of AD based on allergic sensitization patterns [54]

▪ Phenotypes of asthma based on comorbidity, demographics, and asthma symptoms/lung function [55]

▪ Clusters of family adaptation to child’s FA [56]

▪ Clusters of asthma treatment outcome [57]

Latent class analysis (LCA) / longitudinal latent class analysis (LLCA)

▪ Longitudinal or cross-sectional categorical data

▪ Computationally efficient

▪ Informative by providing probability of assignment (soft clustering)

▪ Statistically principled approach to estimate number of clusters available

▪ Interpretability of original variables

▪ Inter-class heterogeneity is possible

▪ Compared to other methods used for trajectory analysis, performance may be lower [58]

▪ Trajectories of wheezing, rhinoconjunctivitis, and eczema symptoms [59]

▪ Grass/mite sensitization trajectoriess[60]

▪ Longitudinal phenotypes of FA and AD [61]

▪ Subtypes of AR based on comorbidity, heredity, and sensitization [62]

▪ Sensitization patterns [63]

Growth mixture modeling (GMM)

▪ Longitudinal data (continuous, but some implementations can handle categorical variables)

▪ Implementations widely available

▪ Allows for within-class variation

▪ Statistically principled approach to estimate number of clusters available

▪ More computationally demanding than e.g., LCGA [64]

▪ Trajectories of wheezing and allergic sensitization [65]

▪ Trajectories of allergic sensitization [66]

Latent class growth analysis (LCGA) / group-based trajectory modelling (GBTM)

▪ Longitudinal data (continuous, but some implementations can handle categorical data)

▪ Small sample or complex model with convergence issue [67]

▪ Implementations widely available

▪ High homogeneity due to no within-class variation allowed

▪ May necessitate larger number of classes due to no within-class variation [68], which may prove problematic if sample size is small

▪ Trajectories of asthma/wheezing based on dispensing data and hospital admissions [69]

▪ Trajectories of wheezing [70]

▪ Trajectories of early-onset rhinitis [71]

▪ Trajectories of eczema [72, 73]

Supervised learning

k-nearest neighbors (k-NN)

▪ Primarily for classification but may also be used for regression

▪ Varying degrees of noise, data size, and label numbers [74]

▪ One of the most widely used methods, with easy implementation, available in most statistical software

▪ Robust to outliers/noise [75]

▪ High computational demand on large datasets [76]

▪ Prediction of persistent asthma [77]

▪ Prediction of asthma diagnosis [78]

▪ Multi-omics based prediction of asthma [79]

▪ Prediction of asthma exacerbations based on blood markers, FeNO, and clinical characteristics [80]

Support vector machine (SVM)

▪ High-dimensional data (continuous, but categorical variables can be supported [81])

▪ Performs well on high-dimensional and complex predictor data

▪ Prone to overfitting, more so than many other supervised learning methods [77]

▪ Low explainability

▪ Prediction of asthma diagnosis [78, 82]

▪ Prediction of symptomatic peanut allergy based on microarray immunoassay [83]

▪ Prediction of AD based on transcriptome/microbiota data [84]

Decision trees (DT)

▪ Classification or regression

▪ Complex large data [85] of both continuous, categorical, or mixed nature

▪ Easily interpretable output

▪ Prone to overfitting

▪ Low accuracy compared to ensembles of DTs [86]

▪ Prediction of asthma diagnosis [78, 82]

▪ Prediction of hospitalization need for asthma exacerbation [87]

▪ Prediction of symptomatic peanut allergy based on microarray immunoassay [83]

Random forests (RF)

▪ Classification or regression

▪ Continuous, categorical or mixed data

▪ Relatively low risk of overfitting [88]

▪ Limited interpretability [88]

▪ Prediction of persistent asthma [77]

▪ Prediction of asthma diagnosis [78, 82]

▪ Prediction of hospitalization need for asthma exacerbation [87]

▪ Prediction of AD based on transcriptome/microbiota data [84]

▪ Prediction of asthma exacerbations based on AI stethoscope, parental reporting [89]

Bayesian network

▪ Continuous (although often discretized), categorical, or mixed data based on a probabilistic graphical model (directed acyclic graph)

▪ Adaptive by possibility to refine the network with new information [90]

▪ Capability of displaying and analyzing complex relationships [91]

▪ Accommodating missingness by utilization of all variables in model [90]

▪ Intuitive interpretation due to probabilistic labeling [92]

▪ Loss of information in discretization [90]

▪ High computational demand on large datasets [93]

▪ Prediction of asthma exacerbation [91]

▪ Prediction of response to short-acting bronchodilator medication [92]

▪ Metabolomic prediction of asthma [94]

▪ Prediction of eczema and asthma, respectively, based on SNP signatures [95]

Naïve Bayes

▪ Primarily for classification

▪ Needs relatively little training data [75]

▪ Computationally efficient [96]

▪ Strong assumptions of independence between the features [97]

▪ May need discretization of continuous variables [98]

▪ Prediction of OFC outcome [99]

▪ Prediction of asthma [78]

▪ Prediction of persistent asthma [77]

Multilayer perceptron (MLP)

▪ Complex high-dimensional continuous data

▪ Ability to learn complex patterns from high-dimensional big data

▪ May require large data for training, particularly if deep/complex network architecture [100]

▪ Prediction of asthma diagnosis [78]

▪ Multi-omics-based prediction of asthma [79]

Extreme gradient boosting (XGBoost)

▪ Classification or regression

▪ Continuous and categorical data [75]

▪ High-dimensional and/or sparse data [101]

▪ Improving accuracy through ensemble of weak prediction models

▪ Computationally efficacious [10]

▪ Performance issues may arise in imbalanced data [101]

▪ High memory usage

▪ Prediction of persistent asthma [77]

▪ Prediction of AD based on transcriptome/microbiota data [84]

Adaptive boosting (Adaboost)/

▪ Classification, but can be adapted for regression

▪ If crucial to boost performance of simple models through ensemble methods

▪ Combines multiple weak classifiers to create a strong classifier, improving accuracy

▪ Often achieves better performance with less tweaking of parameters compared to other complex models

▪ Sensitive to noisy data and outliers, which can lead to decreased performance if not handled properly

▪ Can overfit if the number and size of weak classifiers is not controlled

▪ Prediction of allergy [102]

▪ Allergy diagnosis framework [103]

▪ Asthma treatment outcome prediction [104]

Logistic Regression (LR)

▪ Primarily for binary (but can be extended to multiclass, e.g., with one-vs-rest strategy) classification

▪ Suitable for models where the outcome is a probability 0–1

▪ Simple implementation

▪ Efficient to train

▪ Well-understood and widely used

▪ Comparable performance to more advanced models in specific contexts of binary classification problems [105]

▪ Prone to underfitting when relationship between features and target is non-linear, unless feature engineering is applied

▪ Susceptible to overfitting with high-dimensional data if not regularized appropriately

▪ Assumes linear connection between dependent and independent variables, which can be limiting in complex scenarios [106]

▪ Prediction of house dust mite-induced allergy [107]

▪ Prediction of asthma persistence [77]

▪ Prediction of FA [108]

▪ Prediction of AD based on transcriptome and microbiota data [84]

The list is not intended to be comprehensive or cover all relevant/possible use-cases, but rather to provide an overview of common and promising algorithms. Abbreviations. AD, atopic dermatitis; AI, artificial intelligence; AR, allergic rhinitis; FA, food allergy; FeNO, fraction of exhaled nitric oxide; N/A, not available; OFC, oral food challenge; SNP, single nucleotide polymorphism