. 2024 Dec 21;184(1):98. doi: 10.1007/s00431-024-05925-5

Table 3.

Common/recommended model evaluation techniques

Technique	Rationale/use-case	Examples of implementation/approaches
Unsupervised learning
Robustness	Unsupervised learning solutions robust to noise are reasonably (more) generalizable and less overfit to the trained-on data	▪ Bootstrapping ▪ Sensitivity analysis
Clinical interpretability	In unsupervised learning, algorithms may derive clusters/trajectories that differ by statistical measures, but are so similar that they are not clinically distinguishable. Assessing clinically the characteristics of subgroups is highly relevant. In supervised learning, this objective is less obvious, but clinically sensible predictor variables driving predictions most strongly could indicate a sensible/practically useful solution	▪ Tables and other textual representation ▪ Figures, e.g., heatmaps, line plots etc
Homogeneity within clusters and dissimilarity across clusters	The classical definition of well-defined clusters/trajectories, i.e,. that the subjects within the derived subgroups are similar to each other and dissimilar to subjects in the other subgroups). From a clinical perspective, this may also be done with e.g., heatmaps	▪ Silhouette score ▪ Calinski-Harabasz score ▪ Davies-Bouldin score
Model fit / complexity	Statistical measures of how well a model fits the observed data (with varying penalty for complexity of the solution) provides an easily interpretable metric for selecting the optimal model in a data-driven manner	▪ Akaike information criterion (AIC; commonly suggests more complex solutions and larger number of subgroups) ▪ Bayesian information criterion (BIC; in contrast to AIC emphasizes simpler solutions [149])
Supervised learning
Performance metrics	Various measures are available to assess how accurately the model predicts based on new input data, varying widely depending on the learning task and data at hand. Ideally, multiple metrics need to be evaluated to assess model performance, some being more suitable in specific types of prediction models [148, 150, 151]	▪ R², adjusted R², MAE, RMSE, MAPE etc. (regression) ▪ MCC, F₁ score, Cohen’s kappa, Brier score, AUROC, AUPRC, log-loss, accuracy, recall, specificity, NPV/PPV etc. (classification)
Validation curve(s)	Diagnostic tool to assess the performance of a model in relation to hyperparameter settings (for example number of trees in a random forest model). More specifically, this can be used to e.g., evaluate hyperparameters that influences the model’s tendency to overfit/underfit	▪ Plot of performance metric on y-axis and hyperparameter on x-axis

The table describe common techniques used in model evaluation. The list is not meant to be comprehensive or universally applicable. It is recommended, particularly for unique applications, to evaluate previous similar implementations or relevant technical literature.