Prediction models developed using artificial intelligence: similar predictive performances with highly varying predictions for individuals – an illustration in deep vein thrombosis

Maerziya Yusufujiang; Constanza L Andaur Navarro; Johanna AA Damen; Toshihiko Takada; Geert-Jan Geersing; Lotty Hooft; Ewoud Schuit; Karel GM Moons; Valentijn MT de Jong; Maarten van Smeden

doi:10.1186/s41512-025-00216-5

. 2026 Jan 8;10:1. doi: 10.1186/s41512-025-00216-5

Prediction models developed using artificial intelligence: similar predictive performances with highly varying predictions for individuals – an illustration in deep vein thrombosis

Maerziya Yusufujiang ^1,^✉,^#, Constanza L Andaur Navarro ^1,^#, Johanna AA Damen ¹, Toshihiko Takada ^1,², Geert-Jan Geersing ^1,³, Lotty Hooft ¹, Ewoud Schuit ¹, Karel GM Moons ¹, Valentijn MT de Jong ^1,^#, Maarten van Smeden ^1,^#

PMCID: PMC12784591 PMID: 41508150

Abstract

Objectives

The rise in popularity and off-the-shelf availability of machine learning (ML) and AI-based methodology to develop new prediction models provides developers with ample choices to compare and select the best performing model out of many possible models. Many studies have shown that such comparisons on any particular dataset, the difference in performance between models developed using different techniques (e.g. logistic regression, vs. random forest or neural networks) can often be small, especially when looking at crude performance measures such as the area under the ROC curve. This may lead to the conclusion that such models are essentially exchangeable, and model selection is arbitrary. However, as we will illustrate using a dataset on deep venous thrombosis, prediction models with similar discriminative performance may nonetheless generate different outcome probability estimates for individual patients and potentially lead to meaningfully different decision making.

Methods

We developed diagnostic prediction models to predict the presence of deep venous thrombosis (DVT) in a large dataset of patients with leg symptoms suspected of having DVT, using five modelling techniques: unpenalized logistic regression (ULR), ridge logistic regression (RLR), random forests (RF), support vector machine (SVM) and neural network (NN). Age, sex, d-dimer, history of DVT, diagnosis alternative to DVT, and having cancer were used as a fixed set of predictors. Model performance was evaluated in terms of discrimination, calibration, and stability of individual risk prediction for a set of patients across the models.

Results

Of the 6,087 suspected patients, 1,146 (19%) were diagnosed with DVT based on leg ultrasound (reference test). Three prediction models (ULR, RLR, NN) had similar discrimination with AUCs point estimates of 0.84. However, the 6087 individuals’ estimated probabilities of DVT varied substantially across the five different modelling techniques, highlighting differences in prediction stability. Notably, the RF model tended to overestimate individual risks, while the SVM model tended to underestimate them compared to the other models. While the estimated probabilities were more similar for ULR, RLR and NN, classification measures (sensitivity, specificity, positive and negative predictive value) did differ because of differences in estimated probabilities of individuals near the risk threshold, illustrating that differences, even when relatively small, could potentially lead to different clinical decisions.

Conclusions

Prediction models developed with different modeling techniques yielded very different individuals’ outcome probabilities, even though the models had similar discriminative performance in this low-dimensional setting. Part of this variation can be explained by differences in calibration but also from modelling choices as estimated risks also differed for modelling techniques with similar calibration performance. Hence, our findings highlight the impact of the choice of modelling techniques on model performance, individual estimated probabilities and consequently the impact of that choice on risk-based clinical decision making.

Supplementary Information

The online version contains supplementary material available at 10.1186/s41512-025-00216-5.

Keywords: Prediction model, Artificial intelligence, Discrimination, Decision making, Stability of predictions

Introduction

Healthcare professionals regularly combine multiple pieces of information to estimate an individual’s probability or risk of having an outcome (diagnosis) or developing an outcome in the future (prognosis) [1–3]. Clinical prediction models (CPMs) are the formal combination of multiple variables (or predictors) to estimate this probability. Traditionally, regression models such as logistic regression and cox regression have often been used to estimate these individual’s estimated probabilities. Machine learning (ML) and artificial intelligence (AI) have become increasingly popular in prediction modelling studies [4].

Several studies have suggested that, in low-dimensional settings, where the number of predictors is relatively small compared to the number of observations, prediction models based on ML and AI can achieve better performance than regression-based models, particularly in terms of discrimination measures such as the area under the receiver operating characteristic curve (AUC), but these differences are often small [5–8]. Evaluation of prediction models typically focuses on the model’s ability to separate participants with and without a particular outcome based on the model’s estimated probabilities (discrimination) [9, 10]. CPMs estimate outcome risks based on measured associations between predictors and outcomes, rather than true individual risks. Nevertheless, when such models are used to support probability-based clinical decision-making, it is important that the estimated probabilities are discriminative, well-calibrated, and stable, since unreliable estimates can result in inconsistent guidance for individual patients [11].

Model instability refers to the variability in predictions that arises from small changes in the data or modelling approach. Riley et al. have highlighted the importance of assessing model stability, particularly during internal validation, to understand how predictions might vary across different samples or modelling techniques [12, 13]. Beyond this, it is also essential to distinguish between different sources of uncertainty in prediction models: aleatoric uncertainty, which stems from inherent randomness in the data (e.g., measurement noise), and epistemic uncertainty, which arises due to limitation in model specification (e.g., choice of model or insufficient data) [14]. Differences in modelling approaches—such as logistic regression versus random forests—may not only reflect different assumptions, but also different ways of capturing or propagating these uncertainties.

In clinical practice, prediction models are not only used to estimate probabilities but also to make classification decisions—such as whether a patient is likely to have deep venous thrombosis (DVT)—by applying a predefined probability threshold [15]. The choice of such a threshold directly affects classification performance measures such as sensitivity, specificity, and positive and negative predictive values of the model for that threshold. Even when different models show similar discriminative ability (e.g., comparable AUCs), differences in the distribution of estimated probabilities can lead to varying classification results even when the same threshold is applied. This is particularly important when clinical decision making is based on this classification, e.g., in a situation where the clinical priority is to avoid false negatives, meaning a low threshold is chosen to maximize sensitivity. Therefore, knowing individual risk estimates across different models and how these estimates affect patient classifications when a probability threshold is applied is essential for evaluating their utility in real-world settings.

The objective of this study is to illustrate the variability between individuals’ probabilities when estimated with different modelling methods, including unpenalized logistic regression (ULR), ridge logistic regression (RLR), random forests (RF), support vector machine (SVM) and neural network (NN), in a setting where the models showed similar discrimination performance.

Methods

Data sources

We used data from 8 of the 13 prospective diagnostic studies previously collected for an individual participant data meta-analysis (IPDMA) of patients suspected of deep vein thrombosis (DVT) [16]. Further details on the construction of the dataset can be found elsewhere [16]. In the IPDMA, missing data were imputed. For our illustration, we only used the first imputation set for all models to avoid increased variation in the predictions due to variations in the impact of missing data handling.

Predictors and outcome

We selected a-priori six predictors: age (continuous), sex (categorical, male/female), d-dimer (categorical, yes/no), previous history of DVT (categorical, yes/no), alternative diagnosis as likely as or more likely than DVT (categorical, yes/no), and active malignancy (categorical, yes/no) based on previously developed prediction models for DVT [17, 18]. No data-driven predictor selection was performed, meaning all six predictors were forced in each of the five models. DVT, the outcome of interest, was defined as a dichotomous outcome (present/absent) [17].

Modelling techniques

We selected the following modelling techniques: unpenalized logistic regression (ULR), ridge logistic regression (RLR), random forest (RF), support vector machine (SVM), and neural network (NN). RF is an ensemble technique consisting of multiple decision trees trained on bootstrapped sub-sets of the full dataset and different initial variables. The model outputs predicted probabilities for each class, which can then be converted into classifications using a chosen threshold [19, 20]. SVM aims to find the best line that separates the outcome groups — for example, patients with and without DVT. It tries to create the widest possible gap between the two groups, which helps it make more reliable predictions on new patients [21]. NN consists of layers of simple connected units. They learn patterns in the data by adjusting the strength of these connections during training, allowing them to capture complex patterns and make predictions. Last, we applied ULR and RLR, which regress the outcome on the predictor values, using the logistic link function. ULR optimizes the binomial likelihood, whereas RLR optimizes the ridge penalized binomial likelihood.

We emphasize that our aim was not to identify the best-performing model for clinical use, but to illustrate how different modeling approaches can yield varying individual predictions.

Data analysis

We used the R meta-package “caret” (Classification And Regression Training, https://cran.r-project.org/web/packages/caret/index.html) version 6.0.89 to obtain a series of models (Table S1) based on the common set of predictors (above).

Model development

We developed five models. For ULR and RLR, predictors were entered as linear terms without interactions. In contrast, the RF, SVM, and NN models can capture non-linear relationships and interactions automatically as part of their standard modeling approach.

We performed a 10-fold cross-validation for tuning algorithm-specific parameters (see Table S1) to optimize the AUC.

For RF, we implemented the model using the RandomForest package, which applies a majority vote. For majority vote the individual probability equals the number of trees that indicate outcome presence, i.e., if we have 500 trees in an RF model, and 400 of them indicate the present of the outcome, the probability would be 80%. We used a grid search to tune mtry (the number of variables randomly sampled at each split), and fixed ntree (the number of trees) at 1000. All other parameters were kept at their default settings. Details of the tuning procedure are provided in Table S2.

For SVM, we used the kernlab [22] package, which implements support vector machines using the optimizers from the libsvm software [23]. The probabilities are obtained after a second regression model has been trained on the SVM outputs.

For NN, we used the nnet package through caret to build a model with one hidden layer. Neural networks inherently output probabilities through their activation functions, typically using a sigmoid function (S-shape) in the output layer for binary classification. When no hidden layer is used, a neural network is equivalent to logistic regression. With a hidden layer, it becomes a more flexible model that can capture non-linear relationships.

Sample size

We used the R package pmsampsize [24] to calculate the sample size required to develop a new prediction model based on logistic regression. One is required to input the overall fraction of participants expected to develop DVT (18%), the number of candidate predictor parameters (n = 6) and the anticipated c-statistic (0.81). This c-statistic was informed by previous work on clinical prediction models for suspected DVT that reported similar levels of discriminative performance [25]. The minimum sample size required was at least 266 participants with 48 events, and 7.98 events per candidate predictor parameter (EPP) while considering a shrinkage factor of 0.90 and R² Cox-Snell (R²cs) of 0.18.

Assessment of model performance and classification

We applied each model to the dataset to obtain estimated probabilities for each individual in the dataset. We determined the models’ discriminative performance by calculating the model’s c-index (AUC). A perfect model would have an AUC equal to 1. We also assessed models’ calibration graphically by calibration plots [11]. If a model is well calibrated, there is perfect agreement between model-estimated risks (x-axis) and observed outcome frequencies (y-axis), resulting in a diagonal line in the plot.

In addition, we assessed patient classification with sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). While AUC is probability threshold-independent, classification measures depend on the threshold applied to the predicted model probabilities. In this study, we used a fixed threshold of 0.02 to classify patients as having or not having DVT. This low threshold was explicitly chosen to directly reflect the clinical importance of prioritizing sensitivity, given the potentially life-threatening consequences of a missed DVT diagnosis [17, 26, 27]. This approach aligns with clinical decision-making in high-risk scenarios and allows for meaningful comparison of model predictions under a uniform classification rule.

Estimated probabilities for the same individuals were compared across different modeling techniques using a scatter plot. Additionally, for a random selection of 10 individuals in the dataset, we illustrated the estimated probabilities across the modeling techniques. To explore potential differences in estimated probabilities, we further examined the strength of the association of the six predictors with the outcome across the different modeling techniques using model coefficients (for ULR and RLR) and variable importance measures (for RF, SVM, and NN). Additionally, we used the iml package to generate Individual Conditional Expectation (ICE) plots, which helped visualize how changes in each predictor affected the estimated probabilities for individual patients.

All statistical analyses were performed in R version 4.0.3 [28] with the crossTable, rms, pROC, ggplot, and caret packages.

Results

The dataset included 6,087 patients of whom 1,146 (19%) were confirmed cases of DVT presence. The mean age was 59 (SD ± 17) years for all participants, of whom 3760 (62%) were female. Further details on the study participants can be found in the original publication [16] and in Table 1.

Table 1.

Characteristics of the study population in terms of predictors, overall, and stratified by outcome presence

	Total, n = 6087 (100)	DVT, n = 1146 (19)	No DVT, n = 4941 (81)
Age, mean (SD)	59 (± 17)	61 (± 17)	59 (± 17)
Sex, count (%)
Female	3760 (62)	602 (53)	3158 (64)
Male	2337 (38)	544 (48)	1783 (36)
D-dimer, count (%)
No	3114 (51)	99 (9)	3015 (61)
Yes	2973 (49)	1047 (91)	1926 (39)
Cancer, count (%)
No	5486 (90)	918 (80)	4568 (93)
Yes	601 (10)	228 (20)	373(8)
Previous history of DVT, count (%)
No	5747 (94)	1055 (92)	4692 (95)
Yes	340 (6)	91 (8)	249 (5)
Alternative diagnosis, count (%)
No	3145 (52)	919 (80)	2226 (45)
Yes	2942 (48)	227 (20)	2715 (55)

Open in a new tab

Tuning parameters

The values for hyperparameters of the ML models after tuning are presented in Table S2.

Discrimination

The five models achieved AUCs ranging from 0.81 to 0.84, with the RF model showing the lowest AUC (0.81, 95% CI 0.80–0.82), SVM model also showing relatively low AUC (0.82, 95%CI 0.81–0.83) and the ULR (0.84, 95% CI 0.83–0.85), RLR (0.84, 95% CI 0.83–0.85), and NN (0.84, 95% CI 0.83–0.85) models showing the highest AUC.

Calibration

The calibration plots are shown in Fig. 1 and calibration statistics for five models are provided in Table S4. The ULR and NN models showed good calibration across the range of predicted probabilities, while RLR slightly underestimated higher risks. In contrast, RF and SVM demonstrated clear miscalibration, with RF overestimating and SVM underestimating predicted probabilities in parts of the risk range.

These differences indicate that, despite similar discrimination, models varied in how accurately their estimated probabilities reflected observed outcomes. Such variation in calibration helps explain the differences in individual predicted risks observed across models, which we investigated in the next subsection.

Classification

Table 2 summarizes the classification performance of the five fitted models. All models had similar AUCs, but sensitivity, specificity, and predictive values varied across models. This variation partially reflects differences in calibration: RF and SVM showed poorer calibration and therefore diverged in classification performance. However, ULR and NN, and to some extent RLR, showed good calibration, and still differences in classification measures were observed because of differences in estimated probabilities of individuals near the risk threshold across models. In contrast, RF provided a more balanced sensitivity (0.81) and specificity (0.81), resulting in the highest PPV (0.50) among all models. The neural network also performed relatively well, with sensitivity of 0.82 and specificity of 0.75.

Table 2.

Performance measures

Model

Performance measures

AUC
(95% CI)

Sensitivity
(95% CI)

Specificity
(95% CI)

PPV
(95% CI)

NPV
(95% CI)

Unpenalized logistic regression

0.84

(0.83-0.85)

0.98

(0.98-0.99)

0.31

(0.30-0.33)

0.25

(0.24-0.26)

0.98

(0.98-0.99)

Ridge logistic regression

0.84

(0.83-0.85)

1.00

0.19

(0.18-0.19)

NaN

Random Forest

0.81

(0.80-0.82)

0.81

(0.79-0.83)

0.81

(0.80-0.82)

0.50

(0.47-0.52)

0.94

(0.94-0.95)

Support Vector machine

0.82

(0.81-0.83)

1.00

0.19

(0.18-0.19)

NaN

Neural Network

0.84

(0.83-0.85)

0.82

(0.79-0.84)

0.75

(0.73-0.76)

0.43

(0.41-0.45)

0.95

(0.94-0.95)

Open in a new tab

CI Confidence interval. The 95% confidence intervals for AUC were calculated using the DeLong method and for classification measures using the Wilson score interval. PPV Positive predictive value. NPV Negative predictive value. NaN Not a Number

Individual estimated probabilities

As can be seen from the calibration plots in Fig. 1 and also from the plots on the diagonal of Fig. 2, the range of estimated probabilities substantially differed between modelling techniques. While RF provided a broad range of probabilities (from 0 to 1), NN showed a narrower range (from 0 to 0.7). Likewise, the estimated probabilities are not equally distributed across the range of probabilities for the different modeling techniques, although most individual probabilities were found at the lower end of the distribution. For example, most individual probabilities provided by the RF model were close to 0.

Figure 3 and Table S3 show a comparison of the different estimated probabilities for ten random individuals, indicating that estimated probabilities can differ substantially across methods within the same individual. For example, for individual with ID 195, RLR predicted a probability of 0.028 of DVT being present, which is above the threshold (0.02) commonly used for diagnosing DVT in practice, and NN and ULR respectively estimated probabilities of 0.012 and 0.011, which are below the threshold. Even larger differences were observed when the models with a lower estimate for the AUC were considered. For the same individual, RF estimated a probability of 0.00, and SVM a probability of 0.10. A similar pattern was observed for ID 2463. Further, the individual with ID 1142 had a 0.36 probability based on the ULR model, while the models based on the other methods provided estimated probabilities from 0.03 (RF) to 0.37 (NN) for that same individual. Although these differences are large, they would not lead to different clinical decisions based on the threshold of 0.02.

Fig. 3 — Comparison of estimated probabilities across five different modeling methods for ten random individuals. ULR = unpenalized logistic regression, RLR = ridge logistic regression, RF = random forests, SVM = support vector machine and NN = neural network

Predictors

We generated the Individual Conditional Expectation (ICE) plots (see Figures S1-S6) to assess the contribution of individual predictors. The plots revealed that d-dimer value and alternative diagnostic were the most important predictors across all models. However, differences were found for the RF and SVM models as compared to ULR, RLR, and NN models.

Discussion

We showed that prediction models developed using different modelling techniques can yield comparable discrimination in a low-dimensional setting but differ substantially in calibration, which can partly explain why individual estimated probabilities varied across models. Even though two models may rank patients similarly in terms of estimated probabilities (reflected in the same high discrimination of both models), they may provide very different absolute estimated probabilities for the same individual. This can potentially lead to different clinical decisions for the same patient across modeling techniques. For example, we observed that ULR and NN produced highly correlated probabilities, which frequently but not always led to the same decision at the 2% probability level, while SVM and RF showed weaker correlations and poorer calibration. Once adequate calibration and discrimination are established, examining the stability of individual predicted probabilities across models offers additional insight into those estimates. When well-calibrated models produce similar predictions, this consistency increases confidence in their reliability.

Our findings emphasize the need for careful consideration in individual-level predictions when comparing and choosing models for decision making in daily practice. To evaluate their practical utility, methods such as decision curve analysis and net benefit can be used to assess whether predicted risks lead to improved decision-making across relevant thresholds [29]. Though in practice, the development and validation of a single, well-suited model is preferable; our use of multiple models here served only to demonstrate the variability in individual predictions.

These differences in estimated probabilities reflect variability in predictions that can arise from the modelling choices themselves. Riley et al. have highlighted the importance of assessing such variability to understand how well models generalize [12, 13]. Our study extends this idea by showing that predictions can vary not only within one modelling method, but also between different modelling techniques. This variation also relates to two types of uncertainty in prediction modelling: aleatoric uncertainty, which comes from randomness in the data, and epistemic uncertainty, which stems from modelling choices [14]. Different modelling approaches handle epistemic uncertainty in different ways, which helps explain why predictions can vary even when models have similar AUCs.

In medical practice, decisions often rely on applying a decision threshold to the model’s estimated probability. Performance measures such as sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) can be strongly influenced by how individual risks are distributed around that threshold. In our study, even models with similar discrimination and good calibration (ULR, RLR, and NN) produced different classification results. This highlights the importance of examining the full distribution of predicted risks and understanding how these distributions interact with clinically meaningful probability thresholds. Evaluating only threshold-independent metrics can overlook important practical implications for patient care. Accurately predicting the probability of DVT presence in suspected patients can help decide on their further diagnosis management and subsequent treatments.

Comparison to previous research

A previous study showed that prognostic predictions for individual probabilities of cardiovascular disease (CVD) varied between and within different types of machine learning and regression models. In this study, a patient with a probability of 9.5–10.5% predicted by QRISK3, (a well-stablished CVD risk score) had a probability of 2.9–9.2% in a random forest and 2.4–7.2% in a neural network [30]. A study evaluating the performance of three different existing prediction models based on regression techniques for cardiovascular disease concluded that their application may result in considerable misclassification for individuals with the highest probabilities [31].

Strengths and limitations of this study

This analysis included individual participant data from more than 6,000 participants with suspected DVT, of whom 1146 were diagnosed with the outcome of DVT presence. One strength of our approach is that it shows the impact of different models yielding estimated probabilities in a real-life scenario where a probability-based classification threshold (0.02) is typically used, and missing a DVT diagnosis can be potentially life-threatening due to clot progression to pulmonary embolism should the initial DVT be left untreated. Another strength is that we applied consistent model development procedures to enable a fair comparison of predictive performance across five modelling approaches, including both traditional regression and machine learning techniques. Each model was properly tuned using internal resampling (e.g., selecting the regularization parameter for ridge regression, the number of trees for RF, and the cost parameter for SVM).

This study also has some limitations. We used the caret package for model training and tuning across all methods. While caret provides a convenient and unified framework, recent research has shown that certain tuning methods, especially the popular one-standard-error rule in cross-validation, can lead to poor calibration of prediction models, even when discrimination is good [32].

Finally, this study was designed as an illustrative comparison rather than a model-development study. We do not recommend fitting multiple models for every medical prediction task. Choosing a modelling approach should be guided by the clinical context and the specific requirements of the task, for example, by considering interpretability, model performance (e.g., discrimination, calibration), data structure, and how well the model supports decision-making in situations where certain errors (e.g., false negatives) carry serious consequences (such as in DVT or breast cancer). Factors like model interpretability, data structure, and intended clinical use should guide this choice. Understanding how modeling decisions affect both individual predictions and performance metrics can help researchers align model development with clinical priorities and practical constraints.

Implications for researchers and future research

Our intention in including models with poor calibration (e.g. RF and SVM) was not to advocate for their clinical use, but to highlight that calibration can differ substantially between modeling approaches, even when discrimination (AUC) is similar. This highlights that reliance on AUC alone provides an incomplete assessment of model performance, as it does not reflect how accurately predicted probabilities correspond to observed outcomes.

Systematic reviews have found that calibration is assessed far less often compared to discrimination, irrespective of whether models were built using ML, AI or statistical techniques [9, 10]. The lack of calibration limits the use of clinical prediction models and, consequently, several guidelines have stressed the need to report calibration alongside discrimination [33]. Moreover, as illustrated by Table S3, prediction intervals for individual risks can differ across methods. Particularly for SVM and RF, prediction intervals were wide for some individuals, indicating large uncertainty in estimated risks for these methods for these patients. Beyond calibration, the stability of predictions across modelling approaches may also offer useful insights into the robustness of model results [13].

As shown in this study, details on the different modeling techniques and parameters used is necessary for critical appraisal. To provide guidance on reporting and critical assessment, the TRIPOD + AI and PROBAST + AI have been recently published [34, 35].

Conclusion

Even in a low-dimensional setting where different modelling techniques achieve similar discrimination, they can produce markedly different individual risk estimates. These differences arise not only from variation in calibration but also from the modelling choices themselves. These findings highlight that the selection of a modelling technique can meaningfully affect model performance, individual estimated probabilities and the choice on clinical decisions that depend on risk thresholds.

Supplementary Information

41512_2025_216_MOESM1_ESM.docx^{(889.1KB, docx)}

Supplementary Material 1: Table S1. Details on model development [22, 36–38]. Table S2. Tuning of parameters based on repeated cross-validation. Table S3. Risk probabilities and prediction interval for ten random individuals. Table S4. Calibration statistics (95% Confidence Intervals) for five models. Figure S1: ICE Plot of age for the 5 different models. Figure S2. ICE Plot of sex for the 5 different models. Figure S3. ICE Plot of history of previous DVT for the 5 different models. Figure S4. ICE Plot of dichotomized d-dimer value for the 5 different models. Figure S5. ICE Plot of active malignancy for the 5 different models. Figure S6. ICE Plot of alternative diagnosis for the 5 different models.

Acknowledgements

We gratefully acknowledge the following authors for sharing of individual participant data from the deep vein thrombosis (DVT) studies: G.J. Geersing, N.P.A. Zuithoff, C. Kearon, D.R. Anderson, A.J. ten Cate-Hoek, J.L. Elf, S.M. Bates, A.W. Hoes, R.A. Kraaijenhagen, R. Oudega, R.E.G. Schutgens, S.M. Stevens, S.C. Woller, P.S. Wells and K.G.M. Moons.

Disclaimer

The views expressed in this paper are the personal views of the authors and may not be understood or quoted as being made on behalf of or reflecting the position of the regulatory agency/agencies or organizations with which the authors are employed/affiliated.

Transparency

MY and VdJ affirm that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained.

Abbreviations

DVT: Deep venous thrombosis
ULR: Unpenalized logistic regression
RLR: Ridge logistic regression
RF: Random forests
SVM: Support vector machine
NN: Neural network
IPDMA: Individual participant data meta-analysis

Authors’ contributions

The study concept and design were conceived by CLAN, JAAD, GJG, MvS and KGMM, and later further adapted by MY, ES, and VdJ. TT and GJG provided insight into the dataset and clinical problem. CLAN and MY performed data analysis and wrote the first draft of this manuscript, which was revised by all authors who have provided their final approval for this version. MY, the corresponding author, is the guarantor of the manuscript. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

Funding

GJG is supported by a Veni (016.166.030) and Vidi (91719304) grant from the Dutch Research Council (NWO/ZonMw).

Data availability

The DVT data that support the findings of this study are not publicly available, according to the conditions determined by the authors of the DVT studies, but are available on reasonable request from GJG by e-mail.Analytical code is available via repository [https://github.com/Maerziya/EstimatingRisks].

Declarations

Ethics approval and consent to participate

Not required for this work.

Competing interests

One of the authors (Karel GM Moons) is the Editor-in-Chief of BMC Diagnostic and Prognostic Research.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Maerziya Yusufujiang, Constanza L Andaur Navarro, Valentijn MT de Jong and Maarten van Smeden contributed equally to this work.

References

1.Steyerberg EW, Moons KGM, van der Windt DA, Hayden JA, Perel P, Schroter S, et al. Prognosis research strategy (PROGRESS) 3: prognostic model research. PLoS Med. 2013;10(2):e1001381. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Moons KGM, Kengne AP, Woodward M, Royston P, Vergouwe Y, Altman DG, et al. Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker. Heart Br Card Soc. 2012;98(9):683–90. [DOI] [PubMed] [Google Scholar]
3.Moons KGM, Royston P, Vergouwe Y, Grobbee DE, Altman DG. Prognosis and prognostic research: what, why, and how? BMJ. 2009;338:b375. [DOI] [PubMed] [Google Scholar]
4.Yang C, Kors JA, Ioannou S, John LH, Markus AF, Rekkas A, et al. Trends in the conduct and reporting of clinical prediction model development and validation: a systematic review. J Am Med Inf Assoc JAMIA. 2022;29(5):983–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Steele AJ, Denaxas SC, Shah AD, Hemingway H, Luscombe NM. Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease. PLoS ONE. 2018;13(8):e0202344. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Hale AT, Stonko DP, Brown A, Lim J, Voce DJ, Gannon SR, et al. Machine-learning analysis outperforms conventional statistical models and CT classification systems in predicting 6-month outcomes in pediatric patients sustaining traumatic brain injury. Neurosurg Focus. 2018;45(5):1–7. [DOI] [PubMed] [Google Scholar]
7.Kareemi H, Vaillancourt C, Rosenberg H, Fournier K, Yadav K. Machine learning versus usual care for diagnostic and prognostic prediction in the emergency department: A systematic review. Acad Emerg Med. 2020:1–13. 10.1111/acem.14190. [DOI] [PubMed]
8.Weng SF, Reps J, Kai J, Garibaldi JM, Qureshi N. Can Machine-learning improve cardiovascular risk prediction using routine clinical data? PLoS ONE. 2017;12(4):e0174944. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Andaur Navarro CL, Damen JAA, Takada T, Nijman SWJ, Dhiman P, Ma J, et al. Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review. BMC Med Res Methodol. 2022;22(1):12. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Heus P, Damen JAAG, Pajouheshnia R, Scholten RJPM, Reitsma JB, Collins GS, et al. Poor reporting of multivariable prediction model studies: towards a targeted implementation strategy of the TRIPOD statement. BMC Med. 2018;16(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Riley RD, Collins GS, Kirton L, Snell KI, Ensor J, Whittle R, et al. Uncertainty of risk estimates from clinical prediction models: rationale, challenges, and approaches. BMJ. 2025;388:e080749. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Riley RD, Collins GS. Stability of clinical prediction models developed using statistical or machine learning methods. Biom J. 2023;65(8):2200302. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Kendall A, Gal Y. What uncertainties do we need in Bayesian deep learning for computer vision? In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2017. pp. 5580–90. (NIPS’17). https://dl.acm.org/doi/10.5555/3295222.3295309.
15.Patel BS, Steinberg E, Pfohl SR, Shah NH. Learning decision thresholds for risk stratification models from aggregate clinician behavior. J Am Med Inf Assoc JAMIA. 2021;28(10):2258–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Geersing GJ, Zuithoff NPA, Kearon C, Anderson DR, Ten Cate-Hoek AJ, Elf JL, et al. Exclusion of deep vein thrombosis using the wells rule in clinically important subgroups: individual patient data meta-analysis. BMJ Online. 2014;348(March):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Geersing GJ, Zuithoff NPA, Kearon C, Anderson DR, Ten Cate-Hoek AJ, Elf JL, et al. Exclusion of deep vein thrombosis using the wells rule in clinically important subgroups: individual patient data meta-analysis. BMJ. 2014;348:g1340. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Takada T, van Doorn S, Parpia S, de Wit K, Anderson DR, Stevens SM, et al. Diagnosing deep vein thrombosis in cancer patients with suspected symptoms: an individual participant data meta-analysis. J Thromb Haemost. 2020;18(9):2245–52. [DOI] [PubMed] [Google Scholar]
19.Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. [Google Scholar]
20.Iniesta R, Stahl D, McGuffin P. Machine learning, statistical learning and the future of biological research in psychiatry. Vol. 46, psychological medicine. Cambridge University Press; 2016. pp. 2455–65. 10.1017/S0033291716001367. [DOI] [PMC free article] [PubMed]
21.Cortes C, Vapnik V. Support-Vector networks. Machine. 1995;20(5):273–97. [Google Scholar]
22.Karatzoglou A, Smola A, Hornik K. kernlab: Kernel-Based Machine Learning Lab. 2024. 10.32614/CRAN.package.kernlab,Rpackageversion0.9-33. https://CRAN.R-project.org/package=kernlab. [DOI]
23.Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(3). 10.1145/1961189.1961199.
24.Riley RD, Ensor J, Snell KIE, Harrell FE, Martin GP, Reitsma JB, et al. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020;368:m441. 10.1136/bmj.m441. [DOI] [PubMed]
25.de Jong VMT, Moons KGM, Eijkemans MJC, Riley RD, Debray TPA. Developing more generalizable prediction models from pooled studies and large clustered data sets. Stat Med. 2021;10(15):3533–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Kearon C. Diagnosis of suspected venous thromboembolism. Hematol Am Soc Hematol Educ Program. 2016;2016(1):397–403. 10.1182/asheducation-2016.1.397. [DOI] [PMC free article] [PubMed]
27.Manski CF. Partial Identification in Econometrics. In: The New Palgrave Dictionary of Economics. Palgrave Macmillan, London; 2008. pp. 1–9. Available from: https://link.springer.com/rwe/10.1057/978-1-349-95121-5_2407-1. Cited 24 Sept 2025.
28.R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. 2021. Available from: https://www.R-project.org/
29.Van Calster B, Vickers AJ. Calibration of risk prediction models: impact on Decision-Analytic performance. Med Decis Mak. 2015;35(2):162–9. [DOI] [PubMed] [Google Scholar]
30.Li Y, Sperrin M, Ashcroft DM, Van Staa TP. Consistency of variety of machine learning and statistical models in predicting clinical risks of individual patients: longitudinal cohort study using cardiovascular disease as exemplar. BMJ. 2020;371. [DOI] [PMC free article] [PubMed]
31.Van Staa TP, Gulliford M, Ng ESW, Goldacre B, Smeeth L. Prediction of cardiovascular risk using framingham, ASSIGN and QRISK2: how well do they predict individual rather than population risk? PLoS ONE. 2014;9(10). [DOI] [PMC free article] [PubMed]
32.Dunias ZS, Van Calster B, Timmerman D, Boulesteix AL, van Smeden M. A comparison of hyperparameter tuning procedures for clinical prediction models: A simulation study. Stat Med. 2024;43(6):1119–34. [DOI] [PubMed] [Google Scholar]
33.Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015;162(1):W1–73. 10.7326/M14-0698. [DOI] [PubMed]
34.Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL, Calster BV, et al. TRIPOD + AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378. 10.1136/bmj-2023-078378. [DOI] [PMC free article] [PubMed]
35.Moons KGM, Damen JAA, Kaul T, Hooft L, Navarro CA, Dhiman P, et al. PROBAST + AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ. 2025;388:e082505. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Hastie T, Qian J. An Introduction to glmnet. 2022;5:1–19.
37.Liaw A, Wiener M, Package. randomForest. 2022.
38.Ripley B, Venables W. nnet: Feed-Forward Neural Networks and Multinomial Log-Linear Models. 2025. Available from: https://cran.rproject.org/web/packages/nnet/index.html. Cited 22 Apr 2025.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

41512_2025_216_MOESM1_ESM.docx^{(889.1KB, docx)}

Data Availability Statement

[CR1] 1.Steyerberg EW, Moons KGM, van der Windt DA, Hayden JA, Perel P, Schroter S, et al. Prognosis research strategy (PROGRESS) 3: prognostic model research. PLoS Med. 2013;10(2):e1001381. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Moons KGM, Kengne AP, Woodward M, Royston P, Vergouwe Y, Altman DG, et al. Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker. Heart Br Card Soc. 2012;98(9):683–90. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Moons KGM, Royston P, Vergouwe Y, Grobbee DE, Altman DG. Prognosis and prognostic research: what, why, and how? BMJ. 2009;338:b375. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Yang C, Kors JA, Ioannou S, John LH, Markus AF, Rekkas A, et al. Trends in the conduct and reporting of clinical prediction model development and validation: a systematic review. J Am Med Inf Assoc JAMIA. 2022;29(5):983–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Steele AJ, Denaxas SC, Shah AD, Hemingway H, Luscombe NM. Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease. PLoS ONE. 2018;13(8):e0202344. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Hale AT, Stonko DP, Brown A, Lim J, Voce DJ, Gannon SR, et al. Machine-learning analysis outperforms conventional statistical models and CT classification systems in predicting 6-month outcomes in pediatric patients sustaining traumatic brain injury. Neurosurg Focus. 2018;45(5):1–7. [DOI] [PubMed] [Google Scholar]

[CR7] 7.Kareemi H, Vaillancourt C, Rosenberg H, Fournier K, Yadav K. Machine learning versus usual care for diagnostic and prognostic prediction in the emergency department: A systematic review. Acad Emerg Med. 2020:1–13. 10.1111/acem.14190. [DOI] [PubMed]

[CR8] 8.Weng SF, Reps J, Kai J, Garibaldi JM, Qureshi N. Can Machine-learning improve cardiovascular risk prediction using routine clinical data? PLoS ONE. 2017;12(4):e0174944. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Andaur Navarro CL, Damen JAA, Takada T, Nijman SWJ, Dhiman P, Ma J, et al. Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review. BMC Med Res Methodol. 2022;22(1):12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Heus P, Damen JAAG, Pajouheshnia R, Scholten RJPM, Reitsma JB, Collins GS, et al. Poor reporting of multivariable prediction model studies: towards a targeted implementation strategy of the TRIPOD statement. BMC Med. 2018;16(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Riley RD, Collins GS, Kirton L, Snell KI, Ensor J, Whittle R, et al. Uncertainty of risk estimates from clinical prediction models: rationale, challenges, and approaches. BMJ. 2025;388:e080749. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Riley RD, Collins GS. Stability of clinical prediction models developed using statistical or machine learning methods. Biom J. 2023;65(8):2200302. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Kendall A, Gal Y. What uncertainties do we need in Bayesian deep learning for computer vision? In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2017. pp. 5580–90. (NIPS’17). https://dl.acm.org/doi/10.5555/3295222.3295309.

[CR15] 15.Patel BS, Steinberg E, Pfohl SR, Shah NH. Learning decision thresholds for risk stratification models from aggregate clinician behavior. J Am Med Inf Assoc JAMIA. 2021;28(10):2258–64. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Geersing GJ, Zuithoff NPA, Kearon C, Anderson DR, Ten Cate-Hoek AJ, Elf JL, et al. Exclusion of deep vein thrombosis using the wells rule in clinically important subgroups: individual patient data meta-analysis. BMJ Online. 2014;348(March):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Geersing GJ, Zuithoff NPA, Kearon C, Anderson DR, Ten Cate-Hoek AJ, Elf JL, et al. Exclusion of deep vein thrombosis using the wells rule in clinically important subgroups: individual patient data meta-analysis. BMJ. 2014;348:g1340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Takada T, van Doorn S, Parpia S, de Wit K, Anderson DR, Stevens SM, et al. Diagnosing deep vein thrombosis in cancer patients with suspected symptoms: an individual participant data meta-analysis. J Thromb Haemost. 2020;18(9):2245–52. [DOI] [PubMed] [Google Scholar]

[CR19] 19.Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. [Google Scholar]

[CR20] 20.Iniesta R, Stahl D, McGuffin P. Machine learning, statistical learning and the future of biological research in psychiatry. Vol. 46, psychological medicine. Cambridge University Press; 2016. pp. 2455–65. 10.1017/S0033291716001367. [DOI] [PMC free article] [PubMed]

[CR21] 21.Cortes C, Vapnik V. Support-Vector networks. Machine. 1995;20(5):273–97. [Google Scholar]

[CR22] 22.Karatzoglou A, Smola A, Hornik K. kernlab: Kernel-Based Machine Learning Lab. 2024. 10.32614/CRAN.package.kernlab,Rpackageversion0.9-33. https://CRAN.R-project.org/package=kernlab. [DOI]

[CR23] 23.Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(3). 10.1145/1961189.1961199.

[CR24] 24.Riley RD, Ensor J, Snell KIE, Harrell FE, Martin GP, Reitsma JB, et al. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020;368:m441. 10.1136/bmj.m441. [DOI] [PubMed]

[CR25] 25.de Jong VMT, Moons KGM, Eijkemans MJC, Riley RD, Debray TPA. Developing more generalizable prediction models from pooled studies and large clustered data sets. Stat Med. 2021;10(15):3533–59. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Kearon C. Diagnosis of suspected venous thromboembolism. Hematol Am Soc Hematol Educ Program. 2016;2016(1):397–403. 10.1182/asheducation-2016.1.397. [DOI] [PMC free article] [PubMed]

[CR27] 27.Manski CF. Partial Identification in Econometrics. In: The New Palgrave Dictionary of Economics. Palgrave Macmillan, London; 2008. pp. 1–9. Available from: https://link.springer.com/rwe/10.1057/978-1-349-95121-5_2407-1. Cited 24 Sept 2025.

[CR28] 28.R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. 2021. Available from: https://www.R-project.org/

[CR29] 29.Van Calster B, Vickers AJ. Calibration of risk prediction models: impact on Decision-Analytic performance. Med Decis Mak. 2015;35(2):162–9. [DOI] [PubMed] [Google Scholar]

[CR30] 30.Li Y, Sperrin M, Ashcroft DM, Van Staa TP. Consistency of variety of machine learning and statistical models in predicting clinical risks of individual patients: longitudinal cohort study using cardiovascular disease as exemplar. BMJ. 2020;371. [DOI] [PMC free article] [PubMed]

[CR31] 31.Van Staa TP, Gulliford M, Ng ESW, Goldacre B, Smeeth L. Prediction of cardiovascular risk using framingham, ASSIGN and QRISK2: how well do they predict individual rather than population risk? PLoS ONE. 2014;9(10). [DOI] [PMC free article] [PubMed]

[CR32] 32.Dunias ZS, Van Calster B, Timmerman D, Boulesteix AL, van Smeden M. A comparison of hyperparameter tuning procedures for clinical prediction models: A simulation study. Stat Med. 2024;43(6):1119–34. [DOI] [PubMed] [Google Scholar]

[CR33] 33.Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015;162(1):W1–73. 10.7326/M14-0698. [DOI] [PubMed]

[CR34] 34.Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL, Calster BV, et al. TRIPOD + AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378. 10.1136/bmj-2023-078378. [DOI] [PMC free article] [PubMed]

[CR35] 35.Moons KGM, Damen JAA, Kaul T, Hooft L, Navarro CA, Dhiman P, et al. PROBAST + AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ. 2025;388:e082505. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Hastie T, Qian J. An Introduction to glmnet. 2022;5:1–19.

[CR37] 37.Liaw A, Wiener M, Package. randomForest. 2022.

[CR38] 38.Ripley B, Venables W. nnet: Feed-Forward Neural Networks and Multinomial Log-Linear Models. 2025. Available from: https://cran.rproject.org/web/packages/nnet/index.html. Cited 22 Apr 2025.

PERMALINK

Prediction models developed using artificial intelligence: similar predictive performances with highly varying predictions for individuals – an illustration in deep vein thrombosis

Maerziya Yusufujiang

Constanza L Andaur Navarro

Johanna AA Damen

Toshihiko Takada

Geert-Jan Geersing

Lotty Hooft

Ewoud Schuit

Karel GM Moons

Valentijn MT de Jong

Maarten van Smeden

Abstract

Objectives

Methods

Results

Conclusions

Supplementary Information

Introduction

Methods

Data sources

Predictors and outcome

Modelling techniques

Data analysis

Model development

Sample size

Assessment of model performance and classification

Results

Table 1.

Tuning parameters

Discrimination

Calibration

Fig. 1.

Classification

Table 2.

Individual estimated probabilities

Fig. 2.

Fig. 3.

Predictors

Discussion

Comparison to previous research

Strengths and limitations of this study

Implications for researchers and future research

Conclusion

Supplementary Information

Acknowledgements

Disclaimer

Transparency

Abbreviations

Authors’ contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases