Why implementing machine learning algorithms in the clinic is not a plug-and-play solution: a simulation study of a machine learning algorithm for acute leukaemia subtype diagnosis

Gernot Pucher; Till Rostalski; Felix Nensa; Jens Kleesiek; Hans Christian Reinhardt; Christopher Martin Sauer

doi:10.1016/j.ebiom.2024.105526

. 2024 Dec 24;111:105526. doi: 10.1016/j.ebiom.2024.105526

Why implementing machine learning algorithms in the clinic is not a plug-and-play solution: a simulation study of a machine learning algorithm for acute leukaemia subtype diagnosis

Gernot Pucher ^a,^b, Till Rostalski ^b, Felix Nensa ^c, Jens Kleesiek ^c, Hans Christian Reinhardt ^a, Christopher Martin Sauer ^a,^b,^∗

PMCID: PMC11732467 PMID: 39721215

Summary

Background

Artificial intelligence (AI) and machine learning (ML) algorithms have shown great promise in clinical medicine. Despite the increasing number of published algorithms, most remain unvalidated in real-world clinical settings. This study aims to simulate the practical implementation challenges of a recently developed ML algorithm, AI-PAL, designed for the diagnosis of acute leukaemia and report on its performance.

Methods

We conducted a detailed simulation of the AI-PAL algorithm's implementation at the University Hospital Essen. Cohort building was performed using our Fast Healthcare Interoperability Resources (FHIR) database, identifying all initially diagnosed patients with acute leukaemia and selected differential diagnoses. The algorithm's performance was assessed by reproducing the original study's results.

Findings

The AI-PAL algorithm demonstrated significantly lower performance in our simulated clinical implementation compared to prior published results. The area under the receiver operating characteristic curve for acute lymphoblastic leukaemia dropped to 0.67 (95% CI: 0.61–0.73) and for acute myeloid leukaemia to 0.71 (95% CI: 0.65–0.76). The recalibration of probability cutoffs determining confident diagnoses increased the number of confident positive diagnosis for acute leukaemia from 98 to 160, highlighting the necessity of local validation and adjustments.

Interpretation

The findings underscore the challenges of implementing ML algorithms in clinical practice. Despite robust development and validation in research settings, ML models like AI-PAL may require significant adjustments and recalibration to maintain performance in different clinical settings. Our results suggest that clinical decision support algorithms should undergo local performance validation before integration into routine care to ensure reliability and safety.

Funding

This study was supported by the DFG-cofounded UMEA Clinician Scientist Program and the Ministry of Culture and Science of the State of North Rhine-Westphalia.

Keywords: Machine learning, Artificial intelligence, Real-world evaluation, Clinical implementation, Implementation gap

Research in context.

Evidence before this study

Multi-centre data of confirmed acute leukaemia cases was used to develop AI-PAL, a machine learning model to differentiate subtypes of acute leukaemia. It has the potential to allow for earlier treatment initiation in these acutely ill patients. No results for model performance in a real-world setting are available yet.

Added value of this study

Through simulation of clinical implementation, we identified several issues that need to be addressed before use in clinical routine may be considered. For instance, model performance was significantly lower at our institution and the embedded confidence cutoff values prevent the algorithm from outputting clinically helpful results for ALL and to a lesser extent also for AML. By suggesting adjustments to the algorithm and updating the confidence cutoff values clinically relevant predictions could be obtained. This study highlights challenges of clinical model implementation that are transferable to other models and use cases.

Implications of all the available evidence

This work suggests that after major adjustments to the algorithm it could provide clinical benefit in differentiating subtypes of acute leukaemia and assist in quicker diagnosis making. Future studies should perform para-clinical testing of the algorithm and prospectively establish real-world model performance, establishing potential benefits and harms.

Introduction

Artificial intelligence (AI) and machine learning (ML) algorithms have repeatedly shown to be promising approaches for solving important issues in clinical medicine.¹ Examples cover the whole patient journey, including screening, diagnostic testing, decision-making, treatment and follow-up care.¹ While novel algorithms are abundant and new ones get published weekly, most remain in a pre-clinical stage, leading to an ever-widening implementation gap.² This trend is well described for intensive care medicine, which is at the forefront of ML development due to its abundant, multimodal data.³^,⁴

The reasons for this gap are diverse and include ethical, technological, liability/regulatory, workforce, social, and patient safety barriers.⁵ While many of these barriers are important for the bedside use of AI/ML in clinical care, the development of algorithms typically stalls in a much earlier phase, i.e., before pre-clinical testing.⁴ Consequently, most algorithms never get independently validated or real-world tested in clinical practice. Potential explanations for this may lie with the proposed algorithm itself, for instance poor documentation, missing code limiting reproducibility, an unclear clinical use case or poor model performance. Yet this may also be due to external factors and include a lack of data for validation, missing technical infrastructure or limited scientific expertise.⁶ To bring more AI and ML algorithms to the bedside and leverage their potential, the focus should therefore equally be on developing novel algorithms, as well as advancing published articles towards clinical implementation.

Recently, Alcazer et al. presented a ML algorithm AI-PAL to diagnose acute leukaemia based on laboratory measurements.⁷ Acute leukaemia is a life-threatening diagnosis requiring urgent medical care. Major subgroups include acute leukemic leukaemia (ALL), acute myeloid leukaemia (AML) and acute promyelocytic leukaemia (APL).⁸ As treatment for these three entities differs significantly, establishing the correct diagnosis is paramount. Currently, this is resource-intense and relies on a combination of cytology, immunophenotyping, cytogenetics, and molecular genetics.⁸ It requires trained medical personnel, expensive medical apparatuses and may take multiple days to complete.⁷ Fast and reliable diagnosis through routine laboratory tests therefore is a very promising approach. Thus, the promising classification performance presented by Alcazer et al., enthused us to simulate the necessary steps towards clinical implementation of this ML algorithm at our large academic cancer centre.

This study focuses on the clinical requirements and the simulation of an end-to-end real-world implementation. Thus, we perform an end-to-end analysis of required adjustments to the model prior to clinical implementation. Instead of solely retrospectively validating the algorithm for an in silico designed cohort, we instead simulated how the algorithm could be implemented in clinical practice, for which patients it would be used, and which potential issues could arise during routine clinical use. We highlight key issues that arose during this process and present potential solutions for how the model should be adjusted before attempting clinical implementation. Furthermore, we provide an illustrative example of a likely classification performance in a clinical real-world setting based on our hospital data. Last, we summarize the required steps for implementation of this model and share important lessons learnt during this process which are likely applicable to the clinical implementation of other ML/AI algorithms beyond this use case.

Methods

Cohort building

Cohort building was performed using our Fast Healthcare Interoperability Resources (FHIR)-based clinical research database Smart Hospital Information Platform (SHIP) from the University Medical Centre Essen (UME), Germany. Its Department of Haematology and Stem Cell Transplantation is a regional referral centre for leukaemia with approx. 100 beds. All patients with confirmed ICD-10 codes C92.4 (APL), C92.0 (AML) and C91.0 (ALL) were identified. Furthermore, we extracted all patients with relevant differential diagnoses using ICD-10 codes C92.1 (CML), D46.9 (MDS), D61.9 (AA), and C83.1 (MCL). The cohort was restricted to patients with an initial diagnosis by filtering to hospital stays with a first time ICD-10 code and a documented onset diagnosis within 15 days of hospital admission. This cut-off was chosen as sometimes only the onset month was available. We manually validated this approach in 10% of cases and found it to adequately represent only patients with an initial acute leukaemia diagnosis. Only adult patients (≥18 years) with all nine required laboratory values within 72 h after hospital admission were included. If multiple laboratory results were available within the 72-h time window only the earliest result was used. We excluded all patients with hospital stays before 2015-01-01, or outpatient hospital encounters to ensure comparability of laboratory testing and diagnosis standards.

An extract-transform-load (ETL) approach was implemented in Python (version 3.10.14) using the open-source package FHIR-PYrate (version 0.2.1).⁹ The pre-specified information was extracted and stored as Python pickle files and CSV files for further processing. Data quality was assessed by checking for inconsistent, inaccurate, incomplete, or unreasonable values, and the inclusion and exclusion criteria outlined above were applied to the dataset. Study design and results reporting followed the STROBE guidelines for cohort studies.¹⁰

Ethics

This work is a subproject of ‘CURATE: Compiling and Utilizing electronic health Records to Advance Treatment through Endpoint analysis’ for which a waiver from the Medical Ethics Committee of the University Duisburg Essen was obtained on January 8, 2024 (23-11573-BO).

Classification performance

To reproduce the results published by Alcazer et al. for the extracted UME cohort, an automated prediction pipeline using the unaltered publicly available code of AI-PAL was implemented in R (version 4.3.1).⁷ The trained AI-PAL-model was loaded and presented iteratively with the unseen laboratory values from the whole UME cohort. The model provides different outputs: Each patient is categorized into precisely one of the three classes AML, ALL or APL using the prediction probabilities of the single-label multiclass classification. Based on these probability values and the labels of the true diagnoses, the area under the receiver operating characteristic curve (AUROC) was calculated for each respective AL type. Bootstrapping with 2000 iterations was implemented to calculate 95% confidence intervals.

Additionally, the model outputs class-specific labels of prediction confidence for each presented patient. If the prediction probability of a class is superior to a pre-defined cutoff of a positive predictive value (PPV), an instance is labelled as ‘confident prediction’, e.g. ‘confident prediction of AML’. If the prediction probability of a class is inferior to a pre-defined cutoff of a negative predictive value (NPV), an instance is labelled as ‘confident exclusion’, e.g. ‘confident exclusion of AML’. If prediction probabilities fall outside these cutoff values, they are labelled as ‘not confident’. For each presented patient, the model can provide none to one confident prediction and none to two confident exclusions.

Results from the automated prediction pipeline were compared with results from the AI-PAL webservice based on ten randomly selected corresponding cases. Laboratory values were entered manually through the graphical web interface of AI-PAL.¹¹ All selected cases produced identical results.

Besides the pretrained AI-PAL model, the R libraries pROC, dplyr and tidyr were used. Only after these performance measurements for confirmed acute leukaemia subtypes were obtained, we separately also tested how well AI-PAL would classify common differential diagnoses of acute leukaemias. Based on expert judgment we chose CML, MDS, AA and MCL. A schematic flowchart that visually summarizes the described approach of the simulation study is shown in Fig. 1.

Fig. 1 — Schematic overview of the simulation approach.

Recalibration of probability cutoffs for confident diagnoses and exclusions

The published R-code of the AI-PAL model included PPV and NPV cutoff values for each AL type to define confident diagnoses and exclusions. These cutoff values were determined based on the original AI-PAL training cohort. To optimize probability cutoffs of PPV and NPV for the UME cohort, enhancing the model's performance to make confident positive and negative diagnoses, we transformed the multiclass problem into three independent binary classification problems, following the one-vs-all approach, as suggested by Alcazer et al. This allowed us to calculate optimal cutoffs for each class separately, treating one class as the positive class and combining all others as the negative class. The entire UME dataset (N = 545), without a separate split for calibration and testing within this cohort, was used for probability cutoff calibration using bootstrap resampling.

To identify the thresholds that optimize PPV and NPV, these metrics were evaluated at every potential threshold along the ROC curve. For each threshold, the confusion matrix was computed, and PPV and NPV were extracted. The optimal threshold was selected as the one that maximizes PPV and NPV independently. To improve the robustness and reliability of the cutoff estimates, bootstrap resampling was applied. After 1000 bootstrap iterations, the mean cutoff values for PPV and NPV were calculated for each class.

Role of funders

The funding sources did not have a role in study design, data collection, data analyses, interpretation, or writing of report. The sole responsibility for the content of this publication lies with the authors.

Results

Prerequisites—missing differential diagnoses

As mentioned by Alcazer et al., the AI-PAL algorithm requires further adjustments and evaluations before clinical testing. We identified the inclusion of only confirmed cases of acute leukaemia in algorithm training as a major challenge. In clinical routine, other diseases share clinical and laboratory features with acute leukaemia and a diagnostic tool should be robust to these.¹² Based on expert opinion and supported by literature,¹³ we selected myelodysplastic syndrome (MDS), aplastic anaemia (AA), mantle cell lymphoma (MCL) and chronic myeloid leukaemia (CML) as likely candidates and therefore included these in model evaluation.

Cohort—different disease frequencies and variable distributions

A total of 20,283 hospital encounters with acute leukaemia diagnosis and an additional 28,599 encounters with relevant differential diagnoses were identified from the University Medical Center Essen (UME, Fig. 2). Cohort size decreased considerably after restricting to patients with an initial diagnosis and inpatient encounters only. Restriction to adult patients primarily excluded cases of paediatric ALL. Most identified cases had all required laboratory measurements taken within 72 h of admission. Ultimately, 545 patients with an initial diagnosis of acute leukaemia and 304 additional patients with a relevant differential diagnosis were identified, resulting in a total of 849 patients.

Fig. 2 — Flow-chart of the cohort building steps and case numbers for the UME acute leukaemia (black) and differential diagnoses (red).

The proportion of acute leukaemia subtypes differed between the cohorts, with AML being more frequent at UME (78.5%) compared to the retrospective cohort of AI-PAL (53.0%), and APL being less common (2.4% vs 13.1%, respectively, Table 1). The comparison between patient characteristics generally showed a similar pattern as the AI-PAL cohort, with for instance APL patients having lower median fibrinogen levels compared to ALL or AML patients. Notably, distributions of laboratory measurements differed frequently between the cohorts, which was most pronounced for monocytes. Another notable difference was a higher median age in the UME cohort for ALL compared with AI-PAL. However, across variables the interquartile ranges always overlapped between the cohorts.

Table 1.

Comparison of patient characteristics and laboratory measurements between UME and the AI-PAL retrospective cohort by type of acute leukaemia.

Variables	ALL		AML		APL
Variables	UME (N = 104)	AI-PAL (N = 480)	UME (N = 428)	AI-PAL (N = 745)	UME (N = 13)	AI-PAL (N = 185)
Age, years	50 [30–63]	36.5 [17–60]	60 [51–67.3]	64 [51.1–73]	53 [41–60]	51 [33–63]
White blood cell count, G/L	4.56 [2.9–7.1]	9.7 [3.8–32.2]	4.5 [2.4–9.4]	9.5 [2.5–42.3]	1.2 [0.8–2.8]	3.2 [1.2–17.7]
MCV, fl	90.1 [84.3–95.4]	86.4 [81.6–91.5]	93.8 [88.6–98.9]	95.7 [90.1–101.3]	87.5 [85.9–94.5]	89.7 [84.1–93.9]
MCHC, G/l	343 [335–351]	336 [319–346]	342 [332–351]	336 [318–346]	353 [348–360]	350 [342–358]
Lymphocytes, G/L	1.4 [0.9–2.9]	2.5 [1.2–4.1]	1.2 [0.7–2.1]	1.9 [1.1–3.8]	0.6 [0.3–1.1]	0.9 [0.5–1.7]
Monocytes, %	11.7 [8.4–17.1]	1.1 [0–3.0]	14.8 [8.6–37.6]	4.0 [1–12.5]	21.9 [5.9–33.1]	0.5 [0–2.1]
Monocytes, G/L	0.5 [0.3–0.9]	0.1 [0–0.4]	0.6 [0.3–2.0]	0.2 [0–2.4]	0.1 [0.1–0.7]	0 [0–0.1]
Platelets, G/L	95 [36–175]	53 [30–131]	75 [37–160]	66 [36–120]	34 [29–65]	31 [17–52]
PT, %	87 [78–96]	86 [77–94]	89 [76–100]	78 [66–88]	68 [58–74]	62 [53–71]
Fibrinogen, G/L	3.22 [2.5–4.1]	3.9 [3.1–5]	3.9 [3.1–4.7]	3.9 [3.1–5.1]	1.6 [1–2.2]	1.6 [1–2.2]
LDH, UI/L	331 [203–599]	566 [351–1112]	275 [209–496]	397 [236–764]	285 [257–323]	363 [239–553]

Open in a new tab

Major differences between the cohorts are highlighted in bold. Median [Interquartile Range].

Classification—lower model performance

Evaluation of the unmodified AI-PAL algorithm at UME showed a lower performance compared to the four cohorts from the AI-PAL external validation (Fig. 3, Supplementary Table S1). This was most pronounced for ALL and AML, where AUROC was at 0.67 (95% CI: 0.61–0.73) and 0.71 (95% CI: 0.65–0.76) respectively. This was a further degradation in model performance compared to the cohort from Créteil, which showed the worst model performance during validation, with an AUROC of 0.78 (95% CI: 0.67–0.89) for ALL and 0.80 (95% CI: 0.70–0.90) for AML. For the 13 observed cases of APL in the UME cohort the confidence intervals are too wide to determine definitive subgroup performance (AUROC 0.91, 95% CI: 0.75–0.99).

Fig. 3 — Comparison of the AUROC from the AI-PAL training and validation cohort with the UME cohort.

Confident diagnoses and exclusions—recalibration of cutoff values required

To provide confident positive or negative diagnoses based on prediction probabilities, AI-PAL comes with fixed PPV and NPV cutoff values for each acute leukaemia subtype. Using these cutoffs, not a single ALL case of UME was classified as ‘confident ALL’ (N = 0/104), while 1 ALL patient was classified ‘confident APL’ and 9 ALL patients as ‘confident AML’ (Table 2a). Meanwhile, 12 true ALL patients were classified as ‘confident not ALL’ (Supplementary Table S2a). Model performance was better for AML, where 22.2% (N = 99/428) of true AML cases were classified as ‘confident AML’. Furthermore, 75.9% (N = 325/428) of AML patients were labelled ‘confident not APL’ and 27.6% (N = 118/428) as ‘confident not ALL’, while no ‘confident’ misclassification was observed. Of the 13 true cases of APL, 1 was classified as ‘confident not APL’, while 3 were correctly classified as ‘confident APL’. ALL was ‘confidently excluded’ in 7 patients.

Table 2.

Frequency (%) of confident classification of acute leukaemia types (a) and its differential diagnoses regardless of peripheral blast count (b) and with detection of peripheral blood blasts (c) without recalibration of PPV and NPV cutoff values.

a
AL type	APL diagnosed	APL excluded	AML diagnosed	AML excluded	ALL diagnosed	ALL excluded
ALL, N = 104	1.0	65.4	8.7	0	0	11.5
AML, N = 428	0	75.9	22.2	0	0	27.6
APL, N = 13	23.1	7.7	0	0	0	53.8

b
Differential diagnoses	APL diagnosed	APL excluded	AML diagnosed	AML excluded	ALL diagnosed	ALL excluded
AA, N = 211	0	56.9	2.4	0	0	2.8
CML, N = 36	0	89.2	5.4	0	0	2.7
MDS, N = 39	0	82.1	7.7	0	0	7.7
MCL, N = 18	0	77.8	11.1	0	0	5.6

c
Differential diagnoses with blasts^a	APL diagnosed	APL excluded	AML diagnosed	AML excluded	ALL diagnosed	ALL excluded
AA, N = 9	0	44.4	22.2	0	0	22.2
CML, N = 17	0	100	5.9	0	0	5.9
MDS, N = 4	0	100	50	0	0	50
MCL, N = 1	0	0	0	0	0	0

Open in a new tab

False confident predictions are highlighted in bold and correct ones in italic. For each patient, up to one confident positive diagnosis and up to two confident exclusions of diagnoses are possible. AA: aplastic anaemia, CML: Chronic myelogenous leukaemia, MDS: Myelodysplastic syndrome, MCL: Mantle cell lymphoma.

Detection of blasts in peripheral blood by automated cell differentiation within 72 h of admission.

To explore the robustness of the algorithm to missing values, we performed a sensitivity analysis including additional patients with 1–2 missing laboratory values at initial diagnosis (Supplementary Table S2a). Overall, AUROC by AL subtypes were similar (Supplementary Table S2b), while confident classifications decreased further compared to patients with all laboratory values being available (Supplementary Table S2c).

In separate analyses, the laboratory values of the pre-defined differential diagnoses of AL were fed into the model to check how the model would classify a patient incorrectly suspected of having AL. The highest frequencies of ‘confident’ misclassification were observed for MCL and MDS, of which 11.1% (N = 2/18) and 7.7% (N = 3/39) were misclassified as AML, respectively (Table 2b). In a sensitivity analysis, we restricted the differential diagnoses to patients with detectable blasts in the peripheral blood. Case numbers for this subgroup were much lower, however there seems to be a trend towards higher misclassification, with two out of four MDS patients with blasts and two out of nine patients with AA being ‘confidently’ classified as AML (Table 2c).

Recalibration—increased confident diagnoses and exclusions

To increase the model performance with regard to confident predictions, we recalibrated the cutoff values of PPV and NPV for each AL type using a one-vs-all approach. With these adjustments, considerably more ‘confident’ predictions were obtained for the UME cohort (Table 3). For instance, with ALL cases now 12.5% (was 0%) were classified as ‘confident ALL’ cases, with a higher proportion of patients also being labelled as ‘confident not APL’ (84.6%, was 65.4%) and ‘confident not AML’ (16.3%, was 0%), while misclassification in terms of ‘confident AML’ and ‘confident APL’ increased only slightly from 8.7% to 12.5% and from 1.0% to 1.9%.

Table 3.

Frequency of confident classification (%) of (a) leukaemia types and (b) its differential diagnoses after recalibration of PPV and NPV cutoff values.

a
AL type	APL diagnosed	APL excluded	AML diagnosed	AML excluded	ALL diagnosed	ALL excluded
ALL, N = 104	1.9	84.6	12.5	16.3	12.5	10.6
AML, N = 428	0.5	89.3	32.9	3.3	2.1	23.4
APL, N = 13	46.2	7.7	0	61.5	0	53.8

b
Differential diagnoses	APL diagnosed	APL excluded	AML diagnosed	AML excluded	ALL diagnosed	ALL excluded
AA, N = 211	0	56.9	2.3	0	0	2.8
CML, N = 36	0	89.2	5.4	0	0	2.7
MDS, N = 39	0	82.1	7.6	0	0	7.6
MCL, N = 18	0	77.8	11.1	0	0	5.5

Open in a new tab

False confident predictions are highlighted in bold and correct ones in italic.

Deployment—implausible predictions

Before clinical deployment additional checks and safety features should be implemented. For instance, no prediction should be output when negative values for age or laboratory values are entered by the user. The same holds true for inputs outside the specified patient population, such as paediatric patients. Last, a warning should be shown if the entered values are out-of-distribution (OOD).

Discussion

Clinical implementation of AI and ML algorithms is typically not a plug-and-play operation. While algorithms may be developed with great care and multi-centric data, results can rarely be transferred to other health care settings without adjustments to local contexts.¹⁴ Using the example of an innovative diagnostic support algorithm for acute leukaemia classification, AI-PAL(7), our work showed a significantly lower performance during a simulated clinical implementation than the published data. With an AUROC of 0.67 (95% CI: 0.61–0.73) for ALL and 0.71 (95% CI: 0.65–0.76) for AML, model performance was considerably lower than in the least-performing original validation cohort from Créteil. Together with the inability to, for instance, confidently classify any ALL case correctly, direct implementation of the unadjusted algorithm would not be helpful in clinical routine.

This is striking, as the prediction classes, i.e., leukaemia subtypes, are internationally standardized diagnoses,⁸ rendering major differences in diagnosis unlikely. This applies more so to the predictors used in the study, i.e., laboratory values follow international norms, such as DIN EN ISO 15189, and may therefore also be considered robust. Yet even with standardized and objective inputs, such as age and laboratory measurements, and an internationally standardized diagnosis as the prediction target, as well as a similar study setting, i.e., a large teaching hospital in a neighbouring European country, generalizability of findings and predictive performance is not a given. Instead, differences in disease incidence, variable distributions—in particular monocytes, and healthcare practice—seem to be sufficient to shift model performance significantly, necessitating model recalibration. This effect has previously also been observed for ML algorithms used for image analysis.¹⁵

In our opinion, and supported by the evidence presented above, AI/ML algorithms should not be implemented into clinical practice without prior local performance validation. This was highlighted at large scale by the failed external validation of a widely implemented proprietary sepsis prediction model from EPIC.¹⁶ Furthermore, algorithms are frequently developed in hypothetical research settings dictated by data availability rather than clinical practice. This has pointedly been critiqued by Markowetz with the statement ‘All models are wrong and yours are useless (…)’.¹⁷

As indicated in this work, implementation in routine practice would further lower model performance when other clinically similar disease entities are added to the case mix, the degree of which will depend on hard-to-model factors, such as physician experience, local epidemiology and clinical practices. Only upon real-world model evaluation during routine care will these factors become evident, and a robust performance estimate derivable. Furthermore, clinical use requires additional adjustments, such as consideration of differential diagnoses and safety mechanisms, such as warnings if entered values are out-of-distribution or implausible, e.g. negative laboratory values or age.¹⁸

Future studies should therefore present results of prospective, real-world evaluation of decision support algorithms as the gold standard for model performance. Though tempting to shortcut, algorithms not reporting real-world clinical evaluation results should not be used in routine care yet. This is particularly true for decision support algorithms, which may bias physicians and result in unwanted behaviour, which might be hard to detect. Maybe the guiding principle should be: Do not trust an algorithm which you haven't tweaked yourself (to your local setting).

This study reports validation results for AI-PAL outside of France and from an independent research group. Furthermore, instead of a mere retrospective validation, we performed a simulation of real-world performance after clinical implementation. Hereby, we identified several key challenges that limit clinical use of AI-PAL without major adjustments. Beyond AI-PAL, we hope that our stepwise approach simulating a clinical implementation can be leveraged by others to estimate model performance of other hitherto not clinically evaluated AI and ML algorithms. This includes the adjustment of the cohort to include misidentified patients, observation of variable shift, estimation of local model performance, development of potential improvement strategies, and specification of model impact on clinical decision making.

Our work itself is not without limitations. Importantly, the presented model performances should not be considered as ground truth, as they may differ significantly in other settings. Instead, the results should be interpreted as one possible outcome of a clinical real-world evaluation at our institution. Furthermore, we appreciate that only few cases of APL (N = 13) couple be analysed and therefore exact model performance for this subgroup remains uncertain.

We appreciate that model retraining would likely increase the predictive power of the model. However, we refrained from doing so as the focus of this work was to showcase necessary adjustments to attempt to make a published algorithm fit for clinical routine care and not to derive the best performing local prediction model. The necessary steps to clinically implement AI-PAL have been extensively simulated through thought experiments involving experienced clinicians and haematologists. However, it may not be excluded that additional methodological issues could arise during actual clinical testing.

In conclusion, our implementation simulation of a promising machine learning algorithm for acute leukaemia diagnosis identified critical adjustments necessary before clinical testing. Furthermore, analyses revealed a major drop in predictive performance without model recalibration. The fundamental nature of these issues questions whether any clinical decision support algorithm should be used without prior local performance evaluation. Future studies may address the shortcomings identified in the work and report local model performance derived during clinical use.

Contributors

G. Pucher, M.Sc.: data access and verification, data curation, formal analysis, methodology, project administration, software, visualisation, writing—original draft.

Dr. T. Rostalski: data access and verification, code validation, visualisation, writing—review & editing.

Prof. Dr. F. Nensa: data curation, funding acquisition, resources, software, writing—review & editing.

Prof. Dr. J. Kleesiek: data curation, funding acquisition, resources, software, writing—review & editing.

Prof. Dr. H. C. Reinhardt: supervision, funding acquisition, resources, software, writing—review & editing.

Dr. Dr. C. M. Sauer: conceptualisation, funding acquisition, investigation, methodology, project administration, resources, supervision, validation, visualisation, writing—original draft.

All authors read and approved the final version of the manuscript.

Data sharing statement

The R code used to perform these analyses is based on AI-PAL, and both the original and adjusted code are available from the public GitHub repository:

https://github.com/gernotpuc/case-study-ml-acute-leukemia.

In concordance with German legal code, the data underlying this simulation study cannot be shared publicly, however we will share an anonymized dataset upon reasonable request with researchers for validation purposes on an individual basis.

Declaration of interests

G. Pucher received consulting fees from Trafficon GmbH.

F. Nensa received funding for his work by Siemens Healthineers.

H. C. Reinhardt received consulting and lecture fees from Abbvie, AstraZeneca, Vertex and Merck. H. C. Reinhardt received research funding from AstraZeneca and Gilead Pharmaceuticals. H. C. Reinhardt is a co-founder of CDL Therapeutics GmbH.

C. M. Sauer received consulting fees from Pacmed B.V., and lecture fees from Bristol Myers Squibb.

T. Rostalski received unrelated consultant fees from aQua-Institut GmbH for AI application support.

Acknowledgements

We acknowledge support by the Open Access Publication Fund of the University of Duisburg-Essen.

C. M. Sauer is supported by the German Research Foundation (DFG) funded UMEA Clinician Scientist Program (grant FU356/12-2).

G. Pucher is funded through the program ‘Netzwerke 2021’, an initiative of the Ministry of Culture and Science of the State of North Rhine-Westphalia.

F. Nensa currently receives funding through the German Research Foundation (DFG), the German Ministry of Education and Research (BMBF), the Ministry of Economic Affairs, Industry, Climate Action and Energy of the State of North Rhine-Westphalia (MWIKE), and Siemens Healthineers.

H. C. Reinhardt was supported by the German Research Foundation (DFG) (SFB1399- grant no. 413326622; SFB1430—grant no. 424228829; and SFB1530—grant no. 455784452), the Else Kröner-Fresenius-Stiftung (EKFS-2014-A06 and 2016_Kolleg.19), the Deutsche Krebshilfe (1117240 and 70113041), as well as the German Ministry of Education and Research (BMBF e:Med Consortium InCa, grant 01ZX1901 and 01ZX2201A). Additional Funding was received from the program ‘Netzwerke 2021’, an initiative of the Ministry of Culture and Science of the State of North Rhine-Westphalia for the CANTAR project.

J. Kleesiek was supported by the Comprehensive Cancer Center Cologne Essen, the Helmholtz Information & Data Science School for Health, the Deutsche Krebshilfe and KI Translation Essen through the European Regional Development Fonds.

Footnotes

^{Appendix A}

Supplementary data related to this article can be found at https://doi.org/10.1016/j.ebiom.2024.105526.

Appendix A. Supplementary data

Supplementary Tables

mmc1.docx^{(17.4KB, docx)}

References

1.Rajpurkar P., Chen E., Banerjee O., Topol E.J. AI in health and medicine. Nat Med. 2022;28(1):31–38. doi: 10.1038/s41591-021-01614-0. [DOI] [PubMed] [Google Scholar]
2.Seneviratne M.G., Shah N.H., Chu L. Bridging the implementation gap of machine learning in healthcare. BMJ Innovations. 2020;6(2) https://innovations.bmj.com/content/6/2/45 [cited 2024 Jun 27]. Available from: [Google Scholar]
3.Sauer C.M., Dam T.A., Celi L.A., et al. Systematic review and comparison of publicly available ICU data sets—a decision guide for clinicians and data scientists. Crit Care Med. 2022;50(6) doi: 10.1097/CCM.0000000000005517. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.van de Sande D., van Genderen M.E., Huiskens J., Gommers D., van Bommel J. Moving from bytes to bedside: a systematic review on the use of artificial intelligence in the intensive care unit. Intensive Care Med. 2021;47(7):750–760. doi: 10.1007/s00134-021-06446-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Ahmed M.I., Spooner B., Isherwood J., Lane M., Orrock E., Dennison A. A systematic review of the barriers to the implementation of artificial intelligence in healthcare. Cureus. 2023;15(10) doi: 10.7759/cureus.46454. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.He J., Baxter S.L., Xu J., Xu J., Zhou X., Zhang K. The practical implementation of artificial intelligence technologies in medicine. Nat Med. 2019;25(1):30–36. doi: 10.1038/s41591-018-0307-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Alcazer V., Meur G.L., Roccon M., et al. Evaluation of a machine-learning model based on laboratory parameters for the prediction of acute leukaemia subtypes: a multicentre model development and validation study in France. Lancet Digit Health. 2024;6(5):e323–e333. doi: 10.1016/S2589-7500(24)00044-X. [DOI] [PubMed] [Google Scholar]
8.Arber D.A., Orazi A., Hasserjian R., et al. The 2016 revision to the World Health Organization classification of myeloid neoplasms and acute leukemia. Blood. 2016;127(20):2391–2405. doi: 10.1182/blood-2016-03-643544. [DOI] [PubMed] [Google Scholar]
9.Hosch R., Baldini G., Parmar V., et al. FHIR-PYrate: a data science friendly Python package to query FHIR servers. BMC Health Serv Res. 2023;23(1):734. doi: 10.1186/s12913-023-09498-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.von Elm E., Altman D.G., Egger M., et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Prev Med. 2007;45(4):247–251. doi: 10.1016/j.ypmed.2007.08.012. [DOI] [PubMed] [Google Scholar]
11.AIPAL. https://alcazerv.shinyapps.io/AIPAL/ [cited 2024 Jun 27]. Available from:
12.Vakiti A., Reynolds S.B., Mewawalla P. StatPearls. StatPearls Publishing; Treasure Island (FL): 2024. Acute myeloid leukemia.http://www.ncbi.nlm.nih.gov/books/NBK507875/ [cited 2024 Oct 1]. Available from: [Google Scholar]
13.Onkopedia Acute myeloid leukemia (AML) https://www.onkopedia-guidelines.info/en/onkopedia/guidelines/acute-myeloid-leukemia-aml [cited 2024 Jul 4]. Available from:
14.Chekroud A.M., Hawrilenko M., Loho H., et al. Illusory generalizability of clinical prediction models. Science. 2024;383(6679):164–167. doi: 10.1126/science.adg8538. [DOI] [PubMed] [Google Scholar]
15.Godau P., Kalinowski P., Christodoulou E., et al. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. MICCAI 2023. Lecture Notes in Computer Science. Greenspan H., Madabhushi A., Mousavi P., et al., editors. vol 14222. Springer; Cham: 2023. Deployment of Image Analysis Algorithms Under Prevalence Shifts. [DOI] [Google Scholar]
16.Wong A., Otles E., Donnelly J.P., et al. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern Med. 2021;181(8):1065–1070. doi: 10.1001/jamainternmed.2021.2626. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Markowetz F. All models are wrong and yours are useless: making clinical prediction models impactful for patients. NPJ Precis Oncol. 2024;8(1):1–3. doi: 10.1038/s41698-024-00553-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Goetz L., Seedat N., Vandersluis R., van der Schaar M. Generalization—a key challenge for responsible AI in patient-facing clinical applications. NPJ Digit Med. 2024;7(1):1–4. doi: 10.1038/s41746-024-01127-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Tables

mmc1.docx^{(17.4KB, docx)}

[bib1] 1.Rajpurkar P., Chen E., Banerjee O., Topol E.J. AI in health and medicine. Nat Med. 2022;28(1):31–38. doi: 10.1038/s41591-021-01614-0. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Seneviratne M.G., Shah N.H., Chu L. Bridging the implementation gap of machine learning in healthcare. BMJ Innovations. 2020;6(2) https://innovations.bmj.com/content/6/2/45 [cited 2024 Jun 27]. Available from: [Google Scholar]

[bib3] 3.Sauer C.M., Dam T.A., Celi L.A., et al. Systematic review and comparison of publicly available ICU data sets—a decision guide for clinicians and data scientists. Crit Care Med. 2022;50(6) doi: 10.1097/CCM.0000000000005517. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.van de Sande D., van Genderen M.E., Huiskens J., Gommers D., van Bommel J. Moving from bytes to bedside: a systematic review on the use of artificial intelligence in the intensive care unit. Intensive Care Med. 2021;47(7):750–760. doi: 10.1007/s00134-021-06446-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Ahmed M.I., Spooner B., Isherwood J., Lane M., Orrock E., Dennison A. A systematic review of the barriers to the implementation of artificial intelligence in healthcare. Cureus. 2023;15(10) doi: 10.7759/cureus.46454. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.He J., Baxter S.L., Xu J., Xu J., Zhou X., Zhang K. The practical implementation of artificial intelligence technologies in medicine. Nat Med. 2019;25(1):30–36. doi: 10.1038/s41591-018-0307-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Alcazer V., Meur G.L., Roccon M., et al. Evaluation of a machine-learning model based on laboratory parameters for the prediction of acute leukaemia subtypes: a multicentre model development and validation study in France. Lancet Digit Health. 2024;6(5):e323–e333. doi: 10.1016/S2589-7500(24)00044-X. [DOI] [PubMed] [Google Scholar]

[bib8] 8.Arber D.A., Orazi A., Hasserjian R., et al. The 2016 revision to the World Health Organization classification of myeloid neoplasms and acute leukemia. Blood. 2016;127(20):2391–2405. doi: 10.1182/blood-2016-03-643544. [DOI] [PubMed] [Google Scholar]

[bib9] 9.Hosch R., Baldini G., Parmar V., et al. FHIR-PYrate: a data science friendly Python package to query FHIR servers. BMC Health Serv Res. 2023;23(1):734. doi: 10.1186/s12913-023-09498-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.von Elm E., Altman D.G., Egger M., et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Prev Med. 2007;45(4):247–251. doi: 10.1016/j.ypmed.2007.08.012. [DOI] [PubMed] [Google Scholar]

[bib11] 11.AIPAL. https://alcazerv.shinyapps.io/AIPAL/ [cited 2024 Jun 27]. Available from:

[bib12] 12.Vakiti A., Reynolds S.B., Mewawalla P. StatPearls. StatPearls Publishing; Treasure Island (FL): 2024. Acute myeloid leukemia.http://www.ncbi.nlm.nih.gov/books/NBK507875/ [cited 2024 Oct 1]. Available from: [Google Scholar]

[bib13] 13.Onkopedia Acute myeloid leukemia (AML) https://www.onkopedia-guidelines.info/en/onkopedia/guidelines/acute-myeloid-leukemia-aml [cited 2024 Jul 4]. Available from:

[bib14] 14.Chekroud A.M., Hawrilenko M., Loho H., et al. Illusory generalizability of clinical prediction models. Science. 2024;383(6679):164–167. doi: 10.1126/science.adg8538. [DOI] [PubMed] [Google Scholar]

[bib15] 15.Godau P., Kalinowski P., Christodoulou E., et al. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. MICCAI 2023. Lecture Notes in Computer Science. Greenspan H., Madabhushi A., Mousavi P., et al., editors. vol 14222. Springer; Cham: 2023. Deployment of Image Analysis Algorithms Under Prevalence Shifts. [DOI] [Google Scholar]

[bib16] 16.Wong A., Otles E., Donnelly J.P., et al. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern Med. 2021;181(8):1065–1070. doi: 10.1001/jamainternmed.2021.2626. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Markowetz F. All models are wrong and yours are useless: making clinical prediction models impactful for patients. NPJ Precis Oncol. 2024;8(1):1–3. doi: 10.1038/s41698-024-00553-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Goetz L., Seedat N., Vandersluis R., van der Schaar M. Generalization—a key challenge for responsible AI in patient-facing clinical applications. NPJ Digit Med. 2024;7(1):1–4. doi: 10.1038/s41746-024-01127-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Why implementing machine learning algorithms in the clinic is not a plug-and-play solution: a simulation study of a machine learning algorithm for acute leukaemia subtype diagnosis

Gernot Pucher

Till Rostalski

Felix Nensa

Jens Kleesiek

Hans Christian Reinhardt

Christopher Martin Sauer

Summary

Background

Methods

Findings

Interpretation

Funding

Research in context.

Evidence before this study

Added value of this study

Implications of all the available evidence

Introduction

Methods

Cohort building

Ethics

Classification performance

Fig. 1.

Recalibration of probability cutoffs for confident diagnoses and exclusions

Role of funders

Results

Prerequisites—missing differential diagnoses

Cohort—different disease frequencies and variable distributions

Fig. 2.

Table 1.

Classification—lower model performance

Fig. 3.

Confident diagnoses and exclusions—recalibration of cutoff values required

Table 2.

Recalibration—increased confident diagnoses and exclusions

Table 3.

Deployment—implausible predictions

Discussion

Contributors

Data sharing statement

Declaration of interests

Acknowledgements

Footnotes

Appendix A. Supplementary data

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases