The use of machine learning, a cluster of artificial intelligence (AI) methods, has been exponentially increasing in all domains of our life including research in medicine. Figure [A] presents the trends in the number of publications in PubMed returned by searching for “artificial intelligence + medicine” as a keyword, and Figure [B] shows the trends when searching for “artificial intelligence + cardiology.” Both panels indicate exponentially increasing number of studies involving artificial intelligence; however, the increment in AI research in cardiology seems 2.5‐fold higher compared with the increment observed in AI in medicine. This is another example of the cardiology domain, historically, being very receptive of new data science approaches.
Figure 1. Trends in artificial intelligence publications in medicine versus cardiology.
A, Trends in the number of publications in PubMed returned by searching “artificial intelligence + medicine” as a keyword. B, Trends when searching for “artificial intelligence + cardiology.”
Delayed Promise of Machine Learning or AI in Medicine
Despite the machine learning and AI hype in the last decade, only a very small portion of these studies are ever translated into clinical care. 1 , 2 The simplest explanation is that the vast majority of these studies are “AI in medicine” applications rather than “AI for medicine”. The difference between the two approaches is that AI for medicine would stem from an actual clinical need where AI is determined to be the solution and it is accounting for the real‐world clinical considerations, needs, and limitations. On the other hand, a large proportion of existing AI literature is AI in medicine applications, which are not necessarily clinically driven. They are frequently about finding suitable clinical data to test or compare various machine learning algorithms.
A machine learning model for medicine, or AI for medicine, needs to carry certain properties to be translated into clinical care. These properties may be summarized under 4 titles: actionable, generalizable, explainable, and trustworthy AI, despite there being overlaps and interactions between each of these 4 titles.
Actionable AI models, in medicine, would imply AI models that are designed to solve an immediate clinical problem by considering the limitations of the health care systems and clinical workflows, as well as the availability of technical infrastructure. An actionable AI also should be providing the results in advance leaving time for interventions. 3 Such solutions would stem from developing the simplest possible machine learning models, relying on a minimal amount of data, and solving an actual problem defined by clinicians. The accuracy is another important component of an actionable AI, which depends on the current gold standard in the specific domain, clinical expectations in terms of sensitivity and specificity, and the workflow capacity to handle false positives.
Generalizable AI is a very broad term that may imply a model that is not overfitted to training data or the cohort characteristics that are used in model building. A generalizable AI model is expected to perform similarly in new data even from different cohorts. Such property may be assessed by well‐designed internal validation on the derivation cohort supported by an external validation. It is worth noting that there should be no contamination of test data with any information from training data, at any step of the data analysis. For example, even a simple data imputation process should learn parameters only from training data and the same imputation should be applied to validation and test data. Another aspect of generalizability would include being time invariant, implying that an AI model is valid over time regardless of the technological or clinical practice changes that are affecting the input data quality or the prevalence of the risk factors or the outcome. Generalizability of an AI model over time may be assessed by subgroup analysis for different time periods as well as a prospective validation and continuous accuracy monitoring and model calibration.
Explainable AI would imply interpretability of the association between the model input and the predicted outcome. Due to their nonparametric nature, machine learning models are less intuitive compared with parametric statistical models such as linear or logistic regression. Yet, one may need to acknowledge that these parametric models typically rely on very strong assumptions about the distribution of the data, and more important, how the input and output data should be functionally associated. Such assumptions are frequently violated in real world problems leading to highly interpretable parametric statistical models, yet interpretations would lead to the wrong conclusions. 4 Considering the highly complex and nonlinear architecture of certain AI models, explainability may not be always technically possible, which should not prevent AI models that can significantly improve patient outcomes. However, explainability would always present a positive impact on acceptance of AI models for use by clinicians. 5
Trustworthy AI, in medicine, would imply an AI model that clinicians feel comfortable using in making clinical decisions about their patients or health care systems. Obviously, this may be expressed as a combination of 3 properties above, which may take time to form as it also requires behavioral adaptation to new era solutions. Another important component of trustworthy AI can be expressed as validation and reproducibility of the models by others, which highlights the importance of establishing standards and reporting. 6
The sample size used in machine learning is expected to be larger than is expected in parametric statistical methods. This is because the number of parameters to be learned in a machine learning model can be significantly higher than it is in more traditional statistical models. Hence, “Machine learning is data hungry” is a well‐established terminology with some truth in it. Yet, it is also turning into a cliche as the sample size needed to train a machine learning is problem specific and it is entirely depended on the complexity of the underlying input–output relationship, which is almost never known in real‐world problems. Therefore, it is possible to obtain very robust machine learning models with a small sample size given representative training data with well‐designed validation strategy. For these reasons, the sample size was not mentioned as one of the four categories above because if the sample size was a problem, it would manifest as a problem in at least in one of those four categories anyway.
Machine Learning Use in Critical Congenital Heart Disease Screening
As one of the most prevalent birth defects, congenital heart disease (CHD) is referred to structural defects of the heart present at birth. Such defects may be mild or as severe as being life threatening classified as critical CHD (CCHD). Early diagnosis is the key to reduce CHD‐related mortality. 7 Oxygen saturation (SpO2)‐based screening is recommended over physical examination alone to improve accuracy and timeliness of CHD and CCHD screening, especially in asymptomatic babies. 8 , 9 Yet, there is no consensus in the literature about the efficacy of SpO2 screening with varying sensitivity reported. 10 In this issue of the Journal of the American Heart Association (JAHA), Siefkes et al. 11 propose a machine learning based CCHD screening using pulse oximetry measurements, beyond screening with SpO2 alone.
In their study in this issue of JAHA, Siefkes et al. 11 investigate whether adding pulse oximetry features to SpO2 and analyzing the data using machine learning methods improve the CCHD, especially coarctation of the Aorta (CoA). Authors prospectively collect data at 6 sites. The most important non‐standard of care data collection is the pulse oximetry features generated by preductal (right hand) and postductal (any foot) measurements simultaneously recorded in all infants for 5 minutes up to 4 times (0–24, 24–48, and >48 hours after birth). These and other features were analyzed using logistic regression and machine learning algorithms including random forest, gradient boosting, decision trees, and XGBoost models. The study concludes that random forest performs as the best algorithm with superior CCHD and CoA detection accuracy compared with SpO2‐alone.
Discussion of Siefkes et al.'s Study From the AI For Medicine Perspective
Is It Actionable? As noted, SpO2 based CCHD screening is recommended yet the literature on its efficacy is conflicting and there are studies reporting its low sensitivity. At this point, the Siefkes et al.'s study 11 stems from a real clinically important problem demanding a solution, where authors select machine learning to solve this clinically important problem. Their solution mostly relies on data collected as part of standard of care with a minimal amount of added data points, pulse oximetry features, which increases its acceptability. Another important point is that the proposed model, if automated, would provide the decision support to clinicians in very fast and timely manner, leaving room for timely interventions. Also, the proposed model is a shallow machine learning model that does not require high‐performance computing reducing the need for expensive infrastructure. Lastly, proposed accuracy is superior to existing SpO2‐alone based screening. All of these properties make this proposed model a very actionable machine learning model.
Is It Generalizable? The analytical cohort in Siefkes et al.'s study 11 has a big advantage by prospectively collecting data from 6 sites, yet it had a small sample size with small event rate of CCHD, which was further lower for CoA outcome. Therefore, the potential of having data from multiple sites were not fully used as a low event rate prohibited authors to keep data from some of the sites as an external validation site.
Implemented cross‐validation strategy of splitting data into training and testing and repeating it several times is adequate for a proof‐of‐concept study. As acknowledged by the authors, however, there is a need for follow‐up studies to assess generalizability from many aspects including both prospective and external validation.
The main input data used in machine learning are measured with a simple pulse oximetry. Future technological advances in pulse oximetry technology are less likely to have a large negative impact on the proposed machine learning model. This is a big advantage toward the success of prospective validation studies.
Overall, there is lack of evidence for generalizability of the proposed machine learning model and future studies in this direction is needed.
Is It Explainable? In Siefkes et al.'s study, 11 the feature importance analysis was used as part of feature selection process, rather than to gain insight about the contribution of each predictor on the outcome. Therefore, the study could benefit from calculating Shapley additive explanations values 12 or implementing feature importance and direction analysis 4 , 13 to better understand the underlying mechanism identifying children with CCHD and CoA. Hence, the proposed model is not explainable yet can be explained by implementation of appropriate analysis.
Is It Trustworthy? Considering the highly actionable yet unproven generalizability and explainability of the proposed machine learning model, it is early to suggest that the proposed method is in its prime time to convince clinicians to embrace it in clinical practice. Hence, it is very early to assess the trustworthiness of the proposed machine learning model yet authors are providing their machine learning model's source code in a public repository, a big step toward trustworthy AI.
Conclusions
There is a clear unmet need for better screening for CCHD and CoA to enable timely interventions. The study proposed Siefkes et al. 11 attempts to solve this clinical problem using machine learning. Despite the proposed study does not yet meet all the criteria to suggest its use in clinical practice, it starts by meeting the actionable AI criteria supported by proof‐of‐concept study suggesting pulse oximetry features analyzed by machine learning can result in superior CCHD and CoA screening tool. Translation of machine learning algorithms into clinical practice is a long journey and Siefkes et al.'s study is a great start on which to build next complementary studies to achieve this goal.
Disclosures
None.
The opinions expressed in this article are not necessarily those of the editors or of the American Heart Association.
This araticle was sent to John L. Jefferies, MD, MPH, Guest Editor, for editorial decision and final disposition.
See Article by Siefkes et al.
For Disclosures, see page 4.
References
- 1. Akbilgic O, Davis RL. The promise of machine learning: when will it be delivered? J Card Fail. 2019;25:484–485. doi: 10.1016/j.cardfail.2019.04.006 [DOI] [PubMed] [Google Scholar]
- 2. Fornwalt BK, Pfeifer JM. Promise and frustration: machine learning in cardiology. Circ Cardiovasc Imaging. 2021;14:e012838. doi: 10.1161/circimaging.121.012838 [DOI] [PubMed] [Google Scholar]
- 3. Balch JA, Loftus TJ. Actionable artificial intelligence: overcoming barriers to adoption of prediction tools. Surgery. 2023;174:730–732. doi: 10.1016/j.surg.2023.03.019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Butler L, Gunturkun F, Karabayir I, Akbilgic O. Logistic regression is also a black box. Machine learning can help. In: Shaban‐Nejad A, Michalowski M, Bianco S, eds AI for Disease Surveillance and Pandemic Intelligence: Intelligent Disease Detection in Action. Cham: Springer International Publishing; 2022:323–331. doi: 10.1007/978-3-030-93080-6_23 [DOI] [Google Scholar]
- 5. Kundu S. AI in medicine must be explainable. Nat Med. 2021;27:1328. doi: 10.1038/s41591-021-01461-z [DOI] [PubMed] [Google Scholar]
- 6. Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nat Med. 2022;28:31–38. doi: 10.1038/s41591-021-01614-0 [DOI] [PubMed] [Google Scholar]
- 7. Eckersley L, Sadler L, Parry E, Finucane K, Gentles TL. Timing of diagnosis affects mortality in critical congenital heart disease. Arch Dis Child. 2016;101:516–520. doi: 10.1136/archdischild-2014-307691 [DOI] [PubMed] [Google Scholar]
- 8. de‐Wahl Granelli A, Wennergren M, Sandberg K, Mellander M, Bejlum C, Inganäs L, Eriksson M, Segerdahl N, Agren A, Ekman‐Joelsson BM, et al. Impact of pulse oximetry screening on the detection of duct dependent congenital heart disease: a Swedish prospective screening study in 39,821 newborns. BMJ. 2009;338:a3037. doi: 10.1136/bmj.a3037 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. O'Brien LM, Stebbens VA, Poets CF, Heycock EG, Southall DP. Oxygen saturation during the first 24 hours of life. Arch Dis Child Fetal Neonatal Ed. 2000;83:F35–F38. doi: 10.1136/fn.83.1.f35 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Mahle WT, Newburger JW, Matherne GP, Smith FC, Hoke TR, Koppel R, Gidding SS, Beekman RH 3rd, Grosse SD. Role of pulse oximetry in examining newborns for congenital heart disease: a scientific statement from the American Heart Association and American Academy of Pediatrics. Circulation. 2009;120:447–458. doi: 10.1161/circulationaha.109.192576 [DOI] [PubMed] [Google Scholar]
- 11. Siefkes H, Oliveria LC, Koppel R, Hogan W, Garg M, Manalo E, Cresalia N, Lai Z, Tancredi D, Lakshminrusimha S, et al. Machine learning based critical congenital heart disease screening using dual‐site pulse oximetry measurements. J Am Heart Assoc. 2024;13:e033786. doi: 10.1161/JAHA.123.033786 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Lundberg SM, Lee S‐I. A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, California, USA: Curran Associates Inc; 2017:4768–4777. [Google Scholar]
- 13. Karabayir I, Butler L, Goldman SM, Kamaleswaran R, Gunturkun F, Davis RL, Ross GW, Petrovitch H, Masaki K, Tanner CM, et al. Predicting Parkinson's disease and its pathology via simple clinical variables. J Parkinsons Dis. 2022;12:341–351. doi: 10.3233/JPD-212876 [DOI] [PMC free article] [PubMed] [Google Scholar]