Dear JAMIA Editors,
In their recent article, “Risk prediction of delirium in hospitalized patients using machine learning: an implementation and prospective evaluation study,” Jauk et al implemented and prospectively evaluated the performance of a machine learning algorithm to predict delirium in hospitalized patients using electronic health record (EHR) data available at admission and on the first evening after admission.1 This is an important problem in hospital medicine and neurology where delirium, a preventable condition, is under-recognized and under-treated and often leads to extended lengths of stay, increased health care costs, and acceleration in existing cognitive decline.
Jauk et al demonstrated noteworthy accomplishments that will lead to exciting future investigations: (1) integrating the delirium predictive model into a clinical workflow within the EHR, (2) unobtrusively using data captured and documented in the normal care of patients, and (3) evaluating the performance of their model prospectively in the clinical setting. However, this study also illustrates a critical flaw in our approach to applying artificial intelligence and machine learning that prompts the question: what is the “ground truth” on which we are training our models? We should take pause before implementing machine learning algorithms in clinical contexts and assess the underlying classification task of the algorithms.
The authors acknowledged a limitation of their study: basing the occurrence of delirium on the presence of International Classification of Disease-Tenth Revision (ICD-10 codes, terms and text © World Health Organization, Third Edition. 2007) codes F05 (“delirium due to known physiological condition” including all subcategories) and F10.4 (“alcohol withdrawal state with delirium”) assigned as diagnoses for the encounter. They recognize that a “lack of clear diagnostic criteria” for delirium “might be one reason why the incidence of delirium according to ICD codes in an administrative database (1.5% in this study) is lower than the one reported in prospective studies (ranging from 10%–40%).” Indeed, in the roadmap to advance delirium research from the Network for Investigation of Delirium: Unifying Scientists (NIDUS), Oh et al describe the need for a refined definition of and a reference standard for diagnosis of delirium.2 However, lack of a clear reference standard for delirium is not enough to explain such a deviation from prior measured incidence rates. In this case, it is apparent that the ground truth missed cases of delirium when it was present. Thus, efforts should have been made to evaluate and improve the ground truth prior to using it to train the predictive models because what algorithms are predicting might be the bias of determining the diagnosis, not the condition itself.
Much like how the lack of a gold standard for diagnosis of cancer limits the utility of machine learning algorithms for diagnosing early stage cancer,3 the lack of clear diagnostic criteria to define delirium along with the dependence on the presence or absence of diagnosis codes limit the utility of the machine learning algorithm. If the algorithm performs prospectively as well as it does on the training set, it would only successfully identify cases that would have been coded with a diagnosis of delirium.
Defining clinical conditions using available data, or defining “digital phenotypes,” is an art and a science in biomedical informatics. Definitions of clinical conditions have wide variability based on the data used, such as with congestive heart failure. Data beyond ICD codes are needed to improve the positive predictive value for conditions.4 Research support informatics teams have been developed at academic health centers to aid researchers in defining patient cohorts with various clinical conditions based on the best data available. Including different modalities (diagnosis codes, lab values, vital signs, reference to specific symptoms in the notes) of data in the digital phenotype definition process refines the accuracy of the cohort.
We see evidence in this study of erroneously depending on diagnosis codes alone to define the digital phenotype of delirium in the comparison of expert nurses’ risk ratings for delirium compared to the calculated risk of the algorithm. There was a wide range of both predicted risk and nursing risk assessments of delirium and there was correlation between the predictive model and expert nursing assessments. However, in this study, 0/33 from the initial nursing assessment evaluation and 2/86 from the second nursing assessment had diagnoses of delirium (1 was correctly identified by the algorithm alone, and 1 was correctly identified by expert nursing alone).
The authors expanded the cohort definition of delirium by searching free-text patient summaries for words related to delirium and, if positive, manually checking the cases for evidence of delirium. However, there was still a substantial difference in the incidence rate from this study and benchmark incidence rates. Additionally, the authors recognized that delirium is “not always coded in the participating hospital, and sometimes it is not even mentioned in the discharge summary.”
The authors cite a lack of available data in the EHR limiting both the diagnosis of delirium as well as the performance of the prediction models. In particular, if a patient is new to the system, there is a lack of prior data. Even by including data collected during the first day, the lack of prior data posed challenges to the prediction model. This can be attributed to a dependence on structured data (demographic data, diagnosis data, laboratory data, nursing assessments, and procedures). Clinical notes, even in the emergency department setting, are a valuable source of clinically relevant data.5 Natural language processing technologies available today, and constantly improving, can extract phenotypic data from unstructured free-text notes. Such methods could make sufficient data accessible both to improve the accuracy of the digital phenotype of delirium as well as to improve the prediction model, even in those with no prior encounters.
This work demonstrated that there are great opportunities to improve the defined digital phenotypes for delirium as well as other conditions to use as a more accurate ground truth in developing prediction algorithms. Natural language processing technologies can extend the search for useful data beyond those coded in the EHR to free-text reports and notes to improve both definition and prediction, but effort is needed to curate and optimize the ground truth we use to train future predictive models.
FUNDING
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
AUTHOR CONTRIBUTIONS
Both authors conceived the correspondence. JR drafted the correspondence and WT reviewed and edited the correspondence. Both authors had final approval of the correspondence and are accountable for all aspects of the work.
CONFLICT OF INTEREST STATEMENT
None declared.
REFERENCES
- 1. Jauk S, Kramer D, Großauer B, et al. Risk prediction of delirium in hospitalized patients using machine learning: an implementation and prospective evaluation study. J Am Med Inform Assoc 2020; 27 (9): 1383–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Oh ES, Akeju O, Avidan MS, et al. A roadmap to advance delirium research: recommendations from the NIDUS Scientific Think Tank. Alzheimers Dement 2020; 16 (5): 726–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Adamson AS, Welch HG.. Machine learning and the cancer-diagnosis problem—no gold standard. N Engl J Med 2019; 381 (24): 2285–7. [DOI] [PubMed] [Google Scholar]
- 4. Rosenman M, He J, Martin J, et al. Database queries for hospitalizations for acute congestive heart failure: flexible methods and validation based on set theory. J Am Med Inform Assoc 2014; 21 (2): 345–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Rousseau JF, Ip IK, Raja AS, et al. Can automated retrieval of data from emergency department physician notes enhance the imaging order entry process? Appl Clin Inform 2019; 10: 189–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
