Dear Editors of the Journal of the American Medical Informatics Association,
We are grateful for the opportunity to response to the letter of Rousseau and Tierney, who started a discussion on our recent article “Risk prediction of delirium in hospitalized patients using machine learning: An implementation and prospective evaluation study”.1 We do agree with them on several points, but would like to elaborate on 3 aspects: (1) to provide more insights into our barriers when defining the outcome variable; (2) to illustrate why a low incidence of delirium during prospective evaluation did not necessarily effect the training of our model; and (3) to discuss methods of evaluating prediction models that are beyond data analysis but essential for implementation.
Considering the first aspect, we agree with Rousseau and Tierney that natural language processing can be a way to improve prediction modelling when enriching structured data with unstructured data from free-text notes. Also in our hospital network (Steiermärkische Krankenanstaltengesellschaft, Graz, Austria), clinical notes give much more details on a patient than structured information might ever be able to. However, natural language processing methods cannot overcome the problem with missing data for new patients, as clinical notes are mainly documented in the hospital information system toward the end of a patient stay, or even afterward.
In contrast to the example mentioned by Rousseau and Tierney that uses “data beyond ICD codes” for defining a cohort with congestive heart failure,2 to our knowledge there are neither existing biomarkers nor specific medication or procedures to identify patients with delirium. Neuroleptics are commonly given to patients with schizophrenic or other mental disorders as well; sedatives are not exclusively used for patients in delirium either. Frequency, dosage, time of drug administration, or co-prescribed medication might help to differentiate patients with delirium from those with other disorders. However, in our hospitals, the use of such data is limited, because electronic drug prescribing has only recently been implemented. As the first record included in our training data was documented in 2010, no electronic drug records are available for modeling.
We had planned to include nursing diagnoses indicating a delirium in the outcome variable, but nearly all these cases had also an equivalent International Classification of Diseases (ICD)–Tenth Revision code. We saw no benefit in this information, and concentrated on ICD codes and delirium mentions in free text.
Considering the second aspect, Rousseau and Tierney find it apparent that the “ground truth” of our training dataset “misses cases of delirium when it was present.” There is much research on the incidence of delirium in inpatient settings, but numbers range widely between different studies. Based on the NICE guidelines,3 the occurrence rate for general medicine ranges between 3% and 42%, and between 9% and 24% for general surgery. Healthcare professionals in our hospital network estimated the occurrence rate of delirium between 5% and 15% in the general medicine and surgical departments.
In the data used for our prospective evaluation, delirium was documented (as ICD code or text word) in 1.5% of the cases, so when assuming a real occurrence rate of 7%, we were missing around 5.5% of delirium cases. As noted in the article, 3 factors might have influenced the observed occurrence rate: (1) the phenomena of self-destroying prophecy (eg, when a high risk is correctly predicted by the system and the delirium is successfully prevented due to interventions, resulting in a false positive case), (2) different degrees of severity of delirium (eg, light forms of delirium might not be documented or overseen more often when compared with severe forms), and (3) lack of diagnostic criteria (eg, for medical doctors who are not specialized in the fields of neurology or psychiatry, it may be difficult to diagnose delirium correctly).
However, when training a model, the low occurrence rate of the prospective evaluation has a minor impact. Let us illustrate this on an example: as described in our article, we trained a random forest model on 14 929 hospitalizations. We screened clinical text notes of all patients for delirium or related words, using approximate string matching with Levenshtein distance measures depending on the length of the searched strings. For all matching strings, the corresponding paragraph of the text was extracted and verified manually. Patients with a coded diagnosis F05 (Delirium due to known physiological condition) and those identified in text screening built up a cohort of 4828 cases. The remaining 10 101 control subjects had neither a ICD–Tenth Revision code related to delirium nor delirium written in their clinical text notes.
We want to emphasize that the models were trained on data from the whole hospital network, so the occurrence rate for delirium and quality of coding of delirium might differ from the prospective evaluation data. When assuming a real occurrence rate of 7% and occurrence in our data of 1.5%, 556 cases with delirium would have been wrongly included as controls (5.5% of 10 101 controls). On the contrary, our cohort of delirium cases (n = 4828) is quite accurate—it is very unlikely that these patients did not suffer from delirium. The outcome variable for our training cohort of 14 929 hospitalizations would have been correct for 14 373 (this is 96.3%).
Our third and last aspect refers to the comment that “what algorithms are predicting might be the bias of determining the diagnosis, not the condition itself.” We fully agree with this, which highlights the importance of evaluating prediction models implemented in clinical workflows. Whether a machine learning–based prediction model is reliable and useful for health care should be decided by the users. Even though there were only 2 patients with coded delirium in our comparison with clinical experts, their risk assessment correlated with the algorithm’s prediction. The result supports the usefulness for detecting patients at high risk, not exclusively those with a coded diagnosis. A study that scrutinized technology acceptance and user experience of the delirium prediction algorithm is currently under review, yielding mostly positive results regarding usefulness and ease of use.
In sum, our studies support that prediction models can be very useful for clinical outcomes if reliable biomarkers still do not exist and diagnosis or detection is limited for some aspects. This goes with limitations in training and evaluation of such prediction models. More research is needed to determine how much bias in an outcome variable is acceptable, but for our use case, approximately 96.3% correctly labeled cases appears to be enough.
We thank Mr. Rousseau and Mr. Tierney for their interest in our research and the discussion.
Sincerely,
Stefanie Jauk, on behalf of the coauthors
CONFLICTS OF INTEREST
The author discloses no conflicts.
ACKNOWLEDGMENTS
We gratefully acknowledge all employees at the participating departments at the hospital LKH Graz II who were involved in the implementation and evaluation process, especially those participating in the expert group meetings. We want to acknowledge Herbert Wurzer, Hubert Hauser, and Ewald Tax, who supported the implementation in the clinical departments. Special thanks go to Christian Jagsch, who was at our disposal for clinical questions during the development; to Elisabeth Lampl for her support in data assessment for evaluation; and to the team of Medical Informatics and Process Management (MIP) of KAGes for their technical support. SJ gratefully acknowledges all coauthors as well as Franz Quehenberger for their contribution. SJ acknowledges Michel Oleynik for his feedback on the manuscript, Harry Freitas Da Cruz for sharing his methodological knowledge on calibration, and the PhD program in Advanced Medical Biomarker Research at the Medical University of Graz.
REFERENCES
- 1. Jauk S, Kramer D, Großauer B, et al. Risk prediction of delirium in hospitalized patients using machine learning: An implementation and prospective evaluation study. J Am Med Inform Assoc 2020; 27 (9): 1383–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Rosenman M, He J, Martin J, et al. Database queries for hospitalizations for acute congestive heart failure: flexible methods and validation based on set theory. J Am Med Inform Assoc 2014; 21 (2): 345–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.National Clinical Guideline Centre. DELIRIUM: Diagnosis, Prevention and Management. London: National Clinical Guideline Centre; 2010. https://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0033845/pdf/PubMedHealth_PMH0033845.pdf Accessed April 3, 2018. [Google Scholar]
