Skip to main content
The Journals of Gerontology Series A: Biological Sciences and Medical Sciences logoLink to The Journals of Gerontology Series A: Biological Sciences and Medical Sciences
. 2020 Oct 30;77(3):524–530. doi: 10.1093/gerona/glaa275

Ascertainment of Delirium Status Using Natural Language Processing From Electronic Health Records

Sunyang Fu 1,2, Guilherme S Lopes 1, Sandeep R Pagali 3, Bjoerg Thorsteinsdottir 3, Nathan K LeBrasseur 4,5, Andrew Wen 1, Hongfang Liu 1, Walter A Rocca 1, Janet E Olson 1, Jennifer St Sauver 1, Sunghwan Sohn 1,
Editor: Anne B Newman
PMCID: PMC8893184  PMID: 35239951

Abstract

Background

Delirium is underdiagnosed in clinical practice and is not routinely coded for billing. Manual chart review can be used to identify the occurrence of delirium; however, it is labor-intensive and impractical for large-scale studies. Natural language processing (NLP) has the capability to process raw text in electronic health records (EHRs) and determine the meaning of the information. We developed and validated NLP algorithms to automatically identify the occurrence of delirium from EHRs.

Methods

This study used a randomly selected cohort from the population-based Mayo Clinic Biobank (N = 300, age ≥65). We adopted the standardized evidence-based framework confusion assessment method (CAM) to develop and evaluate NLP algorithms to identify the occurrence of delirium using clinical notes in EHRs. Two NLP algorithms were developed based on CAM criteria: one based on the original CAM (NLP-CAM; delirium vs no delirium) and another based on our modified CAM (NLP-mCAM; definite, possible, and no delirium). The sensitivity, specificity, and accuracy were used for concordance in delirium status between NLP algorithms and manual chart review as the gold standard. The prevalence of delirium cases was examined using International Classification of Diseases, 9th Revision (ICD-9), NLP-CAM, and NLP-mCAM.

Results

NLP-CAM demonstrated a sensitivity, specificity, and accuracy of 0.919, 1.000, and 0.967, respectively. NLP-mCAM demonstrated sensitivity, specificity, and accuracy of 0.827, 0.913, and 0.827, respectively. The prevalence analysis of delirium showed that the NLP-CAM algorithm identified 12 651 (9.4%) delirium patients, the NLP-mCAM algorithm identified 20 611 (15.3%) definite delirium cases, and 10 762 (8.0%) possible cases.

Conclusions

NLP algorithms based on the standardized evidence-based CAM framework demonstrated high performance in delineating delirium status in an expeditious and cost-effective manner.

Keywords: Confusion assessment method, Delirium, Electronic health records, Natural language processing


Delirium is a syndrome with symptoms that present as confusion and is characterized by an acute change in mental status, fluctuating course, lack of attention, and disorganized thinking or altered level of consciousness (1). Delirium is common in hospitalized older adults (2,3), with prevalence in the postoperative setting ranging between 21% and 35% (4), and in intensive care unit patients between 60% and 85% (5). Delirium has been associated with multiple predisposing and precipitating risk factors, including infections, use of specific medications, and the presence of a wide range of other chronic conditions (6,7). In particular, persons with dementia are at high risk of delirium, and delirium has been associated with subsequent cognitive impairment (8) and additional adverse outcomes (9), including longer hospital stays, increased likelihood of nursing home placement, and an increased risk of death (3,10–13).

Delirium is underdiagnosed in clinical practice and is not routinely coded for billing (14). A study of patients undergoing elective surgery indicated that delirium was given an International Classification of Diseases, 9th Revision (ICD-9) code in only 3% of patient records (2). Causes of delirium are multi-factorial, but it has been estimated that 30%–40% of cases may be preventable (3). However, further research is necessary to identify the best ways to detect, prevent, and manage delirium. Such research is currently limited by challenges in identifying persons with delirium from electronic health records (EHRs). Inouye et al. (2) have developed methods for identifying persons with delirium from chart review of medical records. However, manual chart review is time-consuming and costly to extract information from clinical notes for large patient populations. Natural language processing (NLP) has been adopted to computationally extract clinical information from EHRs for a wide range of applications ranging from advancing EHR-based clinical research (15,16) to supporting clinical decision making (17,18). Different NLP frameworks have been developed to convert clinical narratives into structured data, including MedLEE (19), MetaMap (20), KnowledgeMap (21), MedTagger (22), and cTAKES (23). In this study, we adopted the MedTagger framework with domain-specific customizability to develop and validate an NLP algorithm to identify the occurrence of delirium using clinical notes derived from the Mayo Clinic EHR.

Materials and Methods

Study Population

This study was approved by the Mayo Clinic Institutional Review Board and the Olmsted Medical Center Institutional Review Board. The study population consisted of participants of the Mayo Clinic Biobank; details regarding this population have been previously published (24). Briefly, the Mayo Clinic Biobank is an institutional resource comprised of volunteers who have donated biological specimens, provided risk factor data, and have given permission to access clinical data from their EHRs for clinical research studies. Participants were contacted as part of a prescheduled medical examination at Mayo Clinic sites between April 2009 and September 2015. All participants were 18 years or older at the time of consent. Approximately 57 000 participants have been enrolled, and 24 224 of these participants were 65 years of age or older at the time of consent. Among these participants, we identified all persons who received an ICD-9 code for delirium or alteration of consciousness after date of enrollment (N = 731; ICD-9 codes: 290.3, 290.41, 291.0, 292.81, 293.0, 293.1, 348.30, and 780.09). We randomly sampled persons (N = 300) from this population for the annotation guideline development and gold standard identification phases of the study (details below). Among 300 randomly selected individuals aged 65 and older, 48.3% were female. A total of 615 visit records were noted for these 300 individuals of which 247 were inpatient records, 55 observation records, 186 emergency records, and 127 outpatient records. Among the 247 inpatient records, 89 visits were ICU records. Half of the persons from the sample population (n = 150) were used to develop NLP algorithms to identify occurrences of delirium, and the remaining sample (n = 150) was set aside to evaluate the performance of the NLP algorithm. Figure 1 showed a population flow process including all samples during the cohort screening and sampling.

Figure 1.

Figure 1.

Workflow of cohort screening and sampling for the corpus annotation and NLP development. NLP = natural language processing.

Annotation Guideline Development

Delirium is diagnosed based on a constellation of established clinical symptoms. Therefore, our primary criterion for identification of delirium was an explicit mention of a delirium episode (eg, “Patient experienced acute post-operative delirium this morning”) documented in the clinical notes. If there was no clear diagnosis or mention of delirium, then we identified delirium based on whether symptoms documented in the clinical notes satisfied the “Confusion Assessment Method” (CAM) criteria (25).

Briefly, CAM has 4 features that are used to facilitate the diagnosis of delirium, including (A) acute onset and fluctuating course, (B) inattention, (C) disorganized thinking, and (D) altered level of consciousness. Each feature was further represented by a list of specific delirium-related concepts, such as deteriorating mental status, drowsiness, mumbling gibberish, impaired orientation, and encephalopathy (25). For example, the CAM feature “Disorganized thinking” was represented by several expressions, including “mumbling gibberish,” “rambling speech,” “unclear flow of ideas,” etc. Figure 2 summarizes CAM features.

Figure 2.

Figure 2.

Comparison of original CAM and the modified CAM. CAM = confusion assessment method.

In our study, we developed 2 versions of criteria: the original CAM and the modified CAM (mCAM) (Table 1). For the original CAM, we operationalized the results as “definitive delirium” or “no delirium.” “Definitive delirium” status was achieved when the medical records described symptoms that match criteria A and B and either C or D of the CAM criteria within 1 month. The average duration of time from the first symptom to the last symptom was 13.8 days (range: 0–28 days). Most persons with delirium met the criteria within 48 hours (38%).

Table 1.

Confusion Assessment Method (CAM)

A: acute onset and fluctuating B: inattention
course
Do the abnormal behaviors?
• Come and go
• Fluctuate during the day
• Increase/decrease in severity
Does the patient:
• Have difficulty focusing attention
• Become easily distracted
• Have difficulty keeping track of what is said
C: disorganized thinking
Is the patient’s thinking?
• Disorganized
• Incoherent
D: altered level of consciousness
What is the patient’s level of consciousness?
• Alert (normal)
• Vigilant (hyper-alert)
• Lethargic (drowsy but easily roused)
• Stuporous (difficult to rouse)
• Comatose (unrousable)
Original CAM:
Definitive: A and B and (C or D)
Modified CAM:
Definitive: At least 3 unique CAM criteria
Possible: Any 2 criteria and does not meet the definitive criteria as above

Two Mayo geriatricians and one palliative care physician (S.P., Z.X., B.T.) helped to define the review criteria and noted that symptoms of delirium may be poorly documented (missing information related to delirium features and concepts) (26). Therefore, we created mCAM to address this issue. Using the mCAM criteria, definitive delirium status was defined as when the medical records describe symptoms that matched 3 of 4 CAM criteria (eg, CAM B + C + D or A + C + D). Possible delirium status was defined as when symptoms matched exactly 2 CAM criteria. The comprehensive annotation guideline can be found in Supplementary Material 1.

Corpus Annotation

Corpus annotation is the process of manual chart review, marking interpretative linguistic (eg, syntax, negation) or predefined clinical information (eg, delirium-related concepts) to a corpus that can be used for NLP algorithm development and evaluation (27–29). There were 2 phases involved in our process: (a) training phase to be familiar with the annotation process and refine annotation guidelines and (b) production phase to create the gold standard for NLP algorithm development and evaluation.

Training phase

One geriatrician and one psychologist (S.P. and G.S.L.) annotated a random sample of records obtained from 15 patients who had an ICD-9 code for delirium or alteration of consciousness and another 15 patients who did not have code to identify documentation of delirium and/or keywords and terms related to the 4 CAM components. In the training phase, annotators reviewed the full medical records from 30 days prior to the date of the ICD-9 code through 30 days following receipt of the initial diagnosis code. They identified delirium-related terms previously described by Puelle et al. (30) for the identification of delirium from medical records as well as additional keywords associated with episodes of delirium observed in the sample records. Discrepancies between reviewers were discussed and resolved, and annotation criteria were updated.

Production phase

A new sample of 44 patients with an ICD-9 code for delirium and 141 patients with an ICD-9 code for the alteration of consciousness was matched by age (±1 year) and sex to 115 patients without a code for delirium (Figure 1). All 300 patient records within ± 30 days anchored by ICD-9 delirium diagnosis (total 8761 documents) were double-annotated by 2 reviewers (G.S.L. and D.I.) using the final annotation guidelines. Interannotator agreement (IAA) was calculated at the patient-level delirium status (eg, definitive delirium, no delirium) and for each concept (eg, confusion). All conflicting cases were adjudicated by a geriatrician and palliative care physician with expertise in geriatrics (S.P. and B.T.). The results of the final adjudicated annotation served as the gold standard for the development and test of the NLP algorithm.

NLP Algorithm Development

The NLP algorithm was developed to automate EHR chart review to identify patients with delirium based on 2 definitions of delirium (CAM and mCAM). To identify a patient’s delirium status, the NLP algorithm screens the clinical notes to extract direct mention of delirium (physician’s diagnosis) and delirium-related clinical concepts that match the CAM criteria. Each concept was then normalized into a standard form based on the CAM instruments. For example, “unresponsiveness” and “decreased responsiveness” were normalized into “disconnected.” Furthermore, the normalized CAM instruments were mapped into the CAM Features A, B, C, and D. As an example in Figure 3, a patient who experienced acute altered mental status (CAM Feature A), inattention (CAM Feature B), and disorganized thinking (CAM Feature C) was considered “definitive” delirium.

Figure 3.

Figure 3.

An example for detecting delirium status based on CAM. CAM = confusion assessment method.

To implement the algorithm, we adopted the open-source NLP pipeline MedTaggerIE (22), an open-source unstructured information management architecture–based information extraction framework. This system separates task-specific NLP knowledge engineering (ie, CAM criteria) from the generic routine NLP, which enables words and phrases containing clinical information (ie, keywords relevant to CAM features) to be directly coded by subject matter experts. The tool has been utilized in various clinical NLP tasks and adopted by multiple studies of phenotyping algorithm development (15,31,32). Additionally, we utilized the Mayo Clinic big data NLP platform (33), a distributed parallel computing environment to support data sets of extremely large volume which integrates NLP with existing EHR data stores. This enabled us to execute the NLP algorithm without manually retrieving large sets of documents prior to execution.

Figure 4 shows the NLP workflow. The generic NLP includes sentence segmentation, tokenization, temporal status detection (eg, present, history), and assertion detection (eg, negation, possible, hypothetical). The task-specific NLP includes the detection of keywords relevant to delirium in the text using regular expressions and normalized to specific delirium concepts. The summarization component applies heuristic rules (ie, CAM criteria) for assigning the delirium status.

Figure 4.

Figure 4.

Architecture of the NLP. NLP = natural language processing.

The NLP algorithms were developed in the following 3 steps: (a) prototype algorithm development based on CAM, (b) formative algorithm development using the training data after the acceptable performance was reached (accuracy >0.95), and (c) final algorithm evaluation on the independent test set. The algorithm was applied and refined on the training data. Incorrect cases were manually reviewed by 2 domain experts (G.S.L. and D.K.I.) and iteratively refined until all issues were resolved. The comprehensive algorithms can be found at https://github.com/OHNLP/AgingNLP/tree/master/delirium.

Statistical Analysis

The manual annotation of delirium status by 2 annotators was assessed by F1-score (34). F1-score (eqn (1)) is a well-established metric in the information retrieval and machine learning community. It measures both positive predictive value (precision) and sensitivity (recall) of the test object. The performance of the algorithm was assessed by using sensitivity (eqn (2)), specificity (eqn (3)), and accuracy (eqn (4)), to assess concordance between delirium status identified using the NLP algorithms and delirium status identified via manual chart review (gold standard).

  • F1-score = 2 × (Precision × Recall)/(Precision + Recall) (1)

  • Sensitivity = True positive/(True positive + False negative) (2)

  • Specificity = True negative/(False positive + True negative) (3)

  • Accuracy = (True positive + True negative)/(Total positive + Total negative) (4)

Prevalence Analysis

To further compare the effectiveness between the NLP algorithm and ICD-9, we applied NLP-CAM and NLP-mCAM to screen all hospitalized patients who visited Mayo Clinic Rochester from April 2009 to September 2015. About 134 910 patients aged 65 years or older who were hospitalized at the Mayo Clinic in Rochester Minnesota were identified. For the purposes of comparison, we calculated the prevalence of positive delirium cases based on ICD-9, NLP-CAM, and NLP-mCAM.

RESULTS

Interannotator Agreement

Among the 300 patients, 8761 clinical documents were double-reviewed, and 7515 delirium-related concepts were annotated. The IAA of patient-level delirium status (N = 300) between 2 annotators in F1-score was 0.94. Agreement between the 2 annotators at the concept level (ie, whether 2 annotators identify the same delirium-related terms, eg, Figure 3, box 1, mental status decline; N = 7515) in F1-score was 0.87. Overall, there were high agreements between the 2 annotators at both the delirium status and individual delirium concept levels.

Concordance in Delirium Status Between NLP Algorithms and Gold Standard

The NLP-CAM algorithm demonstrated a sensitivity, specificity, and accuracy of 0.919, 1.000, and 0.967, respectively, at identifying delirium compared to the gold standard (Table 2). The NLP-mCAM algorithm demonstrated a sensitivity, specificity, and accuracy of 0.827, 0.913, and 0.827, respectively, at identifying definite, possible, and no delirium compared to the gold standard (Table 3).

Table 2.

2 × 2 Contingency Table of NLP-CAM

Gold Standard
Delirium No delirium Total
NLP Delirium 57 0 57
No delirium 5 88 93
Total 62 88 150
Sensitivity = 0.919 Specificity = 1.000 Accuracy = 0.967

Note: NLP-CAM = natural language processing–confusion assessment method.

Table 3.

3 × 3 Contingency Table of NLP-mCAM

Gold Standard
Definitive delirium Possible delirium No delirium Total
NLP Definitive delirium 60 4 2 66
Possible delirium 1 8 10 19
No delirium 3 6 56 65
Total 64 18 68 150
Sensitivity = 0.938 Sensitivity = 0.444 Sensitivity = 0.824 Accuracy = 0.827
Specificity = 0.930 Specificity = 0.917 Specificity = 0.890 Sensitivity* = 0.827
Specificity* = 0.913

Note: NLP-mCAM = natural language processing–modified confusion assessment method.

*Micro average: calculated by using global counts of all categories.

Prevalence Analysis

When the NLP-CAM algorithm was applied to the clinical notes of patients who were aged 65 or older and hospitalized at the Mayo Clinic in Rochester Minnesota between April 2009 and September 2015, 12 651 (9.4%) patients were identified as having delirium. The NLP-mCAM algorithm yielded 20 611 (15.3%) definite delirium cases and 10 762 (8.0%) possible cases. About 5490 (4.1%) of the sample population were identified as having delirium through ICD-9 screening. The results were consistent with the published literature, between 15% and 20% for patients with age greater or equal to 65 (4,35). These results reflect the ability of the NLP-based phenotyping algorithm to identify more likely delirium cases than conventional code-based screening methods.

Discussion

In this study, we developed and evaluated 2 highly accurate NLP algorithms (NLP-CAM and NLP-mCAM) in the task of automatically identifying patients with delirium from clinical notes, even without a definitive diagnosis of delirium, using descriptive terminology consistent with and meeting standard CAM criteria. The implementation of both algorithms demonstrated excellent performance in accurately classifying delirium. In addition, when applied to a large group of randomly selected patients, the NLP algorithms were able to identify more patients with delirium compared to structured data (ICD codes). These results suggest that these algorithms may have high sensitivity to capture occurrences of delirium by mining relevant keywords and concepts in clinical free text. Compared with conventional manual chart review, NLP provides more systematic and scalable solutions in identifying clinical concepts. In general, human annotators have a lower test–retest reliability score (eg, missing true positives due to human error) and manual chart review is impractical to do on a large number of documents. NLP can solve these issues. However, the performance of NLP algorithms is affected by the quality of EHR documentation (eg, presence, absence, and consistency of information required for delirium) because NLP algorithms are based on the records in EHR documents.

We noticed that the accuracy of the algorithm was lower when classifying the delirium based on the modified CAM criteria compared with the original CAM criteria. This may be due to the introduction of the “possible” category. Our modified CAM changed the task from a 2-class to a 3-class classification problem. This implementation causes a relatively lower performance measure than NLP-CAM even though NLP-mCAM is able to identify more number of total delirium cases (definite and possible). During the NLP evaluation, we performed an error analysis to identify the most common causes of errors. We found that extracting “disorganized thinking” related concepts from clinical notes was a major challenge. The same concept can be expressed in many different ways. For example, the same concept can be expressed through different behaviors, such as speech, cognitive process, decision making, and patient reactions. Therefore, the NLP algorithms needed to be both accurate and generalizable in capturing these concepts. To address this problem, we used relaxed word distance, that is, allowing number of words between 2 anchor words, to help fine-tune the rules in multiple iterations (36). This process is, however, time-consuming. In the future, we will explore advanced machine learning techniques to capture the common meaning of the various expressions to aid the development effort.

Implications for Research

The process of manually abstracting clinical concepts for delirium ascertainment is time-consuming, costly, and non-scalable. NLP algorithms are distinctive in their ability to extract critical information from free text in EHRs. NLP techniques offer a sophisticated way of handling free text with high levels of accuracy, allowing efficient mining of unstructured data for broad applications. The NLP-CAM algorithm was developed strictly based on the original definition of CAM with the objective of achieving a high degree of precision. Researchers may find NLP-CAM to be helpful for identifying delirium with high confidence. However, we recognize that delirium symptoms are likely to be poorly documented in the clinical notes (26). Thus, strictly applying the original definition may not be sensitive enough to capture highly likely or possible cases. We therefore modified the criteria definition to adapt to the real-world EHR. Instead of strictly following the CAM criteria, we developed a modified definition for definite cases and added an additional “possible” category. The modified NLP-mCAM algorithm also had a good performance in identifying definite delirium cases and also identified a significant number of possible cases. Depending on the research study, investigators may wish to include only definite cases or may want to include possible cases. The NLP-CAM and NLP-mCAM algorithms therefore offer investigators the flexibility to apply either algorithm depending on the needs of the study.

Implications for Clinical Practice

The NLP algorithms also have many potential clinical applications especially when it comes to proactively identify patients at high risk of delirium based on prior history, or flagging hospitalized patients, who per clinical documentation are showing signs of delirium in real time. Because delirium is underreported and not all patients have a formal assessment for delirium diagnosis, the use of NLP algorithms on routine EHRs can facilitate the early detection of delirium. This can be achieved by integrating the NLP algorithms into clinical workflow through application programming interface technologies, which allows outputs from NLP to be delivered to clinicians by mobile applications or EHR system (eg, EPIC). Such clinical decision tools could facilitate the implementation of preventive measures to reduce the incidence of delirium (through risk factors) and institute early intervention strategies to avoid escalating symptoms and associated complications of delirium.

Limitations

Our study has several limitations. The algorithms initially were developed after a review of clinical notes at a single institution, with the structure tailored to a specific EHR system. Although we have successfully demonstrated the external validity of other NLP algorithms on EHRs in another hospital setting (37,38), additional work is necessary to demonstrate the portability of these algorithms to other institutions and EHRs. The NLP algorithms do not intend to replace a formal delirium assessment. Instead, the NLP algorithms can be used to automate manual chart review based on CAM criteria. The system was developed based on manual chart review from EHRs, the success of NLP algorithms depends on the level of detail and accuracy of the medical records. If there is no documentation of the features and concepts about delirium (hypoactive and hyperactive), NLP cannot identify the cases. We also note that hyperactive delirium is more likely to be documented in the medical records than hypoactive delirium. Therefore, we expect to consistently underestimate the presence of hypoactive delirium.

Conclusions

We adopted the standardized evidence-based framework CAM to develop and evaluate NLP algorithms to identify the occurrence of delirium from EHRs. Our NLP algorithms demonstrated excellent performance in identifying patients with delirium using clinical notes in an expeditious and cost-effective manner. These algorithms represent a promising alternative to manual chart review used in EHR-based delirium research projects and artificial intelligence-based clinical decision support.

Funding

This work was supported by National Institute on Aging R21 AG58738, National Institute on Aging R01 AG34676, andNational Institute on Aging R01 AG068007.

Supplementary Material

glaa275_suppl_Supplementary_Materials_1

Acknowledgments

The authors would like to gratefully acknowledge Donna M. Ihrke and Xin Zhang for case validation.

Conflict of Interest

The authors declare that there is no conflict of interest regarding the publication of this article.

Author Contributions

S.F., J.S., and S.S. conceptualized and design the study. S.F. and S.S. wrote a draft of the manuscript and performed the analysis. G.S.L., B.T., and D.I. conducted manual annotation and B.T. and S.P. adjudicated discrepancies. S.F. and S.S. developed NLP algorithms. A.W. implemented algorithms to the Mayo big data environment. All authors participated in the interpretation of the data and contributed to manuscript editing and revisions.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

glaa275_suppl_Supplementary_Materials_1

Articles from The Journals of Gerontology Series A: Biological Sciences and Medical Sciences are provided here courtesy of Oxford University Press

RESOURCES