Abstract
Purpose
Performance status (PS), an essential indicator of patients’ functional abilities, is often documented in clinical notes of patients with cancer. The use of natural language processing (NLP) in extracting PS from electronic medical records (EMRs) has shown promise in enhancing clinical decision-making, patient monitoring, and research studies. We designed and validated a multi-institute NLP pipeline to automatically extract performance status from free-text patient notes.
Patients and Methods
We collected data from 19,481 patients in Harris Health System (HHS) and 333,862 patients from veteran affair’s corporate data warehouse (VA-CDW) and randomly selected 400 patients from each data source to train and validate (50%) and test (50%) the proposed pipeline. We designed an NLP pipeline using an expert-derived rule-based approach in conjunction with extensive post-processing to solidify its proficiency. To demonstrate the pipeline’s application, we tested the compliance of PS documentation suggested by the American Society of Clinical Oncology (ASCO) Quality Metric and investigated the potential disparity in PS reporting for stage IV non-small cell lung cancer (NSCLC). We used a logistic regression test, considering patients in terms of race/ethnicity, conversing language, marital status, and gender.
Results
The test results on the HHS cohort showed 92% accuracy, and on VA data demonstrated 98.5% accuracy. For stage IV NSCLC patients, the proposed pipeline achieved an accuracy of 98.5%. Furthermore, our analysis revealed a documentation rate of over 85% for PS among NSCLC patients, surpassing the ASCO Quality Metrics. No disparities were observed in the documentation of PS.
Conclusion
Our proposed NLP pipeline shows promising results in extracting PS from free-text notes from various health institutions. It may be used in longitudinal cancer data registries.
Keywords: Non small cell lung cancer, natural language processing, performance status, American Society of Clinical Oncology, quality metric, Eastern cooperative oncology group
Introduction
The field of health care has witnessed a tremendous increase in the adoption of electronic medical records (EMRs) over the past decade. EMRs contain a vast amount of valuable clinical information, including the performance status (PS). PS helps health care professionals assess the overall health and functional ability of a patients with cancer. 1 It provides valuable information which assists in determining the appropriate treatment options such as surgery, chemotherapy, or radiation therapy. 2 Patients with a poor PS may be more prone to treatment-related complications or may not tolerate intensive therapies as well. 3 Furthermore, PS offers insights into the patient’s general well-being, disease progression, and survival outlook. 3 Patients with a better PS generally have a more favorable prognosis compared to those with a poorer performance status.4-7
There are two prominent PS measures, namely the Eastern Cooperative Oncology Group (ECOG) PS 8,9 and Karnofsky 10 PS (KPS) . ECOG PS ranges from zero to five where lower values indicate better functioning and higher values indicate worse functioning. KPS ranges from zero to 100, with 100 representing normal functioning and 0 indicating complete disability or death. Extracting PS data manually from free-text clinical notes is a time-consuming and error-prone task for health care professionals and using computer automation, ie, natural language processing (NLP) would be beneficial to tackle this challenge.
NLP is a branch of artificial intelligence (AI) that enables computers to understand, interpret, and generate human language. In health care, NLP is used to process and analyze vast amounts of unstructured data found in medical records, research articles, and other health care documents. This technology helps in various applications such as clinical documentation improvement,11-13 and patient care.14,15 Specifically, in the context of cancer, NLP plays a crucial role in several ways, such as clinical trial, 16 pathology report Analysis,17,18 and palliative care.19-24 The integration of NLP techniques with EMRs has also opened new possibilities for PS extraction.25-27 The NLP automation may reduce labor-intensive chart reviewing and help the cancer programs to better allocate the resources. There were a few attempts to extract PS from free-text notes that were limited to ECOG and they did not include KPS. 26
The primary goal of this work was to develop an NLP pipeline to extract PS from unstructured medical notes from multiple institutes. The algorithm was validated on patients with cancer in Harris Health system (HHS) data repository and veteran health (VA). The use-case of the primary goal was to utilize the proposed pipeline to assess the quality of care in stage IV non-small cell lung cancer (NSCLC) patients to determine whether the care met the Quality Oncology Practice Initiative (QOPI) which is developed by ASCO [14]. Additionally, we investigate the potential occurrence of any disparity based on race/ethnicity, marital status, language, and gender in documentation of PS in HHS cancer patients notes.
Methods
The project was approved by Baylor College of Medicine IRB (H-35366) and Michael E. DeBakey VA Medical Center (MEDVAMC) Research & Development committee. This is a secondary database analysis, and the waiver of consent was approved by the institutional review board (IRB).
Study Design and Cohort
We utilized retrospective datasets from two separate health care systems in the United States to develop and externally validate the NLP pipeline for PS. The dataset used for development included patients with cancer from the Harris Health System (HHS) 28 and the national Veterans Affairs (VA) Health care System. These integrated systems provide patients with a high probability of receiving uninterrupted oncologic care and longitudinal follow-up.
HHS database
A retrospective cohort study from 2011-2022 was conducted at HHS, which serves as the largest safety-net health care system for underserved and uninsured patients in the Houston metropolitan area. 29 To establish the cohort, we utilized unique identifiers to link ambulatory and hospitalized patients from both the institutional Cancer Registry and Epic Clarity and Caboodle data warehouse which resulted in 4,395,926 notes from 19,481 patients (supplement table 1). Patients were deemed eligible if they had a newly diagnosed cancer with confirmed histology and invasive staging. We also excluded those patients whose cancer diagnosis occurred more than 60 days before initial EPIC encounter, and we excluded patients younger than 18 years. A total of 15,239 notes were acquired to further subsample for developing the NLP algorithm as shown in Figure 1.
Figure 1.
Strobe diagram of the (A) Harris Health system (HHS) cancer patients and stage IV Non-small cell cancer (NSCLC) patients, (B) Veteran’s affair corporate data warehouse (VA DCW) patients with cancer.
VA database
The VA cohort is comprised of veterans aged higher than 50 by 10/15/2022. Approximately 16.5 million patients were included in the cohort. Patients with neoplasms diagnosis at more than 2 outpatient visits after 01/01/2016 were included. We included patients who had new diagnosis in calendar year 2017. We used the following International Classification of Diseases (ICD) 10 codes: ‘C%’, ‘D0%’, ‘D1%’, ‘D2%’, ‘D3%’, and ‘D4%’ and collected the notes from cancer diagnoses during outpatient visits in calendar year 2017. As shown in Figure 1, this process yielded a population of 2,033,012 patients with 9,653,944 notes with at least one of the keywords in supplement table 2.
Annotation process
To establish a reliable benchmark for evaluating the performance of the NLP algorithm, we constricted a corpus of annotated notes from HHS and VA data. Two medical oncology residents reviewed the notes. The following instructions provided to the annotators: 1) look for any performance status documentation reported as “ECOG”, “ECOG PS”, “PS”, “Performance status”, “KPS” and “Karnofsky performance status” within the notes and document the relevant value, if no value is provided, ‘not available’ (NA) is reported. The annotators were also instructed to highlight the text and report to the senior oncologist if they cannot determine the value. In case of discrepancy, a senior oncologist served as an adjudicator to resolve them. We randomly selected a total of 400 patients from HHS database for chart review. The same annotation process was repeated for 400 patients obtained from the VA cohort. Out of 400 patients from each institute, 200 from each data source were used for training and validation and 200 for testing. Figure 2 shows a summary of study design and cohort selection for development and validation of the NLP algorithm and the use case of the algorithm for NSCLC patients. We assessed the annotators agreement, ie, inter-observer agreement, using Cohen’s kappa analysis. 30 Inter-observer agreement refers to the degree of consistency or concordance between the observations or ratings provided by different observers. Cohen’s kappa (𝜅) is used to measure the level of agreement between two raters beyond what would be expected by chance alone. The calculation of Cohen’s kappa involves comparing the observed agreement with the expected agreement:
where Po is the relative observed agreement among raters and Pe is the hypothetical probability of chance agreement.
Figure 2.
Study design and cohort selection for development, validation, and use case of the NLP algorithm.
NLP algorithm
We developed a rule-based NLP pipeline that utilizes regular expressions to identify entities associated with reporting PS using MATLAB (R2022b). The list of regular expressions is provided in the supplementary Table 2. These expressions play a critical role in controlling the false negative rate error. We iteratively revised this list multiple times to minimize the number of missing extracted PS in the notes.
After extracting the PS-related entity, we developed a method to identify the associated numerical value in the range of 0-4 for ECOG and 0%–100% for KPS, within a maximum of 50 characters following the captured entity. However, relying solely on this method could result in a high error rate. Therefore, we implemented an extensive post-processing set of rules to enhance the algorithm’s performance and to optimize the accuracy and reliability of the algorithm. These rules were crafted according to the recommendations from the medical oncology team members (see supplement table 3 for detailed algorithm specifications). Totally, 200 notes from 200 patients in HHS and the same number from VA were randomly selected for train and validation, and the same combination was used to test the performance of the algorithm. To determine the necessary numbers for testing and training, we utilized the method suggested by Juckett. 31 This approach allows users to estimate the minimum sample size needed to achieve reliable performance metrics.
To visualize the various combinations of PS and related numbers, we established a tree diagram. In the tree diagram, we illustrated the language variation in documentation of PS scores.
NLP algorithm performance
To assess the performance of the NLP algorithm, a confusion matrix was employed for the six ECOG levels (ranging from 0 to 4, as well as missing ECOG). A confusion matrix is a table used to evaluate the performance of a classification algorithm (int this case, different PS values 0-4). The matrix itself is typically a 2 × 2 table, but it can be extended to larger dimensions for multi-class classification problems. The confusion matrix has four main components:
True Positives (TP): The number of instances that were correctly classified as the positive class.
True Negatives (TN): The Number of Instances that Were Correctly Classified as the Negative Class.
False Positives (FP): The number of instances that were incorrectly classified as the positive class when they belong to the negative class.
False Negatives (FN): The number of instances that were incorrectly classified as the negative class when they belong to the positive class.
The overall accuracy was calculated by summing the true positive values across all levels and dividing the result by the total number of samples.
Sensitivity, specificity, positive predictive value (PPV), and negative predictive (NPV) value were calculated in a one-against-all manner for all PS levels. To calculate this, the definition of TP, TN, FP and FN is as follows:
TP i : instances of class i correctly predicted as class i.
FN i : Instances of class i incorrectly predicted as another class.
FP i : Instances of other classes incorrectly predicted as class i.
TN i : Instances of other classes correctly predicted as another class.
Using these definitions, Sensitivity, specificity, PPV, NPV for score of i will be:
Other Variables
We curated sociodemographic variables for the developing cohort such as marital status, language, race/ethnicity, gender, age, socioeconomic indicators, Insurance status, BMI, smoking status, comorbidities, and socioeconomic index. For socioeconomic index, we utilized area deprivation index (ADI) as a measure of socioeconomic neighborhood deprivation. 32
Statistical Analysis and Evaluation
Continuous variables were presented as mean and standard deviation, and categorical variables were shown as number and percentage. For our use case analysis, we used logistic regression to test the association between race/ethnicity, language and gender and the documentation of PS. We considered these variables as independent variables and the documentation of PS as dependent variables. The documentation of PS for each patients’ notes was considered as one and lack of documentation as zero. Odds ratios (OR) with 95% confidence intervals (95% CI) were reported. Each model was adjusted by age, sex, NCI comorbidity index. All statistical analyses were performed using SPSS, version 27 (SPSS Inc, Chicago, Illinois).
Results
Assessment of NLP Performance
For annotation accuracy, the inter-observer agreement was >0.93 (supplement table 4). In the test cohorts, the accuracy of the rule-based NLP pipeline was 92% at HHS and 98.5% at VA. Table 1 and Table 2 show the confusion matrices and other performance indices. The algorithm achieves Precision of 95% for PS 0, 100% for PS 1, 91% for PS 2, 91% for PS 3, 78% for PS 4, and 86% for PS not reported. Also recall for PS 0-4 is 88%, 93%, 88%, 77% and 88% respectively and 100% for PS not reported.
Table 1.
Confusion Matrix and Performance Indices for Two Different Validation Data Sets, Local Harris Health System (HHS) and National Veteran Health Administration (VA) Databases.
HHS database | ||||||||
N = 200 | Predicted PS (NLP) | |||||||
0 | 1 | 2 | 3 | 4 | Not reported | Recall | ||
Actual PS | 0 | 37 | 0 | 0 | 0 | 2 | 3 | 88% |
1 | 1 | 54 | 1 | 0 | 0 | 2 | 93% | |
2 | 1 | 0 | 21 | 0 | 0 | 2 | 88% | |
3 | 0 | 0 | 1 | 10 | 0 | 2 | 77% | |
4 | 0 | 0 | 0 | 1 | 7 | 0 | 88% | |
Not reported | 0 | 0 | 0 | 0 | 0 | 55 | 100% | |
Precision | 95% | 100% | 91% | 91% | 78% | 86% | Accuracy = 92% | |
VA database | ||||||||
N = 200 | Predicted PS (NLP) | |||||||
0 | 1 | 2 | 3 | 4 | Not reported | Recall | ||
Actual PS | 0 | 36 | 0 | 0 | 0 | 0 | 0 | 100% |
1 | 0 | 49 | 1 | 0 | 0 | 0 | 98% | |
2 | 0 | 0 | 25 | 0 | 0 | 0 | 100% | |
3 | 0 | 0 | 0 | 7 | 0 | 0 | 100% | |
4 | 0 | 0 | 0 | 0 | 5 | 0 | 100% | |
Not reported | 1 | 1 | 0 | 0 | 0 | 75 | 97% | |
Precision | 97% | 98% | 96% | 100% | 100% | 100% | Accuracy = 98.5% |
Table 2.
Confusion Matrix for Evaluating Natural Language Processing (NLP) on Performance Status (PS) Report Documentation for Stage IV Non-small Cell Lung Cancer (NSCLC) Patients, n = 200.
Confusion Matrix for PS Report | Predicted (NLP) | |||
---|---|---|---|---|
PS Reported | PS Not Reported | |||
Actual | PS reported | 165 | 0 | Sensitivity = 100% |
PS not reported | 3 | 32 | Specificity = 91.4% | |
PPV = 98.2% | NPV = 100% | Accuracy = 98.5% |
We visualized the rule-based PS extraction with tree diagram in Supplement Figure 1. By summing up the frequency rates of all the entities, it achieved an overall accuracy of 90.3% which shows the lower bound of accuracy in all the notes in HHS data. Among these combinations, the ones including “ECOG” (as shown in Supplement Figure 1) are particularly reliable in accurately capturing the correct value for PS. This certainty stems from the fact that the numeric value falls within a range that can be accurately identified by the NLP algorithm. We conducted a similar analysis for “PS” and “performance status,” although these entities were much less commonly used compared to ECOG. By summing up the frequency rates of all the entities, we can calculate a lower bound of accuracy in all the notes in HHS data.
Use-Case
Next, we applied the validated NLP algorithm in a subset of patients with NSCLC at HHS. Among 757 NSCLC patients, 75% spoke English and were distributed among different ethnicities: 25% White, 42% Black, 25% Hispanic, and 9% Asian or other backgrounds. The group’s average age was 58.1 years, with 41% being females, and 15% aged 65 or older. Socioeconomic indicators ranged across ADI-Q1 to ADI-Q4, while insurance coverage included 9% Medicaid, 9% Medicare, 4% private insurance, and 2% other/unknown. The cohort had an average BMI of 24.1, with 32% current smokers, 27% non-smokers, and 40% former smokers. Additionally, the patients had varying rates of COPD (32%), hypertension (51%), liver disease (12%), renal disease (4%), and diabetes (5%). (refer to Table 3). Only hypertension showed a statistically significant difference between the groups with documented PS and those without. (P < 0.01).
Table 3.
Patients’ Characteristics of Stage IV Non-small Cell Lung Cancer (NSCLC) Patients (N = 757).
All | With PS, n = 643 | Without PS, n = 114 | |
---|---|---|---|
Marital status, married, N(%) | 243 (32%) | 211 (33%) | 32 (28%) |
Language | |||
English, N (%) | 570 (75%) | 486 (76%) | 84 (74%) |
Spanish, N (%) | 136 (18%) | 116 (18%) | 20 (18%) |
Others, N (%) | 51 (6%) | 43 (6%) | 8 (8%) |
Race/Ethnicity | |||
White, N (%) | 186 (25%) | 161 (25%) | 25 (22%) |
Black, N (%) | 315 (42%) | 264 (41%) | 51 (45%) |
Hispanic, N (%) | 189 (25%) | 164 (26%) | 25 (22%) |
Asian/Other, N (%) | 67 (9%) | 54 (8%) | 13 (11%) |
Gender, female, N(%) | 314 (41%) | 264 (41%) | 50 (44%) |
Age, M(SD) | 58.1 (9.1) | 58.3 (8.9) | 57.6 (9.5) |
Age ≥ 65, N(%) | 114 (15%) | 98 (15%) | 16 (14%) |
Socioeconomic indicators (ADI a ), N(%) | |||
ADI-Q1, N (%) | 206 (27%) | 174 (27%) | 32 (28%) |
ADI-Q2, N (%) | 168 (22%) | 142 (22%) | 26 (23%) |
ADI-Q3, N (%) | 181 (24%) | 152 (24%) | 29 (25%) |
ADI-Q4, N (%) | 196 (26%) | 170 (26%) | 26 (23%) |
Insurance status | |||
Medicaid, N (%) | 248 (9%) | 211 (33%) | 37 (32%) |
Medicare, N (%) | 71 (9%) | 61 (9%) | 10 (9%) |
Private, N (%) | 32 (4%) | 30 (5%) | 2 (2%) |
Other/Unknown, N (%) | 18 (2%) | 11 (2%) | 7 (6%) |
Uninsured, N (%) | 388 (51%) | 330 (51%) | 58 (51%) |
BMI, M(SD) | 24.1 (5.1) | 24.1 (5.5) | 24.1 (4.2) |
Smoking status | |||
current Smoker, N (%) | 245 (32%) | 216 (34%) | 29 (25%) |
non-Smoker, N (%) | 207 (27%) | 175 (27%) | 32 (28%) |
former Smoker, N (%) | 305 (40%) | 252 (39%) | 53 (46%) |
Comorbidities | |||
COPD, N (%) | 242 (32%) | 209 (33%) | 33 (29%) |
hypertension, N (%) | 389 (51%) | 319 (50%)** | 70 (61%)** |
liver Disease, N (%) | 93 (12%) | 80 (12%) | 13 (11%) |
renal Disease, N (%) | 31 (4%) | 28 (4%) | 3 (3%) |
Diabetes, N (%) | 35 (5%) | 31 (5%) | 4 (4%) |
aArea deprivation index.
**P < 0.01.
The results revealed that 85% of stage IV NSCLC patients had their PS documented no later than 60 days following their diagnosis (Supplement Table 5).
For race/ethnicity, Black individuals have an OR of 1.24 compared to White individuals, indicating 24% higher odds (P = 0.40). Hispanics have an OR of 0.98 (P = 0.95), and Asian/Other groups have an OR of 1.55 (P = 0.24). Non-White individuals overall have an OR of 1.18 (P = 0.48). Language comparisons show Spanish speakers with an OR of 0.99 (P = 0.99) and non-English speakers with an OR of 1.10 (P = 0.66). Gender and marital status comparisons indicate males have an OR of 1.12 (P = 0.56) and married individuals have an OR of 1.25 (P = 0.32). We did not observe a difference in odds of documenting PS among Blacks, Hispanics, and Asians compared to Whites as reference (Table 4). We did not observe significant difference in odds of PS documentation between Spanish and non-English language speakers vs English language speakers. The same non-significant trend was observed between males and females.
Table 4.
The Odds of Documenting Eastern Cooperative Oncology Group (ECOG) Performance Status Among 757 Patients With Stage IV Non-small Cell Lung Cancer (NSCLC) at the Health System (HHS) Database.
Odds ratio (95%Confidence intervals | P-value | |
---|---|---|
Race/Ethnicity | ||
Black (n = 316) vs white (n = 194) | 1.24 (0.74-2.08) | 0.40 |
Hispanic (n = 178) vs white (n = 194) | 0.98 (0.54-1.78) | 0.95 |
Asian/Other (n = 96) vs white (n = 194) | 1.55 (0.74-3.24) | 0.24 |
Non-white (n = 563) vs white (n = 194) | 1.18 (0.73-1.91) | 0.48 |
Language | ||
Spanish (n = 118) vs English (n = 570) | 0.99 (0.58-1.69) | 0.99 |
Non-English (n = 187) vs English (n = 570) | 1.10 (0.70-1.74) | 0.66 |
Non-Spanish languages (n = 69) vs English (n = 570) | 1.07 (0.48-2.37) | 0.85 |
Gender and marital status | ||
Male (n = 443) vs female (n = 314) | 1.12 (0.75-1.68) | 0.56 |
Married (n = 514) vs unmarried (n = 243) (single, divorced, windowed) | 1.25 (0.80-1.94) | 0.32 |
Discussion
In this study, we proposed a multi-institute NLP pipeline to extract PS scores from cancer patients’ notes. The algorithm performed well in local (HHS cohort) and national (VA cohort) databases with accuracy of ≥90%. Additionally, we investigated the possible existence of disparity among different racial/ethnicity, language and gender groups in the local cohort.
When applied to the internal HHS derivation database, the model achieved high accuracy rates of 92%. However, the accuracy was comparatively lower for ECOG scores of 3 and 4, at 78% and 86% respectively. The primary source of error in these cases was due to class imbalance, which affected the accuracy calculation, particularly if the dataset was skewed towards certain classes. Various methods such as Synthetic Minority Over-sampling Technique (SMOTE), Over-sampling, and under-sampling can be used to address the imbalance issue.33,34 Moreover, the unusual reporting format (such as ‘ECOG was 2+ now +3’) as depicted in supplement Table 6 also contributes to some of the errors made by the NLP. Additionally, the algorithm failed to detect some PS levels due to the tabular format in which the PS was reported. Specifically, as shown in Supplement Figure 2, certain physicians in the HHS database chose to present the PS level in a tabular format by bolding the designated value, which the current algorithm was unable to capture, due to not having the formatting cues used in the original notes. Our analysis indicated that approximately 3.7% of all reports in the HHS database followed this tabular format. During testing on VA notes, the model demonstrated strong performance in classifying most classes, exhibiting high precision and an overall accuracy of 98.5%. It is important to highlight that the primary distinction between the VA and HHS cohorts is the greater prevalence and variety of reporting PS in tabular format within the VA data (supplement Figure 3). Additionally, we observed that the VA data is more uniform and organized in terms of reporting PS.
Comparing to similar works, Cohen et al 26 developed an NLP algorithm for ECOG PS extraction from EMRs. Their algorithm was tailored to extract ECOG, and it does not extract KPS from the notes. Their algorithm’s overall accuracy was 93% across all cancer types. Our algorithm shows 92% on HHS and 98% on VA data.
During the annotation phase, the panel of expert annotators observed several issues. One of the most common challenges was inconsistency in documenting the PS. Some physicians prefer to drop a predefined text regarding the PS and before reporting the final score. Another challenge was ambiguity or vagueness in documenting the PS. Some of the physicians document scores like ‘2-3’ or ‘1.5’ (refer to supplement Table 6). In those cases, we referred to the highest score or rounded the number to the higher number.
To demonstrate a use-case for the pipeline, we utilized the NLP to evaluate compliance with the ASCO quality metric for performance status documentation in patients with metastatic NSCLC. 35 Poor performance status is associated with worse survival outcomes in patients with metastatic NSCLC who receive chemotherapy or immunotherapy.36,37 Performance status cannot be easily extracted from the medical record, and often requires labor intense manual chart review, which is a barrier to evaluating whether providers are meeting the metric. The Harris Health System is an integrated, county health care system for the underserved population in the Houston metropolitan area. We evaluated provider compliance with performance status documentation in patients with metastatic NSCLC and found that providers documentation rate was 85% within 60 days of clinic visits, exceeding the threshold of 75% recommended by ASCO. Although we initially hypothesized that factors such as race or language barriers might contribute to disparities in PS documentation by providers, our analysis showed no difference in PS documentation by race or primary language.
The potential applications for this performance status NLP include not only compliance with national quality metrics, but also cancer registries that require documentation of performance status or as a screening tool for potential candidates for clinical trials that use good performance status as an inclusion criterion. The proposed pipeline has the potential to be implemented in clinical environments. To achieve this, we intend to implement the pipeline within the VA cancer data registry. We will evaluate its performance in a real-world environment to assess its potential for wider adoption in routine cancer database registries and health care systems.
The developed algorithm utilizes a set of regular expressions as keywords to identify PS in the text. Consequently, its effectiveness is confined to these specific keywords. As a result, although the algorithm works well with the HHS and VA databases, it may not perform as effectively with other institutions where the language used might differ. This limitation is a common issue with rule-based NLP. By employing large language models (LLMs), which are trained on extensive text corpora, we aim to enhance generalizability. Another study is warranted to utilize LLMs to extract PS scores.
Conclusion
We built an NLP pipeline which successfully extracted performance status (PS) concepts and associated PS values from the cancer patients’ notes. This system has an average recall of 94%, average precision of 94%, and average accuracy of 95% on document level. The number of samples for some classes, eg, ECOG scores of 3 and 4, may affect system performance. Using an automated system to capture these data is more efficient than manual review and improves access to data and can be used for cancer database registries. We observed no racial/language/gender disparity in documentation of PS among NSCLC patients at HHS, surpassing the American Society of Clinical Oncology (ASCO) Quality Metrics.
Supplemental Material
Supplemental Material for A Multi-Institutional Natural Language Processing Pipeline to Extract Performance Status From Electronic Health Records by Arash Maghsoudi, PhD, Yvonne H. Sada, MD, Sara Nowakowski, PhD, Danielle Guffey, MS, Huili Zhu, MD, Sudha R. Yarlagadda, MD, Ang Li, MD, and Javad Razjouyan, PhD in Cancer Control
Acknowledgments
We would like to extend our acknowledgment to Mehrnaz Azarian, MD, who helped us with submission of this study. All the authors have reviewed and approved the manuscript. All authors have authorized the submission of their manuscript via third party and approved any statements or declarations, eg, conflicting interests, funding, etc.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the seed funding from Baylor College of Medicine, Houston, Texas, United States, the Center for Innovations in Quality, Effectiveness and Safety (CIN 13-413), Michael E. DeBakey VA Medical Center, Houston, TX, United states and a national institute of health (NIH), National Heart, Lung, and Blood Institute (NHLBI) K25 funding (#:1K25HL152006-01), VA Clinical Science Research & Development (IK2 CX001981), and Artificial Intelligence/Machine Learning Consortium to Advance Health Equity and Researcher Diversity (AIM-AHEAD) funding (OD032581-01S1).
Supplemental Material: Supplemental material for this article is available online.
Ethical Statement
Research Approval
The project was approved by Baylor College of Medicine IRB (H-47441) and Michael E. DeBakey VA Medical Center (MEDVAMC) Research & Development committee. Since we did not have any direct patient contact, no informed consent was needed for this study.
ORCID iD
Danielle Guffey https://orcid.org/0000-0003-3721-614X
References
- 1.Datta SS, Ghosal N, Daruvala R, et al. How do clinicians rate patient’s performance status using the ECOG performance scale? A mixed-methods exploration of variability in decision-making in oncology. Ecancermedicalscience. 2019;13:913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Schröder C, Gramatzki D, Vu E, et al. Radiotherapy for glioblastoma patients with poor performance status. J Cancer Res Clin Oncol. 2022;148(8):2127-2136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Allende-Pérez S, Rodríguez-Mayoral O, Peña-Nieves A, Bruera E. Performance status and survival in cancer patients undergoing palliative care: retrospective study. BMJ Support Palliat Care 2022. [DOI] [PubMed] [Google Scholar]
- 4.Chen W-J, Kong D-M, Li L. Prognostic value of ECOG performance status and Gleason score in the survival of castration-resistant prostate cancer: a systematic review. Asian J Androl. 2021;23(2):163-169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Farmakis IT, Barco S, Mavromanoli AC, Konstantinides SV, Valerio L. Performance status and long-term outcomes in cancer-associated pulmonary embolism: insights from the Hokusai-VTE cancer study. Cardio Oncology. 2022;4(4):507-518. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Martinez‐Salamanca JI, Shariat SF, Rodriguez JC, et al. Prognostic role of ECOG performance status in patients with urothelial carcinoma of the upper urinary tract: an international study. BJU Int. 2012;109(8):1155-1161. [DOI] [PubMed] [Google Scholar]
- 7.Dall’Olio FG, Maggio I, Massucci M, Mollica V, Fragomeno B, Ardizzoni A. ECOG performance status≥ 2 as a prognostic factor in patients with advanced non small cell lung cancer treated with immune checkpoint inhibitors—a systematic review and meta-analysis of real world data. Lung Cancer. 2020;145:95-104. [DOI] [PubMed] [Google Scholar]
- 8.Oken MM, Creech RH, Tormey DC, et al. Toxicity and response criteria of the eastern cooperative oncology group. Am J Clin Oncol. 1982;5(6):649-655. [PubMed] [Google Scholar]
- 9.Zubrod C, Scheiderman M, Frei E. Appraisal of methods for the study of chemotherapy in man: comparative therapeutic tiral of nitrogen mustard and triethylene thiophosphoramide. J Chron Dis. 1960;11(7):33. [Google Scholar]
- 10.Karnofsky DA. The clinical evaluation of chemotherapeutic agents in cancer. Evaluation of Chemotherapeutic Agents, New York, NY: Columbia University Press. 1949:191-205. [Google Scholar]
- 11.Gramling R, Javed A, Durieux BN, et al. Conversational stories & self organizing maps: Innovations for the scalable study of uncertainty in healthcare communication. Patient Educ Counsel. 2021;104(11):2616-2621. [DOI] [PubMed] [Google Scholar]
- 12.Wang Y. NLP applications—clinical documents. In: Natural Language Processing in Biomedicine: A Practical Guide. Berlin: Springer; 2024:325-349. [Google Scholar]
- 13.Gramling CJ, Durieux BN, Clarfeld LA, et al. Epidemiology of Connectional Silence in specialist serious illness conversations. Patient Educ Counsel. 2022;105(7):2005-2011. [DOI] [PubMed] [Google Scholar]
- 14.Shen Z. Natural language processing (nlp) applications in patient care: a systematic analysis. Quarterly Review of Business Disciplines. 2020;7(3):223-244. [Google Scholar]
- 15.Nawab K, Ramsey G, Schreiber R. Natural language processing to extract meaningful information from patient experience feedback. Appl Clin Inf. 2020;11(02):242-252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lindvall C, Deng C-Y, Moseley E, et al. Natural language processing to identify advance care planning documentation in a multisite pragmatic clinical trial. J Pain Symptom Manag. 2022;63(1):e29-e36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Santos T, Tariq A, Gichoya JW, Trivedi H, Banerjee I. Automatic classification of cancer pathology reports: a systematic review. J Pathol Inf. 2022;13:100003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.López-Úbeda P, Martín-Noguerol T, Aneiros-Fernández J, Luna A. Natural language processing in pathology: current trends and future insights. Am J Pathol. 2022;192(11):1486-1495. [DOI] [PubMed] [Google Scholar]
- 19.Udelsman BV, Lilley EJ, Qadan M, et al. Deficits in the palliative care process measures in patients with advanced pancreatic cancer undergoing operative and invasive nonoperative palliative procedures. Ann Surg Oncol. 2019;26:4204-4212. [DOI] [PubMed] [Google Scholar]
- 20.Sarmet M, Kabani A, Coelho L, Dos Reis SS, Zeredo JL, Mehta AK. The use of natural language processing in palliative care research: a scoping review. Palliat Med. 2023;37(2):275-290. [DOI] [PubMed] [Google Scholar]
- 21.Lindvall C, Lilley EJ, Zupanc SN, et al. Natural language processing to assess end-of-life quality indicators in cancer patients receiving palliative surgery. J Palliat Med. 2019;22(2):183-187. [DOI] [PubMed] [Google Scholar]
- 22.Udelsman B, Chien I, Ouchi K, Brizzi K, Tulsky JA, Lindvall C. Needle in a haystack: natural language processing to identify serious illness. J Palliat Med. 2019;22(2):179-182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Udelsman BV, Lee KC, Lilley EJ, Chang DC, Lindvall C, Cooper Z. Variation in serious illness communication among surgical patients receiving palliative care. J Palliat Med. 2020;23(3):411-414. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.van den Broek-Altenburg E, Gramling R, Gothard K, Kroesen M, Chorus C. Using natural language processing to explore heterogeneity in moral terminology in palliative care consultations. BMC Palliat Care. 2021;20:1-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Agaronnik N, Lindvall C, El-Jawahri A, He W, Iezzoni L. Use of natural language processing to assess frequency of functional status documentation for patients newly diagnosed with colorectal cancer. JAMA Oncol. 2020;6(10):1628-1630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Cohen AB, Rosic A, Harrison K, et al. A Natural Language Processing algorithm to improve completeness of ECOG performance status in real-world data. Appl Sci. 2023;13(10):6209. [Google Scholar]
- 27.Herath DH, Wilson-Ing D, Ramos E, Morstyn G. Assessing the Natural Language Processing Capabilities of IBM Watson for Oncology Using Real Australian Lung Cancer Cases. Alexandria: American Society of Clinical Oncology; 2016. [Google Scholar]
- 28.Li A, La J, May SB, et al. Derivation and validation of a clinical risk assessment model for cancer-associated thrombosis in two unique US health care systems. J Clin Oncol. 2023. 22.01542. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Li A, da Costa WL, Jr, Guffey D, et al. Developing and optimizing a computable phenotype for incident venous thromboembolism in a longitudinal cohort of patients with cancer. Res Pract Thromb Haemost. 2022;6(4):e12733. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.McHugh ML. Interrater reliability: the kappa statistic. Biochem Med. 2012;22(3):276-282. [PMC free article] [PubMed] [Google Scholar]
- 31.Juckett D. A method for determining the number of documents needed for a gold standard corpus. J Biomed Inf. 2012;45(3):460-470. [DOI] [PubMed] [Google Scholar]
- 32.Knighton AJ, Savitz L, Belnap T, Stephenson B, VanDerslice J. Introduction of an area deprivation index measuring patient socioeconomic status in an integrated health system: implications for population health. EGEMs. 2016;4(3):1238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Wong SC, Gatt A, Stamatescu V, McDonnell MD. Understanding data augmentation for classification: when to warp? In: 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA); Gold Coast, QLD, Australia, 30 November 2016 - 02 December 2016. [Google Scholar]
- 34.Torgo L, Ribeiro RP, Pfahringer B, Branco P. Smote for regression. In Portuguese conference on artificial intelligence. Berlin, Heidelberg: Springer Berlin Heidelberg; 2013:378-389. [Google Scholar]
- 35.https://old-prod.asco.org/sites/new-www.asco.org/files/content-files/practice-patients/2023-QOPI-Reporting-Track-Public-Posting.pdf
- 36.Pater JL, Loeb M. Nonanatomic prognostic factors in carcinoma of the lung. A multivariate analysis. Cancer. 1982;50(2):326-331. [DOI] [PubMed] [Google Scholar]
- 37.Sehgal K, Gill RR, Widick P, et al. Association of performance status with survival in patients with advanced non–small cell lung cancer treated with pembrolizumab monotherapy. JAMA Netw Open. 2021;4(2):e2037120. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental Material for A Multi-Institutional Natural Language Processing Pipeline to Extract Performance Status From Electronic Health Records by Arash Maghsoudi, PhD, Yvonne H. Sada, MD, Sara Nowakowski, PhD, Danielle Guffey, MS, Huili Zhu, MD, Sudha R. Yarlagadda, MD, Ang Li, MD, and Javad Razjouyan, PhD in Cancer Control