Skip to main content
American Journal of Epidemiology logoLink to American Journal of Epidemiology
. 2024 Jul 26;194(4):1097–1105. doi: 10.1093/aje/kwae240

Automated identification of fall-related injuries in unstructured clinical notes

Wendong Ge 1,2, Lilian M Godeiro Coelho 2,2, Maria A Donahue 3, Hunter J Rice 4, Deborah Blacker 5,6,7, John Hsu 8,9,10, Joseph P Newhouse 11,12,13,14, Sonia Hernández-Díaz 15, Sebastien Haneuse 16,17, Brandon Westover 18,19,3, Lidia M V R Moura 20,21,✉,3
PMCID: PMC11978607  PMID: 39060160

Abstract

Fall-related injuries (FRIs) are a major cause of hospitalizations among older patients, but identifying them in unstructured clinical notes poses challenges for large-scale research. In this study, we developed and evaluated natural language processing (NLP) models to address this issue. We utilized all available clinical notes from the Mass General Brigham health-care system for 2100 older adults, identifying 154 949 paragraphs of interest through automatic scanning for FRI-related keywords. Two clinical experts directly labeled 5000 paragraphs to generate benchmark-standard labels, while 3689 validated patterns were annotated, indirectly labeling 93 157 paragraphs as validated-standard labels. Five NLP models, including vanilla bidirectional encoder representations from transformers (BERT), the robustly optimized BERT approach (RoBERTa), ClinicalBERT, DistilBERT, and support vector machine (SVM), were trained using 2000 benchmark paragraphs and all validated paragraphs. BERT-based models were trained in 3 stages: masked language modeling, general boolean question-answering, and question-answering for FRIs. For validation, 500 benchmark paragraphs were used, and the remaining 2500 were used for testing. Performance metrics (precision, recall, F1 scores, area under the receiver operating characteristic curve [AUROC], and area under the precision-recall [AUPR] curve) were employed by comparison, with RoBERTa showing the best performance. Precision was 0.90 (95% CI, 0.88-0.91), recall was 0.91 (95% CI, 0.90-0.93), the F1 score was 0.91 (95% CI, 0.89-0.92), and the AUROC and AUPR curves were [both??] 0.96 (95% CI, 0.95-0.97). These NLP models accurately identify FRIs from unstructured clinical notes, potentially enhancing clinical-notes–based research efficiency.

Keywords: fall-related injuries, Medicare program, natural language processing, BERT, machine learning

Introduction

Fall-related injuries (FRIs) are a leading cause of emergency visits and hospitalizations in adults over age 65 years.1 In 2015, falls cost over $50 billion in US health-care spending, mostly covered by Medicare and Medicaid.2,3 Depending on their severity, FRIs can exacerbate existing physical limitations or cause severe outcomes, such as head injuries or bone fractures, leading to anxiety and reduced independence.4,5

In today’s data-driven era, efficiently identifying FRIs from medical records is a challenge. Manually reviewing electronic health records (EHRs) is a time-consuming process that is susceptible to errors and influenced by reviewer fatigue, affecting both reliability and quality.6-9 A fast and accurate alternative is automation of chart review through natural language processing (NLP).9-11 Readily available digital models for identifying a true FRI accurately enhance clinical research.12,13

NLP models effectively extract unstructured EHR data to accurately detect various health conditions.14-17 While previous studies used the support vector machine (SVM) model for FRI identification,16,17 advancements such as bidirectional encoder representation from transformers (BERT) show promise in recognizing various medical conditions, including dementia and stroke.18,19 Leveraging these new NLP models can enhance FRI identification in clinical notes. Thus, we aimed to develop an effective NLP model tailored to identify falls and FRIs in EHRs among Medicare beneficiaries in a large regional health-care system.

Methods

Data sources

We pulled clinical notes from the Mass General Brigham (MGB) Enterprise Data Warehouse, which includes 7 Massachusetts hospitals: Massachusetts General Hospital, Brigham and Women’s Hospital, McLean Hospital, Newton-Wellesley Hospital, Faulkner Hospital, Salem Hospital, and Spaulding Rehabilitation Hospital.

The MGB Institutional Review Board approved the study, and informed consent was waived. The data and software code that produced the findings are available from the corresponding author.

Sample selection

To define our cohort, we used the following eligibility criteria: (1) age of 65 years or older as of January 1, 2016, and (2) more than 6 months of enrollment in Medicare Parts A and B from January 1, 2016, to December 31, 2019, or death. We identified 42 302 eligible Medicare beneficiaries from the MGB cohort.

To capture FRI instances in diverse settings, we stratified our cohort into 3 mutually exclusive subgroups: (1) beneficiaries with at least 1 inpatient-visit (including emergency and observation visits) primary diagnosis with an International Classification of Diseases, Tenth Revision, code indicative of FRIs (Table S1) (n = 13 793; 33% of the cohort); (2) beneficiaries with at least 1 FRI-related code as the primary diagnosis in an outpatient setting, at least 1 outpatient evaluation and management visit with a selected professional service type (surgery, office/home, preventive physical examinations, urgent care visits), or at least 1 inpatient stay (general, surgical, rehabilitation, emergency department, or skilled nursing facility) (n = 25 693; 61% of the cohort); and (3) any beneficiary not in sample A or B (n = 2816; 7% of the cohort).

We created a sample of 2100 subjects, with 700 from each subgroup of MGB employees, to ensure a balanced representation of FRI instances in diverse settings. Of the 2100 subjects, we excluded patients without notes during the specified time frame, which left 1678 patients whose notes we gathered from the MGB Research Patient Data Registry.20

Definitions and keywords

FRIs were defined as unintended stumbles, trips, or slips leading to an injury with these operational definitions21,22: (1) based only on symptoms such as new-onset fractures, joint dislocation, bruises, cuts, abrasions, open wounds, or back pain; (2) based on health-care use, including a hospital visit or admission for management (medical examination, radiological procedures, reassurance, surgery, suturing); and (3) based on symptoms and health-care use, such as history of falls with a hospital visit and any sequelae relating to a fall.

Using these definitions and manual chart review, we created a collection of FRI-related keywords (Table S2) to automatically scan the notes in our dataset and identify candidate instances where FRIs might have occurred. Keywords were found in 1 518 528 sentences within 154 949 paragraphs from 1669 patients. We assumed that our list of keywords was comprehensive and that paragraphs with no keywords were assumed to be unrelated to FRI and weren’t examined in the subsequent analysis.

Validated pattern labeling

We created a list of validated patterns for our NLP models, defining them as expressions that determine the presence of an FRI without extra context. These patterns improve NLP performance by enhancing linguistic understanding. Our study identified 2 types: positive patterns, signaling an FRI, and negative patterns, refuting it. (See examples in Figure S1.) Before listing these patterns, clinical experts were trained in reliability by reviewing sentences, achieving high agreement (>90%, Cohen’s κ > 80%; Appendix S1). They then examined 9000 sentences to develop the initial pattern list.

Graphical user interface, validated patterns, and active learning

To enhance labeling efficiency, we developed a graphical user interface for quick sentence review and labeling, aiding in selecting samples for active learning (AL) and improving model efficiency. We also used validated pattern regular expressions to automatically label sentences, reducing redundancy and increasing efficiency. This involved searching 1 518 528 sentences and labeling similar patterns. Integrating AL, a machine learning (ML) technique, we queried our database for informative unlabeled sentences, optimizing NLP annotation and ensuring a balanced dataset.

We initially labeled random sentences and then trained an AL classifier to label and assign vectors to unlabeled sentences.23 We generated an embedding map using uniform manifold approximation and projection (UMAP)24 and selected candidate sentences using 2 strategies, uncertainty (entropy-based) and diversity (embedding-based), as detailed in Appendix S2 and Figure 1.

Figure 1.

Figure 1

Active learning pipeline for sentence-level labeling to gather validated patterns. The graph represents the process by which validated patterns were generated. GUI, graphical user interface.

Validated patterns cleaning

In initial labeling, we found 2079 positive and 1858 negative candidate validated patterns. Experts then reviewed these, discarding low-quality ones. We used the NegEx algorithm for a quality check,25 identifying negations in positive pattern matches. Manual review followed, and we removed patterns for which negation altered FRI status. This resulted in a final list of 1888 positive and 1801 negative validated patterns (Figure S2). A more comprehensive explanation of the methods used for labeling and cleaning is given in Appendix S3.

Training and testing data

We created datasets labeled as “benchmark” and “validated” by evaluating FRIs at the paragraph level for our training and testing data. Paragraphs with a maximum token count of 512 were used to provide context for experts to evaluate for FRIs without exceeding the 512 token limit of BERT models. Validated-standard data were paragraphs automatically labeled using validated patterns, whereas benchmark-standard data were directly labeled by clinical experts.

We generated our validated-standard data by scanning the 154 949 keyword-identified paragraphs for the presence of any of our previously defined validated patterns. Through this process, we obtained 29 525 paragraphs that were only matched by positive validated patterns and 63 632 paragraphs that were only matched by negative validated patterns. This gave us a validated-standard dataset of 93 157 paragraphs.

Clinical experts then labeled 5000 unmatched paragraphs to create our benchmark-standard data. These paragraphs were randomly assigned to 2 clinical experts who had been trained and underwent interrater agreement testing to assure accurate assessment of FRIs (>90% agreement).

We divided the labeled data into model development (training) and model evaluation (testing) sets. All 93 157 validated-standard paragraphs were included in the training set. Benchmark-standard data were split such that 2000 paragraphs (40%) were for training, 500 paragraphs (10%) were for validation, and 2500 paragraphs (50%) were held out for model evaluation. All model performance metrics were determined on the held-out testing data. Table S3 presents the proportions of positive and negative examples in validated and benchmark datasets.

We meticulously selected appropriate hyperparameters to attain optimal performance in our ML models. We used random search to fine-tune hyperparameters while exploring them more efficiently than grid search.26 For each iteration of the random search, a set of hyperparameters (learning rate and batch size) was selected randomly from predefined ranges. A validation dataset was used to determine the best epoch number and select the best model in the corresponding epoch. We monitored the model’s performance on the validation set across different epochs. We chose the epoch number where the model achieved the best performance evaluated through precision, recall, F1 score, and accuracy.

Specifically, we set the batch size to 8, the dimension of the hidden vector to 768, and the sequence length to 512. Our chosen objective function was cross-entropy. For optimization, we employed the Adam algorithm,27 a first-order gradient-based optimization algorithm designed to improve ML models that deal with unpredictable or changing data while utilizing adaptive estimates of lower-order moments. The initial learning rate was established at 0.00002, and we applied a weight decay of 0.01.

Classification models

We developed 5 NLP models for binary classification of FRIs using paragraphs of text. Our initial model fine-tuned BERT, a Transformer-based model.28 We also adapted 3 versions of BERT: RoBERTa (larger dataset, different methods),29 ClinicalBERT (medical-notes–trained),30 and DistilBERT (fewer parameters, comparable performance).31

Additionally, we created an SVM model, a classic ML algorithm using a bag-of-words approach, transforming paragraphs into unigram and bigram vectors for classification.18,19 This model contrasts with BERT’s need for pretraining on large datasets.32

We also compared our models with a binary rule-based classifier (yes/no), considered a broader expert system where no learning is involved and the decisions are made on the basis of predefined rules per expert knowledge about what features indicate positivity or negativity for FRI. This pathway is important because rule-based systems may provide a straightforward decision-making process without the complexity of an ML model. We offer a more detailed description of the classification models in Appendix S4.

Model training

In the training stage, we used the 2000 benchmark-standard training data paragraphs and 93 157 validated-standard training data paragraphs to train the NLP models. For the 4 BERT-based models (vanilla BERT, RoBERTa, ClinicalBERT, and DistilBERT), we fine-tuned the model parameters to adapt them to our FRI binary classification task along 3 stages: (1) masked language modeling (MLM), (2) general boolean question-answering (QA), and (3) QA for FRI.

In the MLM stage, we masked some words and used the residual words in the paragraph to predict the masked words; this adapts the models to the type of text (clinical notes) relevant to our task, without making direct use of the FRI labels. In our general boolean QA stage, we used a BoolQ dataset, which allowed us to adapt the BERT-based models for the task of answering binary questions (unrelated to FRI).33 Finally, in the FRI QA stage, we had the model answer our main question, “Has the patient had a fall-related injury?”. For this stage, we gave the model labeled paragraphs and trained the models to return a probability that an FRI was present. This probability can be converted into a binary answer (yes/no) by comparing the probabilities with a threshold value.

Model testing

We used the 2500 held-out benchmark-standard paragraphs for model testing to evaluate each model’s performance. Overall model evaluations used precision (positive predictive value), recall (sensitivity), F1 score, area under the receiver operating characteristic (AUROC) curve, area under the precision-recall (AUPR) curve, false-positive rate (FPR), and false-negative rate (FNR). Since we designed our models to perform binary classification, these parameters provided a thorough view of the model’s ability to identify FRIs accurately.

Precision, recall, and F1 score specifically address the concern about false-positive and false-negative results, which are critical in our binary classification task (yes/no). Precision measures the proportion of correctly identified positive cases among all predicted positive cases, while recall measures the proportion of correctly identified positive cases among all actual positive cases. The F1 score combines precision and recall into a single measure, providing a balanced assessment of the models’ performance. This score ranges from 0 to 1, with 1 being the best score. A higher F1 score indicates that the model has a good balance between precision and recall, which is desirable in many NLP tasks.

Additionally, we included other widely used performance metrics such as AUROC, AUPR, FPR, and FNR to evaluate the models’ performance comprehensively. The AUROC assesses the models’ ability to discriminate between positive and negative cases, while the AUPR focuses on the precision-recall trade-off. The FPR and FNR allow us to examine the rates of false-positive and false-negative predictions, respectively. Note that the area under the curve does not require choosing a threshold, whereas the other metrics (precision, recall, F1 score) do. In our study, we conducted a comprehensive analysis by comparing different precision-recall curves to determine the optimal classification threshold of 0.5 (Figure S3). This involved assessing the trade-off between precision and recall at various thresholds. By carefully examining these curves, we could identify the threshold that maximized the models’ performance in accurately identifying FRIs.

All performance metrics are presented as mean values with 95% CIs. For the 95% CIs, we generated 10 samples of observations for each metric and assumed that they followed a Student’s t distribution. We then used bootstrapping in the Python library “scipy.stats.t” to calculate the 95% CI.34 While other methods for estimating the 95% CI exist, we specifically opted for bootstrapping due to its flexibility for estimating uncertainty. It is a widely used resampling technique that allows us to generate multiple samples with replacement from the test dataset and obtain reliable estimates of the performance metrics. By repeating the process 10 000 times, we used the nonparametric resampling bootstrap method to calculate the 95% CIs as the 2.5th and 97.5th percentiles of the resampled metrics.35 In the test dataset, specifically, we had 2500 examples. In each iteration, we sampled 2500 examples with replacement from the test dataset, and used the sampled data to calculate metrics.

Error analysis

To further evaluate model accuracy, we conducted an error analysis of paragraphs on which clinical experts and BERT models disagreed regarding the correct classification (“errors”). We looked at false-positive paragraphs (paragraphs labeled by the clinician as having no FRI [FRI-negative (FRI−)] but labeled by the model as having FRI [FRI-positive (FRI+)]) and false-negative paragraphs (paragraphs labeled by clinicians as FRI+ but labeled by the model as FRI−).

Based on our post hoc review, we stratified these paragraphs into 4 groups: (1) short paragraphs (not enough context for clinicians to reliably assess FRI); (2) contradictory paragraphs (included evidence for and against FRI); (3) clinician labeling error (the label in the benchmark-standard dataset was wrong, and the correct label was clear and in agreement with the model’s classification upon second review); and (4) model labeling error (the label in the benchmark-standard dataset was correct and in disagreement with the model’s classification).

Results

Performance comparison

We compared the SVM model and the 4 BERT-based models: vanilla BERT, RoBERTa, ClinicalBERT, and DistilBERT. Figure 2 shows box plots for each model’s median representing precision, recall, and F1 score, and Table 1 summarizes each model’s performance.

Figure 2.

Figure 2

Comparison of the performance of 5 natural language processing models for extracting fall-related injury data from electronic health records. A) Performance as measured by precision; B) performance as measured by recall; C) performance as measured by F1 score. The orange lines in the box plots represent median values. The box represents the interquartile range (IQR), which spans from the 25th percentile (Q1) to the 75th percentile (Q3). The whiskers extend to the minimum and maximum values within 1.5 times the IQR from Q1 and Q3, respectively. Outliers are shown as individual points outside the whiskers, representing values beyond 1.5 times the IQR.Performance was measured in the 2500 labeled benchmark and 93 157 validated paragraph samples, using bootstrapping to obtain 95% CIs. After evaluation of the precision, recall, and F1 scores for all of the BERT models, RoBERTa was found to be the best model. BERT, bidirectional encoder representations from transformers; RoBERTa, robustly optimized BERT pretraining approach; SVM, support vector machine.

Table 1.

Performance of 5 natural language processing models for extracting data on fall-related injuries from electronic health records.

Model Count for prediction a
(total = 2500)
Count for ground truth b
(total = 2500)
Precision (95% CI) Recall (95% CI) F1 score (95% CI)
FRI positive FRI negative FRI positive FRI negative
Vanilla BERT 1205 1295 1272 1228 0.87 (0.85-0.89) 0.82 (0.80-0.84) 0.84 (0.83-0.86)
ClinicalBERT 1274 1226 1272 1228 0.88 (0.86-0.90) 0.88 (0.86-0.90) 0.88 (0.87-0.89)
DistilBERT 1273 1227 1272 1228 0.83 (0.80-0.85) 0.83 (0.81-0.85) 0.83 (0.81-0.84)
RoBERTa 1300 1200 1272 1228 0.90 (0.88-0.91) 0.91 (0.90-0.93) 0.91 (0.89-0.92)
SVM 1282 1218 1272 1228 0.80 (0.78-0.83) 0.81 (0.79-0.83) 0.81 (0.79-0.82)

Abbreviations: BERT, bidirectional encoder representations from transformers; FRI, fall-related injury; RoBERTa, robustly optimized BERT pretraining approach; SVM, support vector machine.

a Raw number of instances belonging to a particular class (positive/negative) as assigned by the machine learning model.

b Raw number of actual, correct, or observed outcomes or labels associated with the instances in our dataset; serves as a reference for the model’s prediction.

RoBERTa was the best-performing model, with a precision of 0.90 (95% CI, 0.88-0.91), recall of 0.91 (95% CI, 0.90-0.93), and an F1 score of 0.91 (95% CI, 0.89-0.92). ClinicalBERT was the second-best model, with a precision of 0.88 (95% CI, 0.86-0.90), recall of 0.88 (95% CI, 0.86-0.90), and an F1 score of 0.88 (95% CI, 0.87-0.89). These models were superior to vanilla BERT, which had a precision of 0.87 (95% CI, 0.85-0.89), recall of 0.82 (95% CI, 0.80-0.84), and an F1 score of 0.84 (95% CI, 0.83-0.86).

DistilBERT was the worst-performing BERT model, with a precision of 0.83 (95% CI, 0.80-0.85), recall of 0.83 (95% CI, 0.81-0.85), and an F1 score of 0.83 (95% CI, 0.81-0.84). The SVM model achieved a precision of 0.80 (95% CI, 0.78-0.83), recall of 0.81 (95% CI, 0.79-0.83), and an F1 score of 0.81 (95% CI, 0.79-0.82). However, it performed worse than BERT-based models. Areas under the curve for the different NLP models are shown in Figure 3, with the RoBERTa model having the best performance for both the AUROC curve (AUROC = 0.96; 95% CI, 0.95-0.97) and the AUPR curve (AUPR = 0.96; 95% CI, 0.95-0.97).

Figure 3.

Figure 3

Receiver operating characteristic (ROC) curves and precision-recall curves (PRCs) for 5 natural language processing models for extracting fall-related injury data from electronic health records. A) Area under the ROC curve (AUROC); B) area under the precision-recall curve (AUPRC). The RoBERTa model had the best AUROC (0.962) and AUPRC (0.964). BERT, bidirectional encoder representations from transformers; RoBERTa, robustly optimized BERT pretraining approach; SVM, support vector machine.

The performance of the rule-based system was as follows: precision = 0.86 (95% CI, 0.84-0.88), recall = 0.65 (95% CI, 0.62-0.67), and F1 score = 0.74 (95% CI, 0.72-0.76).

Error analysis

We conducted an error analysis on the best-performing model, RoBERTa. In Tables S4 and S5, we describe the different errors made by the clinical experts or model. We evaluated 131 instances of false positivity, for an FPR of 5.24% (expert label: FRI−; model label: FRI+). The most common reason identified for false positivity (84 paragraphs, 64.12%) was short paragraphs; for these, we concluded that expert labels were not reliable (based on insufficient data) and that the “true” correct label was not clear. The subsequent most common explanation was clinician labeling error (18 paragraphs, 13.74%), followed by model labeling error (19 paragraphs, 14.50%) and contradictory paragraphs (10 paragraphs, 7.63%); for these, we again concluded that the true label was not clear. This post hoc analysis suggested that among the 131 apparent false-positive cases, 94 were better considered “uncertain” rather than true errors, and 18 were not errors. Thus, the “true” FPR may have been as low as 0.76% for the RoBERTa model.

We identified 103 paragraphs as instances of false negativity, for an FNR of 4.12% (expert label: FRI+; model label: FRI−). The most common explanation upon manual review was clinician labeling error (63 paragraphs, 61.16%). This was followed by model labeling errors (27 paragraphs, 26.2%) and contradicting paragraphs (13 paragraphs, 12.62%). Taken together, this post hoc analysis suggested that among the 103 apparent false-negative cases, only 27 were in fact model errors; thus, the FNR may have been as low as 1.08% for the RoBERTa model.

Discussion

Findings

We developed and evaluated 5 NLP models for identifying FRIs in unstructured clinical notes. Among these models, RoBERTa was the best performer. Our error analysis further revealed that when discrepancies arose between the model’s predictions and expert assessments, the model was often correct upon more careful review, or the “true” labels were actually uncertain. We conclude that our NLP model can accurately extract reliable FRI data from clinical notes.

FRIs are complex health events with significant implications for older adults.36 High-performance NLP models automate FRI identification in clinical databases, enabling efficient, large-scale patient identification and information extraction.37 This reduces manual chart review, facilitates the conduct of epidemiologic studies, and enhances understanding of clinical conditions and outcomes.38

NLP models aid in proactive patient care by enabling early FRI detection in EHRs, allowing timely intervention and implementation of targeted fall prevention strategies.13 This can prevent complications and enhance patient safety. Additionally, these models assist health-care institutions in retrospective FRI analyses, identifying quality improvement areas and informing fall prevention guidelines, thus contributing to improved patient care.39

Previous research on identifying FRIs in EHRs used conventional NLP models like SVMs. Two significant studies developed statistical text mining models using SVM, based on ambulatory care notes labeled for fall status.18,19 Although these studies used older SVM models, their results showed improved surveillance of falls and FRIs using medical data.

Our study advanced labeling efficiency using a graphical user interface, AL, and validated patterns, creating a substantial dataset for fine-tuning BERT classifiers. BERT’s training on vast datasets (Wikipedia: 2.5 billion words; Books: 800 million words) allows nuanced semantic representations.28 BERT’s superiority over older models is evident.

However, comparing BERT with SVM involves distinct feature representations. SVM uses uni- and bigrams, while BERT utilizes complex, high-dimensional representations.28,33 This complexity might advantage BERT in capturing intricate data patterns. To ensure fairness, we used BERT’s output as features for the SVM, enabling a direct comparison of SVM’s performance with BERT’s rich features, highlighting the impact of feature representations on model effectiveness.

RoBERTa excelled over other BERT models due to its training on a larger, diverse text corpus and extended training time, enhancing language understanding.29 It focuses on MLM tasks, omitting next sentence prediction for better performance in short texts. RoBERTa’s knowledge distillation improves accuracy and domain-specific nuances, making it a leading NLP model.29

In tasks requiring precise medical understanding and domain-specific knowledge, such as the present study, BERT is typically preferred due to its contextual understanding and accuracy in medical language processing. However, as NLP models like BERT continue to evolve, they will undoubtedly expand into new clinical and research settings, opening up further opportunities for advancements in medical NLP applications. Comparing BERT’s performance with that of newer models, such as the generative pre-trained transformer (GPT),40 in the context of FRI identification holds significant potential for advancing research in this field.

Finally, one could argue that employing a multilayer perceptron (MLP) directly as a decision layer might be a cleaner alternative than framing the binary classification task as a QA task. However, the choice between a QA task and an MLP as a decision layer hinges on the problem’s specific requirements and the data’s nature. Indeed, an MLP as a decision layer could prove valuable, given its simpler model architecture and binary classification nature (eg, yes/no).41 Nevertheless, we opted for a QA task due to certain key factors. First, our data encompassed diverse presentations of falls and FRIs, demanding an understanding of intricate relationships within the dataset. Second, we anticipated incorporating additional questions into our algorithm to capture these nuanced aspects in the future. For these reasons, we concluded that a binary classification was not the preferred approach.

Limitations

Our study had limitations that should be acknowledged. Firstly, we did not evaluate the time aspect of FRIs, which means that the FRI occurrences could range from same-day events to events taking place several years before. This lack of a “time context” might have affected the accuracy and completeness of the model’s predictions. In future studies, we plan to develop a model that considers the time dimension by using the same labeling pattern to determine when the FRI happened in the corresponding clinical note.

Secondly, our evaluation of FRI events focused more on the falling aspect rather than the specific injuries resulting from those falls. This limitation suggests that while the model may excel in detecting falls, it might be less accurate in pinpointing the subsequent injuries. In future studies, we aim to explore this dimension to improve the model’s ability to identify the occurrence of falls and related injuries more accurately.

Thirdly, inherent in our methodology, we employed a set of keywords to identify relevant notes for inclusion in our project. While this approach facilitated the identification of a substantial dataset, we acknowledge the potential for systematic bias. Such bias may have resulted in exclusion of notes containing mentions of FRIs that did not precisely match our chosen criteria. To mitigate keyword-related limitations, we authenticated our data with manual reviews and conducted sensitivity analyses, systematically adjusting criteria and making our keywords more comprehensive.

Additionally, in our methods, we employed paragraphs shorter than 512 tokens. This is because BERT has a maximum token limit, so if clinical notes exceed that limit, they must be truncated. Therefore, when using our model on full-length notes, it must be applied using a “sliding window” approach, in which the model is applied to successive chunks of 512 tokens. An overall decision that the overall note describes an FRI is reached if an FRI is detected in any of the component chunks.

Lastly, despite implementation of various strategies to reduce the model’s errors, our error analysis revealed that improvements in training clinical labelers and providing more context around FRI events could enhance the accuracy of the benchmark- and validated-standard datasets, consequently improving the model’s performance. To mitigate errors effectively, researchers in future studies can adopt essential strategies to refine the model’s performance, ensuring its reliability and accuracy for clinical applications.

A crucial strategy involves continuously monitoring and evaluating the model’s performance using a separate validation dataset or real-time user feedback. These assessments allow us to gather insights into the model’s real-world performance and identify specific situations where errors might occur. We can promptly address these issues to release updated model versions with improved performance.

Automating data-labeling in health care through regular expression (ie, “regex”) patterns promises efficiency and reduces human error. However, this approach faces challenges like the need for infrastructure, computing resources, and specialized software. In addition, the scarcity of a skilled workforce in this area, where potential employees are often attracted to other industries, hinders health-care research.

To overcome these challenges, the establishment of best practices with detailed documentation and transparency is crucial. Sharing datasets on public portals can help others learn and improve these methods, enhancing data-labeling automation in health-care research.

Conclusion

We created an NLP model that accurately detects FRIs in older patients’ clinical notes. The RoBERTa model showed high accuracy, aiding efficient FRI research. In future studies, investigators should compare newer models like GPT with BERT, exploring their effectiveness in identifying FRIs in unstructured text.

Supplementary Material

Web_Material_kwae240
web_material_kwae240.pdf (476.7KB, pdf)

Acknowledgments

A preprint of this article is available from the Social Science Research Network (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4398944).

Contributor Information

Wendong Ge, Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, United States.

Lilian M Godeiro Coelho, Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, United States.

Maria A Donahue, Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, United States.

Hunter J Rice, Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, United States.

Deborah Blacker, Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States; Department of Psychiatry, Massachusetts General Hospital, Boston, MA 02114, United States.

John Hsu, Department of Health Care Policy, Harvard Medical School, Boston, MA 02115, United States; Mongan Institute, Massachusetts General Hospital, Boston, MA 02114, United States; Department of Medicine, Harvard Medical School, Boston, MA 02115, United States.

Joseph P Newhouse, Department of Health Care Policy, Harvard Medical School, Boston, MA 02115, United States; National Bureau of Economic Research, Cambridge, MA 02138, United States; Department of Health Policy and Management, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States; John F. Kennedy School of Government, Harvard University, Cambridge, MA 02138, United States.

Sonia Hernández-Díaz, Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States.

Sebastien Haneuse, Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States; Department of Neurology, Beth Israel Deaconess Medical Center, Boston, MA 02215, United States.

Brandon Westover, Department of Neurology, Harvard Medical School, Boston, MA 02115, United States; Department of Psychiatry, Harvard Medical School, Boston, MA 02215, United States.

Lidia M V R Moura, Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, United States; Department of Neurology, Harvard Medical School, Boston, MA 02115, United States.

Supplementary material

Supplementary material is available at American Journal of Epidemiology online.

Funding

This work was supported by National Institutes of Health (NIH) grant 1R01AG073410-01 from the National Institute on Aging. D.B. received individual funding from the NIH (grants 5P30 AG062421-03, 2P01AG036694-11, 5U01AG032984-12, 1U24NS100591-04, 1R01AG058063-04, R01AG063975-03, 5R01AG062282-04, 3R01AG062282-03S1, 5R01AG066793-02, 1U19AG062682-03, 2P01AG032952-11, 2T32MH017119-34 [Billing Agreement 010289.0001], 3P01AG032952-12S3, 1U01AG068221-01, 1U01AG076478- 01, and 5R01AG048351-05). J.H. received funding from NIH grants U01AG076478, P01AG032952, R01AG062282, and R01AG073410. J.P.N. received funding from NIH grants 3P01AG032952, 1U01AG07647, 1R01AG062282, and 1R01MD010456. M.B.W. received funding from NIH grants R01NS102190, R01NS102574, R01NS107291, RF1AG064312, RF1NS120947, R01AG073410, R01HL161253, and R01NS126282, as well as the National Science Foundation (grant 2014431). L.M.V.R.M. received support from the Centers for Disease Control and Prevention (grant U48DP006377), the NIH (National Institute on Aging grant 5R01AG062282-02), and the Epilepsy Foundation of America. The remaining authors (W.G., L.M.G.C., M.A.D., H.J.R., S.H.-D., S.H., and B.W.) received no specific financial support.

Conflict of interest

B.W. is a cofounder of Beacon Biosignals, not related to the submitted work. None of the other authors have any potential conflicts to declare.

Disclaimer

The lead authors (W.G. and L.M.G.C.) affirm that this article is an honest, accurate, and transparent account of the study described and the results reported, that no important aspects of the study have been omitted, and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.

Data availability

Data are available from the corresponding author.

References

  • 1. Bergen G, Stevens MR, Burns ER. Falls and fall injuries among adults aged ≥65 years—United States, 2014. MMWR Morb Mortal Wkly Rep. 2016;65(37):993-998. 10.15585/mmwr.mm6537a2 [DOI] [PubMed] [Google Scholar]
  • 2. Florence CS, Bergen G, Atherly A, et al. Medical costs of fatal and nonfatal falls in older adults: medical costs of falls. J Am Geriatr Soc. 2018;66(4):693-698. 10.1111/jgs.15304 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Rizzo JA, Friedkin R, Williams CS, et al. Health care utilization and costs in a Medicare population by fall status. Med Care. 1998;36(8):1174-1188. 10.1097/00005650-199808000-00006 [DOI] [PubMed] [Google Scholar]
  • 4. Tinetti ME, Williams CS. The effect of falls and fall injuries on functioning in community-dwelling older persons. J Gerontol A Biol Sci Med Sci. 1998;53(2):M112-M119. 10.1093/gerona/53a.2.m112 [DOI] [PubMed] [Google Scholar]
  • 5. Denkinger MD, Lukas A, Nikolaus T, et al. Factors associated with fear of falling and associated activity restriction in community-dwelling older adults: a systematic review. Am J Geriatr Psychiatry. 2015;23(1):72-86. 10.1016/j.jagp.2014.03.002 [DOI] [PubMed] [Google Scholar]
  • 6. Brundin-Mather R, Soo A, Zuege DJ, et al. Secondary EMR data for quality improvement and research: a comparison of manual and electronic data collection from an integrated critical care electronic medical record system. J Crit Care. 2018;47:295-301. 10.1016/j.jcrc.2018.07.021 [DOI] [PubMed] [Google Scholar]
  • 7. Tomei G, Cinti ME, Cerratti D, et al. Attention, repetitive works, fatigue and stress [in Italian]. Ann Ig. 2006;18(5):417-429PMID: 17089957. [PubMed] [Google Scholar]
  • 8. Larue GS, Rakotonirainy A, Pettitt AN. Real-time performance modelling of a sustained attention to response task. Ergonomics. 2010;53(10):1205-1216. 10.1080/00140139.2010.512984 [DOI] [PubMed] [Google Scholar]
  • 9. Köpcke F, Lubgan D, Fietkau R, et al. Evaluating predictive modeling algorithms to assess patient eligibility for clinical trials from routine data. BMC Med Inform Decis Mak. 2013;13:134. 10.1186/1472-6947-13-134 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Miotto R, Weng C. Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials. J Am Med Inform Assoc. 2015;22(e1):e141-e150. 10.1093/jamia/ocu050 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Gustafson E, Pacheco J, Wehbe F, et al. A machine learning algorithm for identifying atopic dermatitis in adults from electronic health records. IEEE Int Conf Healthc Inform. 2017;2017:83-90. 10.1109/ICHI.2017.31 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Lamb SE, Jørstad-Stein EC, Hauer K, et al. Development of a common outcome data set for fall injury prevention trials: the Prevention of Falls Network Europe consensus. J Am Geriatr Soc. 2005;53(9):1618-1622. 10.1111/j.1532-5415.2005.53455.x [DOI] [PubMed] [Google Scholar]
  • 13. Cummings P, Koepsell TD, Mueller BA. Methodological challenges in injury epidemiology and injury prevention research. Annu Rev Public Health. 1995;16:381-400. 10.1146/annurev.pu.16.050195.002121 [DOI] [PubMed] [Google Scholar]
  • 14. Liao KP, Cai T, Gainer V, et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care Res. 2010;62(8):1120-1127. 10.1002/acr.20184 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Melton GB, Hripcsak G. Automated detection of adverse events using natural language processing of discharge summaries. J Am Med Inform Assoc. 2005;12(4):448-457. 10.1197/jamia.M1794 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Luther SL, McCart JA, Berndt DJ, et al. Improving identification of fall-related injuries in ambulatory care using statistical text mining. Am J Public Health. 2015;105(6):1168-1173. 10.2105/AJPH.2014.302440 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. McCart JA, Berndt DJ, Jarman J, et al. Finding falls in ambulatory care clinical documents using statistical text mining. J Am Med Inform Assoc. 2013;20(5):906-914. 10.1136/amiajnl-2012-001334 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Mao C, Xu J, Rasmussen L, et al. AD-BERT: using pre-trained language model to predict the progression from mild cognitive impairment to Alzheimer’s disease. J Biomed Inform. 2023;144:104442. 10.1016/j.jbi.2023.104442 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Yang L, Huang X, Wang J, et al. Identifying stroke-related quantified evidence from electronic health records in real-world studies. Artif Intell Med. 2023;140:102552. 10.1016/j.artmed.2023.102552 [DOI] [PubMed] [Google Scholar]
  • 20. Mass General Brigham . Research Patient Data Registry/Patient Data Science Repository/Azure Enclave. 2010. Accessed March 9, 2023. https://rc.partners.org/about/who-we-are-risc/research-patient-data-registry
  • 21. Schwenk M, Lauenroth A, Stock C, et al. Definitions and methods of measuring and reporting on injurious falls in randomised controlled fall prevention trials: a systematic review. BMC Med Res Methodol. 2012;12(1):50. 10.1186/1471-2288-12-50 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Mathers LJ, Weiss HB. Incidence and characteristics of fall-related emergency department visits. Acad Emerg Med. 1998;5(11):1064-1070. 10.1111/j.1553-2712.1998.tb02663.x [DOI] [PubMed] [Google Scholar]
  • 23. Bachman P, Sordoni A, Trischler A. Learning algorithms for active learning. In: Precup D, Teh YW, eds. Proceedings of the 34th International Conference on Machine Learning. Microtome Publishing; 2017:301-310. Accessed March 2, 2023. https://proceedings.mlr.press/v70/bachman17a.html [Google Scholar]
  • 24. McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv. 10.48550/arXiv.1802.03426, September 9, 2018, preprint: not peer reviewed. [DOI] [Google Scholar]
  • 25. Goryachev S, Sordo M, Zeng QT, et al. Implementation and Evaluation of Four Different Methods of Negation Detection. (Technical report). DSG; 2006. [Google Scholar]
  • 26. Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res. 2012;13(2):281-305. https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf [Google Scholar]
  • 27. Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv. 10.48550/ARXIV.1412.6980, December 22, 2014, preprint: not peer reviewed. [DOI] [Google Scholar]
  • 28. Devlin J, Chang MW, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T, eds. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics; 2019:4171-4186. 10.18653/v1/N19-1423 [DOI] [Google Scholar]
  • 29. Liu Y, Ott M, Goyal N, et al. RoBERTa: a robustly optimized BERT pretraining approach. arXiv. http://arxiv.org/abs/1907.11692, July 26, 2019, preprint: not peer reviewed. [Google Scholar]
  • 30. Huang K, Altosaar J, Ranganath R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. arXiv. http://arxiv.org/abs/1904.05342, November 29, 2020, preprint: not peer reviewed. [Google Scholar]
  • 31. Sanh V, Debut L, Chaumond J, et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv. 10.48550/ARXIV.1910.01108, October 2, 2019, preprint: not peer reviewed. [DOI] [Google Scholar]
  • 32. Tong S, Koller D. Support vector machine active learning with applications to text classification. J Mach Learn Res. 2001;2(1):45-66. https://www.jmlr.org/papers/volume2/tong01a/tong01a.pdf [Google Scholar]
  • 33. Clark C, Lee K, Chang MW, et al. BoolQ: exploring the surprising difficulty of natural yes/no questions. arXiv. http://arxiv.org/abs/1905.10044, May 24, 2019, preprint: not peer reviewed. [Google Scholar]
  • 34. The SciPy Community . scipy.stats.t. (SciPy v1.10.1 manual). 2008. Accessed March 13, 2023. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html
  • 35. Carpenter J, Bithell J. Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Stat Med. 2000;19(9):1141-1164. [DOI] [PubMed] [Google Scholar]
  • 36. O’Neil CA, Krauss MJ, Bettale J, et al. Medications and patient characteristics associated with falling in the hospital. J Patient Saf. 2018;14(1):27-33. 10.1097/PTS.0000000000000163 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Landolsi MY, Hlaoua L, Ben RL. Information extraction from electronic medical documents: state of the art and future research directions. Knowl Inf Syst. 2023;65(2):463-516. 10.1007/s10115-022-01779-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Al-Garadi MA, Yang YC, Sarker A. The role of natural language processing during the COVID-19 pandemic: health applications, opportunities, and challenges. Healthcare. 2022;10(11):2270. 10.3390/healthcare10112270 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Tamang S, Humbert-Droz M, Gianfrancesco M, et al. Practical considerations for developing clinical natural language processing systems for population health management and measurement. JMIR Med Inform. 2023;11:e37805. 10.2196/37805 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. arXiv. 10.48550/ARXIV.2005.14165, July 22, 2020, preprint: not peer reviewed. [DOI] [Google Scholar]
  • 41. Venugopal I, Töllich J, Fairbank M, et al. A comparison of deep-learning methods for analysing and predicting business processes. In: 2021 International Joint Conference on Neural Networks (IJCNN). Institute of Electrical and Electronics Engineers; 2021:1-8. 10.1109/IJCNN52387.2021.9533742 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web_Material_kwae240
web_material_kwae240.pdf (476.7KB, pdf)

Data Availability Statement

Data are available from the corresponding author.


Articles from American Journal of Epidemiology are provided here courtesy of Oxford University Press

RESOURCES