Skip to main content
. 2024 Jul 26;194(4):1097–1105. doi: 10.1093/aje/kwae240

Figure 2.

Figure 2

Comparison of the performance of 5 natural language processing models for extracting fall-related injury data from electronic health records. A) Performance as measured by precision; B) performance as measured by recall; C) performance as measured by F1 score. The orange lines in the box plots represent median values. The box represents the interquartile range (IQR), which spans from the 25th percentile (Q1) to the 75th percentile (Q3). The whiskers extend to the minimum and maximum values within 1.5 times the IQR from Q1 and Q3, respectively. Outliers are shown as individual points outside the whiskers, representing values beyond 1.5 times the IQR.Performance was measured in the 2500 labeled benchmark and 93 157 validated paragraph samples, using bootstrapping to obtain 95% CIs. After evaluation of the precision, recall, and F1 scores for all of the BERT models, RoBERTa was found to be the best model. BERT, bidirectional encoder representations from transformers; RoBERTa, robustly optimized BERT pretraining approach; SVM, support vector machine.