Table 4.
F1 | Sensitivity | Specificity | Precision | NPV | Accuracy | AUROC (95% CI) |
P values† | AUPRC (95% CI) |
||
---|---|---|---|---|---|---|---|---|---|---|
Rule-based | Exact Match | 0.254 | 0.983 | 0.852 | 0.770 | 0.391 | 0.777 | – | – | – |
Augmented Match | 0.658 | 0.908 | 0.756 | 0.860 | 0.703 | 0.833 | – | – | – | |
Guo et al. (single) | 0.766 | 0.674 | 0.951 | 0.887 | 0.837 | 0.851 | – | – | – | |
Guo et al. (combined) | 0.788 | 0.706 | 0.951 | 0.892 | 0.850 | 0.862 | – | – | – | |
| ||||||||||
Machine Learning | Random Forest | 0.901 | 0.837 | 0.957 | 0.977 | 0.728 | 0.874 | 0.896 (0.868, 0.925) |
<0.001 | 0.964 (0.952, 0.974) |
Support Vector Machine | 0.900 | 0.827 | 0.979 | 0.988 | 0.721 | 0.874 | 0.905 (0.880, 0.928) |
<0.001 | 0.970 (0.959, 0.977) |
|
Linear Regression | 0.889 | 0.811 | 0.971 | 0.984 | 0.701 | 0.861 | 0.890 (0.864, 0.917) |
<0.001 | 0.962 (0.951, 0.972) |
|
XGBoost | 0.901 | 0.837 | 0.957 | 0.977 | 0.728 | 0.874 | 0.897 (0.870, 0.922) |
<0.001 | 0.964 (0.948, 0.977) |
|
| ||||||||||
Deep Learning | ClinicalBERT_TGD | 0.923 | 0.906 | 0.975 | 0.940 | 0.960 | 0.954 |
0.944
(0.913, 0.967) |
– |
0.941
(0.912, 0.965) |
P values were calculated to compare the AUROC between ClinicalBERT_TGD and other machine learning baselines using the two-sided DeLong test.
Best single-rule algorithm was based on ≥2 diagnosis codes and ≥1 keyword(s)
Best combined rule was either gender field indicates transgender or ≥1 diagnosis code(s) plus ≥1 TGD keyword(s)