Table 4.
PHI category | MIST | NLM-Scrubber | Emory HIDE | MIT deid rule-based | NeuroNER | MIT deid transformer-based | Our best model (radiology + i2b2 + augmentation) |
---|---|---|---|---|---|---|---|
All PHI | 75.5 (94.7, 62.7) | 74.1 (64.1, 87.5) | 92.2 (96.6, 88.2) | 74.0 (81.7, 67.6) | 93.6 (94.5, 92.6) | 78.4 (95.1, 66.7) | 97.9 (98.0, 97.7) |
Macro- averaged | 61.5 (94.6, 53.7) | 58.8 (51.8, 83.6) | 72.9 (82.1, 66.1) | 28.0 (35.6, 26.0) | 68.6 (75.1, 65.6) | 53.5 (67.9, 48.8) | 89.4 (91.5, 88.0) |
Dates | 75.1 (97.4, 61.2) | 97.9 (98.3, 97.5) | 96.4 (96.8, 96.0) | 89.0 (96.0, 83.0) | 97.9 (98.4, 97.5) | 83.4 (98.0, 72.6) | 98.9 (99.1, 98.6) |
Provider names | 80.8 (93.0, 71.4) | None | 86.6 (97.5, 77.9) | None | 87.0 (82.0, 92.6) | 54.3 (84.2, 40.1) | 95.6 (92.9, 98.4) |
Locations | 79.6 (85.2, 74.7) | None | 83.0 (93.4, 74.7) | 30.8 (51.1, 22.0) | 86.3 (82.0, 77.9) | 45.0 (60.5, 35.8) | 89.4 (90.6, 88.2) |
Vendors and software | 88.1 (86.7, 89.7) | None | 76.5 (88.6, 67.2) | 6.2 (28.6, 3.4) | 75.9 (82.0, 70.7) | None | 65.0 (78.3, 55.6) |
IDs | 11.1 (100, 5.9) | None | 90.6 (98.1, 84.1) | 0 (0, 0) | 84.8 (81.1, 88.9) | 55.3 (76.3, 43.3) | 97.3 (95.9, 98.8) |
Patient names | 19.0 (100, 10.5) | 45.4 (37.3, 57.8) | 0 (0, 0) | 42.1 (37.9, 47.5) | 48.0 (100, 31.6) | None | 95.9 (100, 92.1) |
Phone numbers | 77.0 (100, 62.5) | 33.0 (19.9, 95.6) | 76.9 (100, 62.5) | 0 (0, 0) | 0 (0, 0) | 29.5 (20.6, 52.0) | 84.0 (84.0, 84.0) |
Notes: Certain cases are left with “None” values, as the corresponding model is not capable of detecting the PHI category. Rule-based models could not be retrained and suffered from differences in what was considered PHI in the original study, which sometimes excluded years or name titles from being labeled as PHI. Our best model was trained on both radiology reports and i2b2 notes with our data augmentation approach. The “All PHI” category corresponds to the PHI versus non-PHI task, where labels and predictions are binarized as either PHI or non-PHI. For each PHI category, the best score is emboldened and underlined.
PHI: protected health information.