Table 2:
Performance of models developed using the i2b2 2014 challenge training set and evaluated on the i2b2 2014 challenge test set. Models use all lower case text and an uncased vocabulary unless otherwise specified. Each token is treated as a distinct entity. Binary evaluation involves collapsing all labeled entities into a single “PHI” group.
| Multi-class | PHI vs. not PHI | |||||
|---|---|---|---|---|---|---|
| PPV | Se | F1 | PPV | Se | F1 | |
| BERTlarge | 98.66 | 98.15 | 98.40 | 99.08 | 98.57 | 98.82 |
| BERTlarge,cased | 98.56 | 97.77 | 98.16 | 99.00 | 98.20 | 98.60 |
| BERTbase | 98.61 | 97.90 | 98.25 | 98.98 | 98.27 | 98.62 |
| BERTbase,cased | 98.36 | 97.38 | 97.87 | 98.90 | 97.91 | 98.40 |
| SciBERTsci | 98.34 | 97.88 | 98.11 | 98.80 | 98.33 | 98.57 |
| SciBERTbase | 98.25 | 98.06 | 98.15 | 98.66 | 98.47 | 98.57 |
| BioBERT | 95.27 | 91.60 | 93.36 | 96.95 | 93.18 | 95.03 |
| † Dernoncourt et al. | 98.16 | 98.32 | 98.23 | 97.92 | 97.83 | 97.88 |
| Hartman et al. | 85.7 | 99.1 | 91.7 | - | - | - |
| Liu et al. | 97.94 | 96.04 | 96.98 | 99.30 | 97.28 | 98.28 |
The PHI vs. not PHI evaluation in Dernoncourt et al. used a subset of classes based upon HIPAA and is not directly comparable to other results.