Skip to main content
. Author manuscript; available in PMC: 2021 Aug 3.
Published in final edited form as: Proc ACM Conf Health Inference Learn (2020). 2020 Apr 2;2020:214–221. doi: 10.1145/3368555.3384455

Table 2:

Performance of models developed using the i2b2 2014 challenge training set and evaluated on the i2b2 2014 challenge test set. Models use all lower case text and an uncased vocabulary unless otherwise specified. Each token is treated as a distinct entity. Binary evaluation involves collapsing all labeled entities into a single “PHI” group.

Multi-class PHI vs. not PHI
PPV Se F1 PPV Se F1
BERTlarge 98.66 98.15 98.40 99.08 98.57 98.82
BERTlarge,cased 98.56 97.77 98.16 99.00 98.20 98.60
BERTbase 98.61 97.90 98.25 98.98 98.27 98.62
BERTbase,cased 98.36 97.38 97.87 98.90 97.91 98.40
SciBERTsci 98.34 97.88 98.11 98.80 98.33 98.57
SciBERTbase 98.25 98.06 98.15 98.66 98.47 98.57
BioBERT 95.27 91.60 93.36 96.95 93.18 95.03
Dernoncourt et al. 98.16 98.32 98.23 97.92 97.83 97.88
Hartman et al. 85.7 99.1 91.7 - - -
Liu et al. 97.94 96.04 96.98 99.30 97.28 98.28

The PHI vs. not PHI evaluation in Dernoncourt et al. used a subset of classes based upon HIPAA and is not directly comparable to other results.