. Author manuscript; available in PMC: 2021 Aug 3.

Published in final edited form as: Proc ACM Conf Health Inference Learn (2020). 2020 Apr 2;2020:214–221. doi: 10.1145/3368555.3384455

Table 2:

Performance of models developed using the i2b2 2014 challenge training set and evaluated on the i2b2 2014 challenge test set. Models use all lower case text and an uncased vocabulary unless otherwise specified. Each token is treated as a distinct entity. Binary evaluation involves collapsing all labeled entities into a single “PHI” group.

	Multi-class			PHI vs. not PHI
	PPV	Se	F1	PPV	Se	F1
BERT_large	98.66	98.15	98.40	99.08	98.57	98.82
BERT_large,cased	98.56	97.77	98.16	99.00	98.20	98.60
BERT_base	98.61	97.90	98.25	98.98	98.27	98.62
BERT_base,cased	98.36	97.38	97.87	98.90	97.91	98.40
SciBERT_sci	98.34	97.88	98.11	98.80	98.33	98.57
SciBERT_base	98.25	98.06	98.15	98.66	98.47	98.57
BioBERT	95.27	91.60	93.36	96.95	93.18	95.03
^† Dernoncourt et al.	98.16	98.32	98.23	97.92	97.83	97.88
Hartman et al.	85.7	99.1	91.7	-	-	-
Liu et al.	97.94	96.04	96.98	99.30	97.28	98.28

^†

The PHI vs. not PHI evaluation in Dernoncourt et al. used a subset of classes based upon HIPAA and is not directly comparable to other results.