Table 2.
Method | Precision | Recall | F1 score | |||
|
Value | P value | Value | P value | Value | P value |
Stanza [48] | 81.5 | N/Ab | 75.0 | N/A | 78.1 | N/A |
BERTc (baseline), mean (SD) | 70.7 (2.7) | N/A | 87.3 (1.5) | N/A | 78.1 (1.1) | N/A |
DAPTd (BioBERT), mean (SD) | 73.4e (2.2) | <.001 | 86.5 (1.7) | <.001 | 79.4 (0.6) | .08 |
DAPT (ClinicalBERT), mean (SD) | 76.2e (3.5) | <.001 | 83.4 (3.1) | <.001 | 79.5e (1.0) | .002 |
ITPTf (NCBIg-disease), mean (SD) | 75.0e (1.8) | <.001 | 85.3 (1.1) | >.99 | 79.8e (0.6) | <.001 |
DAPT (BioBERT)+ITPT (NCBI-disease), mean (SD) | 77.7e (2.6) | <.001 | 85.1 (2.8) | .08 | 81.1e (1.1) | <.001 |
DAPT (ClinicalBERT)+ITPT (NCBI-disease), mean (SD) | 78.6e (3.2) | <.001 | 84.4 (1.5) | .56 | 81.3e (1.2) | <.001 |
aDocument-level precision, recall, and F1 score are reported using official evaluation scripts.
bN/A: not applicable.
cBERT: Bidirectional Encoder Representations from Transformers.
cDAPT: domain-adaptive pretraining.
eRepresents results that are significantly better than the Bidirectional Encoder Representations from Transformers baseline (approximate randomization test; P=.05). Although the recall of baseline Bidirectional Encoder Representations from Transformers is the highest, the differences are not significant except those for 2 domain-adaptive pretraining variants.
fITPT: intermediate-task pretraining.
gNCBI: National Center for Biotechnology Information.