Skip to main content
. 2021 Apr 30;9(4):e24020. doi: 10.2196/24020

Table 2.

Evaluation results on Observation concepts in the test seta.

Method Precision Recall F1 score

Value P value Value P value Value P value
Stanza [48] 81.5 N/Ab 75.0 N/A 78.1 N/A
BERTc (baseline), mean (SD) 70.7 (2.7) N/A 87.3 (1.5) N/A 78.1 (1.1) N/A
DAPTd (BioBERT), mean (SD) 73.4e (2.2) <.001 86.5 (1.7) <.001 79.4 (0.6) .08
DAPT (ClinicalBERT), mean (SD) 76.2e (3.5) <.001 83.4 (3.1) <.001 79.5e (1.0) .002
ITPTf (NCBIg-disease), mean (SD) 75.0e (1.8) <.001 85.3 (1.1) >.99 79.8e (0.6) <.001
DAPT (BioBERT)+ITPT (NCBI-disease), mean (SD) 77.7e (2.6) <.001 85.1 (2.8) .08 81.1e (1.1) <.001
DAPT (ClinicalBERT)+ITPT (NCBI-disease), mean (SD) 78.6e (3.2) <.001 84.4 (1.5) .56 81.3e (1.2) <.001

aDocument-level precision, recall, and F1 score are reported using official evaluation scripts.

bN/A: not applicable.

cBERT: Bidirectional Encoder Representations from Transformers.

cDAPT: domain-adaptive pretraining.

eRepresents results that are significantly better than the Bidirectional Encoder Representations from Transformers baseline (approximate randomization test; P=.05). Although the recall of baseline Bidirectional Encoder Representations from Transformers is the highest, the differences are not significant except those for 2 domain-adaptive pretraining variants.

fITPT: intermediate-task pretraining.

gNCBI: National Center for Biotechnology Information.