. 2021 Apr 30;9(4):e24020. doi: 10.2196/24020

Table 2.

Evaluation results on Observation concepts in the test set^a.

Method	Precision		Recall		F1 score
	Value	P value	Value	P value	Value	P value
Stanza [48]	81.5	N/A^b	75.0	N/A	78.1	N/A
BERT^c (baseline), mean (SD)	70.7 (2.7)	N/A	87.3 (1.5)	N/A	78.1 (1.1)	N/A
DAPT^d (BioBERT), mean (SD)	73.4^e (2.2)	<.001	86.5 (1.7)	<.001	79.4 (0.6)	.08
DAPT (ClinicalBERT), mean (SD)	76.2^e (3.5)	<.001	83.4 (3.1)	<.001	79.5^e (1.0)	.002
ITPT^f (NCBI^g-disease), mean (SD)	75.0^e (1.8)	<.001	85.3 (1.1)	>.99	79.8^e (0.6)	<.001
DAPT (BioBERT)+ITPT (NCBI-disease), mean (SD)	77.7^e (2.6)	<.001	85.1 (2.8)	.08	81.1^e (1.1)	<.001
DAPT (ClinicalBERT)+ITPT (NCBI-disease), mean (SD)	78.6^e (3.2)	<.001	84.4 (1.5)	.56	81.3^e (1.2)	<.001

^aDocument-level precision, recall, and F1 score are reported using official evaluation scripts.

^bN/A: not applicable.

^cBERT: Bidirectional Encoder Representations from Transformers.

^cDAPT: domain-adaptive pretraining.

^eRepresents results that are significantly better than the Bidirectional Encoder Representations from Transformers baseline (approximate randomization test; P=.05). Although the recall of baseline Bidirectional Encoder Representations from Transformers is the highest, the differences are not significant except those for 2 domain-adaptive pretraining variants.

^fITPT: intermediate-task pretraining.

^gNCBI: National Center for Biotechnology Information.