. 2021 Apr 30;9(4):e24020. doi: 10.2196/24020

Table 3.

Evaluation results on Observation concepts in the test set for different intermediate-task pretraining and domain-adaptive pretraining combinations^a.

ITPT^b and BERT^c model		Precision, mean (SD)	Recall, mean (SD)	F1 score, mean (SD)
BERT		75.0 (1.8)	85.3 (1.1)	79.8 (0.6)
NCBI^d-disease
	+DAPT^e (BioBERT)	77.7 (2.6)	85.1 (2.8)	81.1 (1.1)
	+DAPT (ClinicalBERT)	78.6 (3.2)	84.4 (1.5)	81.3 (1.2)
	BERT	71.6 (3.4)	88.9 (2.4)	79.2 (1.5)
i2b2^f 2010
	+DAPT (BioBERT)	75.6 (1.9)	86.2 (1.4)	80.5 (1.4)
	+DAPT (ClinicalBERT)	73.2 (2.0)	89.0 (1.8)	80.3 (0.7)
	BERT	70.7 (2.7)	88.7 (1.5)	78.6 (1.3)
ShARe-CLEF^g 2013
	+DAPT (BioBERT)	72.9 (2.5)	88.3 (2.3)	79.8 (0.8)
	+DAPT (ClinicalBERT)	74.2 (2.6)	86.5 (3.8)	79.8 (0.9)

^aDocument-level precision, recall, and F1 score are reported using official evaluation scripts.

^bITPT: intermediate-task pretraining.

^cBERT: Bidirectional Encoder Representations from Transformers.

^dNCBI: National Center for Biotechnology Information.

^eDAPT: domain-adaptive pretraining.

^fi2b2: Integrating Biology and the Bedside.

^gShARe-CLEF: Shared Annotated Resources-Conference and Labs of the Evaluation Forum.