. 2022 Nov 7;36(1):105–113. doi: 10.1007/s10278-022-00712-w

Table 2.

Performance of the models using both clinical history and impression section. Bold text represents optimal performance on the specific dataset. Shows the p-value for model comparison

Labels	Model	History and impression section				Support
Labels	Model	Precision	Recall	F1-score	p-value
Acute	BERT	0.94	0.95	0.95	p > 0.5	1436
Critical		0.82	0.86	0.83		136
Non-acute		0.97	0.97	0.97		2247
Overall (weighted)		0.96	0.96	0.96		3819
Acute	ClinicalBERT⁺⁺	0.93	0.96	0.95	ref	1436
Critical		0.92	0.84	0.88		136
Non-acute		0.98	0.96	0.97		2247
Overall (weighted)		0.96	0.96	0.96		3819
	External test
Acute	BERT	0.31	0.89	0.45	p < 0.01	325
Critical		0.23	0.33	0.27		24
Non-acute		0.97	0.61	0.75		1690
Overall (weighted)		0.85	0.65	0.69		2039
Acute	ClinicalBERT⁺⁺	0.64	0.55	0.60	ref	325
Critical		0.81	0.54	0.65		24
Non-acute		0.92	0.94	0.93		1690
Overall (weighted)		0.87	0.88	0.87		2039