. 2025 Sep 23;11:e71102. doi: 10.2196/71102

Table 5. Performance of baseline models in text classification.

Model, learning strategy, and sentiment	Precision	Recall	F₁-score	Standard error (F1)	Lower CI (F1)	Upper CI (F1)
Bert-base-cased
SFT^a
Harmful_outcome	0	0	0	0	0	0
Favorable_outcome	0.6470	0.8800	0.7457	0.0435	0.6603	0.8310
Ambiguous_outcome	0.7000	0.4375	0.5384	0.0498	0.4406	0.6361
Bio_ClinicalBERT
SFT
Harmful_outcome	0	0	0	0	0	0
Favorable_outcome	0.6486	0.96	0.7741	0.0418	0.6921	0.8560
Ambiguous_outcome	0.8571	0.375	0.5217	0.0499	0.4237	0.6196
gpt4-1106-preview-chat
Zero-shot
Harmful_outcome	0.6667	0.6667	0.6667	0.0471	0.5743	0.7590
Favorable_outcome	0.7368	0.56	0.6364	0.0481	0.5421	0.7306
Ambiguous_outcome	0.4545	0.625	0.5263	0.0499	0.4284	0.6241
Few-shot
Harmful_outcome	0.5	0.6667	0.5714	0.0494	0.4744	0.6683
Favorable_outcome	0.6429	0.72	0.6792	0.0466	0.5877	0.7706
Ambiguous_outcome	0.3333	0.25	0.2857	0.0451	0.1971	0.3742
Many-shot
Harmful_outcome	0.6667	0.6667	0.6667	0.0471	0.5743	0.7590
Favorable_outcome	0.8519	0.92	0.8846	0.0319	0.8219	0.9472
Ambiguous_outcome	0.7857	0.6875	0.7333	0.0442	0.6466	0.8199

SFT: supervised fine-tuning.