Table 5. Performance of baseline models in text classification.
Model, learning strategy, and sentiment | Precision | Recall | F1-score | Standard error (F1) | Lower CI (F1) | Upper CI (F1) |
---|---|---|---|---|---|---|
Bert-base-cased | ||||||
SFTa | ||||||
Harmful_outcome | 0 | 0 | 0 | 0 | 0 | 0 |
Favorable_outcome | 0.6470 | 0.8800 | 0.7457 | 0.0435 | 0.6603 | 0.8310 |
Ambiguous_outcome | 0.7000 | 0.4375 | 0.5384 | 0.0498 | 0.4406 | 0.6361 |
Bio_ClinicalBERT | ||||||
SFT | ||||||
Harmful_outcome | 0 | 0 | 0 | 0 | 0 | 0 |
Favorable_outcome | 0.6486 | 0.96 | 0.7741 | 0.0418 | 0.6921 | 0.8560 |
Ambiguous_outcome | 0.8571 | 0.375 | 0.5217 | 0.0499 | 0.4237 | 0.6196 |
gpt4-1106-preview-chat | ||||||
Zero-shot | ||||||
Harmful_outcome | 0.6667 | 0.6667 | 0.6667 | 0.0471 | 0.5743 | 0.7590 |
Favorable_outcome | 0.7368 | 0.56 | 0.6364 | 0.0481 | 0.5421 | 0.7306 |
Ambiguous_outcome | 0.4545 | 0.625 | 0.5263 | 0.0499 | 0.4284 | 0.6241 |
Few-shot | ||||||
Harmful_outcome | 0.5 | 0.6667 | 0.5714 | 0.0494 | 0.4744 | 0.6683 |
Favorable_outcome | 0.6429 | 0.72 | 0.6792 | 0.0466 | 0.5877 | 0.7706 |
Ambiguous_outcome | 0.3333 | 0.25 | 0.2857 | 0.0451 | 0.1971 | 0.3742 |
Many-shot | ||||||
Harmful_outcome | 0.6667 | 0.6667 | 0.6667 | 0.0471 | 0.5743 | 0.7590 |
Favorable_outcome | 0.8519 | 0.92 | 0.8846 | 0.0319 | 0.8219 | 0.9472 |
Ambiguous_outcome | 0.7857 | 0.6875 | 0.7333 | 0.0442 | 0.6466 | 0.8199 |
SFT: supervised fine-tuning.