Table 2.
Performance of retrieval-augmented large language models in matching physician clinical trial recommendations.
| Performance | Precision (%) | Recall (%) | F1-score | |
| Baseline GPT-4 | 0.0 | 0.0 | 0 | |
| Retrieval-augmented GPT-4 | 63.0 | 100.0 | 0.77 | |
| Subgroups (cancer types) | ||||
|
|
Head and neck cancers | 72.7 | 100.0 | 0.84 |
|
|
Thyroid cancers | 33.3 | 100.0 | 0.50 |
|
|
Skin cancers | 50.0 | 100.0 | 0.67 |
|
|
Salivary gland cancers | 36.4 | 100.0 | 0.53 |
|
|
Other cancers | —a | — | — |
| Subgroups (biomarkers) | ||||
|
|
Present | 72.7 | 100.0 | 0.84 |
|
|
None | 62.1 | 100.0 | 0.77 |
aNot applicable.