Table 4.
Performance of large language models on estimating symptoms based on the interview data.a
| Model | Method | Accuracy | Precision (positive predictive valueb) | Negative predictive valueb | Recall | F1-score |
| GPT-3.5 Turbo | Fine-tuning | 0.817 (0.002) | 0.828 (0.002) | 0.866 (0.001) | 0.818 (0.001) | 0.821 (0.002) |
| GPT-4 Turbo | In-context learning | 0.537 (0.008) | 0.551 (0.009) | 0.989 (0.003) | 0.550 (0.007) | 0.546 (0.008) |
| GPT-4 Turbo | Zero-shot | 0.644 (0.004) | 0.649 (0.003) | 0.965 (0.011) | 0.681 (0.002) | 0.657 (0.003) |
| GPT-4 Turbo | Zero-shot (with RAGc) | 0.708 (0.005) | 0.715 (0.007) | 0.954 (0.000) | 0.745 (0.005) | 0.722 (0.005) |
aFor each setting, we report the mean and the SD of score values for 3 trials. As we use nonzero temperature parameters of large language models, the performance varies among different trials.
bThe definitions of negative predictive value and positive predictive value are given in Multimedia Appendix 2.
cRAG: retrieval-augmented generation.