Skip to main content
. 2024 Oct 24;8:e58418. doi: 10.2196/58418

Table 4.

Performance of large language models on estimating symptoms based on the interview data.a

Model Method Accuracy Precision (positive predictive valueb) Negative predictive valueb Recall F1-score
GPT-3.5 Turbo Fine-tuning 0.817 (0.002) 0.828 (0.002) 0.866 (0.001) 0.818 (0.001) 0.821 (0.002)
GPT-4 Turbo In-context learning 0.537 (0.008) 0.551 (0.009) 0.989 (0.003) 0.550 (0.007) 0.546 (0.008)
GPT-4 Turbo Zero-shot 0.644 (0.004) 0.649 (0.003) 0.965 (0.011) 0.681 (0.002) 0.657 (0.003)
GPT-4 Turbo Zero-shot (with RAGc) 0.708 (0.005) 0.715 (0.007) 0.954 (0.000) 0.745 (0.005) 0.722 (0.005)

aFor each setting, we report the mean and the SD of score values for 3 trials. As we use nonzero temperature parameters of large language models, the performance varies among different trials.

bThe definitions of negative predictive value and positive predictive value are given in Multimedia Appendix 2.

cRAG: retrieval-augmented generation.