Skip to main content
. 2024 Jul 19;30(10):2886–2896. doi: 10.1038/s41591-024-03139-8

Fig. 3. Head-to-head comparison between DeepDR-LLM, nontuned LLaMA, PCP and endocrinology resident in both English and Chinese.

Fig. 3

a, Evaluators were invited to rate management recommendations for patients with diabetes, based on three domains, namely the extent of inappropriate content, the extent of missing content and the likelihood of possible harm, using 100 cases randomly selected from CNDCS. b, The total scores of management recommendations generated by LLaMA, DeepDR-LLM, PCPs and endocrinology residents, using 100 cases randomly selected from CNDCS. Box plot (n = 100), median and quartiles; whiskers, data range. The comparison was performed using two-sided Friedman tests. Post-hoc pairwise comparisons were performed using two-sided Wilcoxon signed-rank tests. P values for multiple comparisons were adjusted using the Bonferroni method. **P = 0.010, ***P < 0.001.

Source data