. 2023 Aug 24;9(10):1459–1462. doi: 10.1001/jamaoncol.2023.2954

Table 2. Scoring of LLM Chatbot Treatment Recommendations According to Prompt Template^a.

Scoring criteria	Prompt template
Scoring criteria	What is a recommended treatment for [diagnosis description] according to NCCN? (n = 26)	What is a recommended treatment for [diagnosis description]? (n = 26)	How do you treat [diagnosis description]? (n = 26)	What is the treatment for [diagnosis description]? (n = 26)
1. How many treatment recommendations were provided?
0	2 (7.7)	0	0	1 (3.8)
1	2 (7.7)	0	0	0
≥1	22 (84.6)	26 (100)	26 (100)	25 (96.2)
2. How many of the recommended treatments were in accordance with 2021 NCCN guidelines?
0	0	0	0	0
Some but not all	5 (19.2)	11 (42.3)	11 (42.3)	8 (30.8)
All	19 (73.1)	15 (57.7)	15 (57.7)	17 (65.4)
NA	2 (7.7)	0	0	1 (3.8)
3. If some but not all to question 2, how many recommendations were correct in their entirety per 2021 NCCN guidelines?
0	0	0	1 (3.8)	0
≥1	5 (19.2)	11 (42.3)	10 (38.5)	7 (26.9)
NA^b	21 (80.8)	15 (57.7)	15 (57.7)	19 (73.1)
4. How many recommended treatments were hallucinated?
0	25 (96.2)	23 (88.5)	23 (88.5)	20 (76.9)
≥1	1 (3.8)	3 (11.5)	3 (11.5)	6 (23.1)
5. If ≥1 to question 4, how many of the hallucinated treatments are now a recommended treatment in the most current NCCN guidelines?
0	1 (3.8)	3 (11.5)	3 (11.5)	6 (23.1)
≥1	0	0	0	0
NA	25 (96.2)	23 (88.5)	23 (88.5)	20 (76.9)

Abbreviations: LLM, large language model; NA, not applicable; NCCN, National Comprehensive Cancer Network.

^{^a}

Data reported as No. (%) using majority rule of annotators’ scores.

^{^b}

Slight misalignment of categorical scores from questions 2 and 3 resulted from majority rules.

Table 2. Scoring of LLM Chatbot Treatment Recommendations According to Prompt Templatea.