Skip to main content
. 2023 Aug 24;9(10):1459–1462. doi: 10.1001/jamaoncol.2023.2954

Table 2. Scoring of LLM Chatbot Treatment Recommendations According to Prompt Templatea.

Scoring criteria Prompt template
What is a recommended treatment for [diagnosis description] according to NCCN? (n = 26) What is a recommended treatment for [diagnosis description]? (n = 26) How do you treat [diagnosis description]? (n = 26) What is the treatment for [diagnosis description]? (n = 26)
1. How many treatment recommendations were provided?
0 2 (7.7) 0 0 1 (3.8)
1 2 (7.7) 0 0 0
≥1 22 (84.6) 26 (100) 26 (100) 25 (96.2)
2. How many of the recommended treatments were in accordance with 2021 NCCN guidelines?
0 0 0 0 0
Some but not all 5 (19.2) 11 (42.3) 11 (42.3) 8 (30.8)
All 19 (73.1) 15 (57.7) 15 (57.7) 17 (65.4)
NA 2 (7.7) 0 0 1 (3.8)
3. If some but not all to question 2, how many recommendations were correct in their entirety per 2021 NCCN guidelines?
0 0 0 1 (3.8) 0
≥1 5 (19.2) 11 (42.3) 10 (38.5) 7 (26.9)
NAb 21 (80.8) 15 (57.7) 15 (57.7) 19 (73.1)
4. How many recommended treatments were hallucinated?
0 25 (96.2) 23 (88.5) 23 (88.5) 20 (76.9)
≥1 1 (3.8) 3 (11.5) 3 (11.5) 6 (23.1)
5. If 1 to question 4, how many of the hallucinated treatments are now a recommended treatment in the most current NCCN guidelines?
0 1 (3.8) 3 (11.5) 3 (11.5) 6 (23.1)
≥1 0 0 0 0
NA 25 (96.2) 23 (88.5) 23 (88.5) 20 (76.9)

Abbreviations: LLM, large language model; NA, not applicable; NCCN, National Comprehensive Cancer Network.

a

Data reported as No. (%) using majority rule of annotators’ scores.

b

Slight misalignment of categorical scores from questions 2 and 3 resulted from majority rules.