Skip to main content
. 2023 Aug 24;9(10):1459–1462. doi: 10.1001/jamaoncol.2023.2954

Table 1. Oncologist Scoring of LLM Chatbot Treatment Recommendationsa.

Scoring criteria Agreement across 4 unique prompts, No. (%) (n = 26)b All prompts, No. (%) (n = 104) Prompts by cancer type Prompts by extent of disease
Breast cancer, No. (%) (n = 20) Lung cancer, No. (%) (n = 20)c Non–small cell lung cancer, No. (%) (n = 20) Small cell lung cancer, No. (%) (n = 12) Prostate cancer, No. (%) (n = 32) Not specified, No. (%) (n = 20) Localized, No. (%) (n = 64) Advanced, No. (%) (n = 20)
1. How many treatment recommendations were provided?
0 22 (84.6) 3 (2.9) 1 (5.0) 1 (5.0) 0 0 1 (3.1) 0 3 (4.7) 0
1 2 (1.9) 0 0 0 0 2 (6.3) 0 2 (3.1) 0
≥1 99 (95.2) 19 (95.0) 19 (95.0) 20 (100) 12 (100) 29 (90.6) 20 (100) 59 (92.2) 20 (100)
2. How many of the recommended treatments were in accordance with 2021 NCCN guidelines? c
0 9 (34.6) 0 0 0 0 0 0 0 0 0
Some but not all 35 (33.7) 2 (10.0) 6 (30.0) 8 (40.0) 7 (58.3) 12 (37.5) 2 (10.0) 26 (40.6) 7 (35.0)
All 66 (63.5) 17 (85.0) 13 (65.0) 12 (60.0) 5 (41.7) 19 (59.4) 18 (90.0) 35 (54.7) 13 (65.0)
NA 3 (2.9) 1 (5.0) 1 (5.0) 0 0 1 (3.1) 0 3 (4.7) 0
3. If some but not all to question 2, how many recommendations were correct in their entirety per 2021 NCCN guidelines?
0 11 (42.3) 1 (0.9) 0 0 0 1 (8.3) 0 0 1 (1.6) 0
≥1 33 (31.7) 1 (5.0) 7 (35.0) 8 (40.0) 6 (50.0) 11 (34.4) 2 (10.0) 25 (39.1) 6 (30.0)
NAd 70 (67.3) 19 (95.0) 13 (65.0) 12 (60.0) 5 (41.7) 21 (65.6) 18 (90.0) 38 (59.4) 14 (70.0)
4. How many recommended treatments were hallucinated?
0 16 (61.5) 91 (87.5) 19 (95.0) 18 (90.0) 19 (95.0) 9 (75.0) 26 (81.3) 19 (95.0) 55 (85.9) 17 (85.0)
≥1 13 (12.5) 1 (5.0) 2 (10.0) 1 (5.0) 3 (25.0) 6 (18.8) 1 (5.0) 9 (14.1) 3 (15.0)
5. If >1 to question 4, how many of the hallucinated treatments are now a recommended treatment in the most current NCCN guidelines?
0 16 (61.5) 13 (12.5) 1 (5.0) 2 (10.0) 1 (5.0) 3 (25.0) 6 (18.8) 1 (5.0) 9 (14.1) 3 (15.0)
≥1 0 0 0 0 0 0 0 0 0
NA 91 (87.5) 19 (95.0) 18 (90.0) 19 (95.0) 9 (75.0) 26 (81.3) 19 (95.0) 55 (85.9) 17 (85.0)

Abbreviations: LLM, large language model; NA, not applicable; NCCN, National Comprehensive Cancer Network.

a

Data reported as No. (%) using majority rule of annotators’ scores.

b

Diagnosis descriptions for which the output of each of the 4 prompts for a given diagnosis description yielded the same score by the rating oncologists.

c

Lung cancer was queried separately from non–small cell lung cancer and small cell lung cancer.

d

Slight misalignment of categorical scores from questions 2 and 3 resulted from majority rules. For example, for question 3, NA = 70 instead of 69 (66 + 3) because of majority voting.