. 2023 Aug 24;9(10):1459–1462. doi: 10.1001/jamaoncol.2023.2954

Table 1. Oncologist Scoring of LLM Chatbot Treatment Recommendations^a.

Scoring criteria	Agreement across 4 unique prompts, No. (%) (n = 26)^b	All prompts, No. (%) (n = 104)	Prompts by cancer type					Prompts by extent of disease
Scoring criteria	Agreement across 4 unique prompts, No. (%) (n = 26)^b	All prompts, No. (%) (n = 104)	Breast cancer, No. (%) (n = 20)	Lung cancer, No. (%) (n = 20)^c	Non–small cell lung cancer, No. (%) (n = 20)	Small cell lung cancer, No. (%) (n = 12)	Prostate cancer, No. (%) (n = 32)	Not specified, No. (%) (n = 20)	Localized, No. (%) (n = 64)	Advanced, No. (%) (n = 20)
1. How many treatment recommendations were provided?
0	22 (84.6)	3 (2.9)	1 (5.0)	1 (5.0)	0	0	1 (3.1)	0	3 (4.7)	0
1		2 (1.9)	0	0	0	0	2 (6.3)	0	2 (3.1)	0
≥1		99 (95.2)	19 (95.0)	19 (95.0)	20 (100)	12 (100)	29 (90.6)	20 (100)	59 (92.2)	20 (100)
2. How many of the recommended treatments were in accordance with 2021 NCCN guidelines? ^c
0	9 (34.6)	0	0	0	0	0	0	0	0	0
Some but not all		35 (33.7)	2 (10.0)	6 (30.0)	8 (40.0)	7 (58.3)	12 (37.5)	2 (10.0)	26 (40.6)	7 (35.0)
All		66 (63.5)	17 (85.0)	13 (65.0)	12 (60.0)	5 (41.7)	19 (59.4)	18 (90.0)	35 (54.7)	13 (65.0)
NA		3 (2.9)	1 (5.0)	1 (5.0)	0	0	1 (3.1)	0	3 (4.7)	0
3. If some but not all to question 2, how many recommendations were correct in their entirety per 2021 NCCN guidelines?
0	11 (42.3)	1 (0.9)	0	0	0	1 (8.3)	0	0	1 (1.6)	0
≥1		33 (31.7)	1 (5.0)	7 (35.0)	8 (40.0)	6 (50.0)	11 (34.4)	2 (10.0)	25 (39.1)	6 (30.0)
NA^d		70 (67.3)	19 (95.0)	13 (65.0)	12 (60.0)	5 (41.7)	21 (65.6)	18 (90.0)	38 (59.4)	14 (70.0)
4. How many recommended treatments were hallucinated?
0	16 (61.5)	91 (87.5)	19 (95.0)	18 (90.0)	19 (95.0)	9 (75.0)	26 (81.3)	19 (95.0)	55 (85.9)	17 (85.0)
≥1	16 (61.5)	13 (12.5)	1 (5.0)	2 (10.0)	1 (5.0)	3 (25.0)	6 (18.8)	1 (5.0)	9 (14.1)	3 (15.0)
5. If >1 to question 4, how many of the hallucinated treatments are now a recommended treatment in the most current NCCN guidelines?
0	16 (61.5)	13 (12.5)	1 (5.0)	2 (10.0)	1 (5.0)	3 (25.0)	6 (18.8)	1 (5.0)	9 (14.1)	3 (15.0)
≥1		0	0	0	0	0	0	0	0	0
NA		91 (87.5)	19 (95.0)	18 (90.0)	19 (95.0)	9 (75.0)	26 (81.3)	19 (95.0)	55 (85.9)	17 (85.0)

Abbreviations: LLM, large language model; NA, not applicable; NCCN, National Comprehensive Cancer Network.

^{^a}

Data reported as No. (%) using majority rule of annotators’ scores.

^{^b}

Diagnosis descriptions for which the output of each of the 4 prompts for a given diagnosis description yielded the same score by the rating oncologists.

^{^c}

Lung cancer was queried separately from non–small cell lung cancer and small cell lung cancer.

^{^d}

Slight misalignment of categorical scores from questions 2 and 3 resulted from majority rules. For example, for question 3, NA = 70 instead of 69 (66 + 3) because of majority voting.

Table 1. Oncologist Scoring of LLM Chatbot Treatment Recommendationsa.

Table 1. Oncologist Scoring of LLM Chatbot Treatment Recommendations^a.