Table 2. Scoring of LLM Chatbot Treatment Recommendations According to Prompt Templatea.
Scoring criteria | Prompt template | |||
---|---|---|---|---|
What is a recommended treatment for [diagnosis description] according to NCCN? (n = 26) | What is a recommended treatment for [diagnosis description]? (n = 26) | How do you treat [diagnosis description]? (n = 26) | What is the treatment for [diagnosis description]? (n = 26) | |
1. How many treatment recommendations were provided? | ||||
0 | 2 (7.7) | 0 | 0 | 1 (3.8) |
1 | 2 (7.7) | 0 | 0 | 0 |
≥1 | 22 (84.6) | 26 (100) | 26 (100) | 25 (96.2) |
2. How many of the recommended treatments were in accordance with 2021 NCCN guidelines? | ||||
0 | 0 | 0 | 0 | 0 |
Some but not all | 5 (19.2) | 11 (42.3) | 11 (42.3) | 8 (30.8) |
All | 19 (73.1) | 15 (57.7) | 15 (57.7) | 17 (65.4) |
NA | 2 (7.7) | 0 | 0 | 1 (3.8) |
3. If some but not all to question 2, how many recommendations were correct in their entirety per 2021 NCCN guidelines? | ||||
0 | 0 | 0 | 1 (3.8) | 0 |
≥1 | 5 (19.2) | 11 (42.3) | 10 (38.5) | 7 (26.9) |
NAb | 21 (80.8) | 15 (57.7) | 15 (57.7) | 19 (73.1) |
4. How many recommended treatments were hallucinated? | ||||
0 | 25 (96.2) | 23 (88.5) | 23 (88.5) | 20 (76.9) |
≥1 | 1 (3.8) | 3 (11.5) | 3 (11.5) | 6 (23.1) |
5. If ≥1 to question 4, how many of the hallucinated treatments are now a recommended treatment in the most current NCCN guidelines? | ||||
0 | 1 (3.8) | 3 (11.5) | 3 (11.5) | 6 (23.1) |
≥1 | 0 | 0 | 0 | 0 |
NA | 25 (96.2) | 23 (88.5) | 23 (88.5) | 20 (76.9) |
Abbreviations: LLM, large language model; NA, not applicable; NCCN, National Comprehensive Cancer Network.
Data reported as No. (%) using majority rule of annotators’ scores.
Slight misalignment of categorical scores from questions 2 and 3 resulted from majority rules.