Table 1. Accuracy and error breakdown of prostate-specific antigen (PSA) testing recommendations by retrieval-augmented generation–large language model (RAG-LLM) and junior clinicians.
| Group and categorya | Unnecessary tests, n (%) | Missed tests, n (%) | Total errors, n (%) | P value | ||||
|---|---|---|---|---|---|---|---|---|
| Short interval | Did not require | Subtotal | Long interval | Failed to offer | Subtotal | |||
| Overall (n=220) | ||||||||
| LLM | 5 (2.3) | 0 (0) | 5 (2.3) | 0 (0) | 5 (2.3) | 5 (2.3) | 10 (4.5) | —b |
| Human, closed-book | 11 (5.0) | 32 (14.5) | 43 (19.5) | 26 (11.8) | 14 (6.4) | 40 (18.2) | 83 (37.7) | <.001 |
| Human, open-book | 10 (4.5) | 23 (10.5) | 33 (15.0) | 14 (6.4) | 10 (4.5) | 24 (10.9) | 57 (25.9) | <.001 |
| Category 1: PSA screening recommended (n=55) | ||||||||
| LLM | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 5 (9.1) | 5 (9.1) | 5 (9.1) | — |
| Human, closed-book | 0 (0) | 0 (0) | 0 (0) | 3 (0) | 10 (18.2) | 13 (23.6) | 13 (23.6) | .04 |
| Human, open-book | 0 (0) | 0 (0) | 0 (0) | 1 (1.8) | 10 (18.2) | 11 (20) | 11 (20) | .11 |
| Category 2: PSA screening not recommended (n=45) | ||||||||
| LLM | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | — |
| Human, closed-book | 0 (0) | 14 (31.1) | 14 (31.1) | 3 (6.7) | 0 (0) | 3 (6.7) | 17 (37.8) | <.001 |
| Human, open-book | 0 (0) | 10 (22.2) | 10 (22.2) | 0 (0) | 0 (0) | 0 (0) | 10 (22.2) | .001 |
| Category 3: normal PSA follow-up (n=45) | ||||||||
| LLM | 5 (11.1) | 0 (0) | 5 (11.1) | 0 (0) | 0 (0) | 0 (0) | 5 (11.1) | — |
| Human, closed-book | 8 (17.8) | 8 (17.8) | 16 (35.6) | 0 (0) | 3 (6.7) | 3 (6.7) | 19 (42.2) | .001 |
| Human, open-book | 7 (15.6) | 7 (15.6) | 14 (31.1) | 1 (2.2) | 0 (0) | 1 (2.2) | 15 (33.3) | .01 |
| Category 4: elevated PSA (n=40) | ||||||||
| LLM | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | — |
| Human, closed-book | 0 (0) | 0 (0) | 0 (0) | 20 (50) | 1 (2.5) | 21 (52.5) | 21 (52.5) | <.001 |
| Human, open-book | 1 (2.5) | 0 (0) | 1 (2.5) | 12 (30) | 0 (0) | 12 (30) | 13 (28.9) | <.001 |
| Category 5: others (n=35) | ||||||||
| LLM | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | — |
| Human, closed-book | 3 (8.6) | 10 (28.6) | 13 (37.1) | 0 (0) | 0 (0) | 0 (0) | 13 (37.1) | <.001 |
| Human, open-book | 2 (5.7) | 6 (17.1) | 8 (22.9) | 0 (0) | 0 (0) | 0 (0) | 8 (22.9) | .002 |
The denominators used for all percentage calculations represent the number of cases in each category multiplied by 5, as each of the 44 case scenarios was independently evaluated by 5 junior clinicians. Accordingly, the overall total is shown as n=220, and the denominators for each category (eg, n=55 for category 1, n=45 for category 2, etc) follow the same calculation method.
Not available.