Skip to main content
. 2025 Nov 19;27:e78393. doi: 10.2196/78393

Table 1. Accuracy and error breakdown of prostate-specific antigen (PSA) testing recommendations by retrieval-augmented generation–large language model (RAG-LLM) and junior clinicians.

Group and categorya Unnecessary tests, n (%) Missed tests, n (%) Total errors, n (%) P value
Short interval Did not require Subtotal Long interval Failed to offer Subtotal
Overall (n=220)
 LLM 5 (2.3) 0 (0) 5 (2.3) 0 (0) 5 (2.3) 5 (2.3) 10 (4.5) b
 Human, closed-book 11 (5.0) 32 (14.5) 43 (19.5) 26 (11.8) 14 (6.4) 40 (18.2) 83 (37.7) <.001
 Human, open-book 10 (4.5) 23 (10.5) 33 (15.0) 14 (6.4) 10 (4.5) 24 (10.9) 57 (25.9) <.001
Category 1: PSA screening recommended (n=55)
 LLM 0 (0) 0 (0) 0 (0) 0 (0) 5 (9.1) 5 (9.1) 5 (9.1)
 Human, closed-book 0 (0) 0 (0) 0 (0) 3 (0) 10 (18.2) 13 (23.6) 13 (23.6) .04
 Human, open-book 0 (0) 0 (0) 0 (0) 1 (1.8) 10 (18.2) 11 (20) 11 (20) .11
Category 2: PSA screening not recommended (n=45)
 LLM 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)
 Human, closed-book 0 (0) 14 (31.1) 14 (31.1) 3 (6.7) 0 (0) 3 (6.7) 17 (37.8) <.001
 Human, open-book 0 (0) 10 (22.2) 10 (22.2) 0 (0) 0 (0) 0 (0) 10 (22.2) .001
Category 3: normal PSA follow-up (n=45)
 LLM 5 (11.1) 0 (0) 5 (11.1) 0 (0) 0 (0) 0 (0) 5 (11.1)
 Human, closed-book 8 (17.8) 8 (17.8) 16 (35.6) 0 (0) 3 (6.7) 3 (6.7) 19 (42.2) .001
 Human, open-book 7 (15.6) 7 (15.6) 14 (31.1) 1 (2.2) 0 (0) 1 (2.2) 15 (33.3) .01
Category 4: elevated PSA (n=40)
 LLM 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)
 Human, closed-book 0 (0) 0 (0) 0 (0) 20 (50) 1 (2.5) 21 (52.5) 21 (52.5) <.001
 Human, open-book 1 (2.5) 0 (0) 1 (2.5) 12 (30) 0 (0) 12 (30) 13 (28.9) <.001
Category 5: others (n=35)
 LLM 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)
 Human, closed-book 3 (8.6) 10 (28.6) 13 (37.1) 0 (0) 0 (0) 0 (0) 13 (37.1) <.001
 Human, open-book 2 (5.7) 6 (17.1) 8 (22.9) 0 (0) 0 (0) 0 (0) 8 (22.9) .002
a

The denominators used for all percentage calculations represent the number of cases in each category multiplied by 5, as each of the 44 case scenarios was independently evaluated by 5 junior clinicians. Accordingly, the overall total is shown as n=220, and the denominators for each category (eg, n=55 for category 1, n=45 for category 2, etc) follow the same calculation method.

b

Not available.