. 2025 Nov 19;27:e78393. doi: 10.2196/78393

Table 1. Accuracy and error breakdown of prostate-specific antigen (PSA) testing recommendations by retrieval-augmented generation–large language model (RAG-LLM) and junior clinicians.

Group and category^a	Unnecessary tests, n (%)			Missed tests, n (%)			Total errors, n (%)	P value
	Short interval	Did not require	Subtotal	Long interval	Failed to offer	Subtotal
Overall (n=220)
LLM	5 (2.3)	0 (0)	5 (2.3)	0 (0)	5 (2.3)	5 (2.3)	10 (4.5)	—^b
Human, closed-book	11 (5.0)	32 (14.5)	43 (19.5)	26 (11.8)	14 (6.4)	40 (18.2)	83 (37.7)	<.001
Human, open-book	10 (4.5)	23 (10.5)	33 (15.0)	14 (6.4)	10 (4.5)	24 (10.9)	57 (25.9)	<.001
Category 1: PSA screening recommended (n=55)
LLM	0 (0)	0 (0)	0 (0)	0 (0)	5 (9.1)	5 (9.1)	5 (9.1)	—
Human, closed-book	0 (0)	0 (0)	0 (0)	3 (0)	10 (18.2)	13 (23.6)	13 (23.6)	.04
Human, open-book	0 (0)	0 (0)	0 (0)	1 (1.8)	10 (18.2)	11 (20)	11 (20)	.11
Category 2: PSA screening not recommended (n=45)
LLM	0 (0)	0 (0)	0 (0)	0 (0)	0 (0)	0 (0)	0 (0)	—
Human, closed-book	0 (0)	14 (31.1)	14 (31.1)	3 (6.7)	0 (0)	3 (6.7)	17 (37.8)	<.001
Human, open-book	0 (0)	10 (22.2)	10 (22.2)	0 (0)	0 (0)	0 (0)	10 (22.2)	.001
Category 3: normal PSA follow-up (n=45)
LLM	5 (11.1)	0 (0)	5 (11.1)	0 (0)	0 (0)	0 (0)	5 (11.1)	—
Human, closed-book	8 (17.8)	8 (17.8)	16 (35.6)	0 (0)	3 (6.7)	3 (6.7)	19 (42.2)	.001
Human, open-book	7 (15.6)	7 (15.6)	14 (31.1)	1 (2.2)	0 (0)	1 (2.2)	15 (33.3)	.01
Category 4: elevated PSA (n=40)
LLM	0 (0)	0 (0)	0 (0)	0 (0)	0 (0)	0 (0)	0 (0)	—
Human, closed-book	0 (0)	0 (0)	0 (0)	20 (50)	1 (2.5)	21 (52.5)	21 (52.5)	<.001
Human, open-book	1 (2.5)	0 (0)	1 (2.5)	12 (30)	0 (0)	12 (30)	13 (28.9)	<.001
Category 5: others (n=35)
LLM	0 (0)	0 (0)	0 (0)	0 (0)	0 (0)	0 (0)	0 (0)	—
Human, closed-book	3 (8.6)	10 (28.6)	13 (37.1)	0 (0)	0 (0)	0 (0)	13 (37.1)	<.001
Human, open-book	2 (5.7)	6 (17.1)	8 (22.9)	0 (0)	0 (0)	0 (0)	8 (22.9)	.002

The denominators used for all percentage calculations represent the number of cases in each category multiplied by 5, as each of the 44 case scenarios was independently evaluated by 5 junior clinicians. Accordingly, the overall total is shown as n=220, and the denominators for each category (eg, n=55 for category 1, n=45 for category 2, etc) follow the same calculation method.

Not available.