. 2024 Aug 2;7(8):e2425373. doi: 10.1001/jamanetworkopen.2024.25373

Table 4. Chatbots as a Grader: Comparison of Grades and Ranks Given by Surgeon-Reviewers vs Chatbot-Reviewers^a.

Abstract version	Grade, median (IQR)		P value^b	Grade, median (IQR)		P value^b	Grade, median (IQR)		P value^b
Abstract version	Surgeon-grader	Chatbot 1-grader	P value^b	Surgeon-grader	Chatbot 2-grader	P value^b	Chatbot 1-grader	Chatbot 2-grader	P value^b
10-Point scale
Resident	7.0 (6.0-8.0)	7.0 (6.7-7.5)	.89	7.0 (6.0-8.0)	7.5 (7.5-7.8)	.24	7.0 (6.7-7.5)	7.5 (7.5-7.8)	.12
Senior author	7.0 (6.0-8.0)	7.3 (6.4-8.0)	.86	7.0 (6.0-8.0)	7.5 (7.5-7.8)	.13	7.3 (6.4-8.0)	7.5 (7.5-7.8)	.30
Chatbot 1	7.0 (6.0-8.0)	7.2 (6.5-7.8)	.10	7.0 (6.0-8.0)	8.2 (8.0-8.5)	.003	7.2 (6.5-7.8)	8.2 (8.0-8.5)	.02
Chatbot 2	7.0 (6.0-8.0)	7.3 (6.2-7.5)	.76	7.0 (6.0-8.0)	7.9 (7.0-8.0)	.14	7.3 (6.2-7.5)	7.9 (7.0-8.0)	.08
20-Point scale
Resident	14.0 (12.0-17.0)	14.0 (13.0-15.0)	.79	14.0 (12.0-17.0)	16.9 (16.0-17.5)	.02	14.0 (13.0-15.0)	16.9 (16.0-17.5)	.003
Senior author	15.0 (13.0-17.0)	13.5 (13.0-15.5)	.28	15.0 (13.0-17.0)	17.0 (16.5-18.0)	.03	13.5 (13.0-15.5)	17.0 (16.5-18.0)	.004
Chatbot 1	14.0 (12.0-16.0)	14.5 (13.0-15.0)	.48	14.0 (12.0-16.0)	17.8 (17.5-18.5)	.002	14.5 (13.0-15.0)	17.8 (17.5-18.5)	.003
Chatbot 2	14.0 (13.0-16.0)	14.0 (13.0-15.0)	.79	14.0 (13.0-16.0)	16.8 (14.5-18.0)	.04	14.0 (13.0-15.0)	16.8 (14.5-18.0)	.01
Rank, quartile (range)
Resident	3.0 (1.0-4.0)	2.5 (2.0-4.0)	.70	3.0 (1.0-4.0)	3.0 (2.0-4.0)	.54	2.5 (2.0-4.0)	3.0 (2.0-4.0)	.78
Senior author	2.0 (1.0-4.0)	2.5 (1.0-3.0)	>.99	2.0 (1.0-4.0)	3.0 (2.0-4.0)	.45	2.5 (1.0-3.0)	3.0 (2.0-4.0)	.56
Chatbot 1	3.0 (2.0-4.0)	1.5 (1.0-3.0)	.05	3.0 (2.0-4.0)	1.0 (1.0-2.0)	.002	1.5 (1.0-3.0)	1.0 (1.0-2.0)	.51
Chatbot 2	2.0 (1.0-3.0)	3.0 (2.0-4.0)	.10	2.0 (1.0-3.0)	2.5 (2.0-4.0)	.11	3.0 (2.0-4.0)	2.5 (2.0-4.0)	.94

^{^a}

Abstracts were either written by a research resident within the first 6 months of their research year, were the final submitted version edited by a senior author, or were generated by chatbot 1 (Chat Generative Pretrained Transformer [GPT] version 3.5) or chatbot 2 (Chat-GPT version 4.0).

^{^b}

Statistical significance was P < .05.

Table 4. Chatbots as a Grader: Comparison of Grades and Ranks Given by Surgeon-Reviewers vs Chatbot-Reviewersa.

Table 4. Chatbots as a Grader: Comparison of Grades and Ranks Given by Surgeon-Reviewers vs Chatbot-Reviewers^a.