Skip to main content
. 2024 Aug 2;7(8):e2425373. doi: 10.1001/jamanetworkopen.2024.25373

Table 4. Chatbots as a Grader: Comparison of Grades and Ranks Given by Surgeon-Reviewers vs Chatbot-Reviewersa.

Abstract version Grade, median (IQR) P valueb Grade, median (IQR) P valueb Grade, median (IQR) P valueb
Surgeon-grader Chatbot 1-grader Surgeon-grader Chatbot 2-grader Chatbot 1-grader Chatbot 2-grader
10-Point scale
Resident 7.0 (6.0-8.0) 7.0 (6.7-7.5) .89 7.0 (6.0-8.0) 7.5 (7.5-7.8) .24 7.0 (6.7-7.5) 7.5 (7.5-7.8) .12
Senior author 7.0 (6.0-8.0) 7.3 (6.4-8.0) .86 7.0 (6.0-8.0) 7.5 (7.5-7.8) .13 7.3 (6.4-8.0) 7.5 (7.5-7.8) .30
Chatbot 1 7.0 (6.0-8.0) 7.2 (6.5-7.8) .10 7.0 (6.0-8.0) 8.2 (8.0-8.5) .003 7.2 (6.5-7.8) 8.2 (8.0-8.5) .02
Chatbot 2 7.0 (6.0-8.0) 7.3 (6.2-7.5) .76 7.0 (6.0-8.0) 7.9 (7.0-8.0) .14 7.3 (6.2-7.5) 7.9 (7.0-8.0) .08
20-Point scale
Resident 14.0 (12.0-17.0) 14.0 (13.0-15.0) .79 14.0 (12.0-17.0) 16.9 (16.0-17.5) .02 14.0 (13.0-15.0) 16.9 (16.0-17.5) .003
Senior author 15.0 (13.0-17.0) 13.5 (13.0-15.5) .28 15.0 (13.0-17.0) 17.0 (16.5-18.0) .03 13.5 (13.0-15.5) 17.0 (16.5-18.0) .004
Chatbot 1 14.0 (12.0-16.0) 14.5 (13.0-15.0) .48 14.0 (12.0-16.0) 17.8 (17.5-18.5) .002 14.5 (13.0-15.0) 17.8 (17.5-18.5) .003
Chatbot 2 14.0 (13.0-16.0) 14.0 (13.0-15.0) .79 14.0 (13.0-16.0) 16.8 (14.5-18.0) .04 14.0 (13.0-15.0) 16.8 (14.5-18.0) .01
Rank, quartile (range)
Resident 3.0 (1.0-4.0) 2.5 (2.0-4.0) .70 3.0 (1.0-4.0) 3.0 (2.0-4.0) .54 2.5 (2.0-4.0) 3.0 (2.0-4.0) .78
Senior author 2.0 (1.0-4.0) 2.5 (1.0-3.0) >.99 2.0 (1.0-4.0) 3.0 (2.0-4.0) .45 2.5 (1.0-3.0) 3.0 (2.0-4.0) .56
Chatbot 1 3.0 (2.0-4.0) 1.5 (1.0-3.0) .05 3.0 (2.0-4.0) 1.0 (1.0-2.0) .002 1.5 (1.0-3.0) 1.0 (1.0-2.0) .51
Chatbot 2 2.0 (1.0-3.0) 3.0 (2.0-4.0) .10 2.0 (1.0-3.0) 2.5 (2.0-4.0) .11 3.0 (2.0-4.0) 2.5 (2.0-4.0) .94
a

Abstracts were either written by a research resident within the first 6 months of their research year, were the final submitted version edited by a senior author, or were generated by chatbot 1 (Chat Generative Pretrained Transformer [GPT] version 3.5) or chatbot 2 (Chat-GPT version 4.0).

b

Statistical significance was P < .05.