Table 4. Chatbots as a Grader: Comparison of Grades and Ranks Given by Surgeon-Reviewers vs Chatbot-Reviewersa.
Abstract version | Grade, median (IQR) | P valueb | Grade, median (IQR) | P valueb | Grade, median (IQR) | P valueb | |||
---|---|---|---|---|---|---|---|---|---|
Surgeon-grader | Chatbot 1-grader | Surgeon-grader | Chatbot 2-grader | Chatbot 1-grader | Chatbot 2-grader | ||||
10-Point scale | |||||||||
Resident | 7.0 (6.0-8.0) | 7.0 (6.7-7.5) | .89 | 7.0 (6.0-8.0) | 7.5 (7.5-7.8) | .24 | 7.0 (6.7-7.5) | 7.5 (7.5-7.8) | .12 |
Senior author | 7.0 (6.0-8.0) | 7.3 (6.4-8.0) | .86 | 7.0 (6.0-8.0) | 7.5 (7.5-7.8) | .13 | 7.3 (6.4-8.0) | 7.5 (7.5-7.8) | .30 |
Chatbot 1 | 7.0 (6.0-8.0) | 7.2 (6.5-7.8) | .10 | 7.0 (6.0-8.0) | 8.2 (8.0-8.5) | .003 | 7.2 (6.5-7.8) | 8.2 (8.0-8.5) | .02 |
Chatbot 2 | 7.0 (6.0-8.0) | 7.3 (6.2-7.5) | .76 | 7.0 (6.0-8.0) | 7.9 (7.0-8.0) | .14 | 7.3 (6.2-7.5) | 7.9 (7.0-8.0) | .08 |
20-Point scale | |||||||||
Resident | 14.0 (12.0-17.0) | 14.0 (13.0-15.0) | .79 | 14.0 (12.0-17.0) | 16.9 (16.0-17.5) | .02 | 14.0 (13.0-15.0) | 16.9 (16.0-17.5) | .003 |
Senior author | 15.0 (13.0-17.0) | 13.5 (13.0-15.5) | .28 | 15.0 (13.0-17.0) | 17.0 (16.5-18.0) | .03 | 13.5 (13.0-15.5) | 17.0 (16.5-18.0) | .004 |
Chatbot 1 | 14.0 (12.0-16.0) | 14.5 (13.0-15.0) | .48 | 14.0 (12.0-16.0) | 17.8 (17.5-18.5) | .002 | 14.5 (13.0-15.0) | 17.8 (17.5-18.5) | .003 |
Chatbot 2 | 14.0 (13.0-16.0) | 14.0 (13.0-15.0) | .79 | 14.0 (13.0-16.0) | 16.8 (14.5-18.0) | .04 | 14.0 (13.0-15.0) | 16.8 (14.5-18.0) | .01 |
Rank, quartile (range) | |||||||||
Resident | 3.0 (1.0-4.0) | 2.5 (2.0-4.0) | .70 | 3.0 (1.0-4.0) | 3.0 (2.0-4.0) | .54 | 2.5 (2.0-4.0) | 3.0 (2.0-4.0) | .78 |
Senior author | 2.0 (1.0-4.0) | 2.5 (1.0-3.0) | >.99 | 2.0 (1.0-4.0) | 3.0 (2.0-4.0) | .45 | 2.5 (1.0-3.0) | 3.0 (2.0-4.0) | .56 |
Chatbot 1 | 3.0 (2.0-4.0) | 1.5 (1.0-3.0) | .05 | 3.0 (2.0-4.0) | 1.0 (1.0-2.0) | .002 | 1.5 (1.0-3.0) | 1.0 (1.0-2.0) | .51 |
Chatbot 2 | 2.0 (1.0-3.0) | 3.0 (2.0-4.0) | .10 | 2.0 (1.0-3.0) | 2.5 (2.0-4.0) | .11 | 3.0 (2.0-4.0) | 2.5 (2.0-4.0) | .94 |
Abstracts were either written by a research resident within the first 6 months of their research year, were the final submitted version edited by a senior author, or were generated by chatbot 1 (Chat Generative Pretrained Transformer [GPT] version 3.5) or chatbot 2 (Chat-GPT version 4.0).
Statistical significance was P < .05.