Abstract
Artificial intelligence (AI) large language models (LLMs) now produce human-like general text and images. LLMs’ ability to generate persuasive scientific essays that undergo evaluation under traditional peer review has not been systematically studied. To measure perceptions of quality and the nature of authorship, we conducted a competitive essay contest in 2024 with both human and AI participants. Human authors and four distinct LLMs generated essays on controversial topics in stroke care and outcomes research. A panel of Stroke editorial board members (mostly vascular neurologists), blinded to author identity and with varying levels of AI expertise, rated the essays for quality, persuasiveness, best in topic, and author type. Among 34 submissions (22 human, 12 LLM) scored by 38 reviewers, human and AI essays received mostly similar ratings, though AI essays were rated higher for composition quality. Author type was accurately identified only 50% of the time, with prior LLM experience associated with improved accuracy. In multivariable analyses adjusted for author attributes and essay quality, only persuasiveness was independently associated with odds of a reviewer assigning AI as author type (aOR 1.53 [95% CI: 1.09-2.16], p = 0.01). In conclusion, a group of experienced editorial board members struggled to distinguish human vs. AI authorship, with a bias against best in topic for essays judged to be AI-generated. Scientific journals may benefit from educating reviewers on the types and uses of AI in scientific writing and developing thoughtful policies on the appropriate use of AI in authoring manuscripts.
Background
Artificial intelligence (AI) large language models (LLMs) have rapidly emerged as a significant cultural phenomenon, with OpenAI’s ChatGPT recording 100 million active monthly users within two months of its launch.1 LLMs have demonstrated an ability to pass multiple professional certifying and licensure exams, such as the Bar exam and Medical Boards, and are generating text and images into the public domain at an unprecedented pace.2 As the influence of artificial intelligence expands, organizations, including scientific publishers, governments, academia, and research, face the complex challenge of establishing regulatory frameworks to manage the integration of AI-derived content.3 Debates have emerged on inadvertent plagiarism, violation of copyright protections, privacy concerns regarding the data used for training these models, the potential for substandard outputs or “model hallucinations”, and their broader societal repercussions.4
There is little consensus at present as to whether the use of LLMs in scientific research and publishing is permissible or appropriate and whether it is an innovation to be embraced or a form of forgery to be shunned.5 To explore the role of LLMs in scientific writing and the perceptions of manuscript reviewers, we launched a competitive essay contest for the Journal Stroke. We sought to evaluate how editorial board members would score persuasive essays on 3 topics of controversy in the stroke field when reviewing essays blinded to author status (human vs. one of four different LLMs), what factors would influence the assignment of human vs AI author, and whether or not the perceived author status would influence the reviewer’s assessment.
Methods
Participants
Participants were instructed to submit persuasive essays of up to 1000 words and six references on one of three available topics. Authors were required to disclose the total time spent on submission, including the time spent on formatting, whether they were a trainee, and the country in which they practiced. They also affirmed they had not used AI in any manner when preparing their submission and were aware that there would be AI-generated submissions as well. The highest scoring human authored essay in each category, and the single highest scoring trainee essay would receive a token prize, present their essay at the 2024 International Stroke Conference, and have it highlighted on the Stroke website (https://www.ahajournals.org/str/call-for-submissions).
Two authors (LHS and RK) developed iterative prompts for four leading LLMs—ChatGPT 3.5 and ChatGPT 4 (OpenAI, Inc.), Bard (Alphabet, Inc), and LLaMA-2 (Meta, Inc) — to generate one essay per topic responsive to the exact text of the essay topic (prompt sequences available on request). Additional prompting was used to provide context for the LLMs regarding style, tone, journal formatting requirements, and word length. No editing of the text was permitted post-generation, but literature citations were manually reviewed and corrected, if necessary, given their known limitations in this task, so any evaluation of citation quality was omitted from the analytical component of this study.
Essay Topics
The topics chosen for the contest revolved around three important and unsettled questions in the field of stroke clinical care and outcomes research, where definitive answers are elusive and subject to interpretation based on current research.
Do statins increase the risk of hemorrhagic stroke?
What are the best acute treatment options for patients with mild neurological deficits and large vessel occlusion?
When measuring stroke outcome, should we use a dichotomized modified Rankin Scale (mRS) or ordinal shift analysis of mRS?
Reviewers
All members of the Stroke editorial board were invited to review the essays, and 38 members participated. None submitted essays, and they were blinded to author status. They provided information on years since completing training, degree of experience or proficiency with ChatGPT or other LLMs (none, little, moderate, skilled, expert), field of specialty training from a dropdown list, and if they were based in the US or another country. Before the evaluation process, each essay was screened by the journal staff to ensure it met the criteria for inclusion in the rating phase. Reviewers rated the essays as below and were also asked to give their best guess and level of certainty as to the nature of the author on a 5-point Likert scale (1-definitely human, 2-likely human, 3-uncertain if human or AI, 4-likely AI, 5-definitely AI).
For each submission, they answered three initial yes or no questions: (1) Does the manuscript adhere to the specified contest requirements regarding length and format? (2) Among all the essays reviewed in the given topic category, does this essay stand out as the best? (3) Are the conclusions presented in the essay well-supported by the medical literature and references cited?
This was followed by rating each essay on a scale of 1 to 5, with 1 being the best and 5 being the worst, in two domains: (1) The quality of the essay’s composition, including its structure, use of vocabulary, and grammatical style; (2) The persuasiveness of the content or the stance taken by the author.
Not all reviewers rated all essays. The reported sample sizes pertain to individual essay-reviewer pairings. Analyses were at the level of the essay-reviewer dyad unless specifically stated. All dyads (n=1271) were included in the multivariable models. Statistical analyses were pre-planned and conducted to compare the perceived quality of the essays across all submissions and to assess the raters’ accuracy in identifying the author type. We considered the assignment of authorship correct if it was rated either ‘definitely’ or ‘likely’ on the Likert scale. We also explored factors influencing the selection of the “best essay” as well as the assignment of author-type attribution. Unadjusted bivariate comparisons were performed with Fischer’s exact Test comparing essays rated 1-2 vs 3-5 on a Likert scale for quality of composition and persuasiveness, with 1 as ‘best’ and 5 as ‘worst’. We conducted multivariable analyses using hierarchical linear mixed models fit by maximum likelihood to evaluate the association of author and reviewer characteristics to the perceived author type, the accurate identification of a human-written vs an LLM-generated essay, and the determination of “best essay” selection. The essays were the unit of analysis and the multilevel models included author-level fixed effects and rater-level random effects. All analyses were conducted using Python 3.6. All statistical tests were 2-sided, with a level of significance set at 0.05.
The final data set was fully deidentified before analysis, and all participants and reviewers were assigned coded identity variables. The participant authors were aware that the essay contest would be used as a vehicle to compare human and AI-generated content, the full editorial board of Stroke voted to endorse the conduct of the contest, and service as an essay reviewer was completely voluntary. All human authors were aware that their essay submissions, professional attributes collected, and essay reviews would be used in the evaluation of the journal contest outcomes and process evaluation. As LLMs are non-sentient, consent for their participation was not pursued.
Results
Participant and Reviewer Characteristics
We received 22 essays from human authors, each contributing one essay. The distribution of essays among the three proposed topics varied with 7, 12 and 3 essays for topics 1, 2 and 3, respectively. Four LLMs were each programmed to generate one essay per topic (Bard, LLaMA-2, GPT3.5, and GPT4).
All 69 citations in the AI-generated manuscripts were manually reviewed by the manuscript authors. Within the AI-generated essays, 18 references were removed (eight did not directly address the manuscript content and 10 were fabrications/hallucinations) and 14 new references were added by the manuscript authors. All preserved references required correction of minor formatting or typographical errors. Because of the known challenges of AI generated references, we chose to correct the references so that the evaluation of the AI essays would focus on the content. Therefore, we did not evaluate the third question “Are the conclusions presented in the essay well-supported by the medical literature and references cited?” The three winning essays and highest scoring trainee essay will be posted on the Stroke website, and the prompt sequences and output to each essay question by each LLM is available for review in the Supplementary Material. The human authors were predominantly trainees (63%) and based in English speaking countries (59%). The median [IQR] time composing the essays was 6 hours [4, 10] with a wide range of 1.25 to 96 hours.
A total of 38 unique reviewers rated the essay entries, with a median [IQR] of 34 [33,34] essays scored per rater (26 reviewers scored all 34 essays, 5 scored 33 essays, 1 scored 32 essays, and 6 scored under 20 essays). Half the reviewers were U.S.-based, with a median of 17.5 years [15.0, 25.8] of experience. Reviewers classified their proficiency in AI/LLM as non-existent (47%), minimal (45%), moderate (8%) and expert (0%). Most reviewers in the study were vascular neurologists (55%), while general neurologists, statisticians, neuroendovascular specialists, physical medicine and rehabilitation clinicians made up 5% each, with the remaining 26% from various other specialties.
Essay Quality Ratings
The distribution of raw Likert scores for persuasiveness and quality of composition are shown in Figure 1. In unadjusted bivariate analyses comparing the writing quality of AI versus human authored essays, AI essays were superior in composition (56% vs 38%, OR 2.09 [1.63-2.68] P <0.01) with borderline significance in persuasiveness (45% vs 39%, OR 1.28 [1.00-1.63] P = 0.05).
Figure 01:

Comparisons of Writing Quality by Artificial Inteligence (AI) vs Human Authors (Bivariate Analysis of Scores 1-2 vs 3-5)
Essay Authorship Assignment
Reviewers frequently misclassified human-written essays as AI-generated or vice versa. Human essays were misclassified as AI-generated or of uncertain authorship in 277/747 (44%) dyads, while AI-authored essays were misclassified as definitely or likely human-written in 156/436 (36%) of dyads and as either human-written or of uncertain authorship in 273/436 (63%) dyads. Overall, reviewers could only correctly assign authorship in 633/1271 dyads for an accuracy of 50% (Table 1). In multivariable analyses that adjusted for the number of authors, author country, trainee status, and measures of essay quality, the only variable that was independently associated with the odds of a reviewer assigning AI as the author was the essay’s persuasiveness rating (aOR 1.53 [95% CI: 1.09-2.16], p = 0.01).
Table 01:
Author Type Attribution and Level of Certainty by Reviewers by Essay-Reviewer Dyad
| Definitely AI | Likely AI | Likely Human | Definitely Human | Total Dyad Ratings | |
|---|---|---|---|---|---|
| Human Essay (n=22) | 43 | 131 | 282 | 188 | 835 |
| AI essay (n=12) | 29 | 134 | 117 | 39 | 436 |
| Total | 72 | 265 | 399 | 227 | 1271 |
Essays authored by different LLMs were correctly identified as AI-generated at low rates but with comparable frequency. The study was not powered to look at differences between LLM. However, essays produced by LLaMA-2 were correctly recognized as AI-generated at a numerically higher rate than those from other LLMs (47.8% vs 38.9%, p=0.17) (Table 2). No rater or author self-reported characteristic was associated with assigning a human or AI author type. However, the reviewer’s self-report of “any” vs. “no” AI/LLM experience was associated with higher odds of correctly assigning author status (aOR 1.37 [1.03-1.83], p = 0.03).
Table 02:
Reviewer Attribution of AI-Generated Author Type by Large Language Model Engine
| LLM ID | Definitely AI | Likely AI | Uncertain | Likely Human | Definitely Human | TOTAL | % Assigned as AI |
|---|---|---|---|---|---|---|---|
| GPT 3.5 | 8 | 32 | 21 | 32 | 10 | 103 | 38.8% |
| GPT 4 | 5 | 35 | 20 | 34 | 6 | 100 | 40.0% |
| BARD | 8 | 31 | 21 | 29 | 14 | 103 | 37.9% |
| LLAMA2 | 8 | 36 | 17 | 22 | 9 | 92 | 47.8% |
| TOTAL | 29 | 134 | 79 | 117 | 39 | 398 |
In unadjusted analyses, AI-authored essays were chosen at similar rates to human-authored ones as the best in topic (7.8% vs 7.3%, OR 1.26 [95% CI: 0.46-3.42], P=0.66) and did not change after adjustment for raters’ self-reported years since training and AI/LLM experience. Within the group of LLM-authored essays, Bard-generated essays were rated as best in topic significantly more often than those of the three other LLMs (20.4% vs 3.1%, p <0.0001). Out of 95 essays rated as best in topic, Bard accounted for 23/34 (67.6%) of the AI-authored essays and 23/95 (24.2%) of all essays (Table 3).
Table 03:
Univariate Comparisons of Best Essay Selection by Author Type
| Author | Rated Best in Topic Area | Overall Dyad Pairs |
|---|---|---|
| Human | 61 (7.3%) | 835 |
| AI | 34 (7.8%) | 436 |
| GPT 3.5 | 4 (3.5%) | 114 |
| GPT 4 | 3 (2.7%) | 111 |
| BARD | 23 (20.4%) | 113 |
| LLAMA2 | 4 (4.1%) | 98 |
Essays chosen as best in topic had similar odds of being AI-generated or by human authors (aOR 1.22 [95% CI: 0.44-3.35], p= 0.71). However, essays had a significantly lower odds of being chosen as the best essay if reviewers assigned them as AI- vs human-generated in both unadjusted (OR 0.04 [95% CI: 0.01-0.17], P<0.01) and adjusted analyses (aOR 0.04 [95% CI: 0.01-0.17], P =0.004). Similarly, essays had significantly lower odds of being chosen as the best essay if reviewers assigned them as of uncertain vs human authorship in unadjusted (OR = 0.32, [95% CI: 0.15-0.70], P= 0.004) and adjusted analyses (OR = 0.33, [95% CI: 0.15-0.70], P< 0.001).
Discussion
Our study is the first we are aware of to systematically evaluate experienced editorial board reviewers’ peer review for perceptions of quality, composition and assessment of authorship in a blinded assessment of a cohort of human vs AI-generated medical opinion essays. There have been other small-scale observational reports that have raised the concern that LLM-generated content was approaching a level of quality sufficient to conceal its non-human origins, but our study is the first to conduct a comprehensive assessment.4,6 Many scientific journals have recently provided guidance to authors on the allowable use of LLM in journal submissions,7 but rely on author attestation that an LLM was not used to subvert the authorship process or be solely responsible for content generation.8,9
The present study highlights several key themes. First, reviewers had great difficulty accurately assigning correct authorship to the essays, essentially a flip of the coin, likely due to the sophisticated and mature use of relevant scientific language and concepts in the essays. Interestingly, no demographic or professional attributes of the authors or reviewers were predictive of the accuracy in identifying the source of the essays. However, those with previous exposure to AI technologies were modestly more successful in accurately recognizing AI-authored content. These findings suggest that familiarity with AI may enhance one’s ability to differentiate between human and machine-generated text in medical publications.
Being based in a US vs international country did not have any influence on reviewer accuracy, nor did the level of experience with the essay topics with approximately half of the reviewers trained as vascular neurologists. Surprisingly, AI-generated essays had twice the odds of being rated at the highest level of quality compared to human authored essays, suggesting that the formulaic nature of persuasive scientific essays lends itself particularly well to LLM training and output.
Second, not all LLMs performed alike, and the differences in their training data sets, model constructs and output styles led to differences in reviewer attitudes toward their work. LLaMA-2 stood out numerically (but not statistically significantly) as the easiest to spot as generated by an LLM, whereas Bard captured a disproportionate number of best in topic essays. A more in-depth analysis leveraging natural language processing is underway to identify any specific linguistic factors or constructs that were associated with author attribution and essay quality. In the interest of focusing the contest on the quality of the content of the essays, and with the known vulnerabilities of AI-generated citations, we chose to correct the citations so as to avoid obvious detection of the AI generated essays. The consequence of this decision was that we could not evaluate the third question regarding quality of the literature cited, but given the evident bias against AI generated essays, we believe this strengthens the overall findings of the study.
Third, reviewers had a clear bias against content they judged to be AI-generated. AI- vs human-generated essays were rated as best in topic at similar rates, but when reviewers assigned AI authorship to the essay the odds of receiving a best in topic rating were vanishingly small (aOR 0.04).
If most journal editorial board members and ad hoc reviewers are similar in composition to those at Stroke in terms of their professional attributes and experience with LLMs, our study suggests that journal reviewers may not currently possess the skills to accurately distinguish AI-generated content from human authorship. This underscores the need for explicit training focused on LLMs to improve their discernment capabilities.3,10 As LLMs continue to improve in performance, it is likely that this will become an increasingly difficult task for reviewers and editors, even those who are expert in LLM capabilities.11 Providing focused education on the risks and benefits of AI-assisted publication preparation could reduce potential implicit bias affecting reviewers’ judgments of manuscript quality, enabling reviewers to evaluate submissions more on the merit of the content rather than the perceived source of authorship.12 Furthermore, the lack of clear and uniform policies regarding AI-generated contributions across the many heterogeneous journals represents a gap in the current editorial framework and an opportunity for harmonization.8 Journals must establish comprehensive guidelines that address the allowable use of AI in all phases of manuscript preparation, review and dissemination. These policies should articulate the extent to which AI involvement is acceptable and the necessary disclosures authors must make regarding AI assistance.9
Lastly, the future of content creation for medical journals will likely evolve to a more collaborative approach, where AI-generated initial drafts are enhanced through human expertise, and AI-generated experiments or self-directed simulations can provide additional tests of the generalizability of research findings.3,13 Such a hybrid model of content generation promises to uphold the quality and integrity of manuscripts and spur innovation and discovery. It could also alleviate some of the burden of low-value, labor-intensive tasks such as those associated with formatting, synthesizing large bodies of established prior work, thus allowing the human authors to focus their intellect on high-level conceptual thinking and critical analysis.14 The burden of recruiting and sustaining a panel of uncompensated expert reviewers and ensuring timely review is a daunting challenge that is only getting harder with the proliferation of open access journals.15 There may be a future in which AI-assisted manuscript review can also allow reviewers to focus on the higher conceptual aspects of medical manuscripts or grant submissions and be relieved of some more monotonous tasks.16
Our study has some limitations. First, the raters were aware that some essays were AI-generated, which might have biased their evaluations. Second, we corrected any citation errors in the AI-generated essays prior to review, which could influence the perceived accuracy of AI’s output. Furthermore, the AI-generated essays were developed by two cardiovascular clinician scientists skilled in prompt engineering, potentially enhancing the submissions beyond the typical quality expected from less experienced authors. In addition, the reviewers in this contest were limited in their experience with AI, which might make them less able to detect human vs AI content, but also were editorial board members and thus experienced researchers and reviewers which should have made them more discerning and capable of distinguishing high-quality writing. Both of these facts may limit the generalizability of our results to other groups of reviewers, but strengthen the observation that there was little distinguishable difference in the quality of the text of the essays themselves. Lastly, we did not base our sample size on a power calculation, so caution should be exercised when applying our conclusions more broadly.
In conclusion, our study illuminates the complexities and inherent challenges faced by reviewers in distinguishing between human and AI-generated content in the emerging medical literature. The diminished accuracy in identifying AI-generated content and the bias demonstrated against it suggest the need for specific training in AI and LLMs for editorial members to ensure an unbiased assessment of submissions based purely on content quality. The absence of such training is a lacuna in the editorial process that should be addressed through the development of explicit policies and training regarding AI-assisted contributions. A hybrid approach to content creation, combining AI’s efficiency with human expertise, appears to be a promising pathway. Such collaboration could shift the intellectual focus of professionals towards more strategic and analytical endeavors.
Supplementary Material
Disclosures:
Dr Silva reports employment by UNIFESP; compensation from ISchemaView for consultant services; compensation from Bayer for consultant services; compensation from Boehringer Ingelheim for consultant services; employment by Sociedade Beneficente Israelita Brasileira Albert Einstein; and compensation from Pfizer for other services.
Dr Khera reports grants from BridgeBio; a provisional patent for Methods of generating digital twin-based datasets; an ownership stake in Ensight-AI, Inc.; employment by Yale School of Medicine; grants from National Institutes of Health; a patent pending for Methods For Neighborhood Phenomapping For Clinical Trials, No. 63/177,117; a provisional patent for Format independent detection of cardiovascular disease from printed ECG images with deep learning licensed to Ensight-AI, Inc.; grants from Novo Nordisk; grants from Blavatnik Family Foundation; a provisional patent for Machine Learning Method for Adaptive Trial Enrichment; an ownership stake in Evidence2Health; grants from Doris Duke Charitable Foundation; a provisional patent for A multi-modal video-based progression score for aortic stenosis using artificial intelligence; a provisional patent for Artificial intelligence-guided screening of under-recognized cardiomyopathies adapted for POCUS; grants from Bristol-Myers Squibb; stock holdings in Evidence2Health; grants from National Institutes of Health; a provisional patent for Biometric contrastive learning for data-efficient deep learning from electrocardiographic images licensed to Ensight-AI, Inc.; and a provisional patent for Articles and methods for detecting hidden cardiovascular disease from portable electrocardiography.
Dr Schwamm reports compensation from Medtronic for consultant services.
Non-standard Abbreviations and Acronyms:
- AI
Artificial intelligence
- LLMs
large language models
- mRS
modified Rankin Scale
Appendix
Reviewer Contributors
We gratefully acknowledge the following Stroke Editorial Board members who contributed their time and effort by participating in the blinded grading and rating of essays for the Stroke AI essay contest. The following members granted permission to be listed.
Maurizio Acampa, MD PhD, Stroke Unit, Department of Medical Sciences, Surgery and Neurosciences, University of Siena, Siena, Italy
Eric E. Adelman, MD, Department of Neurology, University of Wisconsin – Madison WI USA
Johannes Boltze, MD, PhD, University of Warwick, Coventry, United Kingdom
Joseph P. Broderick, MD, UC Gardner Neuroscience Institute, University of Cincinnati, Cincinnati OH, USA
Amy Brodtmann MBBS, PhD, School of Translational Medicine, Monash University, Clayton VIC, Australia
Hanne Christensen, MD, PhD, DMSci, Copenhagen University Hospital, Bispebjerg Denmark
DR LACHLAN DALLI, PhD, Department of Medicine, Monash University, Clayton VIC, Australia
Kelsey Rose Duncan, MD, MBA, Department of Neurology, Case Western Reserve University, Cleveland OH USA
Islam Y. Elgendy, MD, Division of Cardiovascular Medicine, Gill Heart Institute, University of Kentucky, Lexington, KY USA
Adviye Ergul, MD, PhD, Medical University of South Carolina, Charleston, SC USA
Larry B. Goldstein, MD Dept of Neurology, University of Kentucky, Lexington, KY USA
Janice L Hinkle, RN. PhD, CNRN, M Fitzpatrick School of Nursing, Villanova University, Villanova, PA USA
Michelle C. Johansen MD PhD - The Johns Hopkins University School of Medicine, Baltimore MD, USA
Katarina Jood MD, PhD, University of Gothenburg, Sweden
Scott E. Kasner, MD, Department of Neurology, University of Pennsylvania, Philadelphia PA USA
Steven R. Levine, MD, SUNY Downstate Health Sciences University, Brooklyn NY USA
Zixiao Li, MD, PhD, Department of Neurology, Capital Medical University, Beijing, China
Gregory Lip MD FRCP DFM, Cardiovascular Medicine, University of Liverpool, United Kingdom
Elisabeth B Marsh, MD, Johns Hopkins School of Medicine, Baltimore MD USA
Keith W Muir, MD, University of Glasgow, Glasgow, Scotland
Johanna Maria Ospel, MD PhD, Cumming School of Medicine, University of Calgary, AB Canada
Joanna Pera, MD, PhD, Department of Neurology, Jagiellonian University Medical College, Krakow, Poland
Terence J Quinn MD, University of Glasgow, Glasgow, United Kingdom
Silja Räty, MD, PhD, Department of Neurology, Helsinki University Hospital, Finland
Professor Anna Ranta, MD, PhD, Department of Medicine - University of Otago, Wellington, NZ
Lorie Gage Richards, PhD, OTR/L, Department of Occupational and Recreational Therapies, University of Utah, Salt Lake City, UT USA
Jose Rafael Romero, MD, Dept of Neurology, Boston University Chobanian and Avedisian School of Medicine, Boston MA USA
Joshua Z Willey MD MS, Columbia University Vagelos College of Physicians and Surgeons, New York NY USA
Argye E. Hillis MD, PhD, Dept of Neurology, Johns Hopkins University School of Medicine, Baltimore MD USA
Janne M. Veerbeek, PT, PhD Neurocenter, Luzerner Kantonsspital, Lucerne, Switzerland
Footnotes
Stroke Essay Contest Prompts and Responses
References
- 1.Ghassemi M, Birhane A, Bilal M, Kankaria S, Malone C, Mollick E, Tustumi F. ChatGPT one year on: who is using it, how and why? Nature. 2023; 624:39–41. [DOI] [PubMed] [Google Scholar]
- 2.Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S et al. Large Language Models Encode Clinical Knowledge. Nature. 2022;620:172–180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Liebrenz M, Schleifer R, Buadze A, Bhugra D, Smith A. Generating scholarly content with ChatGPT: ethical challenges for medical publishing. Lancet Digit Health. 2023;5:e105–e106. [DOI] [PubMed] [Google Scholar]
- 4.Májovský M, Černý M, Kasal M, Komarc M, Netuka D. Artificial Intelligence Can Generate Fraudulent but Authentic-Looking Scientific Medical Articles: Pandora’s Box Has Been Opened. J Med Internet Res. 2023; 25:e46924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hosseini M, Resnik DB, Holmes K. The ethics of disclosing the use of artificial intelligence tools in writing scholarly manuscripts. Res Ethics. 2023;19:449–465. [Google Scholar]
- 6.Safrai M, Orwig KE. Utilizing artificial intelligence in academic writing: an in-depth evaluation of a scientific review on fertility preservation written by ChatGPT-4. J Assist Reprod Genet. 2024; doi: 10.1007/S10815-024-03089-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Flanagin A, Pirracchio R, Khera R, Berkwits M, Hswen Y, Bibbins-Domingo K. Reporting Use of AI in Research and Scholarly Publication-JAMA Network Guidance. JAMA. 2024;331:1096–1098. [DOI] [PubMed] [Google Scholar]
- 8.Inam M, Sheikh S, Minhas AMK, Vaughan EM, Krittanawong C, Samad Z, Lavie CJ, Khoja A, D’Cruze M, Slipczuk L. A review of top cardiology and cardiovascular medicine journal guidelines regarding the use of generative artificial intelligence tools in scientific writing. Curr Probl Cardiol. 2024;49 :102387. [DOI] [PubMed] [Google Scholar]
- 9.Tang A, Li KK, Kwok KO, Cao L, Luong S, Tam W. The importance of transparency: Declaring the use of generative artificial intelligence (AI) in academic writing. Journal of Nursing Scholarship. 2024;56:314–318. [DOI] [PubMed] [Google Scholar]
- 10.Eysenbach G The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers. JMIR Med Educ. 2023;9:e46885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt JN, Laleh NG, Löffler CML, Schwarzkopf SC, Unger M, Veldhuizen GP, et al. The future landscape of large language models in medicine. Communications Medicine. 2023;3:141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023;388:1233–1239. [DOI] [PubMed] [Google Scholar]
- 13.Leung TI, de Azevedo Cardoso T, Mavragani A, Eysenbach G. Best Practices for Using AI Tools as an Author, Peer Reviewer, or Editor. J Med Internet Res. 2023;25:e51584. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Baffy G, Burns MM, Hoffmann B, Ramani S, Sabharwal S, Borus JF, Pories S, Quan SF, Ingelfinger JR. Scientific Authors in a Changing World of Scholarly Communication: What Does the Future Hold? American Journal of Medicine. 2020;133:26–31. [DOI] [PubMed] [Google Scholar]
- 15.Perlis RH, Kendall-Taylor J, Hart K, Ganguli I, Berlin JA, Bradley SM, Haneuse S, Inouye SK, Jacobs EA, Morris A. Peer Review in a General Medical Research Journal Before and During the COVID-19 Pandemic. JAMA Netw Open. 2023;6:E2253296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Biswas S, Dobaria D, Cohen HL. Focus: Big Data: ChatGPT and the Future of Journal Reviews: A Feasibility Study. Yale J Biol Med. 2023;96:415–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
