Skip to main content
Springer logoLink to Springer
. 2025 Aug 7;167(1):214. doi: 10.1007/s00701-025-06622-4

Can we trust academic AI detective? Accuracy and limitations of AI-output detectors

Gökberk Erol 1, Anıl Ergen 2, Büşra Gülşen Erol 3, Şebnem Kaya Ergen 4, Tevfik Serhan Bora 5, Ali Deniz Çölgeçen 6, Büşra Araz 7, Cansel Şahin 7, Günsu Bostancı 8, İlayda Kılıç 9, Zeynep Birce Macit 8, Umut Tan Sevgi 10, Abuzer Güngör 11,
PMCID: PMC12331776  PMID: 40773066

Abstract

Objective

This study evaluates the reliability and accuracy of AI-generated text detection tools in distinguishing human-authored academic content from AI-generated texts, highlighting potential challenges and ethical considerations in their application within the scientific community.

Methods

This study analyzed the detectability of AI-generated academic content using abstracts and introductions created by ChatGPT versions 3.5, 4, and 4o, alongside human-written originals from the pre-ChatGPT era. Articles were sourced from four high impact neurosurgery journals and categorized into four categories: originals and generated by ChatGPT 3.5, ChatGPT 4, and ChatGPT 4o. AI-output detectors (GPTZero, ZeroGPT, Corrector App) were employed to classify 1,000 texts as human- or AI-generated. Additionally, plagiarism checks were performed on AI-generated content to evaluate uniqueness.

Results

A total of 250 human-authored articles and 750 ChatGPT-generated texts were analyzed using three AI-output detectors (Corrector, ZeroGPT, GPTZero). Human-authored texts consistently had the lowest AI likelihood scores, while AI-generated texts exhibited significantly higher scores across all versions of ChatGPT (p < 0.01). Plagiarism detection revealed high originality for ChatGPT-generated content, with no significant differences among versions (p > 0.05). ROC analysis demonstrated that AI-output detectors effectively distinguished AI-generated content from human-written texts, with areas under the curve (AUC) ranging from 0.75 to 1.00 for all models. However, none of the detectors achieved 100% reliability in distinguishing AI-generated content.

Conclusions

While models like ChatGPT enhance content creation and efficiency, they raise ethical concerns, particularly in fields demanding trust and precision. AI-output detectors exhibit moderate to high success in distinguishing AI-generated texts, but false positives pose risks to researchers. Improving detector reliability and establishing clear policies on AI usage are critical to mitigate misuse while fully leveraging AI’s benefits.

Keywords: Academic writing, AI-output detector, Artificial intelligence, ChatGPT, Ethical considerations, Plagiarism detection

Introduction

Artificial Intelligence (AI) and Machine Learning (ML) have made significant progress in recent years, with increasing applications in medicine [3, 7, 9, 19, 23, 24, 28, 30]. The advancements in the fields of natural language processing (NLP) and text generation are particularly significant. Chat Generative Pre-Trained Transformer (ChatGPT), prominent example of a large language model (LLM) was released by OpenAI in November 2022 and has been able to demonstrate the potential applications of AI models by exemplifying their ability to answer questions in a human-like way, process and analyze large amounts of data, summarize texts, translate languages, and generate content [5, 14, 16, 22, 27].

With the developing technology, ChatGPT 4 was launched in March 2023 and ChatGPT 4o was released in May 2024. The number of neural networks has increased with the updated versions, and the new versions have enhanced capabilities in advanced reasoning, complex instructions, and creativity [12]. Also, studies have shown that the newer versions answer medical questions more accurately than the previous version, can generate better answers, and interpret image inputs [17, 26]. Researching a topic and writing an academic paper are important academic skills, yet LLM tools, including ChatGPT, have become widely used in academic article production due to their ability to quickly and effortlessly generate clear, contextually relevant, and coherent text, making them an attractive option for writers.

Language-based AI has already entered the scientific community with its wide range of uses [2]. ChatGPT has even been cited as an author in academic papers in early published studies [18, 20]. While ChatGPT can assist in various aspects of scientific research, there are several challenges and limitations when using AI for scientific research reports, including issues related to quality control, ethical concerns, and the amplification of existing biases in data and algorithms. Therefore, there are growing concerns about the misuse of AI chatbots. One major concern is hallucination, where LLMs generate answers that sound correct but are actually wrong or make no sense [1, 26]. While this may be beneficial in tasks that require creative thinking, as it suggests an ability to generate knowledge beyond the training data, it poses challenges in fields like medicine, where strict accuracy to real facts is essential [13].

It is especially difficult for reviewers to determine whether academic content, such as an abstract, has been generated by ChatGPT. This poses a potential problem for the academic community, as fake abstracts and publications can be presented at conferences, scientific meetings, or become published articles. The generation of fake scientific research reports using AI is troubling, as it can erode the credibility and trust in scientific world and may result in misguided or even harmful decisions based on the related or corresponding inaccurate information. Therefore, there is an ethical dilemma about how much AI should be allowed in publications. In this manner, some detectors are now being introduced to identify AI-generated texts in academic articles. These detectors don’t evaluate the accuracy of the content but gather clues based on the writing style of the text with their algorithm’s [31]. Having said that, it is not uncommon to see an AI detector wrongfully analyzing a genuinely human written context. This has far-reaching consequences, such as the possibility of career-altering accusations against researchers whose work is, in fact, original. In this study, we use papers published before the advent of AI-assisted text generation, along with abstracts and introductions generated by various versions of ChatGPT with matching titles, to evaluate the accuracy of AI-output detectors.

Materials and methods

AI-output detectors were used to analyze human-written and ChatGPT-generated abstracts and introductions, including articles from various stages of the ChatGPT era. Institutional review board approval was not required, as ChatGPT is publicly available.

Data collection

For our study, we established an archive of articles. Additionally, we aimed to determine if there were differences between the ChatGPT 3.5, 4, and 4o eras. Consequently, we categorized the articles into four historical periods for analysis: the pre-ChatGPT era (prior to November 30, 2022, used as a control group), the ChatGPT 3.5 era (beginning March 15, 2022), ChatGPT 4 era (beginning March 14, 2023), and ChatGPT 4o era (beginning May 13, 2024).

Articles were collected from four high-impact neurosurgical journals selected for their open accessibility to back issues. As inclusion criteria, “letters to the editor”, “editorials” and “case reports” published in 2021 and before were excluded. Open access articles were systematically downloaded retrospectively from the designated dates, starting with the first issue. A team of 6 English proficient medical students, divided into two groups, searched the database for eligible articles and were checked for eligibility by two neurosurgeons.

The texts evaluated in our study consist of the abstracts and introduction sections of the original articles. This selection was made due to the word limit in the free versions of chatbot detectors and the increasing preference for utilizing the summarization capabilities of artificial intelligence LLMs in the academic field. We created a control group, of which we are certain that they were not generated by ChatGPT, by downloading 250 articles from pre-ChatGPT era issues. Using the same titles of these articles, we generated 250 abstracts and introductions sections from each three versions of ChatGPT. All these articles were analyzed by using contemporary AI-output detectors.

Text production

We employed ChatGPT 3.5, ChatGPT 4, and ChatGPT 4o versions. Accordingly with its official website, ChatGPT 4 has a knowledge cutoff date of was December 2023. The newest ChatGPT 4o’s knowledge cutoff date is instead October 2023 (https://platform.openai.com/docs/models).

As the prompt text, we used the sentence: “Please write an abstract and introduction section of a neurosurgery article titled as “title of the original article”. The abstract should include objective, methods, results, conclusions sections and word count of approximately 300.” The texts for the control group articles were generated by six medical students using different models of ChatGPT based on a given prompt. Two neurosurgeons reviewed the appropriateness of the abstract and introduction sections.

Plagiarsm and AI detection

While scanning the texts, we excluded title name and reference lists. For both published articles and those we were produced with ChatGPT, we used a standardized method where only the abstract and introduction texts were included in the detector analysis.

To investigate the human or AI likeness of our control group and texts generated by different versions of ChatGPT, three of the most popularized AI-output detectors utilized: GPTZero (https://app.gptzero.me), ZeroGPT (https://www.zerogpt.com) and Corrector App (https://corrector.app/ai-content-detector/). These detectors classify the provided text as either AI or Human with a score ranging from 0 to 100, where a higher percentage score suggests a greater probability for that context to be AI generated. The results were recorded for a total of 1,000 texts.

We conducted plagiarism checks on the ChatGPT-produced abstracts using the'Plagiarism Detector'tool (https://plagiarismdetector.net), which assigns a uniqueness score on a scale from 0 to 100%. On this platform, a higher score indicates that the text is more unique.

The valid versions of these online applications were used between June 1 and June 15, 2024.

Statistical analyses

G*Power 3.1.9.4 package program was used for sample size calculation. In this study, the same article was reproduced with three different versions of ChatGPT (GPT-3.5, GPT-4, GPT-4o) and the plagiarism and chatbot detector results were compared. Since comparisons were made between different versions of the same article, the Repeated Measures ANOVA method was used to calculate the sample size. In the interpretation of the effect size, based on Cohen (1988) criteria, f = 0.10 was considered as small, f = 0.25 as medium and f = 0.40 as large effect size. In order to detect significant differences in this study, the effect size was set as f = 0.10, the probability of error (α) = 0.05 and the power (1-β) = 97.4%. As a result of the power analysis, it was calculated that at least 250 articles should be included in the study (Fig. 1).

Fig. 1.

Fig. 1

The sample size calculation was performed using the G*Power 3.1.9.4 software package. In this study, the same article was reproduced using three different versions of ChatGPT (GPT-3.5, GPT-4, GPT-4o), and the results of plagiarism and chatbot detection were compared. Since comparisons were made between different versions of the same article, the Repeated Measures ANOVA method was used for sample size calculation. As a result of the power analysis, it was determined that at least 250 articles should be included in the study

The data analysis was conducted using the Statistical Package for the Social Sciences (SPSS) version 26.0. The results from the plagiarism detectors and AI-output detectors for both published articles and those generated by ChatGPT versions are presented as mean and standard deviation. The normality of these variables was found that the results from Plagiarism Detector, Corrector, Zero GPT, and GPT Zero within the normal distribution limits. The reference range for normal distribution was set at ± 1.96.

Post Hoc tests were used to identify differences between the groups that show significant variance. Pearson Correlation analysis was employed to examine the relationship between the currency of ChatGPT versions and the results from the plagiarism detectors and AI-output detectors for the generated articles. The correlation coefficient was interpreted as follows: 0.00–0.30 indicates a low level of correlation, 0.30–0.70 a moderate level, and 0.70–1.00 a high level. To estimate the similarity between AI-generated texts and original texts based on the Corrector, Zero GPT, and GPT Zero results, ROC analysis was used. Throughout the study, significance levels of 0.05 and 0.01 were considered.

Results

We included 250 articles that are unquestionably human-authored. ChatGPT is then instructed to generate scientific abstracts, each under 300 words, along with introductions, aligned with the titles of these articles and the specific titles of four high-impact-factor neurosurgery journals (objective, methods, results, conclusions). A total of 750 abstracts and introductions were generated using GPT-3.5, GPT-4, and GPT-4.0 versions (Examples are provided in supplement data 1).

The plagiarism detector and chatbot detector results of the published articles and ChatGPT generated versions are presented in Table 1.

Table 1.

The text materials’ plagiarism and AI detector results

Detector Groups N Mean S.D p Difference
Plagiarism Detector Publisheda 250 0,074 -
GPT 3,5b 250 98,95 2,26
GPT 4c 250 99,35 2,15
GPT 4od 250 99,29 1,82
Corrector Publisheda 250 36,90 28,19  < 0,001** a < c < b,d
GPT 3,5b 250 94,19 11,92
GPT 4c 250 85,84 17,62
GPT 4od 250 93,56 9,56
Zero GPT Publisheda 250 27,55 24,83  < 0,001** a < b < c < d
GPT 3,5b 250 81,48 25,65
GPT 4c 250 87,39 16,30
GPT 4od 250 94,35 8,75
GPT Zero Publisheda 250 5,88 5,83  < 0,001** a < b < c < d
GPT 3,5b 250 81,71 23,59
GPT 4c 250 96,83 7,11
GPT 4od 250 99,58 2,23

*p < 0,05; **p < 0,01; Tests: Repeated Measures ANOVA; Difference: Benforrini Test

In our study, a total of 750 abstracts and introductions were generated using different versions of ChatGPT. Each was evaluated by a plagiarism detection tool that provides an originality score, where 100% indicates no detected similarity. Of these, 637 texts (84.9%) received an originality score of 100%. The mean originality scores were 98.95% for GPT-3.5, 99.35% for GPT-4, and 99.29% for GPT-4o. These minor differences were not statistically significant (p > 0.05), suggesting that the different ChatGPT versions perform similarly in terms of producing texts that are rated as original by the detection tool. (Table 1).

Analyzing the abstracts and introductions from both the original published versions and ChatGPT-generated versions using the Corrector detector yielded an average AI likelihood score of 36.90% for the original content, 94.19% for the GPT-3.5 version, 85.84% for the GPT-4 version, and 93.56% for the GPT-4o version. According to these data, there was a difference between the Corrector detector results of the original articles and the articles generated with ChatGPT versions (p < 0.01). The corrector AI-output detector results of the original articles were the lowest, and the results of GPT 4 version was lower than the Corrector result of GPT 3.5 and GPT 4o versions. In fact, the results of GPT 3.5 and GPT 4o versions were similar. The Corrector detector results of the articles original and generated contents are shown in Fig. 2.

Fig. 2.

Fig. 2

The Corrector, The Zero GPT and The GPT Zero detectors results of the published texts and generated by ChatGPT versions

The abstracts and introductions from both the original published versions and the ChatGPT-generated versions were evaluated using the ZeroGPT detector, resulting in an average AI likelihood score of 27.55% for the published version, 81.48% for the GPT-3.5 version, 87.39% for the GPT-4 version, and 94.35% for the GPT-4.0 version. According to the data, there was a significant difference between the ZeroGPT detector results for the published articles and those generated by various ChatGPT versions (p < 0.001). The original articles had the lowest AI likelihood scores, and the scores increased as newer GPT versions were analyzed. The Zero GPT detector results of original articles and generated contents are shown in Fig. 2.

The abstracts and introductions from both the original published articles and the ChatGPT-generated versions were assessed using the GPTZero detector, producing average AI likelihood scores of 5.88% for the published versions, 81.71% for GPT-3.5, 96.83% for GPT-4, and 99.58% for GPT-4o. The data revealed a statistically significant difference between the ZeroGPT scores of the published articles and the ChatGPT-generated versions (p < 0.001). The published articles showed the lowest AI likelihood scores, whilst an increase is observed in newer GPT versions. Figure 2 displays the GPTZero results for both the published articles and the ChatGPT-generated versions.

The results of the correlation analysis between the plagiarism detector and chatbot detector scores of all three ChatGPT generated articles are presented on Table 2.

Table 2.

Correlation analysis results between the Recency of ChatGPT versions and the results of detectors’ consistency

Detectors Coefficents Recency of ChatGPT Versions Level of Consistency
Plagiarism Detector r 0,066 -
p 0,072
Corrector r ,603** Moderate
p  < 0,001
Zero GPT r ,695** Moderate
p  < 0,001
GPT Zero r ,823** High
p  < 0,001

*p < 0,05; **p < 0,01; r: Correlation coefficient

There was no significant correlation between the currency of ChatGPT versions and the plagiarism detector results of the generated articles (p > 0.05).

A moderately positive correlation was found between the articles that are generated by newer versions of ChatGPT versions and the results of the Corrector and ZeroGPT detectors (p < 0.001). According to these results, it was observed that the results of the Corrector and Zero GPT detector of the generated articles are increased as the increase in the currency of ChatGPT versions. A highly positive correlation was found between the newer ChatGPT versions and the GPT Zero detector results of the generated articles (p < 0.001). These findings indicate that the consistency of the GPT Zero detector results for the generated articles improved as the updated versions of ChatGPT were used.

The area under the ROC curve is statistically important in estimating the resemblance of the genuinely human written texts versus AI generated context. The ROC analysis showed an area under the curve (AUC) of 0.50, indicating that AI-generated and published articles were indistinguishable. In a perfect test, AUC would be 1.00. AUC values are interpreted as follows: 0.90–1.00 = excellent, 0.80–0.90 = good, 0.70–0.80 = fair, 0.60–0.70 = poor, and 0.50–0.60 = fail. In this context, the results of Corrector, Zero GPT and GPT Zero were found to be successful in predicting the probability that the texts written by artificial intelligence resemble original texts and were statistically significant (p < 0.05. (Fig. 3). Table 3 presents the results of the ROC analysis for three models (Corrector, Zero GPT, and GPT Zero) used to predict the similarity between AI-generated texts and human-written texts. For the Corrector model, the cut-off value was 79.32, with a sensitivity of 92.4% and a specificity of 90.8%. In the Zero GPT model, the cut-off value was 75.3, with 94.4% sensitivity and 93.2% specificity. For the GPT Zero model, the cut-off value was 31.5, with 100% sensitivity and 99.6% specificity.

Fig. 3.

Fig. 3

ROC curve for estimating the similarity of AI-generated texts to original texts based on Corrector, ZeroGPT, and GPTZero results

Table 3.

ROC analysis results for estimating the similarity of AI-generated texts to realistic texts based on Corrector, ZeroGPT, and GPTZero results

Test Variables Area* Std. Error p Confidence Interval (95) Cut-off Sensitivity (%) Specificity (%)
Lower Bound Upper Bound
Corrector 0,96 0,01  < 0,001 0,95 0,98 79,32 92,4 90,8
Zero GPT 0,98 0,01  < 0,001 0,97 0,99 75,3 94,4 93,2
GPT Zero 1,00 0,00  < 0,001 1,00 1,00 31,5 100 99,6

*Area Under The Curve (AUC)

An example of the scoring of the plagiarism detector and AI-output detectors is shown in Fig. 4.

Fig. 4.

Fig. 4

Interfaces of Plagiarism and AI Output Detectors Used A)'Plagiarism Detector'tool (https://plagiarismdetector.net) B) Corrector App (https://corrector.app/ai-content-detector/) C) ZeroGPT (https://www.zerogpt.com) D) GPTZero (https://app.gptzero.me) Due to the word limits of the detection tools, only the abstracts and introduction sections were used in this study, rather than the full articles

Discussion

AI text generators and detectors rely on similar underlying technologies and principles. Key among these are machine learning (ML) and natural language processing (NLP), which allow detection tools to analyze data and differentiate between content generated by artificial intelligence (AI) and that authored by humans. Certain characteristics make AI-output detectors capable of identifying AI-generated content. Perplexity and burstiness are two major key metrics for distinguishing AI-generated content from human writing. Perplexity measures the"surprise"an AI model experiences with new text; predictable content is often AI-generated, while higher perplexity suggests human authorship. Burstiness evaluates variability in sentence structure, length, and complexity, with human writing displaying greater diversity compared to the uniformity of AI-generated text [8, 32, 33]. These metrics aid in identifying differences in unpredictability and variability between human and AI content. However, relying solely on AI-output detectors is not advisable [18, 23]. AI-output detectors cannot achieve perfect accuracy, as their functioning is predominantly based on probabilistic algorithms. Moreover, each detector is trained on distinct datasets, leading to potential variability in their outputs. Consequently, these detectors often produce differing results when analyzing the same content.

Chatbot technology holds promise as a valuable tool when used responsibly. By assisting researchers with the writing and formatting process, it can help reduce the workload associated with preparing scientific manuscripts. This could be especially beneficial when paired with a researcher’s own scientific expertise, allowing them to focus more on the substance of their research rather than on writing mechanics. It can also turn out to be a very strong tool for the increasing number of scientists that publish in a non-native language, which may increase fairness within the worldwide scientific community. By improving the quality and clarity of their work, it may help researchers from diverse linguistic backgrounds share their findings more effectively, leading to a more inclusive and accessible body of scientific knowledge.

As demonstrated in our study, all versions of ChatGPT successfully generated academic-style articles in less than a minute when provided with a well-defined title and topic, and remarkably, with only a title and correct input, it produced coherent abstracts reflecting key themes and plausible cohort sizes. All AI-generated texts were evaluated by two neurosurgeons and deemed suitable for a scientific article. This capability of ChatGPT underscores its potential as a valuable tool for scientists, aiding in the articulation of their knowledge in a more scholarly manner.

Despite its potential benefits, this technology also poses serious risks, particularly when misused to generate scientifically structured abstracts containing fabricated or misleading information for academic papers and conferences. ChatGPT’s ability to produce convincing content challenges the reliability of academic discourse, as reviewers often struggle to discern whether abstracts are genuine or entirely AI-generated, threatening the integrity of scientific research and public trust in published findings. Researchers must exercise caution, avoiding blind trust in AI-generated outputs, which may include inaccuracies, outdated information, or even fictitious statistics presented persuasively Makiev et al. investigated the ability of orthopedic residents and professors to differentiate between real and AI-generated orthopedic abstracts. The residents correctly identified 34.9% of the abstracts, compared to 31.7% by the professors. The study highlights the challenges humans face in distinguishing AI-generated content [18]. In Gao et al.’s study, blind human reviewers correctly identified 68% of the AI-generated abstracts as generated and 86% of the original abstracts as genuine. However, they misclassified 32% of generated abstracts as real and 14% of original abstracts as generated. This study shows that humans can identify some summaries generated by ChatGPT, but the accuracy is not perfect [11]. To prevent such deceptive abstracts from being presented at scientific conferences, it is crucial to develop more advanced AI detection tools in near future, as the current systems remain inadequate in identifying these false submissions. Also, some studies suggest that AI-detectors should be integrated into the research editorial process, depending on the guidelines of the publisher or conference [2, 5, 11].

Recent literature offers important insights into the accuracy of AI-output detectors, presenting studies that demonstrate their success in identifying AI-generated summaries, as well as instances that reveal their shortcomings in detection [31]. In Gao et al.’s study, which included 50 generated abstracts and examined the accuracy of AI-output detectors, 17 (34%) of the abstracts scored below 50% on the AI-output detector; 5 of them scored below 1% [11]. In Hua et al.’s study, 14 AI-generated abstracts were assessed using two AI-output detectors. The average"fake"scores, indicating the likelihood of 100% AI-generated content, were 65.4% and 10.8% for the earlier and updated chatbot-generated summaries, respectively, as evaluated by the first detector, and 69.5% and 42.7% for the second detector [12]. In a study by Makiev et al. involving 21 abstracts (10 real and 11 generated), 2 AI-output detectors were used, the first AI detector was able to correctly identify only 9/21 (42.9%) of the authors of the abstracts, while the second AI detector identified 14/21 (66.6%) [18]. In our study, of the texts generated by ChatGPT 3.5, 59 received a score below 50% from GPTZero, none received a score below 50% from ZeroGPT, and 1 received a score below 50% from Corrector. For texts generated by ChatGPT 4, 2 received a score below 50% from both GPTZero and Corrector, while none received a score below 50% from ZeroGPT. Similarly, for texts generated by ChatGPT 4o, none scored below 50% on GPTZero or ZeroGPT, and 1 scored below 50% on Corrector. According to our results, 750 AI-generated abstracts were evaluated using three AI-output detectors, which demonstrated statistically significant success in predicting the likelihood that the AI-generated texts closely resembled original, human-written texts. Also, according to our results, it was observed that the performance of all three detectors in evaluating generated articles improved as newer versions of ChatGPT were used.

Although there are useful uses for AI-output detectors, the possibility with which these detection tools misclassify original, unaided creative work as AI-generated raises significant concerns. Such errors could have far-reaching consequences, including career-impacting accusations against researchers whose work is entirely their own creation. In Gao et al.’s study, only 1 out of 50 original abstracts (2%) scored above 50% on the AI-output detector, indicating that most original abstracts were correctly identified as human-generated by the detector [11]. Rashidi et al. evaluated 14,400 abstracts published between 1980 and 2023, and found that the AI detector incorrectly identified up to 8% of known real abstracts as AI-generated text [23]. AI detection tools analyze text based on patterns, syntax, and structure. Since AI-generated content often exhibits repetitive and formulaic characteristics, these tools may occasionally misidentify certain human writing styles as AI-generated. Texts with highly uniform sentence structures or minimal tonal variation, even when authored by humans, can still be flagged as AI-generated. In our study, to test the predictive capabilities of these detection tools, a total of 250 articles from various well-known scientific journals that were published before ChatGPT was published were evaluated. Among these, 76 (30.4%) were scored above 50% by the Corrector, and 40 (16%) by ZeroGPT. GPTZero demonstrated superior performance, as it did not score any original article above 50%. This finding highlights the current limitations of detection tools, underscoring the need for new or combined approaches to address these shortcomings. Therefore, AI content detectors should be regarded as supplementary tools rather than definitive solutions. While they can assist in content evaluation, overreliance on their assessments should be avoided. Similar to ChatGPT, these tools are still in their early stages of development and should not be considered substitutes for human review and editing. Importantly, misclassification of original human-authored texts as AI-generated can lead to ethical challenges, including unjustified allegations of misconduct and reputational harm. To mitigate these risks, transparency in the use of AI detectors and clear editorial policies are essential to ensure fairness and protect academic integrity.

AI detection models are not always accurate, as they may misidentify human-written text as AI-generated or fail to distinguish between AI-generated and AI-paraphrased content. Therefore, they should not be the sole basis for taking negative actions against academics. There is no clear consensus on a specific indicator value for detecting AI-generated text. Turnitin’s AI writing detection model flags text as AI-generated when the percentage is between 20 and 100%, indicating content classified as AI-generated or AI-paraphrased using tools like word spinners. However, a higher rate of false positives has been observed when the detected percentage is between 1 and 20%. To avoid misclassification, AI detection scores in the range of 1% to 19% are not given any emphasis [34]. Of the cut-off values that would maximize sensitivity and specificity in our study, the closest cut-off value to this value was 31.5% for GPTZero.

The rise of AI-generated manuscripts presents significant challenges to scientific publishing, particularly in terms of authorship, originality, and ethical responsibility. Large language models (LLMs) such as ChatGPT can produce coherent and well-structured abstracts, often including fabricated data that mimics scientific reasoning. These texts typically evade plagiarism detection systems, as they are generated independently rather than copied from existing sources. In our study, authenticity scores were remarkably high—98.95% for GPT-3.5, 99.35% for GPT-4, and 99.29% for GPT-4o—highlighting this limitation. While AI-output detectors may assist experienced reviewers, they remain imperfect and should not replace human judgment. Beyond technical concerns, the integration of LLMs into scientific writing also raises important ethical issues, including transparency, authorship responsibility, data protection, and intellectual property rights [21]. Moreover, there is a risk that users may unintentionally commit plagiarism when relying on AI-generated content lacking proper source attribution or factual accuracy [4, 10, 21]. These challenges underline the need for clear editorial policies and robust ethical guidelines that define acceptable uses of AI in academic writing. Ensuring accurate attribution, protecting intellectual property, and promoting transparency in editorial workflows will be essential to uphold academic integrity and the credibility of scientific communication as LLMs become more integrated into research environments.

Looking ahead, the growing use of AI-generated content in scientific publishing calls for a comprehensive approach. Our findings indicate that AI detection tools alone are not always reliable. We recommend a hybrid evaluation framework that includes human review to ensure fair and accurate decisions. In the future, developing transparent and interpretable detection models tailored to scientific writing would improve trust in these systems. Therefore, future research should focus on checking not only stylistic and structural features but also the factual veracity of AI-generated texts, including their behavior in citations and in the hallucinatory generation of references. As LLMs improve, it will become more important than ever to establish ethical, technical, and procedural safeguards that can assist their responsible utilization in academic writing.

Beyond the technical performance of AI detection tools, the integration of AI-generated content into academic and clinical applications requires ethical considerations informed by existing literature. Keskinbora et al. emphasize the importance of developing AI systems with transparency, auditability, and reliability features to prevent unfair outcomes based on biases or data misuse, and argue that interdisciplinary collaboration, including ethicists, should shape the AI development process from the outset [15]. In line with our findings, Mathiesen and Broekman argue that AI and machine learning ought to serve as complementary aids rather than substitutes for human decision-making, highlighting the ethical necessity of safeguarding data integrity, promoting equitable access to resources, and preserving human accountability [35]. Stengel et al. also supports the view that human judgment is still more reliable. They evaluated large language models (LLMs) on European Neurosurgery Board exam questions and found that while these models could outperform humans in theory-based tasks, they performed worse in interpretation-heavy questions, expressing concern about overreliance on AI for evaluation and certification [29]. In the future, with the improvement of the abilities of artificial-intelligence systems to enhance processes, create new knowledge, and aid clinical decision-making is expected to increasingly rival huma expertise, thereby raising fundamental concerns of professional accountability and system reliability [6]. Indeed, previous studies have shown that the primary concerns regarding AI applications are reliability and accountability, followed by privacy, equity, and sustainability [25]. An open and comprehensive regulatory framework will be necessary to adequately address these challenges. At a broader level, policymakers and regulatory bodies are aware of the urgency of these issues and related legislative work is ongoing. We believe that systematic training and the development of educational resources, in collaboration with national and international professional organizations and experts in data science, law, and policy-making, will be necessary to ensure the responsible and successful integration of AI into the evolving fields of neurosurgery and scientific communication.

Limitations

In our study, each prompt was submitted to ChatGPT only once, and the initial response was used for analysis without any modification. No paraphrasing tools or other forms of post-processing were applied to the generated responses. However, AI-generated content may evade detection under certain conditions. Specifically, when the same prompt is submitted multiple times, or when the output is altered using rephrasing tools, the ability of detection systems to identify AI-generated text may be reduced. Anderson et al. demonstrated that paraphrasing AI-generated texts decreased detection scores from 0.02% to 99.52%, and from 61.96% to 99.98% in a second trial, highlighting a significant limitation of current detection tools.[2]. Furthermore, due to the maximum input limit of the AI-output detectors utilized, only the abstract and introduction sections of the original papers could be analyzed. In order to make an accurate comparison, only the abstract and introduction parts were generated by the AI. Moreover, this study focused solely on outputs generated by ChatGPT. At the time of data collection, other large language models were either not widely adopted or lacked the stability and accessibility necessary for large-scale analysis. Future studies should aim to evaluate the performance and detectability of outputs from these emerging models as access and tool development continue to evolve. Another limitation of our research is the absence of systematic factual accuracy assessment in the generated abstracts and introductions. While structural coherence and novelty were the primary objectives of this study, factual consistency is a significant aspect—particularly in scientific text—that merits thorough examination. Due to the scale of our dataset (750 texts), a reliable factuality assessment would require topic-specific expert review, which was beyond the scope of the present study. Future research should consider incorporating factual verification metrics alongside linguistic and stylistic evaluations.

Conclusion

AI-generated text, such as that produced by ChatGPT, can ease the writing process by offering drafts for scientists to refine; however, it must be carefully reviewed for factual accuracy, as inaccuracies could undermine the credibility of scientific publications. While AI-output detectors may serve as supplementary tools in peer review or abstract evaluation, they often misclassify texts and require improvement. The scientific community continues to debate the ethical use and legitimacy of AI-assisted writing, with no definitive consensus on best practices or permissible use. Clear guidelines are essential for integrating AI tools into scientific publishing, ensuring they act as supportive resources rather than fully reliable solutions.

Author contributions

All authors contributed to the study conception and design. G.E, A.E and A.G wrote the main manuscript text. G.E and A.E prepared figures. Data collection performed by A.D.G, B.A, C.Ş, G.B, İ.K and Z.B.C. Material preparation and data analysis were performed by G.E, A.E, B.G.E, Ş.K.E, T.S.B and U.T.S. All authors reviewed the manuscript. All authors read and approved the final manuscript.

Funding

This study did not receive any funding or financial support during the preparation of this manuscript.

Data availability

No datasets were generated or analysed during the current study.

Declarations

Ethics approval

According to the policy of the institution where this research was conducted, ethics committee approval is not mandatory for this kind of study.

Competing interests

The authors declare no competing interests.

Disclosures

DeepL and ChatGPT4o were used for language editing.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Alkaissi H, McFarlane SI (2023) Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus 15(2):e35179 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Anderson N, Belavy DL, Perle SM, Hendricks S, Hespanhol L, Verhagen E, Memon AR (2023) AI did not write this manuscript, or did it? Can we trick the AI text detector into generated texts? The potential future of ChatGPT and AI in sports & exercise medicine manuscript generation. BMJ Open Sport Exerc Med 9(1):e001568. 10.1136/bmjsem-2023-001568 [DOI] [PMC free article] [PubMed]
  • 3.Atici MA, Sagiroglu S, Celtikci P, Ucar M, Borcek AO, Emmez H et al (2020) A novel deep learning algorithm for the automatic detection of high-grade gliomas on T2-weighted magnetic resonance images: a preliminary machine learning study. Turk Neurosurg 30(2):199–205 [DOI] [PubMed] [Google Scholar]
  • 4.Bassiri-Tehrani B, Cress PE (2023) Unleashing the power of ChatGPT: revolutionizing plastic surgery and beyond. Aesthet Surg J 43(11):1395–1399 [DOI] [PubMed] [Google Scholar]
  • 5.Bhatia P (2023) ChatGPT for academic writing: a game changer or a disruptive tool? J Anaesthesiol Clin Pharmacol 39:1–2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Boaro A, Mezzalira E, Siddi F, Bagattini C, Gabrovsky N, Marchesini N et al (2025) Knowledge, interest and perspectives on artificial intelligence in neurosurgery. A global survey. Brain Spine 5:104156 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Briganti G, Le Moine O (2020) Artificial intelligence in medicine: today and tomorrow. Front Med (Lausanne) 5:27 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Cingillioglu I (2023) Detecting AI-generated essays: the ChatGPT challenge. Int J Inf Learn Technol 40(3):259–268 [Google Scholar]
  • 9.Duman EA, Sagiroglu S, Celtikci P, Demirezen MU, Borcek AO, Emmez H et al (2022) Utilizing deep convolutional generative adversarial networks for automatic segmentation of gliomas: an artificial intelligence study. Turk Neurosurg 32(1):16–21 [DOI] [PubMed] [Google Scholar]
  • 10.Dutton JJ (2023) Artificial intelligence and the future of computer-assisted medical research and writing. Ophthalmic Plast Reconstr Surg. Wolters Kluwer Health 39:203–205 [DOI] [PubMed] [Google Scholar]
  • 11.Gao CA, Howard FM, Markov NS, Dyer EC, Ramesh S, Luo Y et al (2022) Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers [Internet]. Available from: 10.1101/2022.12.23.521610
  • 12.Hua HU, Kaakour AH, Rachitskaya A, Srivastava S, Sharma S, Mammo DA (2023) Evaluation and comparison of ophthalmic scientific abstracts and references by current artificial intelligence chatbots. JAMA Ophthalmol 141(9):819–824 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Huang KT, Mehta NH, Gupta S, See AP, Arnaout O (2024) Evaluation of the safety, accuracy, and helpfulness of the GPT-4.0 large language model in neurosurgery. J Clin Neurosci 1(123):151–156 [DOI] [PubMed] [Google Scholar]
  • 14.Jeblick K, Schachtner B, Dexl J, Mittermeier A, Stüber AT, Topalis J et al (2024) ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol 34(5):2817–2825 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Keskinbora KH (2019) Medical ethics considerations on artificial intelligence. J Clin Neurosci 64:277–82 [DOI] [PubMed] [Google Scholar]
  • 16.Korngiebel DM, Mooney SD (2021) Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3 (GPT-3) in healthcare delivery. NPJ Digit Med 4(1):93 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Kozel G, Gurses ME, Gecici NN, Gökalp E, Bahadir S, Merenzon MA et al (2024) Chat-GPT on brain tumors: an examination of artificial intelligence/machine learning’s ability to provide diagnoses and treatment plans for example neuro-oncology cases. Clin Neurol Neurosurg 1:239 [DOI] [PubMed] [Google Scholar]
  • 18.Makiev KG, Asimakidou M, Vasios IS, Keskinis A, Petkidis G, Tilkeridis K et al (2023) A Study on Distinguishing ChatGPT-Generated and Human-Written Orthopaedic Abstracts by Reviewers: Decoding the Discrepancies. Cureus 15(11):e49166 [DOI] [PMC free article] [PubMed]
  • 19.Nernekli K, Persad AR, Hori YS, Yener U, Celtikci E, Sahin MC, Sozer A, Sozer B, Park DJ, Chang SD (2024) Automatic segmentation of vestibular schwannomas: a systematic review. World Neurosurg 188:35–44. 10.1016/j.wneu.2024.04.145 [DOI] [PubMed]
  • 20.O’Connor S, ChatGPT. Open artificial intelligence platforms in nursing education: tools for academic progress or abuse? Nurse Educ Pract. 2023;66:103537. Elsevier Ltd. [DOI] [PubMed]
  • 21.Pressman SM, Borna S, Gomez-Cabello CA, Haider SA, Haider C, Forte AJ (2024) AI and ethics: a systematic review of the ethical considerations of large language model use in surgery research. Healthcare (Basel) 12(8):825. 10.3390/healthcare12080825 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, Radiological Society of North America Inc et al (2023) ChatGPT and other large language models are double-edged swords. Radiology 307(2):e230163 [DOI] [PubMed]
  • 23.Rashidi HH, Fennell BD, Albahra S, Hu B, Gorbett T (2023) The ChatGPT conundrum: human-generated scientific manuscripts misidentified as AI creations by AI text detection tool. J Pathol Inform 14:100342. 10.1101/2022.12.23.521610 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Rashidi HH, Tran NK, Betts EV, Howell LP, Green R (2019) Artificial intelligence and machine learning in pathology: the present landscape of supervised methods. Acad Pathol 6:2374289519873088. 10.1177/2374289519873088 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Reddy S (2022) Explainability and artificial intelligence in medicine. Lancet Digit Health 4(4):e214–e215. 10.1016/S2589-7500(22)00029-2 [DOI] [PubMed]
  • 26.Sahin MC, Sozer A, Kuzucu P, Turkmen T, Sahin MB, Sozer E et al (2024) Beyond human in neurosurgical exams: ChatGPT’s success in the Turkish neurosurgical society proficiency board exams. Comput Biol Med 1:169 [DOI] [PubMed] [Google Scholar]
  • 27.Sevgi UT, Erol G, Doğruel Y, Sönmez OF, Tubbs RS, Güngor A (2023) The role of an open artificial intelligence platform in modern neurosurgical education: a preliminary study. Neurosurg Rev 46(1):86 [DOI] [PubMed] [Google Scholar]
  • 28.Sozer A, Borcek AO, Sagiroglu S, Poshtkouh A, Demirtas Z, Karaaslan MM et al (2023) The first case of glioma detected by an artificial intelligence algorithm running on real-time data in neurosurgery: illustrative case. J Neurosurg: Case Less 5(19):CASE22536 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Stengel FC, Stienen MN, Ivanov M, Gandía-González ML, Raffa G, Ganau M et al (2024) Can AI pass the written European board examination in neurological surgery? Brain and Spine 4:102765 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW (2023) Large language models in medicine. Nat Med 29(8):1930–1940 [DOI] [PubMed] [Google Scholar]
  • 31.Weber-Wulff D, Anohina-Naumeca A, Bjelobaba S, Foltýnek T, Guerrero-Dib J, Popoola O et al (2023) Testing of detection tools for AI-generated text. Int J Educ Integr 19(1):1-39
  • 32.Liao W, Liu Z, Dai H, Xu S, Wu Z, Zhang Y, Huang X, Zhu D, Cai H, Li Q, Liu T, Li X (2023 ) Differentiating ChatGPT-generated and human-written medical texts: quantitative study. JMIR Med Educ 9:e48904. 10.2196/48904 [DOI] [PMC free article] [PubMed]
  • 33.Taguchi K, Gu Y, Sakurai K (2024)The Impact of Prompts on Zero-Shot Detection of AI-Generated Text. Available from: http://arxiv.org/abs/2403.20127. Accessed 29 Mar 2025
  • 34.AI writing detection in the classic report view [Internet] (2024) Available from: https://guides.turnitin.com/hc/en-us/articles/28457596598925-AI-writing-detection-in-the-classic-report-view. Acessed 19 Feb 2025
  • 35.Mathiesen T, Broekman M (2022) Machine Learning and Ethics, pp 251–256. Available from: https://link.springer.com/10.1007/978-3-030-85292-4_28 [DOI] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No datasets were generated or analysed during the current study.


Articles from Acta Neurochirurgica are provided here courtesy of Springer

RESOURCES