Abstract
The integration of large language models (LLMs) in academic research has transformed traditional research methodologies. This review investigates the current state, applications, and limitations of LLMs, particularly ChatGPT, in medical and scientific research. I performed a systematic review of recent literature and LLM development reports in artificial intelligence-assisted research tools, including commercial LLM services (GPT-4o, Claude 3, Gemini Pro) and specialized research platforms (Genspark, Scispace). I evaluated their performance, applications, and limitations across stages of the research process. Recent advancements in LLMs shows potential for improving research efficiency, particularly in literature review, data analysis, and manuscript preparation. Performance comparison revealed varying strengths: GPT-4o and o1 outperformed in the overall area, Claude 3 in writing and coding, and Gemini Pro in multimodal processing. Therefore, it is important to choose and use each model wisely according to its advantages. However, hallucination risks, inherent biases, plagiarism concerns, and privacy issues are concerns in LLMs. The emergence of Retrieval-Augmented Generation models and specialized research tools has improved accuracy and current information access. LLMs offer effective support for research productivity, but they should serve as complementary tools rather than primary research drivers. The successful application of these tools depends on a thorough understanding of their limitations, strict adherence to ethical guidelines, and preservation of researcher autonomy.
Keywords: Artificial intelligence, Natural language processing, Biomedical research, Generative pre-trained transformer, Research ethics
Highlights
Large language models, including ChatGPT, are transforming academic research by enhancing efficiency in literature review, data analysis, and manuscript preparation. Frontier generative AI models show potential advantages in reasoning, writing, and multimodal processing. However, ensuring ethical awareness, rigorous validation of AI-generated outputs, and preservation of researcher autonomy is essential to maintain research integrity and academic standards.
Introduction
The introduction of ChatGPT in late 2022 represented a significant advancement in artificial intelligence (AI) applications within all areas, including healthcare and biomedical research [1-4]. This technology showed remarkable natural language processing (NLP) capabilities and contextual understanding, with potential implications for clinical practice, medical education, and healthcare service [3,5]. LLMs have seen significant advancement, demonstrating expert-level performance in domains ranging from professional licensing examinations to technical documentation [6-8].
Additionally, there have been numerous reports of increased productivity in general business operations. Research conducted by the Nielsen Norman Group demonstrated meaningful productivity enhancements: 125% improvement in programming efficiency, 62% acceleration in document creation, and 14% advancement in customer support effectiveness. Microsoft's comprehensive analysis of Copilot implementation further validates these findings, with 70% of users reporting increased productivity. The research indicates that users completed routine tasks 30% faster, while 85% reported improved draft quality, with 77% of participants considering the tool indispensable following adoption. Additional efficiency gains were clear through decreased time spent managing emails and faster file navigation (75%) [9-12].
The academic research community has increasingly used these technologies in its methodologies, particularly in data analysis, visualization, and manuscript preparation. Several research teams have created AI systems that can help with many parts of the research process. There was even a study by an AI scientist who automated the entire research process and succeeded in planning and writing papers [13]. However, major challenges remain, especially concerning the accuracy of AI-generated content. These AI systems often generate convincing but inaccurate information, leading to citation errors and paper retractions in academic work [14]. This issue originates from their recognition patterns in language rather than true understanding of data. Recent advances have addressed many initial limitations by integrating language processing with verified information sources and specialized research tools incorporating citation database functionality. Although these improvements have enhanced the reliability of AI-assisted research, meticulous human oversight remains essential to ensure academic integrity.
This review investigates the evolution and impact of language models in medical research practices. It analyzes current capabilities, practical applications, and emerging best practices for academic application and provides guidance for utilizing these powerful tools to maintain scientific principles and ethical suitability.
Introduction of commercial large language model services
1. Generative pre-trained transformer
Generative pre-trained transformer (GPT) remains the most widely utilized service among competing models, demonstrating superior functionality and performance in various tasks and domains, from coding to content creation. A version of GPT-3 fine-tuned by reinforcement learning from human feedback (helped enhance instruction- following and generated safer, more helpful responses. OpenAI released ChatGPT in November 2022. Prior to the release of GPT-4, OpenAI introduced several beta features, including the code interpreter, which was later renamed advanced data analysis. This feature has received attention for its ability to directly upload and analyze data, as well as execute and display coding results within the GPT platform. Subsequently, GPT-4 was launched in March 2023 as a multimodal model capable of processing text and image inputs, incorporating all these beta features into a comprehensive system.
Access to GPT-4 was available through a paid subscription (ChatGPT Plus) and an application programming interface (API). Later in 2023, OpenAI released GPT-4 Turbo, which came with significant upgrades including a 128K-token context length and knowledge base extending to April 2023. These improvements allowed the model to process more inputs and access more recent information. In May 2024, OpenAI introduced GPT-4o, a revolutionary model that integrates text, audio, image, and video processing capabilities within a single neural network. This model maintains the text processing performance of GPT-4 Turbo while being twice as fast and half the cost, along with significantly improved real-time conversational abilities via enhanced audio input and output functionalities, further improvements in understanding across 50 languages, offering a more unified and natural interaction experience.
In December 2024, OpenAI released the o1 Pro model with enhanced reasoning capabilities at $200 after a 12-day update period, significantly improving multilingual responses and casual query handling. The enhanced reasoning of o1 Pro has attracted much attention for its deep insights into research planning and problem-solving. However, it does not offer practical features such as Canvas-based writing and editing or code-based statistical analysis and DALLE functions that produce images from text. Therefore, if a user wants to conduct statistical analysis, they must use the previous 4o model.
2. Claude
Anthropic, founded primarily by previous OpenAI researchers, is an AI safety and research company. Their model Claude, launched in March 2023 to compete with GPT, has shown outstanding coding and writing capabilities, further improved with the release of Claude 3 in March 2024.
The artifact feature, added to GPT 3.5 Sonnet, received praise from users for displaying visualized coding results through JavaScript. In October 2024, Claude showed artificial general intelligence potential through demonstrations of control of virtual computers to complete tasks. Although GPT remains dominant, the community recognizes Claude's strong coding abilities. A new data visualization feature, currently in beta preview, provides data analysis capabilities, handles Excel files in commaseparated values (CSV) format, and includes multimodal support for processing images and files. However, it has limited web searching capabilities and experiences frequent hallucinations.
3. Perplexity
While GPT and Claude have significantly reduced their hallucination rates recently, their initial limitations in accuracy were less suitable for academic research applications. Perplexity emerged as a prominent solution to address this gap.
Perplexity AI, established in 2022, is an AI-powered search engine designed to provide accurate and reliable responses by conducting real-time searches across multiple sources with proper citations. The platform's key advantages include real-time information searching, enhanced credibility through transparent source attribution, and the ability to perform focused searches within specific platforms. For instance, subscribers to the premium model can utilize academic mode, which exclusively sources responses from peer-reviewed publications, eliminating potentially unreliable information from blogs, general internet sources, and Wikipedia to support scholarly accuracy and reliability. Furthermore, the Page feature enables users to compile multiple well-referenced responses into a comprehensive summary document, facilitating thorough literature reviews and background research prior to manuscript preparation.
Limitations include restricting certain advanced features to paid subscription models and the inherent potential for generating inaccurate information, characteristic of large language model (LLM) systems. These attributes and constraints suggest that Perplexity AI should be utilized selectively based on specific user requirements and research objectives.
4. Gemini
Since its initial release in December 2023, Google's Gemini model has rapidly developed through Bard updates and the Gemini 1.5 release, culminating in the latest version announcement at Google I/O in May 2024. Notably, its multimodal capabilities and extensive context processing capacity of up to 2 million tokens represent significant advantages, demonstrating superior performance compared to GPT-4 across various benchmarks.
Gemini Pro has shown remarkable improvement over its predecessor, Bard, gaining recognition as an actual multimodal model by introducing the more efficient Gemini 1.5 Flash and its ability to process real-time voice, text, and image inputs simultaneously. The recently introduced Gemini Pro 1.5 Deep Research feature has received positive evaluations for its potential contribution to academic research, offering high-quality literature reviews with precise references.
Integration with existing Google services, including Google Docs and Slides, provides enhanced document creation and management functionality. Additionally, while Google AI Studio offers limited access to the latest models for nonsubscribers, its "Stream Realtime" feature enables advanced virtual assistance capabilities through real-time visual information processing via mobile devices and cameras, facilitating interactive dialogue-based interactions.
5. Microsoft (MS) Copilot
GPT models became accessible through the Microsoft Azure platform. In contrast to existing web-based services like GPT or Claude, MS Copilot differentiates itself by integrating AI capabilities into familiar productivity tools such as Word, PowerPoint, and Excel, offering a more user-friendly approach to AI-assisted workflows. MS Copilot was expected for its potential use in academic work, especially for writing documents, organizing data, and creating visual presentations. However, current reviews show that it does not achieve this aim as well as other similar tools. Despite this limitation, its main advantage is working directly within familiar Microsoft Office programs like Word and Excel. This integration with everyday work tools suggests that MS Copilot could become more useful than other AI services as it continues to improve.
Comparative analysis of leading AI-language models
1. Performance
GPT-o1 pro, Claude 3.5 sonnet, and Gemini 2.0 pro experimental are currently evaluated as the highest-performing models. Claude 3 Opus demonstrated strengths in reasoning, mathematics, and coding, while Gemini 2.0 Pro excelled in long-context and multimodal processing. GPT-4o and o1 continue to demonstrate excellence in multifunction and stability.
2. Pricing
Claude 3 Haiku, Gemini 1.5 Flash, and GPT-3.5 Turbo proved economical for API usage. Consumer subscription services, including ChatGPT Plus, Perplexity Pro, and Gemini Advanced, maintain similar pricing at approximately $20 monthly.
3. Real-time information
Perplexity, leveraging real-time web search capabilities, demonstrated superior performance in maintaining current information. Gemini Pro, through its integration with Google Search, effectively reflected relatively recent information. In contrast, GPT-4 Turbo (trained through April 2023), GPT-4o (October 2023), and Claude 3 (August 2023) showed relatively limited access to current information due to their training cutoff dates.
4. Specialized capabilities
Gemini 1.5 Pro offers extended context processing of up to 2 million tokens. At the same time, Claude 3 provides context windows of 1M tokens (Opus model) and 200k tokens (other models), and GPT-4 Turbo handles 128k tokens. Multimodal functionality is supported across Gemini, GPT-4o, and Claude 3 models. Claude emphasizes safety and ethics through Constitutional AI, while Perplexity and Bard prioritize real-time web search capabilities.
Each model maintains advantages, disadvantages, and specialized features, requiring careful consideration of intended use, budget limitations, and required functionalities in model selection. Perplexity is optimal for current information and concise responses, and Claude 3 outperforms in safety and extended context processing. GPT-4 Turbo and GPT-4o offer superior adaptability and stability, and Gemini models are particularly suitable for long-context, multimodal applications with Google ecosystem integration. From a cost-effectiveness perspective, GPT-3.5 Turbo, Claude 3 Haiku, Gemini 1.5 Flash, and Perplexity showed superior value. Based on the technological advances discussed, model selection requires comprehensive evaluation of benchmark results, user feedback, and ethical considerations (Table 1).
Table 1.
Recommendation of use artificial intelligence (AI) tools for research stages
| Research stage | Recommended AI models and key advantages |
|---|---|
| Research planning and hypothesis generation | GPT-4o/O1 Pro: enhanced reasoning capabilities for innovative research ideas; advanced problem-solving in mathematics and coding; creative hypothesis generation |
| Claude 3: systematic research methodology development; structured experimental design guidance; comprehensive study planning frameworks | |
| Gemini Pro 2.0 experimental: high-quality inference, web searching, creative hypothesis generation | |
| Literature review | Gemini Pro 1.5: deep research function for literature review; 2M token context |
| Perplexity: real-time academic database search; exclusive academic mode for peer-reviewed publications; accurate citation generation | |
| GPT with Consensus/Scispace: PDF analysis and detailed Q&A capabilities(copilot); systematic literature summarization; integration with reference management tools (Zotero, Endnote) | |
| Genspark autopilot agent: broad-spectrum web searching; timesaving task | |
| Stanford STORM: multiagent conversation-based contents generation; accurate references, PDF generation | |
| Data analysis and visualization | GPT-4o/O1 pro (only code request): multilanguage programming support (R, Python, SPSS); complete code documentation; advanced statistical analysis function |
| Claude 3: superior JavaScript-based visualizations; enhanced data visualization design; artifact feature for interactive graphics | |
| Manuscript preparation | Claude 3: Superior academic writing capabilities; maintenance of scholarly tone; enhanced logical flow |
| GPT-4o: multilingual support; integration with document formats (Word/PPT/Excel); Structured section organization | |
| Editing and proofreading | Claude 3: Grammar correction; academic paraphrasing suggestions; consistency checking |
| MS Copilot: direct integration with Microsoft Office; real-time editing capabilities; format standardization |
Potential applications of AI in academic research
1. Hypothesis generation and research design
AI technology has recently created new ways to plan and design research studies. ChatGPT shows excellent capabilities at identifying research gaps and generating creative research ideas through its systematic analysis of scientific literature. Particularly, the GPT o1 pro model, which has enhanced reasoning capabilities, has provided solutions through previously unthought of methods or insights, even when faced with complex mathematical or coding problems.
When requesting abstracts or research ideas for specific conferences or topics, ChatGPT can generate a variety of novel ideas. As with any AI interaction, the more specific and detailed is the prompt, the more targeted and useful will be the response. Some examples of ChatGPT capabilities are listed.
• Generate abstract ideas suitable for pediatric surgery conferences based on existing research.
• Assign roles such as medical researcher and medical article writer.
• Request in structured document format (JSON format) rather than free text.
By applying these techniques, scholars can utilize LLMs for research idea generation. Through step-bystep questioning, researchers can expand their thinking to develop outlines, establish research directions, synthesize existing research, and conduct preliminary investigations before beginning main research activities.
LLM is also a powerful tool for helping researchers plan their studies. It provides detailed guidance on study design elements, including participant selection, sampling strategies, data collection methods, and analytical approaches. For example, if a researcher asks it to design a randomized controlled trial, it offers comprehensive recommendations for randomization procedures, control group establishment, intervention protocols, outcome measurements, sample size calculations, and statistical analyses.
As a multimodal model with image recognition capabilities, it can analyze uploaded flow chart figures that summarize research methodologies and automatically generate corresponding methods based on visual content (Fig. 1).
Fig. 1.
Artificial intelligence generation of study design and methods sections for “Evaluating AI-Assisted Tacrolimus Dosage Prediction in Liver Transplantation” using image data with a prompt. AI, artificial intelligence; SNUH, Seoul National University Hospital.
2. Applications of LLM service in systematic literature review
AI has completely changed research of academic literature. ChatGPT offers researchers an innovative approach to efficiently navigate extensive scholarly databases, transcending traditional keyword-based search limitations. NLP capabilities facilitate intuitive, context-aware literature retrieval. For instance, researchers can query complex topics such as "recent advancement in AI-driven drug synthesis" across major scientific databases, including PubMed, Scopus, and Web of Science, obtaining comprehensive summaries of related abstracts and findings. This functionality enables rapid identification of core concepts, research trends, and key contributors in the field.
However, many researchers frequently become frustrated and discontinue using LLMs in their research due to the limitation known as the hallucination problem. LLMs may produce incorrect or unverified content when they generate new text from their training data. This creates significant risks in academic research, potentially leading to the citation of inaccurate references or the presentation of faked research findings. There are several methods to overcome this.
1) GPTs (Consensus, Scispace)
GPT is a general model and does not perform specific functions. GPTs is a service available within ChatGPT that allows users to configure processes to perform specific functions. Commercial services also operate their apps within GPTs, and individuals can create their GPTs according to their preferences. Within the research tab, well-known paper database search services like Consensus and Scispace are implemented as GPTs.
Consensus is an AI-based service that summarizes and searches for key content in papers, helping researchers more efficiently review academic literature. This service is integrated with major academic databases like PubMed and uses natural language queries to find relevant papers and extract key findings. To minimize the hallucination problem, Consensus provides sources for all information. It also has a voting feature for research topics with debates, revealing supported hypotheses among existing studies.
Scispace offers a similar function, with the feature of directly uploading, analyzing, and conducting Q&A with paper PDFs. It also provides templates and formatting tools for paper writing. The paid model has recently offered various features such as multiple language selection and integration with citation tools like Zotero, paraphraser, AI writer, AI detector, and other functions to assist research.
When using these services through ChatGPT GPTs, hallucinations are reduced, and citations to source articles are provided with greater accuracy than with standard ChatGPT. However, since it primarily uses opensource databases, it is difficult to access papers from closed databases requiring paid subscriptions, and this can lead to unintended hallucinations due to insufficient information. To overcome this, you can either add specific paid database papers or download and upload all related research PDFs, allowing a user to appropriately combine traditional paper writing methods with LLM usage.
2) Gemini 1.5 Pro with Deep Research
Before the last update in December 2024, Gemini demonstrated relatively low performance compared to existing models like ChatGPT and Claude, resulting in limited attention. However, the recent update introducing the Gemini Pro 1.5 with Deep Research and the 2.0 Experimental model has significantly enhanced performance, leading to consistently positive user feedback across research and various other domains. This improvement is attributed to integrating Gemini's advanced reasoning capabilities with a 1M token context window, enabling the generation of comprehensive reports that provide valuable and accessible insights.
Building on Google's dominance in pre-AI search technology, the Deep Research feature reduces hallucinations by combining traditional search capabilities with advanced reasoning and is an invaluable tool for preliminary literature review. This access is primarily limited to advanced model subscribers; users can try a free version with some restrictions through the Google AI Studio website.
3) Genspark autopilot agent
The Genspark Autopilot Agent, introduced in 2024 by Genspark, a pioneering AI-focused company, leverages cutting-edge asynchronous processing to automate tasks like research, fact-checking, and data analysis. Genspark, known for its innovative approach to AI-driven solutions, aims to enhance productivity by combining real-time data collection, transparent sourcing, and community-driven result sharing. This tool has been instrumental in streamlining workflows and providing accurate, up-to-date insights for professionals across various domains.
When requesting information on a specific topic, this service initiates a background investigation process distinct from conventional services. After conducting a comprehensive analysis for approximately 3–5 minutes, it notifies users of completion via email. It performs comprehensive literature reviews autonomously, like a research assistant or staff member. Upon completion, the report includes detailed metrics regarding the number of documents reviewed, complete source citations, and a comparative analysis of time efficiency, demonstrating a reduction in research time compared to traditional manual methods (Fig. 2).
Fig. 2.
Genspark autopilot search results on “VAE vs. GAN for Structured Data Synthesis.” VAE, Variational Autoencoders; GAN, Generative Adversarial Networks.
4) Stanford STORM
STORM, developed by a Stanford University research team, represents an innovative and transformative approach to document creation that addresses the challenges in writing articles from web searches [15]. The system focuses on the crucial pre-writing phase of research and outline preparation, functioning as a multiagent collaborative system where autonomous agents engage in chat-like dialogues to gather and validate information.
These agents simulate expert-level conversations, each playing distinct roles such as researcher, fact-checker, and editor. They work to gather diverse perspectives and curate information from reliable sources. Through this interactive agent-to-agent dialogue system, STORM synthesizes comprehensive research findings into well-structured documents, effectively combining the depth of expert knowledge with the efficiency of automated information processing. This innovative methodology enables the system to automatically generate thoroughly researched, well-documented content while maintaining high accuracy and credibility in the final output (Fig. 3).
Fig. 3.
Stanford STORM search results: multiagent approach for surgical video quality evaluation artificial intelligence models.
3. Data collection and analysis
ChatGPT's advanced data analysis function enables data analysis and visualization in medical research. The platform supports multiple programming languages essential for medical statistics, including R, SPSS, and Excel VBA, with expertise in Python for machine learning applications. The system efficiently generates statistical code and provides visualization outputs upon request, facilitating data analysis in medical research.
Data can be uploaded in Excel and CSV formats, enabling statistical analysis. This tool makes it easy to organize data and create visual elements like charts and graphs for research papers. It also includes missing value identification through overview functions, statistical processing through data preprocessing, and table and figure generation.
However, when handling medical data, researchers should avoid uploading raw data or any information that could potentially identify individuals due to privacy concerns and data protection regulations. Additionally, to prevent uploaded data from being incorporated into model training datasets, researchers should disable the "improve the model for everyone" option in the settings tab before proceeding with data analysis.
Furthermore, when working with unprocessed data, GPT may perform preprocessing operations autonomously, which could affect data integrity. Therefore, preprocessed datasets are recommended to maintain analytical consistency. To address reproducibility concerns that may arise during statistical analyses, researchers should request whole Python code documentation of successful analytical procedures. This code can then be backed up and executed locally or through cloud-based services such as Google Colab, ensuring consistent and reliable statistical outcomes.
Claude's data analysis function can be accessed through its beta preview option, offering an advanced analytical function comparable to GPT. Its JavaScript-based artifact function enables superior visualization designs compared to GPT. However, several important considerations should be noted. (1) The system occasionally loses track of previously input data, resulting in need for periodic verification of dataset usage. (2) Even for paid users, conversation limits may interrupt ongoing analyses, requiring regular backup procedures. (3) The platform exclusively accepts CSV files, requiring conversion from Excel formats. (4) Unlike GPT, it cannot generate complete Word documents, PowerPoint presentations, or Excel files, which should be factored into result preparation strategies.
Academic writing and manuscript preparation
1. Introduction
ChatGPT helps researchers write academic papers in several ways. It can help create first drafts of paper sections, including introduction, methods, results, and discussion, using the provided research information and ideas. However, when large amounts of text must be generated, the output quality produced by ChatGPT generally worsens as it progresses, resulting in responses that may be inconsistent or lacking relevance. It is better to break down requests into smaller units or focused segments to maintain quality and consistency.
For example, when writing a research paper on "Complications in Liver Transplant Recipients," a researcher can start with an outline of the overall structure. Then, one can develop specific questions for each subtopic in the introduction section, such as
(1) Current status of liver transplant recipients
(2) Types of posttransplant complications
(3) Clinical significance and impact of complications in liver transplantation outcomes
(4) Critical analysis of existing literature and identification of knowledge gaps
(5) Research hypothesis and distinguishing features of the present investigation compared to prior studies.
The research questions for the Liver Transplant Complications Study are as follows:
Q1) What is the global incidence and status of liver transplantation?
Q2) What are the most common complications in liver transplant recipients?
Q3) What is the relationship between posttransplant complications and long-term outcomes, and what is the significance of complication prevention and management?
Q4) Provide a systematic review of existing studies on posttransplant complications, including their clinical significance and limitations.
This literature review can be conducted more efficiently using tools like Scispace or Perplexity. For those familiar with reference management software (Endnote, Mendeley, Zotero), ChatGPT's API can be utilized to summarize abstracts and full texts.
Q5) Based on answers to Q1–4, develop novel research hypotheses that differentiate this study from previous work.
If questions are too broad or lack detail, a researcher should consider these subquestions:
For Q1 (global transplant status):
- What are the regional differences in liver transplant rates?
- What are the primary indications for transplantation by region?
For Q2 (complications):
- What are the rates of surgical vs. medical complications?
- What is the incidence of each common complication in liver transplants?
For Q3 (outcomes):
- What is the impact on patient survival rates?
- How do complications affect the quality of life?
- What are the economic implications of posttransplant complications?
For Q4 (literature review):
- Are there any well-designed multicenter studies?
- What are the key findings from major multicenter studies?
- What are common limitations of previous studies?
- Are there any unmet needs about this issue?
Each subquestion should be specific and answerable with available data. This structured approach helps ensure comprehensive coverage while maintaining focus on the primary research objectives.
2. Methods
When composing the methods section, AI-assisted tools can improve the writing process in several ways. If the research design was developed using GPT, researchers can request a methodological framework formatted according to standard academic research. The AI can generate comprehensive methodology descriptions for studies utilizing statistical analyses through the GPT or Claude analytical function. Moreover, these platforms can guide appropriate statistical methods and corresponding analytical codes (R or Python) when researchers seek clarification on methodological approaches or specific analytical techniques. This systematic approach helps document research methods efficiently and accurately.
3. Results
LLMs are most effective when preparing the results section of scientific manuscripts because this section emphasizes objective data presentation rather than analytical interpretation. Researchers can generate publication-standard descriptions by submitting statistical outputs, tables, and figures with appropriate prompting. A straightforward command requesting professional medical formatting typically yields publication-quality descriptions (e.g., "Describe this table in the style of the results section of a medical article in professional English").
However, since LLMs might describe all aspects of tables and figures, it is essential to specify the content to be emphasized. Additionally, if tables or figures lack appropriate information, the LLM may describe them based on its interpretation. Therefore, information should be provided in text or explained in the prompt to achieve the desired results.
4. Discussion
The discussion section involves logical analysis of the research findings. Though AI can enhance the writing style and structure of the manuscript, the core interpretations and conclusions must originate from the researchers' expertise. The exclusive use of AI for results interpretation, without incorporating researchers' intellectual contributions, reduces the scholarly value of their work. Therefore, researchers should develop their interpretations and conclusions and then use AI only to refine the presentation of these ideas, ensuring that the final work reflects their scholarly perspective while maintaining academic integrity.
There remains an ongoing debate regarding using LLMs in academic writing. However, if the fundamental purpose of scientific papers is to effectively communicate research findings to advance scientific knowledge, using LLMs strategically to improve readability and communication efficiency is likely to produce clear benefits. Researchers from non-English-speaking countries often face significant language barriers that can delay manuscript preparation or diminish the chances of publication despite the valuable results of their research findings. Therefore, utilizing these tools to overcome linguistic barriers and labor-intensive research communication can accelerate sharing of scientific findings.
The Open Access movement started by platforms like arXiv, which facilitated rapid knowledge transfer through improved communication tools, is important in rapidly evolving fields like AI. These technological advances in research communication strengthen efforts to make scientific knowledge more accessible and accelerate academic progress.
Limitations and precautions
1. Hallucination and inaccurate information generation
One of the most critical considerations when utilizing ChatGPT in research is the phenomenon of hallucination, which occurs when LLMs like ChatGPT generate information that appears factual but is incorrect or logically inconsistent. This phenomenon primarily originated from the incompleteness and bias in training data, inherent limitations of statistical language models, and vulnerability to out-of-distribution data that extend beyond the training dataset. These hallucinations pose risk of distorting research outcomes. If researchers blindly accept ChatGPT-generated information as factual and base their research hypotheses or data analyses upon it, the resulting research lacks credibility. For example, establishing research backgrounds using incorrectly cited literature or conducting analyses based on inaccurate statistical data generated by ChatGPT can seriously compromise research validity and lead to incorrect conclusions [14]. To reduce these risks, researchers must use critical evaluation and fact-checking when using ChatGPT- generated content rather than accepting it as presented. This requires cross-validation with reliable sources, expert review when necessary, and transparent documentation of ChatGPT use in the research process, including clear attribution of information sources.
Recently, the World Association of Medical Editors and the International Committee of Medical Journal Editors published guidelines addressing concerns about the potential deterioration in research quality due to the indiscriminate use of LLMs [16]. Most medical journals have now incorporated specific guidelines regarding AI-assisted writing in their submission requirements. According to these guidelines, AI should be utilized only for improving readability and editing purposes, and any use of generative AI in research must be disclosed in the acknowledgment section.
2. Bias and fairness issues
LLMs like ChatGPT can reflect social biases in training data, including gender, race, religion, geography, and political views [17,18]. When asked about professional roles, for instance, ChatGPT might reinforce gender stereotypes, associating nursing with female caregivers and doctors with predominantly male roles. If not carefully monitored, these inherent biases can undermine the objectivity of research.
Such biases can significantly impact research outcomes. If researchers accept ChatGPT's biased outputs without critical evaluation, their study outcomes may demonstrate systematic bias or unintended discrimination against particular demographic groups.
Although researchers are developing technical solutions to address these biases, technology alone cannot solve the problem. Researchers must critically evaluate AI-generated content, understand its limitations, and work to minimize bias in their research design. Success requires both technological understanding and careful research practices to maintain scientific integrity.
3. Plagiarism and copyright issues
As ChatGPT is trained on extensive public text data, its outputs may unintentionally mirror existing published works, raising plagiarism concerns in academic writing. While plagiarism has always been an ethical issue in research, AI-generated content presents new challenges. It is difficult to detect when AI-generated text reproduces content from its training data, hindering detection of plagiarism [19-21].
Although various detection tools like GPTzero and ZeroGPT are emerging to identify AI-written content, they have substantial limitations. Simple prompting techniques can bypass these detectors, and they occasionally misidentify human-written text as AI-generated—known as false positives. Due to these technical limitations, current AI detection tools cannot reliably identify AI-generated content.
Therefore, when utilizing ChatGPT in research, researchers must make efforts to prevent plagiarism and copyright abuse. Rather than directly using ChatGPT-generated text, it is advisable to use it as reference material and paraphrase it in your own words. Furthermore, researchers should employ plagiarism detection tools to review similarities with existing works and indicate citations and sources when necessary. Kovari has presented exemplary practices for preventing ChatGPT-related plagiarism in the educational field, suggesting source citation using AI-generated text, utilizing plagiarism detection tools, and strengthening research ethics education [22]. Proper education can help researchers to understand plagiarism and copyright issues for ethical use of AI tools. Universities and research institutions should create clear guidelines for using AI in research. It is also essential for universities and research institutions to establish clear guidelines for using ChatGPT and provide them to researchers. Stanford recently opened a course on this, teaching use of AI while adhering to ethical guidelines.
4. Security and privacy issues
There are significant risks of sensitive research information leakage when utilizing ChatGPT in research. During interactions with ChatGPT, researchers may input unpublished research ideas, confidential research data, or personally identifiable information of research subjects. This information could be incorporated into ChatGPT's training data, leading to potential information leakage [23]. According to a recent study, attackers can extract training data from language models through specific queries [24]. This finding highlights the risk that LLMs like ChatGPT may store sensitive information during training, which could be vulnerable to extraction by harmful users. Therefore, researchers must be fully aware of these risks of sensitive information leakage when using ChatGPT and implement appropriate security and privacy protection measures.
To minimize such risks, researchers must anonymize or pseudonymize research data before utilizing ChatGPT. Research institutions should establish comprehensive security policies and guidelines regarding ChatGPT utilization and provide regular security training to researchers to ensure data protection and privacy compliance.
5. Limitations in reflecting up-to-date information
LLMs are trained on data with a specific cutoff date. For instance, ChatGPT trained until September 2023 cannot access events or research published after this date. While web browsing functions have been implemented to overcome this limitation, using the model without this feature can affect the accuracy and timeliness of provided information. This limitation is particularly significant in fields requiring current information. For example, ChatGPT cannot provide information about technological advances about AI research beyond its training data. This constraint is especially challenging in rapidly evolving fields such as medicine, information technology, and economics, where current information is crucial. When researching topics like the COVID-19 pandemic, the model may lack information about recent vaccine developments or new viral variants.
Retrieval-Augmented Generation (RAG) models are emerging as a promising solution to overcome ChatGPT's limitations in accessing current information. RAG models combine information retrieval and text generation by searching external knowledge bases (e.g., recent web documents, databases) in real time to generate responses based on current data. For example, when asked about "U.S. annual pediatric surgery statistics in 2024," a RAG model can analyze recent news, research reports, and statistical data to generate a comprehensive response. Lewis et al. [25] first introduced RAG models, demonstrating their ability to exceed traditional generative model limitations by providing more accurate and current information. Subsequently, Shuster et al. [26] showed that RAG models reduce hallucination compared to conventional conversational models, producing more evidence-based and reliable responses. These models show promise for research applications requiring current information and timely insights.
6. Excessive dependence and reduced critical thinking
Although ChatGPT offers valuable support for enhancing research productivity, researchers must maintain a balanced approach to its use. Over-dependence on AI assistance could compromise researchers' capacity for independent critical thinking and original insight generation [16,19,27]. Critical thinking is a core research skill involving objectively analyzing and evaluating information. When using ChatGPT, researchers must carefully assess its outputs. This is especially crucial for students and graduate researchers who are still developing their ability to assess information accuracy due to limited research experience.
When using ChatGPT in research, it is essential to remember that the researcher is the primary investigator. The researcher's expertise and judgment are crucial throughout the process, from developing ideas to interpreting results. It is essential to recognize its limitations and be ready to seek additional resources or expert advice when needed. ChatGPT can strengthen research skills but can never replace the researcher's critical role in the scientific process.
Conclusion and future research directions
This study provides a systematic analysis of ChatGPT's applications and challenges in academic research. I investigated its utility across the research process, including literature review, hypothesis development, data analysis, and scientific writing. My findings suggest that, while ChatGPT offers clear benefits for research productivity, researchers must carefully consider its limitations and potential challenges.
Key considerations include the model's potential for generating inaccurate information, inherent biases, intellectual property concerns, data privacy risks, limited access to current information, possible deterioration of critical thinking skills through over-reliance, and associated ethical challenges.
Future research should address several key directions to overcome ChatGPT's limitations and enhance its research applications. Technical improvements should focus on reducing hallucination, minimizing bias, incorporating current information, and increasing model explainability. Promising developments in this direction include RAG models, which enhance accuracy and timeliness by combining current data retrieval with text generation.
Ethical considerations require comprehensive AI research guidelines and ensuring transparency in AI utilization within research contexts, including legal and institutional frameworks for AI implementation in academic research.
Furthermore, interdisciplinary collaboration among AI developers, researchers, ethicists, and legal scholars is essential to maximize ChatGPT's potential while minimizing associated risks. Opportunities include developing field-specific RAG models optimized for medicine, biology, and engineering, researching fine-tuning techniques while considering unique research data characteristics, and creating specialized tools for different research domains. These developments could establish ChatGPT as a reliable and ethical research tool that maintains scientific integrity while advancing research capabilities.
In conclusion, ChatGPT demonstrates high potential for enhancing research efficiency, generating research ideas, and reducing barriers to academic investigation. However, researchers remain responsible for verifying the accuracy and reliability of AI-generated content and are also obligated to address ethical considerations. Researchers should maintain their own critical thinking when utilizing ChatGPT, ensuring they remain aware of its limitations and adhere to ethical responsibilities throughout the process.
Footnotes
Conflicts of interest
No potential conflict of interest relevant to this article was reported.
Funding
This review was supported by a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare, Republic of Korea (grant number: HI23C159101).
Acknowledgments
This author used generative AI to write the manuscript and for editing and proofreading. GPT-4 version 2024.12.13 from OpenAI (https://openai.com/policies/terms-of-use) and the Claude 3 Sonnet version 2024.10.24 from Anthropic (https://www.anthropic.com/legal/consumer-terms) were useful to me.
Author contribution
JML is the only author listed in this manuscript.
References
- 1.Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595. doi: 10.3389/frai.2023.1169595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Haque MA, Li S. Exploring ChatGPT and its impact on society. AI Ethics. 2025;5:791–803. [Google Scholar]
- 3.Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare (Basel) 2023;11:887. doi: 10.3390/healthcare11060887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sallam M. The utility of ChatGPT as an example of large language models in healthcare education, research and practice: systematic review on the future perspectives and potential limitations. medRxiv [Preprint] 2023 doi: 10.1101/2023.02.19.23286155. Available from: [DOI] [Google Scholar]
- 5.Mesko B. The ChatGPT (generative artificial intelligence) revolution has made artificial intelligence approachable for medical professionals. J Med Internet Res. 2023;25:e48392. doi: 10.2196/48392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How does ChatGPT perform on the United States medical licensing examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312. doi: 10.2196/45312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Humar P, Asaad M, Bengur FB, Nguyen V. ChatGPT Is equivalent to first-year plastic surgery residents: evaluation of ChatGPT on the plastic surgery in-service examination. Aesthet Surg J. 2023;43:NP1085–9. doi: 10.1093/asj/sjad130. [DOI] [PubMed] [Google Scholar]
- 8.Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: comparison study. JMIR Med Educ. 2023;9:e48002. doi: 10.2196/48002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. What can copilot’s earliest users teach us about generative AI at work? Work trend index special report [Internet] Microsoft; 2023 [cited 2024 Dec 24]. Available from: https://www.microsoft.com/en-us/worklab/work-trend-index/copilots-earliest-users-teach-us-about-generative-ai-atwork.
- 10.Brynjolfsson E, Li D, Raymond LR. Generative AI at work. Cambridge (MA): National Bureau of Economic Research; 2023. Available from: http://www.nber.org/papers/w31161.pdf. [Google Scholar]
- 11.Noy S, Zhang W. Experimental evidence on the productivity effects of generative artificial intelligence. Science. 2023;381:187–92. doi: 10.1126/science.adh2586. [DOI] [PubMed] [Google Scholar]
- 12.Peng S, Kalliamvakou E, Cihon P, Demirer M. The impact of AI on developer productivity: evidence from GitHub copilot. arXiv:2302.06590v1 [Preprint] doi: 10.48550/ARXIV.2302.06590. 2023 [cited 2024 Dec 24]. Available from: [DOI] [Google Scholar]
- 13.Lu C, Lu C, Lange RT, Foerster J, Clune J, Ha D. The AI scientist: towards fully automated open-ended scientific discovery. arXiv:2408.06292v3 [Preprint] 2024 [cited 2024 Dec 24]. Available from: https://doi.org/10.48550/ARXIV. 2408.06292. [Google Scholar]
- 14.Chelli M, Descamps J, Lavoué V, Trojani C, Azar M, Deckert M, et al. Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: comparative analysis. J Med Internet Res. 2024;26:e53164. doi: 10.2196/53164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Shao Y, Jiang Y, Kanell TA, Xu P, Khattab O, Lam MS. Assisting in writing wikipedia-like articles from scratch with large language models. arXiv:2402.14207v2 [Preprint] doi: 10.48550/ARXIV.2402.14207. 2024 [cited 2024 Dec 24]. Available from: [DOI] [Google Scholar]
- 16.Chetwynd E. Ethical use of artificial intelligence for scientific writing: current trends. J Hum Lact. 2024;40:211–5. doi: 10.1177/08903344241235160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Li Y, Zhang L, Zhang Y. Fairness of ChatGPT. arXiv:2305. 18569v2 [Preprint] doi: 10.48550/ARXIV.2305.18569. 2024 [cited 2025 Jan 5]. Available from: [DOI] [Google Scholar]
- 18.Wu J, Song Y, Wu DC. Does ChatGPT show gender bias in behavior detection? Humanit Soc Sci Commun. 2024;11:1706. [Google Scholar]
- 19.Rahimi F, Talebi Bezmin Abadi A. ChatGPT and publication ethics. Arch Med Res. 2023;54:272–4. doi: 10.1016/j.arcmed.2023.03.004. [DOI] [PubMed] [Google Scholar]
- 20.Elali FR, Rachid LN. AI-generated research paper fabrication and plagiarism in the scientific community. Patterns. 2023;4:100706. doi: 10.1016/j.patter.2023.100706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kim SJ. Research ethics and issues regarding the use of ChatGPT-like artificial intelligence platforms by authors and reviewers: a narrative review. Sci Ed. 2024;11:96–106. [Google Scholar]
- 22.Kovari A. Ethical use of ChatGPT in education—Best practices to combat AI-induced plagiarism. Front Educ. 2025;9:1465703. [Google Scholar]
- 23.Nasr M, Carlini N, Hayase J, Jagielski M, Cooper AF, Ippolito D, et al. Scalable extraction of training data from (production) language models. arXiv:2311.17035v [Preprint] doi: 10.48550/arXiv.2311.17035. 2023 [cited 2025 Jan 5]. Available from: [DOI] [Google Scholar]
- 24.Kim M, Kim Y, Kang HJ, Seo H, Choi H, Han JY, et al. Finetuning LLMs with medical data: can safety be ensured? NEJM AI. 2025;2(1) DOI: 10.1056/AIcs2400390. [Google Scholar]
- 25.Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv:2005.11401v4 [Preprint] doi: 10.48550/ARXIV.2005.11401. 2020 [cited 2025 Jan 5]. Available from: [DOI] [Google Scholar]
- 26.Shuster K, Poff S, Chen M, Kiela D, Weston J. Retrieval augmentation reduces hallucination in conversation. arXiv: 2104.07567v1 [Preprint] doi: 10.48550/ARXIV.2104.07567. 2021 [cited 2025 Jan 5]. Available from: [DOI] [Google Scholar]
- 27.Liebrenz M, Schleifer R, Buadze A, Bhugra D, Smith A. Generating scholarly content with ChatGPT: ethical challenges for medical publishing. Lancet Digit Health. 2023;5:e105–6. doi: 10.1016/S2589-7500(23)00019-5. [DOI] [PubMed] [Google Scholar]



