Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Dec 16.
Published in final edited form as: JAMA Surg. 2025 May 1;160(5):588–589. doi: 10.1001/jamasurg.2024.6025

Practical Guide to Artificial Intelligence, Chatbots, and Large Language Models in Conducting and Reporting Research

Tyler J Loftus 1, Adil Haider 2,3, Gilbert R Upchurch Jr 4
PMCID: PMC12704455  NIHMSID: NIHMS2122588  PMID: 39774809

In 1966, Joseph Weizenbaum described the first chatbot: a computer program using natural language processing to decompose user sentences, identify key words in context, and generate appropriate responses.1 Weizenbaum anticipated and criticized the notion that his chatbot was intelligent: “…once a particular program is unmasked, once its inner workings are explained, its magic crumbles away; it stands revealed as a mere collection of procedures.” Subsequently, decades of research enhanced natural language processing capabilities, most notably via large language models (LLMs, deep learning models that process and generate text, usually by leveraging transformer architectures), which underlie contemporary chatbots. Even LLMs—heralded for their potential to transform science, industry, and society—are mere collections of procedures. Yet already, an LLM trained on both clinical and general English language text can generate clinical notes that pass the Turing Test (ie, physicians could not consistently distinguish between human- and LLM-generated text) while maintaining high linguistic readability and clinical relevance.2 In surgery, LLMs may be particularly useful in extracting surgical risk factors or outcomes from clinical notes and operative reports, learning from text inputs when performing prognostication and decision-support tasks, and serving as educational tools for surgical concept summarization, active knowledge retrieval practice, and virtual surgical simulation. The purpose of this article is to summarize the limitations and opportunities in applying artificial intelligence (AI), chatbots, and LLMs in conducting and reporting surgical research (Box).

Box. Artificial Intelligence, Chatbots, and Large Language Models (LLMs).

  • Contemporary chatbots leverage LLMs that process and generate text, usually with transformer model architectures.

  • LLM “hallucinations” can be mitigated by high-quality training data, retrieval augmented generation, and detecting and rewriting incorrect outputs.

  • The billions of tokens (usually words) and parameters (mathematical expressions of what models learn) used to train LLMs require substantial computational power.

  • LLM prompts should include context, instructions with constraints, and uncertainty estimations, while omitting extraneous information.

  • Training data should represent all patient populations and settings for which the LLM may be applied.

  • LLMs have the potential to anchor clinical decisions in objectivity, as a bulwark against the bias and error inherent to hypothetical-deductive reasoning.

Limitations

LLMs are limited by their propensity to “hallucinate” by generating misinformation that appears feasible, just as human memory consolidation includes the generation of false, synthetic “memories” that fill gaps in our recollection or understanding of life events. Technical mitigation strategies (ie, techniques to ensure that technology works correctly) include incorporation of comprehensive, factual training data, retrieval augmented generation (pulling facts from external databases to ensure accuracy), and the detection, flagging, and rewriting of incorrect outputs.3 The latter process can also avoid providing sensitive or potentially harmful answers (eg, newer commercial LLMs often avoid giving medicolegal advice). For health care LLMs to progress from research and development phases, technical mitigation must be comprehensive.

Commercial or closed-source LLMs (eg, GPT-4, Gemini) do not disclose training data or model architecture, such that model development and foundation cannot be rigorously investigated, though it remains possible to measure their task-specific performance. Even when LLMs are developed in university medical centers as academic research products, models may be publicly unavailable or subject to high licensing fees, missing opportunities to build trust through open science.

Data Management Considerations

The optimal size of training datasets depends on the use case, the quality and diversity of training data, and whether there are opportunities to fine-tune a foundation model for specific tasks (eg, identifying specific events in operative reports). Rather than perform loosely informed sample size determinations, it may be preferrable to generate model learning curves illustrating the point at which adding additional training data no longer improves model performance (ie, the point at which epistemic uncertainty is minimized). Dataset size is typically measured in tokens, or the discrete units into which text is divided (tokens vary by model but are almost always based on subwords using the byte pair encoding algorithm; usually 1 token=1 word). The number of tokens (often >100 billion) should increase proportionately with the number of model parameters or weights (often >200 billion), which are numerical values that connect nodes in neural networks and determine how nodes interact with one another in generating outputs.4

Some commercial LLMs pose substantial risk for leaking sensitive data, including prompts fed to the model. OpenAI offers GPT-4 on Microsoft’s Azure cloud computing platform, which protects users’ training data (for fine-tuning a foundation model), prompts, and model outputs from being shared with OpenAI or other Azure customers. Alternatively, local research environments can be engineered to allow investigators to perform experiments with open-source LLMs without data ever leaving the local environment (eg, the HiPerGator supercomputer at the University of Florida).

The large-volume, high-quality data needed to train and fine-tune LLMs are often unavailable within smaller health care enterprises; including data from those sources is important in building generalizable tools. It is similarly important that LLMs perform well in smaller, community, and resource-constrained settings that often serve vulnerable populations. It might be possible to achieve these goals via federated learning approaches in which individual sites share model parameters with a central server that aggregates them in generating a global model that is shared with individual sites iteratively until converging on an optimal solution, without ever sharing patient data among sites. For other (non-LLM) AI model architectures, federated learning offers not only privacy and security, but also a predictive performance advantage, presumably by learning from more data and more edge (rare) cases.5

Data Analytic Considerations

Training LLMs requires substantial computational power, while fine-tuning a pretrained LLM requires far fewer resources, in-context learning (the model understands and performs tasks based on examples within the prompt) is far less expensive than fine-tuning, and running a LLM is computationally inexpensive. In 1 study, training a LLM on both clinical and general English language text with 20 billion parameters took 20 days despite using 560 graphical processing units (GPUs) in a state-of-the-art GPU clustering architecture.2 These computational requirements are infeasible for most investigator teams and academic research environments. Alternatively, pretrained, open-source LLMs can be used, but most were not trained on clinical text for use in health care settings. It is difficult to choose from among these options because commercial LLMs do not share information on training data or model architecture, and open-source LLM performance varies by task or use case. Rigorous scientific experimentation and publication of results, positive or negative, would illuminate this domain.

Most LLMs are operationalized as question-answer interfaces. Engineering the question (also known as a prompt) helps obtain the desired output. Best practices for prompt engineering include the provision of adequate context (which can take the form of examples of correct outputs, a pattern to emulate, or assigning a role to the chatbot) followed by clear instructions on what should be delivered, including constraints (eg, “capture only Graves disease with compressive symptoms”) and an assessment of uncertainty (eg, “is the answer correct?”), while omitting extraneous information.

Health Equity Considerations

Like other forms of artificial intelligence, LLMs can potentiate health disparities.6 LLMs trained on text that represents human biases may exhibit similar biases when performing text processing and generation tasks. Although careful management of training data mitigates risk for encoding overt, explicit bias, it remains difficult to consistently identify and mitigate risk for encoding subtle, implicit bias. LLMs could also worsen health disparities by performing well for patients in digitally avid settings that make large contributions to training datasets while performing poorly in settings with limited digital resources. This juxtaposition underscores the importance of community-engaged research that includes data and stakeholder input from all settings that may deploy the LLM.

Conversely, when LLMs achieve high accuracy and avoid hallucinations, they have potential to address health disparities by anchoring research and health care operations in objectivity, as a bulwark against subjectivity, bias, and attendant error in the hypothetical-deductive reasoning process that dominates clinical decision-making.7 This advantage is likely greatest when applying LLMs to augment decisions made under time constraints with incomplete access to relevant information, or information so vast and complex that human cognitive is inefficient or ineffective—scenarios that are relatively common in surgical care (eg, clinical reasoning and decision-making in managing rare or complex surgical diseases, trauma care, emergency surgery, and when time constraints are imposed by busy schedules in high-volume settings). Before realizing this potential advantage, there must be extensive technology readiness assessments, model stress testing to understand when and how LLMs fail in clinical settings, and generation of high-level evidence from randomized or pragmatic, digital trials. Under these circumstances, LLMs have the potential to transform surgical research and care.

Footnotes

Conflict of Interest Disclosures: Dr Loftus reported grants from the National Institute of General Medical Sciences. Dr Haider reported being co-founder of the startup BostonHealthAI. No other disclosures were reported.

Additional Contributions: We thank Benjamin Shickel, PhD, University of Florida, for imparting his technical expertise in computer engineering.

Contributor Information

Tyler J. Loftus, Department of Surgery, University of Florida Health, Gainesville.

Adil Haider, Medical College, Aga Khan University, Karachi, Pakistan; Deputy Editor, JAMA Surgery.

Gilbert R. Upchurch, Jr, Department of Surgery, University of Florida Health, Gainesville.

REFERENCES

  • 1.Weizenbaum J ELIZA—a computer program for the study of natural language communication between man and machine. Commun ACM. 1966;9(1):36–45. doi: 10.1145/365153.365168 [DOI] [Google Scholar]
  • 2.Peng C, Yang X, Chen A, et al. A study of generative large language model for medical research and healthcare. NPJ Digit Med. 2023;6(1):210. doi: 10.1038/s41746-023-00958-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Devaraj A, Sheffield W, Wallace BC, Li JJ. Evaluating factuality in text simplification. Presented at: 62nd Annual Meeting of the Association for Computational Linguistics. August 11, 2024; Bangkok, Thailand. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Hoffmann J, Borgeaud S, Mensch A, et al. Training compute-optimal large language models. arXiv:2203.15556 [cs.CL]. 2022. doi: 10.48550/arXiv.2203.15556 [DOI] [Google Scholar]
  • 5.Dayan I, Roth HR, Zhong A, et al. Federated learning for predicting clinical outcomes in patients with COVID-19. Nat Med. 2021;27(10):1735–1743. doi: 10.1038/s41591-021-01506-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Johnson-Mann CN, Loftus TJ, Bihorac A. Equity and artificial intelligence in surgical care. JAMA Surg. 2021;156(6):509–510. doi: 10.1001/jamasurg.2020.7208 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Loftus TJ, Tighe PJ, Filiberto AC, et al. Artificial intelligence and surgical decision-making. JAMA Surg. 2020;155(2):148–158. doi: 10.1001/jamasurg.2019.4917 [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES