Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Mar 1.
Published in final edited form as: J Am Acad Orthop Surg. 2024 Jan 3;32(5):205–210. doi: 10.5435/JAAOS-D-23-00787

Ethical Considerations of Artificial Intelligence in Health Care: Examining the Role of Generative Pretrained Transformer-4

Suraj Sheth 1, Hayden P Baker 1, Hannes Prescher 1, Jason A Strelzow 1
PMCID: PMC11536186  NIHMSID: NIHMS2032415  PMID: 38175996

Abstract

The integration of artificial intelligence technologies, such as large language models (LLMs), in health care holds potential for improved efficiency and decision support. However, ethical concerns must be addressed before widespread adoption. This article focuses on the ethical principles surrounding the use of Generative Pretrained Transformer-4 and its conversational model, ChatGPT, in healthcare settings. One concern is potential inaccuracies in generated content. LLMs can produce believable yet incorrect information, risking errors in medical records. Opacity of training data exacerbates this, hindering accuracy assessment. To mitigate, LLMs should train on precise, validated medical data sets. Model bias is another critical concern because LLMs may perpetuate biases from their training, leading to medically inaccurate and discriminatory responses. Sampling, programming, and compliance biases contribute necessitating careful consideration to avoid perpetuating harmful stereotypes. Privacy is paramount in health care, using public LLMs raises risks. Strict data-sharing agreements and Health Insurance Portability and Accountability Act (HIPAA)-compliant training protocols are necessary to protect patient privacy. Although artificial intelligence technologies offer promising opportunities in health care, careful consideration of ethical principles is crucial. Addressing concerns of inaccuracy, bias, and privacy will ensure responsible and patient-centered implementation, benefiting both healthcare professionals and patients.


The appropriate use of artificial intelligence (AI)–based instruments in healthcare delivery has long been a topic of debate. Advances in the complexity of deep learning models have enabled these technologies to serve as diagnostic tools and to provide treatment recommendations. AI has been shown to achieve diagnostic performance on par with subject experts in identifying skin cancer lesions with potential for additional diagnoses.1,2 Similarly, deep learning models were able to detect clinically important abnormalities on chest radiography at a performance level similar to practicing radiologist.3 Although the utility of AI as an adjunct in health care has been repeatedly demonstrated, concerns regarding how best to integrate these into clinical practice remain. Recent developments in language learning models that are able to interface with patients and physicians in highly coherent and intelligent exchanges have brought these concerns again to the forefront.

Generative Pretrained Transformer-4 (GPT-4) is the fourth iteration of the generative pretrained transformer series, one of the largest and most powerful autoregressive language models available today.4 Developed in 2018 by OpenAI, a San Francisco-based company,5 GPT-4 is a language model that learns language patterns by processing large amounts of written text.5 ChatGPT, a built-in addition to GPT-4, is an AI conversational model that has been further developed to generate human-like responses to text-based prompts. ChatGPT uses an autoregressive language model, which means that it is primarily trained to generate words by using previous text to predict the appropriate words that come next.4 ChatGPT was trained using a deep learning architecture called the transformer model and was introduced by Vaswani et al6 in 2017.6 The transformer model is a type of neural network that is designed to process sequences of data.

The training process for ChatGPT involved feeding the model large amounts of text data and then using an optimization algorithm to adjust the parameters of the model and minimize predefined loss of function.4 The GPT-4 model consists of more than 175 billion different data sources including a diverse range of text data found on the internet, including websites, forums, and social media, as well as books, scientific articles, Wikipedia, and news articles through 2021. The model is built on more than 10 times as many sources as any previous language model and requires 800 GB of memory to store.7 However, specific data sources have not been made public. The objective of the training process was to teach the model to predict the next word in a sequence based on preceding words, with the goal of generating coherent text. To ensure the quality of the training data, GPT-4 uses a flagging system to identify and exclude sources that may contain biased or low-quality text.8 In addition, many argue the training material, flagging/self-policing system implemented by the company, and the issues with the lack of transparency inherently introduce bias.9-12 How these biases influence and potentially change the outcomes from these tools presents a challenging situation when such tools may be used for medical or health-related indications. However, ChatGPT does acknowledge that incorrect or misleading data can slip through the flagging system, which makes it possible for certain outputs to be misleading, untruthful, harmful, incorrect, or biased.8

Since its release on November 30, 2022, ChatGPT has garnered viral attention with more than 13 million active daily users as of January 2023.5 ChatGPT has several capabilities outside of text generation, including question answering, summarization, translation, and conversational AI.8 Impressively, ChatGPT’s ability was recently demonstrated by passing the final written examination for the Master of Business Administration program at the University of Pennsylvania Wharton School of Business.13 ChatGPT also passed all three of the United States Medical Licensing Exams.14

There exists increasing interest in the use of ChatGPT in healthcare settings to assist with documentation and medical decision making, and to provide patients with information and support.15 However, the use of AI technology in health care raises notable ethical concerns regarding privacy, bias, and the accuracy of information. The purpose of this article was to discuss the ethical principles in the context of healthcare delivery and AI. As with any technological advancement, it is important to highlight potential pitfalls of this emerging technology to ensure that it is used responsibly and in the best interest of patients.

Role of Generative Pretrained Transformer in Health Care

Physicians perform a wide range of tasks including data analysis, decision making, text generation/editing, information dissemination, patient education, communication with insurance companies, writing discharge summaries, and medical note-taking. In addition, the creation of academic medical literature and scientific manuscripts is another labor-intensive function frequently required and/or encouraged within health care. Generative AI and large language models (LLMs), such as GPT-4, have the potential to improve efficiency in physician workflows and reduce documentation burden and burnout. Recent studies have demonstrated the success of ChatGPT in these documentation tasks. Cascella et al16 prompted ChatGPT to compose medical notes for patients admitted to the intensive care unit and showed that it was able to do so coherently based on a series of laboratory data and hemodynamic parameters. ChatGPT has also been used to generate factually accurate and complete radiology reports17,18 and patient discharge summaries.19 Through the automation of these processes, the patient-physician and interphysician communication may be enhanced while reducing or avoiding the workload associated with increasing reliance on technology and the electronic medical record. Given these promising advancements, it is crucial to acknowledge that the integration of AI and tools such as ChatGPT into healthcare workflows necessitates collaboration with the Privacy and Risk Management Offices of healthcare institutions to address compliance and risk concerns, underscoring the importance of early and proactive engagement.

Inaccuracy

Although emerging LLMs have documented impressive abilities to summarize large bodies of literature and collate solutions to clinical problems, the capacity of these systems to “hallucinate” (create inaccurate but believable appearing statements in their responses)20 presents a fundamental challenge in health care, where accuracy is essential to clinical success. Two recent cases are illustrative. In February 2023, during an official demonstration, Google’s Bard model made an incorrect statement about the James Webb Telescope.21 Similarly, in Microsoft’s official demonstration of its new Bing Chat feature, built in part using the OpenAI’s GPT-4 model, the model incorrectly summarized the financial statements of two different companies, generating false figures for key performance indicators in an industry where accuracy is crucial.22 In both cases, these critical errors were not immediately noted during the demonstration but rather identified later by fact checkers online. In effect, the system was “convincing” enough for people to accept the wrong information as “facts” during the immediacy of the presentation.22 In medical contexts, the ability of these models to produce a grammatically correct facsimile of human-generated summaries and reports, while populating them with both correct and “hallucinated” incorrect information, can lead to the generation and propagation of errors in medical documentation. For these reasons, thoughtful implementation with checks and balances is critical particularly in the evidence-based medical community. Given the importance of an accurate medical history for providing high-quality patient care, and the already high rates of error in medical note-taking,23 any implementation of LLMs for clinical note creation must proceed with caution, including extensive review and fact checking by physicians.

Ethical and factual errors may lead to unintended consequences and are present with each AI learning model. The first and most obvious concern is the opacity of the training data used to create publicly available LLMs. Although many companies report generalized features of their models, including the number of parameters and training protocols, details about the data sources that input into the models are scarce. This presents a challenge because it prevents the user, physicians in the medical context, and others from assessing the accuracy of the inputs that are used to create responses. If incorrect and outdated information is used to train these systems, they are more likely to generate responses and suggest solutions that are either false, inaccurate, or obsolete. With the ever-evolving nature of the clinical standard of care, the quality and recency of data sources of LLMs used in clinical settings are of critical importance for timely, effective delivery of care. Medical education and continuing medical educational requirements are examples of the importance of staying “relevant” and “up-to-date” with modern, novel, and practice changing information within an ever rapidly growing medical field. Recent tests comparing the performance of ChatGPT, trained on large base of general content, and PubMedGPT, a LLM trained exclusively on biomedical abstracts and research articles, found that ChatGPT consistently scored higher on the United States Medical Licensing Examination (USMLE) Step Exams than PubMedGPT.24 The authors suggested that this was due to the nature of scientific writing, which tends to be restrained and cautious in tone.24

The use of ChatGPT and other LLMs powering “chatbots” presents an opportunity for patients to interact with medical information in a more intuitive way, in the form of a discussion not unlike a patient-physician interaction. Although this interface provides an easier way for patients to seek, understand, and manage their illnesses, the accuracy and relevance of the outcome must be carefully observed. Furthermore, commercial models, including Microsoft’s Bing Chat, plan to integrate advertisements within their AI chatbot programs to generate revenue.25 Ethically, this presents challenges because patients may be confronted with paid messaging that seems in answers to questions patients (and providers) may presume to be unbiased. Directed consumer-targeted advertisement further blends the boundaries of consumerism and independent medical advice. Such blending of commercial and medical advice may directly affect patients’ understanding of their disease and their subsequent decision making, and potentially subconsciously or consciously affect medical autonomy. Previous work has demonstrated the effect of advertisement on decision-making bias that can lead to suboptimal or even negative clinical outcomes.26,27

Bias

Model bias is another concern. Model bias exists because of the limits of AI learning, whereby the data inputted during the model training are inherently limited and potentially flawed. This type of bias inherently induces errors in accuracy or mistakes of commission and omission. Given that the lack of information about the accuracy and timeliness of the content AI models are trained on and the rules that govern the kinds of responses the models generate, there are notable concerns that large, generative AI models based on natural language processing (NLP) could reproduce biases found in the textual corpora they are trained on. In theory, such bias can potentiate medically inaccurate and discriminatory information in the AI model responses. These issues have already been reported before the popular rise of the current ChatGPT-4. For example, a paper produced by Google researchers in 2019 found that NLP models often reproduced biases about people with both physical and mental disabilities, and overreport information about homelessness and drug addiction.28 This work expanded on previous research noting sex and racial biases in model outputs.29,30 Simply listed, these biases are a result of sampling bias, programming specific bias (colloquially called “programmatic morality bias”), and the compliance bias.9,30-34

Sampling bias develops because of the specific inputs used to train the AI’s NLP model. Despite the large data set available to AI training teams, programmatic information is inherently flawed. Available information online most frequently reflects the “loudest voices” and the biases of the writers. In addition, societies and groups without access or with limited voice are underrepresented. Sampling bias can therefore amplify “group” think and relegate important, and balance facts/opinions and content from being “trained into the system.” This sampling bias has been well documented in previous literature on risk assessment tools, facial recognition, and other areas using AI.33,34 Within the medical context, the use of these tools to generate medical information increases the risk that physicians may inadvertently perpetuate harmful stereotypes about historically marginalized groups, violating two important tenets of medical ethics, nonmaleficence, and justice.35

Because of the “training” to build AI NLP programs, there are potential effects of the creators influence on the output any system creates. Although typically well intentioned, the effect of such creator bias and the resulting programming bias cannot be ignored. One could consider the comparison with a parental-child relationship where the “habits, lessons, and opportunities provided” by the parent(s) can have a profound effect on childhood development and growth. Although ChatGPT and related AI creators describe their efforts to mold the training and content, this morality filter is still not without its problems. Examples exist across various topics and have been documented with many successful and prominent bypasses of these content restrictions.36

Compliance bias represents additional substantial concern, whereby the constructed text/data provided by the AI system are of such perceived high quality, with the assumed accuracy implied by the large data set that it is “believed” as fact despite potential errors. These generated “hallucinations” are coherent collections of mainstream data that simply seem factual but are fictional. Such bias presents a challenge because the adoption and use of AI tools accelerate with few checks and balances. As with any novel tool or medical tests, a degree of scientific scrutiny and skepticism is critical. Despite this, there exists few, if any policy or position statements, within the medical community on this technology and its adoption. Although these biases are not confined to medicine, it is in this realm where true physical harm can come as a direct cause from inherently flawed AI training, oversight, and reliance.

Privacy

Protection of a patient’s right to privacy is paramount in modern medical ethics. In their current form, publicly available LLMs such as ChatGPT represent a privacy risk because the organization providing the model as a service can view all queries submitted to the model.37 Data entered into these large database systems are often private information and the details around the output, whether it is collected, analyzed, or used to further train the system is unclear. Furthermore, if the systems are using private, specific sensitive consumer data, it is unclear how such inputs affect the accuracy and appropriateness of the outputs. With these concerns in mind, a number of countries have expressed concerns regarding data safety. In Spring 2023, privacy concerns led Italy to temporarily ban ChatGPT altogether, stating that the service may not comply with the European Union’s General Data Protection Regulation law and that it had concerns about the “the mass collection and storage of personal data for the purpose of ‘training’ the algorithms underlying the operation of the platform.”38 Similarly, the United Kingdom National Cyber Security Centre generated warnings stating “Individuals and organizations should take great care with the data they choose to submit in prompts.”37 Finally in March 2023, the European Data Protection Board announced that it would be starting a task force aimed at understanding and evaluating data protection because it relates to ChatGPT services.

Complex Decision Making

Decision making in medicine is rarely simple and based solely on evidence-based algorithms. The question of what is best for patients lies at the core of a physician’s clinical judgement, which is developed over years of experience. A treatment plan for a particular disease process is designed based not only on symptoms but also on the patients’ particular goals of care. Patient autonomy is a stalwart principle in that process. ChatGPT may be able to generate appropriate responses to patients inquiring about a specific collection of symptoms, but these recommendations lack clinical context and thus may not be patient-centered. There exists inherent vulnerability in illness, and medical advice should revolve around an empathetic understanding of patients’ symptoms and expert’s medical knowledge. ChatGPT fails in at least one of these elements. This concern is reflected in studies that surveyed patients’ perceptions of using AI-based technologies in medicine.39 Patients showed markedly higher confidence in clinicians rather than AI technologies in making clinical diagnoses that guide their treatment.40

The role of learned language models such as ChatGPT in health care is yet to be defined. Haupt et al41 recently published a proposal that aptly summarizes AI’s role in three distinct relationships: AI as a tool to augment the patient-physician relationship, as a way to substitute for clinical judgement and to provide direct-to-consumer health advice. The authors appropriately identify a lack of legal protection for ChatGPT as a barrier for widespread implementation. However, most importantly, replacing human judgement with AI degrades the patient-physician relationship and the trust that is placed on the physician to do what is best for the patient.

Summary

Medical ethics focuses on the delivery of a high standard of clinical care while safeguarding a patient’s right to privacy. Despite these concerns, there remains a paucity of guidance statements and policies at the medical society and subspecialty level to help provide patients, healthcare professionals, and corporate partners guidance on AI usage and future directions. The increasing sophistication of generative AI and LLMs is an exciting technological advancement that has the potential to transform many aspects of knowledge-based work. However, the numerous limitations of these models, namely their potential inaccuracies and biases, arising from erroneous inputs, flawed compilation, or a combination of both, as well as privacy issues, present challenges to their implementation in healthcare settings. Careful use and additional refinement of these tools, done in a way that upholds clinical ethical ideals, are essential to providing a high standard of patient-centric care moving forward.

References

RESOURCES