Recent progress in generative artificial intelligence (AI) has given rise to large language models (LLMs) that can be prompted to craft persuasive essays,1 pass professional competency exams,2–4 and write patient-friendly empathetic messages.5 Amid growing recognition of the capabilities of LLMs, many have expressed concerns about their use in medicine and healthcare, citing known risks of confabulation, fragility, and factual inaccuracy.6 As these risks are measured and mitigated, what is coming into focus are unresolved questions about the human values that will remain embedded in AI, both in their creation and in their use, and how the “values of an LLM” may misalign with human values even if AI models no longer confabulate and have been scrubbed of obvious toxic output. Such “human values” pertain broadly to the principles, standards, and preferences that reflect human goals and guide human behaviors (Glossary). As we review here, LLMs and new foundation models, as technically impressive as they are, are only the latest incarnation of a long line of probabilistic models integrated into medical decision-making, all of which have required their creators and implementers to make value judgments.
Many of the challenges we address here were evident to the pioneers of medical decision analysis of the 1950s7 and to scholars in subsequent decades8–11 who conducted careful and creative studies of both human and algorithmic decision-making to disentangle probability, the chance of an event occurring, from utilities, the quantified value judgments that are often only indirectly articulated in much of medical decision-making (Glossary). The nuanced understanding of individual values and risks is what makes the thoughtful clinician so indispensable. These considerations have renewed relevance with unprecedented and ubiquitously available AI models like LLMs. In this article, we first describe how value judgments enter predictive models, using familiar clinical settings and new AI language models. We then connect early work in reasoning about probabilities and utilities to the emerging issues of newer AI models, identifying unresolved challenges and future opportunities in designing high-performance and safe AI models.
HOW HUMAN VALUES SHAPE AI MODELS
Myriad examples spanning automated chest X-ray interpretation,12 skin disease classification,13 and algorithmic decisions about healthcare resource allocation14 have illustrated how the data used in training AI models encode individual and societal values that may become cemented in the model. As reviewed recently in the Journal,15 biased training data may both amplify and reveal the values and biases present in society. Conversely, studies have also shown that AI can be used to reduce bias. For example, researchers applied deep learning models to knee X-ray images and identified factors within the knee that were missed by standard severity measures graded by radiologists, reducing unexplained pain disparities between Black and White patients.16
Despite growing recognition of bias in AI models, particularly with respect to training data, less appreciated are the many additional entry points for human values along the development and deployment journey of an AI model. Explicit considerations and modeling of human values, and how they interact with risk assessment and probabilistic reasoning have been largely absent amidst the otherwise impressive recent successes of medical AI.
A MOTIVATING CLINICAL EXAMPLE
To make these abstract concepts concrete, imagine that you are an endocrinologist who has been asked to prescribe recombinant human growth hormone (hGH) to a 8-year old male whose height is falling below the 3rd percentile for his age and who has a post-stimulation hGH level of less than 2 ng/L (normal > 10 ng/L and > 7 ng/L in many countries outside the US), and is found to have a rare loss-of-function mutation in the gene encoding hGH. We posit that in this setting proceeding with hGH treatment is straightforward and uncontroversial. Much more controversial would be the administration of supraphysiologic doses of hGH to a 14-year-old male, whose height has consistently been at the 10th percentile for age, whose has a post-stimulation hGH peak of 8 ng/L, who harbors no known functional mutations that impact height, has no other known cause for his short stature and whose bone age is 15 years (i.e., not delayed). Only part of the controversy is due to divergence in the hGH threshold that experts, informed by dozens of studies, used in making the diagnosis of isolated growth hormone deficiency (IGHD).17 At least as much of the controversy stems from the risk-benefit tradeoff as seen from the perspective of the patient, his parents, the health professional, the pharmaceutical company, and the payor. The pediatric endocrinologist might weigh the rare adverse effects of 2 years of daily GH injections against the likelihood of no or minimal gain in adult stature. The boy might feel that even the possibility of a 2 cm gain is worth the effort. The payor and the pharmaceutical company may further disagree.
In 2024, the second case would have a default recommendation from an LLM such as the Generative Pre-trained Transformer 4 (GPT-4) model, which reflects not only its training data but also the process by which the model was trained, including methods such as supervised fine-tuning and reinforcement learning from human feedback (described below). Beyond this, each of the people or parties involved in the second case—the patient, the parents, the physician, the drug maker, and the payor—could instruct GPT-4 to inject “custom values” to reflect their viewpoints on the matter (Figure 1). This “tunability” of the model’s output is a desirable feature of these models, but it raises several questions.
Figure 1. How contemporary AI models may be “steered” to capture different human values.

Large language models (LLMs) such as GPT-4 encode human values based both on their training data and how they subsequently “tuned.” As this example illustrates, they can further be powerfully “steered” to adopt different roles. The human prompts in this example are about an identical case of a 14-year old short-statured boy; GPT-4 is instructed to “adopt” three different perspectives: A the treating physician, B the insurance company, C the boy’s parents. GPT-4 prompts and output are abridged to fit.
Whose values do a given AI model reflect? How should we proceed in adapting generalist AI models18 for medical decision-making? Will AI models facilitate rational decision-making that reflects the values of the patient or those of other parties? How will financial forces shape the creation and use of these models in medicine? How steerable should an AI model be when used by a physician for an evaluation and treatment plan?
At every stage of model creation and model use, human values enter (Figure 2). We illustrate this first with a simple statistical model familiar to clinicians (the estimated glomerular filtration rate, eGFR) and then, using LLMs, we show that beyond the data underlying an AI model, the model design, training and use cases will encode “human values”. These examples are not intended to be exhaustive but only to illustrate how human values enter across the spectrum of model complexity.
Figure 2. Entry points and choices for human values in traditional clinical equations and new AI models.

In both traditional clinical equations (e.g. eGFR) and new AI models (e.g. LLMs), human values enter at every stage including in choices about training data, model development, and model use. While the examples are highly varied, often the same questions can be used to elucidate human values in both traditional clinical equations and newer AI models.
IMPLICIT AND EXPLICIT VALUES IN FAMILIAR CLINICAL EQUATIONS
Consider a familiar clinical equation, the creatinine-based estimated glomerular filtration rate (eGFR), a widespread index of kidney function used to diagnose and stage chronic kidney disease, including setting thresholds to determine if an individual is eligible for kidney transplantation or donation, and to determine dose reductions and contraindications for many prescription drugs.19 The eGFR is a simple regression equation developed to estimate the measured glomerular filtration rate (mGFR), which is the gold standard but more onerous to assess.20 This regression equation could hardly be considered an AI model but nonetheless illustrates many principles about human values and probabilistic reasoning.
Human values first enter eGFR in the data used to fit the equation. Most of the original cohorts involved Black and White participants;21,22 generalizability to many other race groups was unknown. Human values further enter into this equation in the choice of accuracy (with respect to mGFR) as the primary target to optimize in estimating kidney function, what constitutes an acceptable accuracy level, how accuracy should be measured, and in the use of eGFR as a threshold to trigger clinical decisions (such as eligibility for transplant or prescription drugs). Finally, values enter in the choice of inputs to the model and their impact on its outputs. For example, until 2021, guidelines23 recommended adjusting creatinine levels in the eGFR equation based on a patient’s age, sex, and race (categorized as Black or non-Black). Race adjustment was introduced to improve the equation’s accuracy (with respect to mGFR), but in 2020, major hospitals started challenging race-based eGFR, citing concerns including delayed transplant eligibility and the reification of race as biology.24,25 Studies showed that how the eGFR model is formulated with regard to race can have profound and varying effects on both accuracy and clinical outcomes,26–28 and thus selectively focusing on accuracy or on a subset of outcomes reflects value judgments that could obscure transparent decision-making.29,30 Ultimately a national task force recommended31 a new equation32 refit without race to balance both performance and equity concerns.33 eGFR illustrates that even a simple clinical equation has many entry points for human values.
THE VALUES EMBEDDED IN LARGE LANGUAGE MODELS
In contrast to clinical equations with few predictor variables, large language models (LLMs) may be composed of an inscrutable combination of tens to hundreds of billions of parameters or more. We say “inscrutable” because the exact way that a query leads to a response in most LLM is not mappable. GPT-4’s parameter count is undisclosed; its predecessor GPT-3 has 175 billion parameters.34 More parameters do not necessarily equate with more capability, as smaller models subject to more compute cycles like the LLaMA35 model or models that are carefully fine-tuned with human feedback can outperform their larger counterparts. For example, the InstructGPT36 model (a 1.3 billion parameter model) outperformed GPT-3 as assessed by human raters in preferred model outputs.
The exact training details of GPT-4 are not publicly available but details for predecessor models including GPT-3, InstructGPT, and many other open-source LLMs have been published. Many artificial intelligence models now come with “model cards”37 (Glossary); evaluation and safety data for GPT-4 have been released in an analogous “system card”38 provided by the model’s creator, OpenAI. The creation of LLMs can be broadly divided into two phases, an initial “pre-training” phase followed by a “fine-tuning” phase to refine the model’s output.39 In the pre-training phase, large corpora including raw internet text are provided to the model, which is trained to predict the next word. This seemingly simple “autocomplete” process yields a powerful base model but one that may also exhibit harmful behavior. Values entered here in the choice of the pre-training dataset(s) for GPT-4, and furthermore the decision to scrub inappropriate content (e.g., erotic content) from the pre-training data.38 Despite these efforts, the base model may be neither useful nor free of harmful output.38 It is the next phase of “fine-tuning” where much of the useful and non-toxic behavior emerges.
In the fine-tuning phase, “supervised fine tuning” (SFT) and “reinforcement learning from human feedback” (RLHF) are used to change, often profoundly, the language model’s behavior. In the SFT phase, hired human contractors write example responses to prompts which directly train the model. In RLHF, human raters rank model outputs for example inputs. These comparisons are then used to learn a “reward model” that further optimizes the model using reinforcement learning.36 A surprisingly modest level of human participation can fine-tune these large models. For example, the InstructGPT model used a team of only about 40 human contractors, recruited from crowdsourcing websites, who passed a screening test that was used to “select a group of labelers who were sensitive to the preferences of different demographic groups.”36 With LLMs like GPT-4, further complexity emerges from the infinite ways in which the model can be “steered” (Figure 1) to encode values long after the model is first trained.6 Many of these same considerations of how human values shape general-purpose LLMs apply not only to GPT-4 but also to the ecosystem of competing LLMs40 produced by other organizations. There is also a growing cadre of medical large language models, for example Google’s Med-PaLM models.41 Finally, we note that LLMs will often not be used in a standalone manner, but rather after they have been customized and embedded in a larger system, creating further entry points for values.
As illustrated by these two extreme examples, i.e., simple clinical equation (eGFR) and in the powerful LLM (GPT-4), human decisions and therefore human values play an indispensable role in shaping model outputs. Do these AI models capture patient and physician values which themselves may be quite varied? How can we openly guide AI implementations in medicine? As described below, a principled approach to these questions may arise from revisiting medical decision analysis.
MEASURING HUMAN VALUES: MEDICAL DECISION ANALYSIS AND UTILITY ELICITATION
MEDICAL DECISION ANALYSIS
Although unfamiliar to many practicing clinicians, medical decision analysis provides a systematic approach to complex medical decisions by disentangling probabilistic reasoning about uncertain outcomes related to a decision (e.g. whether to administer hGH in the controversial case in Figure 1) from considerations of the subjective values attached to those outcomes, quantified as “utilities’’ (e.g. the value to the boy of an additional 2cm of height). Decision analysis requires one to first identify all potential decisions, outcomes, and the probabilities associated with each outcome, and then incorporates patient (or other party) utilities attached to these outcomes to select the optimal choice. As a result, the validity of a decision analysis depends on how comprehensively the outcomes are specified as well as how well the utilities are measured and probabilities are estimated. Ideally, this method can help ensure decisions are evidence-based and aligned with patient preferences, bridging the gap between objective data and personal values. This approach was introduced to medicine decades ago7,10 and has been applied to both individual patient decisions 42 and population health evaluations such as recommendations for colorectal cancer screening in the general population.43
While we do not foresee physicians dramatically altering diagnostic practice using decision analysis in the era of LLMs, the core principle of utility elicitation offers lessons on aligning AI models for medicine. These lessons include the fundamental incompatibility of utilities from competing parties,44 the importance of how information is presented,45 and the benefits of enumerating and measuring both probabilities and utilities even when uncertainty remains in both.10
UTILITY ELICITATION
Many methods have been developed in medical decision analysis to obtain utilities. Most conventional ways of doing so involve direct elicitation of the value from an individual. The simplest approach is to use a rating scale, where an individual scores their preferences for an outcome on a numeric scale, such as a linear rating scale (e.g. 1–10), with the most extreme health outcome (e.g., perfect health and death) on either end.46 While these are intuitive, a major challenge is incorporating uncertainties in healthcare decisions. Time-tradeoff is another commonly used method. Here, individuals are asked to make decisions about how much time in “good health” they would trade for a quantity of time in a lesser health state.47 The standard gamble is another popular approach for determining utilities. Here, an individual is asked their preferences of two options: either they live t years in a normal health state with a given probability, p, and risk dying at a probability of 1-p, or they live t years in a lesser health state with certainty. An individual is asked this multiple times at different values of p until they do not show any preference toward either option, allowing the calculation of a utility based on the response.46
In addition to eliciting the preferences of individual patients, methods for obtaining the utilities of a group of patients have also been developed. In particular, focus group discussions, where patients are brought together to discuss a specific experience, can be useful in understanding their perspectives.48,49 To effectively aggregate the utilities from a group, many structured group discussion techniques have been proposed. For example, the nominal group technique allows participants to write down their thoughts and preferences independently, followed by idea sharing and group discussion. Finally, the preferences of the group are aggregated by a voting process.50 Although these structured discussion techniques can overcome issues of groupthink, there are inherent limitations in the voting procedures for obtaining group preferences.44 In addition, as is true of all such exercises, the aggregated decision is not necessarily reflective of individual preferences.51 These findings highlight the difficulty in defining optimal decisions when more than one stakeholder is involved.
In practice, the elicitation of utilities directly from each individual is time-consuming. As a solution, population-level utility scores are commonly obtained using questionnaires sent to a randomly selected portion of the population. Some examples of these are the EQ-5D,52 the SF-6D utility weights,53 the Health Utilities Index,54 and the cancer-specific QLQ-C30 instrument.55 From large surveys, population-level utilities can often be generated from survey data using methods such as the time tradeoff method and the standard gamble. The discrete choice experiment is another survey-based method for understanding preferences. Here, individuals are given a series of choices to choose between, from which quantitative health utilities can be calculated.56 In each of these approaches, an individual’s utility may differ from the group utility, which raises the issue of individual autonomy when group derived utilities are applied to individuals.
UNRESOLVED CHALLENGES AND FUTURE DIRECTIONS
The examples from medical decision analysis, current methods for utility elicitation, and their limitations point to several unresolved issues and key questions for contemporary AI models in medicine.
WHOSE VALUES SHOULD BE INJECTED?
As discussed above, human values can profoundly shape the inputs and outputs of both simple clinical regression models and advanced AI models. For example, with LLMs, fine-tuning methods, including SFT and RLHF, refine LLM outputs based on human input from crowdsourced workers, hired and instructed by the model developers. This transmutes the question of which values are encoded in models to whose values are encoded. The values that should govern the range of model behavior in clinical care and the healthcare system remain unresolved, but efforts to develop principles for responsible medical AI are underway.57,58 The potential biases from crowdsourced inputs and the variability in values across cultures further compound this challenge. Studies that develop and evaluate AI in areas where resources may be limited, including low and middle-income countries, are needed.59,60 Emerging work characterizing the “psychology” of LLMs is promising.61 Future studies of AI in realistic clinical settings that rigorously evaluate how AI affects human decision-making and skill development are urgently needed.62,63 Undoubtedly, such studies will both rediscover and exploit many lessons from the psychology and medical decision-making literature about human cognitive biases and heuristics that both enhance decision-making and lead it astray.8
DATASET SHIFT
Dataset shift64 refers to changes in the data characteristics that can undermine the accuracy and reliability of AI models. Such shifts can arise due to evolving medical practices, demographic changes in the population, or the emergence of novel diseases. When incorporating human values into AI systems, shifts in societal values and differences in values among subpopulations can lead to inappropriate treatment recommendations, poor alignment with common societal expectations, and a potential loss of trust in AI-driven tools among both clinicians and patients.65 Ensuring models are periodically retrained and that model outputs are regularly monitored can help foster the safe and effective application of AI in medicine,66–68 as with non-AI diagnostic tests and procedures.69,70 AI governance teams can also help provide oversight,71,72 and agencies worldwide are grappling with how to regulate AI models, a challenge that will become more complex with foundation models73,74 and models that can reason over multiple data types.75–77 Finally, considerations of the values of individual patients may cause physicians to ignore or override AI recommendations; the liability implications remain an active focus by legal scholars.78 As medical AI becomes more integrated into care, recognizing and mitigating the risks associated with dataset shift will be paramount in aligning AI outputs with human values.
ALTERNATIVES TO DIRECT UTILITY MEASUREMENT
Although the utility elicitation methods described above can obtain human values, they are often limited to well-controlled study settings and miss the nuances of decision-making as individuals grapple with real-world healthcare scenarios. They can also be sensitive to framing and context,45 biased,79,80 as well as difficult to scale. Decision curve analysis81,82 is an alternative paradigm to evaluate diagnostic tests and predictive models without requiring explicit utility elicitation. Another emerging line of research employs data-driven methods to extract human values and integrate them as long term objectives in order to support continual learning that may adapt to shifting data and values.
The subfield of reinforcement learning focuses on having a computer “agent” learn what actions to take in a given state and environment in order to maximize a specified “reward.” RLHF, seen above with LLMs, is one example of reinforcement learning. A key component is the reward function, which quantifies the desirability of each state. Given the myriad clinical scenarios and patient-specific utility variations, crafting this function is challenging but remains an active frontier.
Conclusion
At every stage of training and deploying an AI model, human values enter. Echoing the insights of decision analysts from decades ago, we recognize that AI models are far from immune to the shifts and discrepancies of values across individuals and societies. Past utilities may no longer be relevant or even reflect pernicious societal biases. Our shared responsibility is to ensure that the AI models we deploy accurately and explicitly reflect patient values and goals. As noted by Pauker and Kassirer in the Journal more than three decades ago in reviewing progress in medical decision analysis,10 “the threat to physicians of a mathematical approach to medical decision making simply has not materialized.” Similarly, rather than replacing physicians, AI has made the consideration of values reflected in the guidance of a thoughtful physician more essential than ever.
Acknowledgments
We thank Dr. Mihaela van der Schaar and Mr. James Diao for helpful discussions.
Glossary
- Alignment
The degree to which an artificial intelligence system’s behaviors and actions are congruent with human values.
- Generative Artificial Intelligence
A form of artificial intelligence designed to produce new and original data outputs, including those that resemble human-made content (including text, code, images, audio, video etc).
- Human Values
A broad term for the principles, standards, and preferences that reflect human goals and guide human behaviors.
- Large Language Model (LLM)
A type of artificial intelligence model that interprets and generates text. LLMs are often “pre-trained” with large text corpora and then fine-tuned with human feedback via supervised fine-tuning and reinforcement learning from human feedback.
- Model Card
Comprehensive overview and performance characteristics of a machine learning model (e.g. training and evaluation data, training procedure), existing evaluations (e.g. observed safety or bias challenges and existing remediation strategies), intended use cases, and model performance across populations (e.g. key demographic or clinical groups). Similar to “System Card” for GPT-4.38
- Reinforcement Learning from Human Feedback (RLHF)
A method of fine-tuning LLMs where humans rank responses to prompts; reinforcement learning is then used to optimize the output to align with human preferences.
- Supervised Fine-tuning (SFT)
A method of fine-tuning LLMs that uses human-written responses to example prompts.
- Utility
The quantitative measure used in decision analysis to assess the value of a health state or outcome. Utilities may be elicited directly from individual patients or groups, or they can be learned from data. Utilities may be applied to individuals, groups, or populations.
Footnotes
Disclosure forms provided by the authors are available with the full text of this article at NEJM.org.
References
- 1.Noy S, Zhang W. Experimental evidence on the productivity effects of generative artificial intelligence. Science 2023;381(6654):187–92. [DOI] [PubMed] [Google Scholar]
- 2.Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on Medical Challenge Problems [Internet]. arXiv [cs.CL]. 2023;Available from: http://arxiv.org/abs/2303.13375 [Google Scholar]
- 3.OpenAI. GPT-4 Technical Report [Internet]. arXiv [cs.CL]. 2023;Available from: http://arxiv.org/abs/2303.08774 [Google Scholar]
- 4.Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature 2023;620(7972):172–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ayers JW, Poliak A, Dredze M, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med 2023;183(6):589–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med 2023;388(13):1233–9. [DOI] [PubMed] [Google Scholar]
- 7.Ledley RS, Lusted LB. Reasoning foundations of medical diagnosis; symbolic logic, probability, and value theory aid our understanding of how physicians reason. Science 1959;130(3366):9–21. [DOI] [PubMed] [Google Scholar]
- 8.Tversky A, Kahneman D. Judgment under Uncertainty: Heuristics and Biases. Science 1974;185(4157):1124–31. [DOI] [PubMed] [Google Scholar]
- 9.Szolovits P, Pauker SG. Cateegorical and Probabilistic Reasoning in Medical Diagnosis *. Artif Intell 1978;11(1):115–44. [Google Scholar]
- 10.Pauker SG, Kassirer JP. Decision analysis. N Engl J Med 1987;316(5):250–8. [DOI] [PubMed] [Google Scholar]
- 11.McNeil BJ, Keller E, Adelstein SJ. Primer on certain elements of medical decision making. N Engl J Med 1975;293(5):211–5. [DOI] [PubMed] [Google Scholar]
- 12.Seyyed-Kalantari L, Zhang H, McDermott MBA, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat Med 2021;27(12):2176–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Daneshjou R, Vodrahalli K, Novoa RA, et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci Adv 2022;8(32):eabq6147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019;366(6464):447–53. [DOI] [PubMed] [Google Scholar]
- 15.Ferryman K, Mackintosh M, Ghassemi M. Considering Biased Data as Informative Artifacts in AI-Assisted Health Care. N Engl J Med 2023;389(9):833–8. [DOI] [PubMed] [Google Scholar]
- 16.Pierson E, Cutler DM, Leskovec J, Mullainathan S, Obermeyer Z. An algorithmic approach to reducing unexplained pain disparities in underserved populations. Nat Med 2021;27(1):136–40. [DOI] [PubMed] [Google Scholar]
- 17.Rodari G, Profka E, Giacchetti F, Cavenaghi I, Arosio M, Giavoli C. Influence of biochemical diagnosis of growth hormone deficiency on replacement therapy response and retesting results at adult height. Sci Rep 2021;11(1):14553. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence. Nature 2023;616(7956):259–65. [DOI] [PubMed] [Google Scholar]
- 19.Levey AS, Grams ME, Inker LA. Uses of GFR and Albuminuria Level in Acute and Chronic Kidney Disease. N Engl J Med 2022;386(22):2120–8. [DOI] [PubMed] [Google Scholar]
- 20.Levey AS, Coresh J, Tighiouart H, Greene T, Inker LA. Measured and estimated glomerular filtration rate: current status and future directions. Nat Rev Nephrol 2020;16(1):51–64. [DOI] [PubMed] [Google Scholar]
- 21.Levey AS, Bosch JP, Lewis JB, Greene T, Rogers N, Roth D. A more accurate method to estimate glomerular filtration rate from serum creatinine: A new prediction equation. Ann Intern Med 1999;130(6):461–70. [DOI] [PubMed] [Google Scholar]
- 22.Levey AS, Stevens LA, Schmid CH, et al. A New Equation to Estimate Glomerular Filtration Rate. Ann Intern Med 2009;150(9):604–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.KDIGO CKD Work Group. KDIGO 2012 Clinical Practice Guideline for the Evaluation and Management of Chronic Kidney Disease. Kidney Int Supplement 3 2013;1–150. [DOI] [PubMed] [Google Scholar]
- 24.Eneanya ND, Yang W, Reese PP. Reconsidering the Consequences of Using Race to Estimate Kidney Function. JAMA 2019;322(2):113–4. [DOI] [PubMed] [Google Scholar]
- 25.Vyas DA, Eisenstein LG, Jones DS. Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms. N Engl J Med [Internet] 2020;Available from: https://www.nejm.org/doi/full/10.1056/NEJMms2004740 [DOI] [PubMed] [Google Scholar]
- 26.Diao JA, Wu GJ, Taylor HA, et al. Clinical Implications of Removing Race From Estimates of Kidney Function. JAMA [Internet] 2020;Available from: 10.1001/jama.2020.22124 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Levey AS, Titan SM, Powe NR, Coresh J, Inker LA. Kidney Disease, Race, and GFR Estimation. Clin J Am Soc Nephrol [Internet] 2020;Available from: 10.2215/CJN.12791019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Ahmed S, Nutt CT, Eneanya ND, et al. Examining the Potential Impact of Race Multiplier Utilization in Estimated Glomerular Filtration Rate Calculation on African-American Care Outcomes. J Gen Intern Med [Internet] 2020;Available from: 10.1007/s11606-020-06280-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Manski CF, Mullahy J, Venkataramani AS. Using measures of race to make clinical predictions: Decision making, patient health, and fairness. Proc Natl Acad Sci U S A 2023;120(35):e2303370120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Pierson E. Accuracy and Equity in Clinical Risk Prediction. N Engl J Med 2024;390(2):100–2. [DOI] [PubMed] [Google Scholar]
- 31.Delgado C, Baweja M, Crews DC, et al. A Unifying Approach for GFR Estimation: Recommendations of the NKF-ASN Task Force on Reassessing the Inclusion of Race in Diagnosing Kidney Disease. Am J Kidney Dis [Internet] 2021;Available from: 10.1053/j.ajkd.2021.08.003 [DOI] [PubMed] [Google Scholar]
- 32.Inker LA, Eneanya ND, Coresh J, et al. New Creatinine- and Cystatin C–Based Equations to Estimate GFR without Race. N Engl J Med [Internet] 2021;Available from: 10.1056/NEJMoa2102953 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Diao JA, Inker LA, Levey AS, Tighiouart H, Powe NR, Manrai AK. In search of a better equation - performance and equity in estimates of kidney function. N Engl J Med 2021;384(5):396–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020;33:1877–901. [Google Scholar]
- 35.Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and Efficient Foundation Language Models [Internet]. arXiv [cs.CL]. 2023;Available from: http://arxiv.org/abs/2302.13971 [Google Scholar]
- 36.Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst 2022;35:27730–44. [Google Scholar]
- 37.Mitchell M, Wu S, Zaldivar A, et al. Model Cards for Model Reporting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency. New York, NY, USA: Association for Computing Machinery; 2019. p. 220–9. [Google Scholar]
- 38.OpenAI. GPT-4 System Card. 2023;Available from: https://cdn.openai.com/papers/gpt-4-system-card.pdf [Google Scholar]
- 39.Stats, STAT! [Internet]. [cited 2024 Mar 25];Available from: https://evidence.nejm.org/browse/evidence-media-type/stats-stat [Google Scholar]
- 40.Yang J, Jin H, Tang R, Han X, Feng Q, Jiang H. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv [Internet] 2023;Available from: https://arxiv.org/abs/2304.13712 [Google Scholar]
- 41.Singhal K, Tu T, Gottweis J, et al. Towards Expert-Level Medical Question Answering with Large Language Models [Internet]. arXiv [cs.CL]. 2023;Available from: http://arxiv.org/abs/2305.09617 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.O’Meara JJ 3rd, McNutt RA, Evans AT, Moore SW, Downs SM. A decision analysis of streptokinase plus heparin as compared with heparin alone for deep-vein thrombosis. N Engl J Med 1994;330(26):1864–9. [DOI] [PubMed] [Google Scholar]
- 43.Zauber AG, Lansdorp-Vogelaar I, Knudsen AB, Wilschut J, van Ballegooijen M, Kuntz KM. Evaluating test strategies for colorectal cancer screening: a decision analysis for the U.S. Preventive Services Task Force. Ann Intern Med 2008;149(9):659–69. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Arrow KJ. A Difficulty in the Concept of Social Welfare. J Polit Econ 1950;58(4):328–46. [Google Scholar]
- 45.McNeil BJ, Pauker SG, Sox HC Jr, Tversky A. On the elicitation of preferences for alternative therapies. N Engl J Med 1982;306(21):1259–62. [DOI] [PubMed] [Google Scholar]
- 46.Torrance GW. Measurement of health state utilities for economic appraisal. J Health Econ 1986;5(1):1–30. [DOI] [PubMed] [Google Scholar]
- 47.Garza AG, Wyrwich KW. Health utility measures and the standard gamble. Acad. Emerg. Med 2003;10(4):360–3. [DOI] [PubMed] [Google Scholar]
- 48.Wong LP. Focus group discussion: a tool for health and medical research. Singapore Med J 2008;49(3):256–60; quiz 261. [PubMed] [Google Scholar]
- 49.Powell RA, Single HM. Focus groups. Int J Qual Health Care 1996;8(5):499–504. [DOI] [PubMed] [Google Scholar]
- 50.Gallagher M, Hares T, Spencer J, Bradshaw C, Webb I. The nominal group technique: a research tool for general practice? Fam Pract 1993;10(1):76–81. [DOI] [PubMed] [Google Scholar]
- 51.Patty JW, Penn EM. Measuring Fairness, Inequality, and Big Data: Social Choice Since Arrow. Annu Rev Polit Sci 2019;22(1):435–60. [Google Scholar]
- 52.Rabin R, de Charro F. EQ-SD: a measure of health status from the EuroQol Group. Ann Med 2001;33(5):337–43. [DOI] [PubMed] [Google Scholar]
- 53.Norman R, Viney R, Brazier J, et al. Valuing SF-6D Health States Using a Discrete Choice Experiment. Med Decis Making 2014;34(6):773–86. [DOI] [PubMed] [Google Scholar]
- 54.Horsman J, Furlong W, Feeny D, Torrance G. The Health Utilities Index (HUI®): concepts, measurement properties and applications. Health Qual Life Outcomes 2003;1(1):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Aaronson NK, Ahmedzai S, Bergman B, et al. The European Organization for Research and Treatment of Cancer QLQ-C30: a quality-of-life instrument for use in international clinical trials in oncology. J Natl Cancer Inst 1993;85(5):365–76. [DOI] [PubMed] [Google Scholar]
- 56.Ryan M, Bate A, Eastmond CJ, Ludbrook A. Use of discrete choice experiments to elicit preferences. Qual Health Care 2001;10 Suppl 1(Suppl 1):i55–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Goldberg CB, Adams L, Blumenthal D, et al. To do no harm — and the most good — with AI in health care. NEJM AI [Internet] 2024;1(3). Available from: https://ai.nejm.org/doi/abs/10.1056/AIp2400036 [DOI] [PubMed] [Google Scholar]
- 58.Guidance W. Ethics and governance of artificial intelligence for health [Internet]. 2021. [cited 2024 Apr 1];Available from: https://hash.theacademy.co.ug/wp-content/uploads/2022/05/WHO-guidance-Ethics-and-Governance-of-AI-for-Health.pdf [Google Scholar]
- 59.Mehta MC, Katz IT, Jha AK. Transforming Global Health with AI. N Engl J Med 2020;382(9):791–3. [DOI] [PubMed] [Google Scholar]
- 60.Mate A, Madaan L, Taneja A, et al. Field Study in Deploying Restless Multi-Armed Bandits: Assisting Non-profits in Improving Maternal and Child Health. AAAI 2022;36(11):12017–25. [Google Scholar]
- 61.Shiffrin R, Mitchell M. Probing the psychology of AI models. Proc. Natl. Acad. Sci. U. S. A 2023;120(10):e2300963120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Agarwal N, Moehring A, Rajpurkar P, Salz T. Combining human expertise with artificial intelligence: Experimental evidence from radiology. 2023;Available from: https://www.nber.org/papers/w31422 [Google Scholar]
- 63.Tu T, Palepu A, Schaekermann M, et al. Towards Conversational Diagnostic AI [Internet]. arXiv [cs.AI]. 2024;Available from: http://arxiv.org/abs/2401.05654 [Google Scholar]
- 64.Finlayson SG, Subbaswamy A, Singh K, et al. The Clinician and Dataset Shift in Artificial Intelligence. N Engl J Med 2021;385(3):283–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Yu K-H, Kohane IS. Framing the challenges of artificial intelligence in medicine. BMJ Qual Saf 2019;28(3):238–41. [DOI] [PubMed] [Google Scholar]
- 66.Beam AL, Manrai AK, Ghassemi M. Challenges to the Reproducibility of Machine Learning Models in Health Care. JAMA [Internet] 2020;Available from: 10.1001/jama.2019.20866 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Yu K-H, Lee T-LM, Yen M-H, et al. Reproducible Machine Learning Methods for Lung Cancer Detection Using Computed Tomography Images: Algorithm Development and Validation. J Med Internet Res 2020;22(8):e16709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Yu K-H, Beam AL, Kohane IS. Artificial intelligence in healthcare. Nature Biomedical Engineering 2018;2(10):719–31. [DOI] [PubMed] [Google Scholar]
- 69.Manrai AK, Funke BH, Rehm HL, et al. Genetic Misdiagnoses and the Potential for Health Disparities. N Engl J Med 2016;375(7):655–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Manrai AK, Patel CJ, Ioannidis JPA. In the Era of Precision Medicine and Big Data, Who Is Normal? JAMA - Journal of the American Medical Association 2018;319(19):1981–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Price WN, Sendak M, Balu S, Singh K. Enabling collaborative governance of medical AI. Nature Machine Intelligence 2023;5(8):821–3. [Google Scholar]
- 72.Paige Nong, Reema Hamasha, Karandeep Singh, Julia Adler-Milstein, Jody Platt. How Academic Medical Centers Govern AI Prediction Tools in the Context of Uncertainty and Evolving Regulation. NEJM AI 2024;0(0):AIp2300048. [Google Scholar]
- 73.Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit Med 2023;6(1):120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Minssen T, Vayena E, Cohen IG. The Challenges for Regulating Medical Use of ChatGPT and Other Large Language Models. JAMA 2023;330(4):315–6. [DOI] [PubMed] [Google Scholar]
- 75.Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. Multimodal biomedical AI. Nat Med 2022;28(9):1773–84. [DOI] [PubMed] [Google Scholar]
- 76.Yu K-H, Zhang C, Berry GJ, et al. Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features. Nat Commun 2016;7:12474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Nasrallah MP, Zhao J, Tsai CC, et al. Machine learning for cryosection pathology predicts the 2021 WHO classification of glioma. Med 2023;4(8):526–40.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Price WN 2nd, Gerke S, Cohen IG. Potential Liability for Physicians Using Artificial Intelligence. JAMA 2019;322(18):1765–6. [DOI] [PubMed] [Google Scholar]
- 79.van Osch SMC, Wakker PP, van den Hout WB, Stiggelbout AM. Correcting biases in standard gamble and time tradeoff utilities. Med Decis Making 2004;24(5):511–7. [DOI] [PubMed] [Google Scholar]
- 80.Lugnér AK, Krabbe PFM. An overview of the time trade-off method: concept, foundation, and the evaluation of distorting factors in putting a value on health. Expert Rev Pharmacoecon Outcomes Res 2020;20(4):331–42. [DOI] [PubMed] [Google Scholar]
- 81.Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making 2006;26(6):565–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Fitzgerald M, Saville BR, Lewis RJ. Decision curve analysis. JAMA 2015;313(4):409–10. [DOI] [PubMed] [Google Scholar]
