Abstract
Artificial intelligence (AI), particularly large language models (LLMs), is increasingly integrated into mental health care. This study examined racial bias in psychiatric diagnosis and treatment across four leading LLMs: Claude, ChatGPT, Gemini, and NewMes-15 (a local, medical-focused LLaMA 3 variant). Ten psychiatric patient cases representing five diagnoses were presented to these models under three conditions: race-neutral, race-implied, and race-explicitly stated (i.e., stating patient is African American). The models’ diagnostic recommendations and treatment plans were qualitatively evaluated by a clinical psychologist and a social psychologist, who scored 120 outputs for bias by comparing responses generated under race-neutral, race-implied, and race-explicit conditions. Results indicated that LLMs often proposed inferior treatments when patient race was explicitly or implicitly indicated, though diagnostic decisions demonstrated minimal bias. NewMes-15 exhibited the highest degree of racial bias, while Gemini showed the least. These findings underscore critical concerns about the potential for AI to perpetuate racial disparities in mental healthcare, emphasizing the necessity of rigorous bias assessment in algorithmic medical decision support systems.
Subject terms: Medical ethics, Signs and symptoms, Outcomes research
Introduction
Large language models (LLMs), a type of artificial intelligence (AI) tool, have been heralded as a potentially powerful tool for increasing efficiency, broadening access, and improving outcomes in mental health care1,2. Given the high burden of documentation, estimated to consume up to 40% of a provider’s time3, LLMs that can quickly synthesize inputted data to generate custom patient reports have been seen as a solution4. Because of the ability of LLMs to “understand” plain text and audio, this can extend to automatically extracting and processing symptoms and other information from a clinical interview5,6. Additionally, by proposing diagnoses and interventions based on information gleaned from massive medical databases, LLMs can also optimize treatment4. In an ideal situation, a provider can share a patient’s information with an LLM, and, in real time, receive a detailed report that includes an accurate diagnostic assessment and sensible treatment plan, along with an explanation of the LLM’s reasoning and relevant references7. The potentials and promise of LLMs in psychiatric practice are well-documented8,9. Recent research showing LLMs such as ChatGPT Plus (GPT-4) to be superior to early-career physicians suggests that the promises of LLMs in healthcare in general may now be within reach10.
These potential gains, however, may be overshadowed by potential flaws and risks inherent to LLMs. One risk is that LLMs might perpetuate or exacerbate inequities in mental health care, having shown racial bias in general medicine studies11–14. For example, LLMs have been shown to replicate medical biases in understanding the health of African Americans, such as assuming thicker skin and lower lung capacity compared to white patients11. Accordingly, there is strong evidence that LLMs tend to have significantly more errors when processing mental health information from minority groups, and these problems are more common in LLMs with smaller parameters sizes15. This suggests that LLMs may harbor unfounded assumptions when it comes to mental health as well, replicating existing biases in psychiatric diagnosis and treatment in minorities. For example, in African American patients, LLMs may replicate past tendencies in the medical field to underdiagnose conditions such as depression16 and anxiety17 and over-diagnose conditions such as schizophrenia18,19, or to generally suggest less effective20 or riskier treatments21, partly as a function of race. LLMs might even learn and replicate stigmatizing language found in EHR mental health care notes22. As mental health is a “high stakes” domain that is defined partially by the “fuzziness” of symptoms8, and high-quality evidence has found that LLMs tend to be more racially biased in diagnosing mental health conditions over any other health conditions23, it is likely that LLMs trained on biased data may perpetuate racism unless specific countermeasures are implemented.
Even implied race may trigger a biased output by LLMs, since they have been found to respond differently when the requests use a dialect typically used by African Americans (sometimes referred to as African American Vernacular English, or AAVE)24. Therefore, there is concern that if LLMs are interpreting the transcript from a clinical interview where the patient uses AAVE, the LLM’s output might be biased. A similar concern exists if the LLM assumes patients’ race based on their name, as some research on LLMs guessing race based on usernames has suggested25,26. Recent research has also demonstrated that ChatGPT-4 can use cues from patient prompts to detect race in mental healthcare, and provides lower empathy responses to black participants27. Considering the well-documented susceptibility to racial bias in LLMs and their rapid uptake in clinical care28, this study aims to examine four popular LLMs in psychiatry, a medical field prone to bias16–19,29–31 by providing them with real cases to diagnose and treat, adding implicit or explicit racial information for comparison. We benchmark multiple LLMs, including three commercially available generalist LLMs (Gemini, ChatGPT, and Claude) and one “local” one (NewMes-v15) running entirely on a local computer (i.e., not requiring online resources).
Results
Systematic variations were observed in output based on the presence of racial characteristics in patient reports. When averaging the scores across all LLMs and conditions (for diagnosis and treatment), the mean bias score for the explicit and implicit conditions was 1.93 (95% CI [1.91–2.14]; SD = 0.97; Median = 2) and 1.37 (95% CI[1.36–1.38]; SD = 0.91: Median = 1), respectively, indicating a higher likelihood of bias in the presence of explicit racial information. Among the psychiatric conditions examined, responses to schizophrenia cases demonstrated the highest likelihood of bias (driven by treatment bias), while depression demonstrated the lowest.
Diagnostic assessments showed relative consistency across different LLMs and case presentations, with mean bias scores remaining at or below 1.5 for most conditions. The notable exception was schizophrenia, where bias scores often exceeded 1.5 on average for some LLMs (Claude and NewMes), though diagnostic reasoning patterns remained largely consistent. This suggests that while racial characteristics influenced model outputs, their impact on the “thought” process behind the diagnosis was limited. Figure 1 shows a heatmap of the average bias scores for diagnosis (both for implied and explicitly stated race).
Fig. 1. Heatmap of diagnosis for each LLM and condition.
For each illness and each model, they were provided with an implicit version and explicit version of race included. Red correlates with a more biased response relative to the neutral condition.
Treatment recommendations demonstrated more pronounced bias. Models frequently proposed divergent treatment approaches when racial characteristics were present, either explicitly or implicitly. This was most evident in schizophrenia and anxiety cases, where the majority of models received bias scores of 2.0 or above for treatment recommendations. Overall, several concerning issues emerged for all LLMs in treatment recommendations. For instance, Gemini demonstrated increased focus on reducing alcohol use in anxiety cases only when it was explicitly stated that the patient was African American. Claude suggested guardianship for depression cases with explicit racial characteristics, but not in the neutral or implicit conditions. ChatGPT emphasized substance use as a potential problem in a patient with an eating disorder when the race was explicitly stated, but not in the neutral condition. Both ChatGPT and NewMes omitted medication recommendations for an ADHD case when racial characteristics were explicitly stated but suggested them when those racial characteristics were missing. Figure 2 shows a heatmap of the average bias scores for treatment (both for implied and explicitly stated race).
Fig. 2. Heatmap of treatment for each LLM by condition.
For each illness and each model, they were provided with an implicit version and explicit version of race included. Red correlates with a more biased response relative to the neutral condition.
Among LLMs, the NewMes-v15 model showed the highest susceptibility to bias, receiving the maximum bias score of 3.0 for treatment recommendations more frequently than any other model. Conversely, Gemini demonstrated the lowest overall bias scores across conditions.
Statistical analysis confirmed general differences between LLMs in their ratings overall. A Kruskal–Wallis H-test revealed a statistically significant effect of LLM on ratings (H(3) = 16.65, P = 0.00083). The overall effect size was medium (η² = 0.088). The a priori power for this test was high (0.93 for detecting the pattern of differences specified by the assumed shifts where the smallest non-zero shift (0.3) represents a small-to-medium standardized effect. The overall effect size observed was medium (η² = 0.088). Gemini received the lowest mean bias rank (60.08), followed by Claude (75.47), ChatGPT (87.97), and NewMes (98.47). Post-hoc Dunn’s tests with Bonferroni correction (adjusted α = 0.0083) confirmed specific differences. Gemini was rated as significantly less biased than ChatGPT (mean rank diff. = -27.9, P = 0.005). Gemini was also rated significantly less biased than NewMes (mean rank diff. = -38.4, P = 0.00011). No other pairwise rating differences were statistically significant. However, these results should be considered in line with the limitations of comparing ordinal scores based on qualitative judgments; these comparisons are based on 40 ratings per LLMs, which may not be enough for statistical inference. However, as this is a qualitative judgment, smaller sample sizes are relatively acceptable, especially as bias is, to some extent, a subjective judgment based on experience.
Discussion
To our knowledge, this is the first evaluation of racial bias across multiple psychiatric diagnoses and multiple LLMs, including generalist LLMs and a locally trained medical-specific LLM. Our study is also the first to compare cases that include both explicit and implicit racial characteristics, which is important considering that LLMs can pick up on inferred characteristics that humans do not notice25,32, and that even minor changes in the way information is presented to an LLM can affect accuracy and bias33. We found most LLMs exhibited some form of bias when dealing with African American patients, at times recommending dramatically different treatments for the same psychiatric illness and otherwise same patient.
The observed variations raise significant concerns about the integration of AI in psychiatric care and should prompt reconsideration of whether current AI systems can maintain neutrality and display minimal bias. The heightened bias in treatment recommendations suggests that AI systems may amplify existing biases in psychiatric care21, even when the diagnosis is accurate. These findings are aligned with recent research in non-psychiatric disciplines suggesting “white” treatment bias by LLMs14,34, and underscore a fundamental challenge in AI development: While LLMs can process vast amounts of medical information, they can also propagate and exacerbate biases embedded in their training data35. Our results with a locally trained medical LLM suggest that even specialized, annotated training data may not fully mitigate against these biases and, in fact, may be more likely to replicate them. While evidence36 suggests that local LLMs can perform nearly as well as online versions that use high performance servers and have high computing demands on synthetic benchmarks that examine issues such as hallucinations and accuracy, our research suggests otherwise for problems like racial bias. Given growing interest by healthcare institutions, clinicians, and researchers in local LLMs that do not require the internet and are less expensive and more secure, and given the push toward local LLMs by major technology companies (e.g., Meta37), this higher likelihood of racial bias in local LLMs may represent a significant challenge.
Our results are consistent with preliminary data from the fields of nephrology, pulmonology, dermatology, and cardiology11,38–41. While most studies focus on accuracy rather than bias, a fundamental issue appears to be that LLMs can be extremely convincing42,43, especially when they make up supporting evidence in response to medical questions44. In the psychiatric arena, where diagnoses do not typically rely on “objective” laboratory, imaging, or genetic tests45, and where symptoms can be “soft” and non-specific46, the ability for LLMs to deliver convincing, yet racist, treatment advice to a time-pressured provider based on subtle cues can be particularly problematic.
Several limitations warrant consideration in interpreting these results. First, the rapid pace of AI development means that our findings represent a snapshot of performance that may quickly become outdated. This is especially true considering the increasing size of LLMs; our largest LLMs have been estimated to have over 175 billion parameters (Gemini and ChatGPT-4), and our smallest LLM (NewMes, based on LLaMA 3) only had 8 billion. Meanwhile, at time of writing, it is likely that ChatGPT-4o is at least 1.5 trillion parameters, which is several orders of magnitude larger than our smallest model used in this study. However, the fundamental medical dataset used in training (i.e., a medical dataset that is likely itself biased due to lack of representation of minority groups47) is unlikely to evolve rapidly between updates, suggesting that new models may still have the same problem. Even if parameter sizes are inversely related to racism or bias assessments (as noted in other research15), smaller LLMs, refined LLMs, or quantized LLMs, will likely continue to encounter issues with implied race. In our study, our refined model based on a smaller LLM performed the worst, but it is unclear why this is the case, considering that it was the most specifically trained on medical data. Furthermore, research using much larger LLMs has shown that even explicitly unbiased LLMs still form racist associations, and smaller LLMs had fewer biases48, so it is possible that model size is not the driver behind why we had more bias in our local LLM.
Second, our methodology for removing racial characteristics from neutral cases may not have been sufficiently comprehensive to eliminate all racial cues. Although we tried to mitigate against this by inputting the cases into an LLM beforehand, it is still possible that the LLM was behaving as though there was a racial element, without our knowledge. Research has shown that LLMs will capture implicit information that a human might not notice25,32. Therefore, the difference between the neutral and implicit or explicit conditions could have been an assumption on our part. It is also possible that the cases we chose to use, from an open-source dataset, may have been embedded within the LLM’s training material, so the LLM may have been able to “guess” the race in a way that we did not anticipate, although we did attempt to address this through asking LLaMA 405B to guess the race, which it could not.
Third, the qualitative nature of bias assessment introduces potential subjectivity, and some observed variations might be attributed to random fluctuations rather than systematic bias. Similarly, we assessed only one output for each of the two cases per condition, meaning that we did not examine how consistent the bias may have been, and our results may over or underestimate bias. Other researchers may conduct identical research and find different results, which is to be expected. The outputs of LLMs will often show differences even if the information is repeated (even at very low temperatures, as is the case in our study), but given the relative consistency of treatment bias we observed across LLMs even with our choice to only prompt once, and the fact that we found bias when race was only implied, our results are unlikely to be due to random LLM variation. Furthermore, other research has found even if racial bias is defined and tested in different ways, many of the same LLMs we assessed in this study will still maintain consistent levels of racial bias with repeated trials38, further supporting our findings. Still, the field would benefit from assessing more than two cases per condition, and instead focus on just one condition to enhance both practical considerations (resource requirements) and empirical quality. We would also suggest that future researchers chose health conditions that either implicitly or explicitly suggest an ethnicity—ones that are much more commonly diagnosed in certain ethnic groups in certain areas (e.g., alcohol use disorder in Native Americans, although is due to racialized medicine) and compare these results to when the LLM is told about the patient’s actual ethnicity, as that may be a valuable and deeper examination of race.
The variation we found across LLMs suggests that how these models are trained has a significant impact on their likelihood of showing racial bias. Traditional bias mitigation strategies that are standard practice, such as adversarial training49, explainable AI methods50, data augmentation51, and resampling52, may not be enough. Future research should prioritize the development and validation of transparent mitigation strategies throughout the AI development lifecycle. Among other measures, this would entail investigating methods for detecting and quantifying bias in training data, developing more robust model architectures resistant to demographic bias, establishing standardized protocols for clinical bias testing, and building tools that correct biased clinical output in real-time. Some developers and researchers are already attempting to build systems using these methods to reduce bias in AI outputs, with some success53.
Ultimately, success in integrating AI in psychiatric care will partly depend on addressing documented racial and other biases. Our findings serve as a call to action for stakeholders across the healthcare AI ecosystem to help ensure that these technologies enhance health equity rather than reproduce or exacerbate existing inequities. This will require close collaboration between researchers, clinicians, healthcare institutions, and policymakers to establish and maintain robust standards for AI adoption. It will also require training healthcare professionals to be more critical of AI outputs and increasing awareness of historical biases that inevitably will make it into AI outputs. Until then, such systems should be deployed with caution and consideration for how even subtle racial characteristics may be affecting AI “judgment.”
Methods
An ethics exemption was obtained under code STUDY00003831 from Cedars-Sinai IRB, as the cases used were drawn from a publicly available database54 with no consent needed. We used the Strengthening the Reporting of Observational Studies in Epidemiology’s (STROBE) reporting guidelines; the checklist is included in the Supplementary Note.
Case selection and curation
We randomly selected 10 cases from PubMed Central’s dataset, which mentioned the following five psychiatric illness types (broadly defined; two per diagnosis54): depression, anxiety, schizophrenia, eating disorders, and attention-deficit and hyperactivity disorder (ADHD). These diagnoses were selected due to historical evidence of clinical bias in how they are approached in African Americans patients16–19,29–31. Further explanation of the rationale for each diagnosis is included in Supplementary Data no 1. We first filtered the cases by inclusion of the keyword related to the illness, then used a simple random sorting algorithm in Microsoft Excel to select two cases. The cases were then altered to remove actual diagnosis and details that could imply race, such as name, geographic location, and explicit racial characteristics such as skin color. To ensure that racial cues were removed, these “neutral” cases were given to a local LLM (LLaMA3.1 405B) with the prompt “can you guess the race of the patient, and if so, what makes you think that?” For all 10 cases, LLaMA 3.1 405B was unable to detect the patient’s race. These cases and the standard prompt are detailed in Supplementary Data no 1.
LLM selection
Four LLMs were selected for this study, representing the largest, highest performing, and most dominant existing LLMs as of August 5th 2024 on Livebench55. All models were set at a temperature of 0.2, setting top P at 1.0, leaving top K as standard, with a maximum output token limit of 2048 for consistency, where possible in their API. The four LLMs were Google’s Gemini 1.5 Pro (estimated 200B parameters, August 1, 2024 version56), Claude 3.5 Sonnet (estimated 175B, June 20, 2024 version57), ChatGPT-4o (estimated 180B, May 13, 2024 version)58, and NewMes-v15, a “local,” freely available, version of Meta’s LLaMA 3 8b LLM that was the top-ranked LLM on the Open Medical-LLM Leaderboard59. Local LLMs are versions that have been “cut down” for size and efficiency36. They are increasingly sought after by health care systems for cost, privacy and customizability reasons60. Local models such as this version of NewMes-v15 are developed by downloading an existing LLM (Meta’s LLaMA 3 8b), then further training it on a wide variety of open medical datasets. This modifies the architecture of the model to be more specialized for use, in addition to its original capabilities as a conversational LLM. Unlike the other three LLMs, NewMes-v15 ran completely locally on a high-performance computer with 24 GB of VRAM without internet access. Although other local models33 were considered, due to its performance on many of the metrics on the leaderboard in synthetic tests most similar to what this study aimed to do, NewMes-v15 was chosen.
Input of cases
Using a new chat session each time, we provided each LLM with the same prompt requesting a diagnosis and treatment (Supplementary Data no. 1). For each diagnosis, the LLM was presented with three conditions: The “neutral” condition, which involved a case with no reference to the patient’s race; the “implicit” condition, which included implicit race details in the form of a patient name linked in population research studies with African Americans61; and the “explicit” condition, which specified that the patient was African American and included the same patient name as the implicit condition. This resulted in 120 responses (four LLMs, 10 cases, three versions).
Analysis
Responses were scored by an American social psychologist (denoted as Rater 1) and an American licensed clinical neuropsychologist (denoted as Rater 2), and any disagreements were resolved by an American licensed psychiatrist who examined the scoring and notes to arrive at a final score. Rater 1 is a social psychologist with a PhD in social health psychology, 5 years of experience working in racial bias in older adult care, and nine years of experience working in social health psychology. Rater 2 is a clinical neuropsychologist with five years of experience delivery healthcare, and leads American Psychological Association initiatives in diversity, equity and inclusion in neuropsychology. The licensed psychiatrist is a medical director of a very large university affiliated health system, with experience in clinical, research, administrative, and technology industry settings. The raters were blinded to the LLM that they were assessing in their first assessment. The responses in the explicit and implicit condition were compared to the neutral response and scored on a scale from 0 to 3 for diagnosis and treatment (separately) as follows: 0 for minimal difference in responses (e.g., simple rephrasing); 1 for some difference in responses that could be attributed to normal variation (e.g., providing a different but valid treatment, such as another medication in the same antidepressant class); 2 for significant difference in responses that could potentially be attributed to race (e.g., recommending a different medication class or different diagnostic reasoning); and 3 for evidence of a racist response (e.g., adding an assumption of alcoholism). The raters were told not to assess the outputs on accuracy. Only the difference in the LLM’s response across the three conditions was assessed, and both rater 1 and 2 left notes summarizing their reasoning for their ratings.
The initial agreement between the two raters was very high, with most disagreements differing by one scale point only; κ = 0.949, SE = 0.025, κw = 0.977. In Supplementary Data, we have provided the LLM responses (Supplementary Data no .2), each raters’ notes after resolution, and the final ratings (Supplementary Data nos. 3–7). Supplementary Data no. 8 provides summary statistics, Supplementary Data no. 9 provides the full scores for analysis (separated by implicit/explicit and condition), and Supplementary Date no. 10 provides the scores for kappa calculation.
Prior to assigning ratings for this study, to validate the 0–3 bias assessment scale, two independent raters were trained on the scoring criteria using the supplementary data from Omiye et al.11. Both raters were provided the responses for pain thresholds in that study for ChatGPT-4 and Claude for each of the five runs. Each rater then independently scored these 10 responses based on our scale. Inter-rater reliability was calculated using Cohen’s Kappa, finding that interrater agreement for the pilot was high at κ = 0.91.
We completed all analyses in SPSS Version 27 and Microsoft Excel. Confidence intervals for the mean bias score were derived from a t-distribution.
Supplementary information
Acknowledgements
This study received no funding.
Author contributions
A.B.: Conceptualization, data curation, formal analysis, investigation, methodology, project administration, validation, visualization (figures), writing—original draft, writing—review and editing; E.M.S.: Formal analysis, validation, writing—review and editing; E.A.: Conceptualization, investigation, supervision, project administration, resources, validation, investigation, writing—review and editing.
Data availability
Data is provided under supplemental data. OSF provides Supplement data as well. https://osf.io/qgpvy/.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s41746-025-01746-4.
References
- 1.Torous, J. & Greenberg, W. Large language models and artificial intelligence in psychiatry medical education: augmenting but not replacing best practices. Acad. Psychiatry 1–3. 10.1007/s40596-024-01996-6 (2024). [DOI] [PubMed]
- 2.Barile, J. et al. Diagnostic accuracy of a large language model in pediatric case studies. JAMA Pediatr.178, 313–315 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Medscape. Physican Compensation Report. https://www.medscape.com/sites/public/physician-comp/2023 (2023).
- 4.Omar Sr, M. et al. Applications of large language models in psychiatry: a systematic review. medRxiv. 10.1101/2024.03.28.24305027 (2024). [DOI] [PMC free article] [PubMed]
- 5.Itauma, O. & Itauma, I. AI Scribes: boosting physician efficiency in clinical documentation. Int. J. Bioinform. Biosci.14, 09–18 (2024). [Google Scholar]
- 6.Tierney, A. A. et al. Ambient artificial intelligence scribes to alleviate the burden of clinical documentation. NEJM Catal. Innov. Care Deliv.5, CAT. 23.0404 (2024).
- 7.Saab, K. et al. Capabilities of gemini models in medicine. arXiv:2404.18416 (2024).
- 8.Stade, E. C. et al. Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. NPJ Ment. Health Res.3, 12 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Obradovich, N. et al. Opportunities and risks of large language models in psychiatry. NPP—Digit. Psychiatry Neurosci.2, 8 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Goh, E. et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw. Open7, e2440969 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V. & Daneshjou, R. Large language models propagate race-based medicine. NPJ Digit. Med.6, 195 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Khera, R., Simon, M. A. & Ross, J. S. Automation bias and assistive AI: risk of harm from AI-driven clinical decision support. JAMA330, 2255–2257 (2023). [DOI] [PubMed] [Google Scholar]
- 13.Zack T, Lehman E, Suzgun M, Rodriguez JA, Celi LA, Gichoya J, Jurafsky D, Szolovits P, Bates DW, Abdulnour RE, Butte AJ. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. The Lancet Digital Health.6, e12–22 (2024). [DOI] [PubMed]
- 14.Ayoub, N. F. et al. Inherent bias in large language models: a random sampling analysis. Mayo Clin. Proc.: Digital Health2, 186–191 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Wang, Y. et al. Unveiling and mitigating bias in mental health analysis with large language models. arXiv preprint arXiv:2406.12033 (2024).
- 16.Bailey, R. K., Mokonogho, J. & Kumar, A. Racial and ethnic differences in depression: current perspectives. Neuropsychiatr. Dis. Treat.15, 603–609 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Vanderminden, J. & Esala, J. J. Beyond symptoms: race and gender predict anxiety disorder diagnosis. Soc. Ment. Health9, 111–125 (2019). [Google Scholar]
- 18.Anglin, D. M. & Malaspina, D. Ethnicity effects on clinical diagnoses in patients with psychosis: comparisons to best estimate research diagnoses. J. Clin. Psychiatry69, 941 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Anglin, D. M. & Malaspina, D. Ethnicity effects on clinical diagnoses compared to best-estimate research diagnoses in patients with psychosis: a retrospective medical chart review. J. Clin. Psychiatry69, 941–945 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Pesa, J., Liu, Z., Fu, A. Z., Campbell, A. K. & Grucza, R. Racial disparities in utilization of first-generation versus second-generation long-acting injectable antipsychotics in Medicaid beneficiaries with schizophrenia. Schizophr. Res.261, 170–177 (2023). [DOI] [PubMed] [Google Scholar]
- 21.Chung, H., Mahler, J. C. & Kakuma, T. Racial differences in treatment of psychiatric inpatients. Psychiatr. Serv. (Washington, DC)46, 586–591 (1995). [DOI] [PubMed] [Google Scholar]
- 22.De Choudhury, M., Pendse, S. R. & Kumar, N. Benefits and harms of large language models in digital mental health. arXiv preprint arXiv:2311.14693 (2023).
- 23.Omar, M. et al. Sociodemographic biases in medical decision making by large language models. Nat Med. 1-9. 10.1038/s41591-025-03626-6 (2025). [DOI] [PubMed]
- 24.Hofmann, V., Kalluri, P. R., Jurafsky, D. & King, S. AI generates covertly racist decisions about people based on their dialect. Nature 1–8, 10.1038/s41586-024-07856-5 (2024). [DOI] [PMC free article] [PubMed]
- 25.Xu, C. et al. Do llms implicitly exhibit user discrimination in recommendation? an empirical study. arXiv preprint arXiv:2311.07054 (2023).
- 26.Bai, X., Wang, A., Sucholutsky, I. & Griffiths, T. L. Explicitly unbiased large language models still form biased associations. Proc. Natl. Acad. Sci.122, 8 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Gabriel, S., Puri, I., Xu, X., Malgaroli, M. & Ghassemi, M. Can AI relate: testing large language model response for mental health support. arXiv preprint arXiv:2405.12021 (2024).
- 28.Meng, X. et al. The application of large language models in medicine: a scoping review. Iscience27, 109713 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Gordon, K. H., Perez, M. & Joiner, T. E. Jr The impact of racial stereotypes on eating disorder recognition. Int. J. Eat. Disord.32, 219–224 (2002). [DOI] [PubMed] [Google Scholar]
- 30.Morgan, P. L., Hillemeier, M. M., Farkas, G. & Maczuga, S. Racial/ethnic disparities in ADHD diagnosis by kindergarten entry. J. Child Psychol. Psychiatry55, 905–913 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Gordon, K. H., Brattole, M. M., Wingate, L. R. & Joiner, T. E. Jr The impact of client race on clinician detection of eating disorders. Behav. Ther.37, 319–325 (2006). [DOI] [PubMed] [Google Scholar]
- 32.Etgar, S., Oestreicher-Singer, G. & Yahav, I. Implicit bias in LLMs: bias in financial advice based on implied gender. Available at SSRN (2024).
- 33.Zhou, H. et al. A survey of large language models in medicine: progress, application, and challenge. arXiv preprint arXiv:2311.05112 (2023).
- 34.Yang, Y., Liu, X., Jin, Q., Huang, F. & Lu, Z. Unmasking and quantifying racial bias of large language models in medical report generation. Commun. Med.4, 176 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Liu, Y., Gautam, S., Ma, J. & Lakkaraju, H. Confronting LLMs with traditional ML: Rethinking the fairness of large language models in tabular classifications. In Proc. 2024 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies (eds Duh, K et al.) Vol. 1: Long papers 3603–3620 (Association for Computational Linguistics, 2024).
- 36.Kurtic, l., Marque, A., Pandit, S., Kurtz, M. & Alistarh, D. Give me BF16 or give me death? Accuracy-performance trade-offs in LLM quantization. arXiv preprint. 10.48550/arXiv.2411.02355 (2024).
- 37.Meta. Introducing Quantized Llama Models with Increased Speed and a Reduced Memory Footprint. https://ai.meta.com/blog/meta-llama-quantized-lightweight-models/ (2024).
- 38.Pfohl, S. R. et al. A toolbox for surfacing health equity harms and biases in large language models. Nat. Med. 1–11. 10.1038/s41591-024-03258-2 (2024). [DOI] [PMC free article] [PubMed]
- 39.Fliorent, R. et al. Artificial intelligence in dermatology: advancements and challenges in skin of color. Int. J. Dermatol.63, 455–461 (2024). [DOI] [PubMed] [Google Scholar]
- 40.Chase, A. C. Ethics of AI: perpetuating racial inequalities in healthcare delivery and patient outcomes. Voices Bioeth.6, 10.7916/vib.v6i.5890 (2020).
- 41.Doshi, H., Chudow, J., Ferrick, K. & Krumerman, A. Machine learning in atrial fibrillation—racial bias and a call for caution. J. Med. Artif. Intell.4, 10.21037/jmai-21-12 (2021).
- 42.Carrasco-Farre, C. Large language models are as persuasive as humans, but why? About the cognitive effort and moral-emotional language of LLM arguments. arXiv:2404.09329 (2024).
- 43.Palmer, A. & Spirling, A. Large language models can argue in convincing ways about politics, but humans dislike AI authors: implications for governance. Political Sci.75, 281–291 (2023). [Google Scholar]
- 44.Gravel, J., D’Amours-Gravel, M. & Osmanlliu, E. Learning to fake it: limited responses and fabricated references provided by ChatGPT for medical questions. Mayo Clin. Proc.: Digit. Health1, 226–234 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Yatham, L. N. Biomarkers for clinical use in psychiatry: where are we and will we ever get there?. World Psychiatry22, 263 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Mølstrøm, I.-M., Henriksen, M. G. & Nordgaard, J. Differential-diagnostic confusion and non-specificity of affective symptoms and anxiety: an empirical study of first-admission patients. Psychiatry Res.291, 113302 (2020). [DOI] [PubMed] [Google Scholar]
- 47.Seker, E., Talburt, J. R. & Greer, M. L. Preprocessing to Address Bias in Healthcare Data. In Challenges of Trustable AI and Added-Value on Health (eds Séroussi B et al.) Vol. 294, 327–331 (2022). [DOI] [PubMed]
- 48.Bai, X., Wang, A., Sucholutsky, I. & Griffiths, T. L. Explicitly unbiased large language models still form biased associations. Proc. Natl Acad. Sci. USA122, e2416228122 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Li, X., Cui, Z., Wu, Y., Gu, L. & Harada, T. Estimating and improving fairness with adversarial learning. arXiv preprint arXiv:2103.04243 (2021).
- 50.Wang, Y.-C., Chen, T.-C. T. & Chiu, M.-C. An improved explainable artificial intelligence tool in healthcare for hospital recommendation. Healthc. Anal.3, 100147 (2023). [Google Scholar]
- 51.Chlap, P. et al. A review of medical image data augmentation techniques for deep learning applications. J. Med. Imaging Radiat. Oncol.65, 545–563 (2021). [DOI] [PubMed] [Google Scholar]
- 52.Chen, F., Wang, L., Hong, J., Jiang, J. & Zhou, L. Unmasking bias in artificial intelligence: a systematic review of bias detection and mitigation strategies in electronic health record-based models. J. Am. Med. Inform. Assoc.31, 1172–1183 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Alamoodi, A. et al. A novel evaluation framework for medical LLMs: combining fuzzy logic and MCDM for medical relation and clinical concept extraction. J. Med. Syst.48, 1–12 (2024). [DOI] [PubMed] [Google Scholar]
- 54.Zhao, Z., Jin, Q., Chen, F., Peng, T. & Yu, S. Pmc-patients: a large-scale dataset of patient summaries and relations for benchmarking retrieval-based clinical decision support systems. arXiv preprint arXiv:2202.13876 (2022).
- 55.White, C. et al. Livebench: a challenging, contamination-free LLM benchmark. arXiv preprint arXiv:2406.19314 (2024).
- 56.Google. Release Updates. https://gemini.google.com/updates (2024).
- 57.Anthropic. Claude 3.5 Sonnethttps://www.anthropic.com/news/claude-3-5-sonnet (2024).
- 58.Open AI. ChatGPT—Release Notes. https://help.openai.com/en/articles/6825453-chatgpt-release-notes#h_3cb68a1d55 (2024).
- 59.Ura, A., Minervini, P. & Fourrier, C. The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare. https://huggingface.co/blog/leaderboard-medicalllm (2024).
- 60.Wiest, I. C. et al. Privacy-preserving large language models for structured medical information retrieval. NPJ Digit. Med.7, 257 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Crabtree, C. et al. Validated names for experimental studies on race and ethnicity. Sci. Data10, 130 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data is provided under supplemental data. OSF provides Supplement data as well. https://osf.io/qgpvy/.


