Abstract
Background
We evaluated whether a large language model could assist in selecting psychopharmacological treatments for adults with treatment-resistant depression.
Methods
We generated 20 clinical vignettes reflecting treatment-resistant depression among adults based on distributions drawn from electronic health records. Each vignette was evaluated by 2 expert psychopharmacologists to determine and rank the 5 best next-step pharmacologic interventions, as well as contraindicated or poor next-step treatments. Vignettes were then presented in random order, permuting gender and race, to a large language model (Qwen 2.5:7B), augmented with a synopsis of published treatment guidelines. Model output was compared to expert rankings, as well as to those of a convenience sample of community clinicians and an additional group of expert clinicians.
Results
The augmented model prioritized the expert-designated optimal choice for 114/320 vignettes (35.6 %, 95 % CI 30.6 %–41.0 %; Cohen’s kappa = 0.34, 95 % CI 0.28–0.39). There were no vignettes for which any of the model choices were among the poor or contraindicated treatments. Results were not meaningfully different when gender or race of the vignette was permuted to examine risk for bias. A sample of community clinicians identified the optimal treatment choice for 12/91 vignettes (13.2 %, 95 % CI: 7.7–21.6 %; Cohen’s kappa = 0.10, 95 % CI 0.03–0.18), while an additional group of expert psychopharmacologists identified optimal treatment for 9/140 (6.4 %, 95 %CI: 3.4–11.8 %; Cohen’s kappa = 0.03, 95 % CI 0.01–0.08).
Conclusion
An augmented language model demonstrated moderate agreement with expert recommendations and avoided contraindicated treatments, suggesting potential as a tool for supporting complex psychopharmacologic decision-making in treatment-resistant depression.
Keywords: Major depression, Artificial intelligence, Machine learning, Psychopharmacology, Expert consensus
1. Introduction
A major challenge in the clinical management of major depressive disorder is the difficulty of identifying optimal next-step treatments. First-line pharmacologic treatments are well-established [1], while there is a paucity of evidence regarding the effectiveness of next-step treatments, particularly head-to-head data [2], [3], [4]. Some expert guidelines [5], [6] address this gap by simply listing multiple next-step options, leaving the precise treatment selection to some vaguely-worded notion of case formulation [7] or clinician and patient preference. The problem is compounded by the fact that most community psychiatrists or primary care physicians will not manage a high volume of patients with treatment-resistant depression, so their experience with next-step treatments is often variable and limited [8].
Expert consultation represents one strategy for addressing this lack of experience, where a clinician with psychopharmacologic expertise provides a set of recommendations for treatment options. However, such expertise is often inaccessible, particularly in under-resourced areas, leading to wide variation in practice and potentially suboptimal care. While efforts to encourage depression screening in primary care settings have been successful, they have led to a growing group of patients for whom there is no one available to manage their psychopharmacologic care, especially for those who have failed first line approaches. Over the past 2 decades, there have been numerous efforts to develop automated approaches to guide treatment [9], from computerized algorithms [10], [11] to app-based care [12], [13]. Despite pilot studies suggesting effectiveness, none of these approaches have become widely used. This gap led the US Preventive Services Task Force to note that screening is only useful when resources exist to provide depression care [14]. In much of the world, including many areas of the United States, such resources may not exist.
To address this gap, we aimed to apply and extend a method previously developed to guide bipolar depression treatment [15], modifying it to focus on treatment decisions for treatment-resistant depression. Using a set of deidentified and permuted clinical vignettes based on clinical cases of outpatient major depressive disorder, we sought both expert opinion and community clinician opinion about optimal next-step treatment. We then applied a large language model [16] to select treatment options and compared the clinician and model-selected choices. We focused this effort on a model that can be implemented locally on a laptop computer, in light of potential translational advantages for dissemination and security [17].
2. Materials and methods
2.1. Vignette generation
We generated 20 vignettes reflecting outpatient major depressive disorder with at least one prior antidepressant treatment failure and probabilities approximating treatment history and comorbidities in the electronic health records of outpatient clinics associated with 2 large academic medical centers and their affiliated community hospitals. We then generated 4 versions of each vignette, reflecting male and female gender and White or Black race. The vignettes also included both medical and psychiatric comorbidity, as well as past psychiatric illness, treatment history, and current presentation. (Supplemental Text).
2.2. Vignette evaluation
We used a Qualtrics survey to collect optimal treatment options from two groups of psychiatric prescribers, defined as either psychopharmacology experts or community clinicians. These groups were defined by number of years in practice since training and practice setting. Study procedures were the same across groups. Each prescriber was presented with 10 vignettes and asked to rank the top 5 best next-step treatments, as well as 5 poor or contraindicated treatment options. The subset of 10 vignettes, drawn from the full set of 20, were presented in random order, permuting gender and race. (In previous pilot work, presenting a longer corpus of vignettes to experts led to fatigue and early discontinuation [15]). Using the same approach, community clinicians attending a psychopharmacologic treatment course were asked to rank best and worst next-step treatments; community clinicians were compensated with entry into a prize drawing. Informed consent was implied by voluntary completion of the survey and the privacy of human subjects was preserved. All procedures were performed in compliance with relevant laws and institutional guidelines and have been approved by the appropriate institutional committee (IRB protocol #2023P001987, approved August 22, 2023).
Optimal treatment for a given vignette was assigned based on mean ranking of each medication; poor treatment options were those that appeared at least twice in the lists provided by expert clinicians. To avoid inconsistency, the latter were eliminated from optimal treatment lists.
2.3. Model design
To generate primary recommendations, we used a locally run open-source language model (Qwen 2.5:7B-Instruct; https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), accessed using Python via Ollama, with model temperature set at 0 (i.e., results as close to deterministic as possible). Exploratory analyses used a private Microsoft Azure instance of GPT-4 (GPT-4o; version 11–20) accessed using Python via API, with the same parameters. The prompt included 3 sections (See Supplement): context and task; background knowledge to use; and clinical vignette itself. We selected extracts from the CANMAT 2023 Depression Update, providing first, second, and third-line antidepressant treatments including strengths and weaknesses, as well as first- and second-line adjunctive medication treatments. The model was asked to return a ranked list of the best five next-step pharmacologic interventions, and a list of five poor or contraindicated pharmacologic interventions. Vignettes were presented four times each, with gender and race permuted between male and female and race between Black or white.
For comparison, we also examined a simpler or naïve model which did not include any guidance, but solely the first and third portion of the original prompt.
2.4. Analysis
In the primary analysis, we compared the augmented model to the expert-annotated selections, evaluating how often this model identified the expert-defined optimal medication choice. As there are a multitude of means for comparing such predictions, we adopted a similar method to prior work [15]. To determine whether augmentation was required, we then compared these results to an unaugmented model (i.e., the LLM without inclusion of guideline knowledge). Performance of the two models was compared using McNemar’s test for agreement. To facilitate comparison to real-world standards, we then examined two comparison groups. First, we compared results to the community clinician sample, again calculating the proportion of vignettes for which community clinician choices matched expert selections. Then, we compared results to a second set of psychopharmacology experts, who completed the same survey.
We adopted a similar approach to evaluating how often models or clinicians selected at least one expert-defined poor medication choice. In this context, a poor medication was one selected by both expert annotators as being contraindicated or less useful.
To confirm stability of results across gender and race, we examined model responses stratified by 4 gender-race pairs (Black men, Black women, White men, White women). In a manner analogous to ANOVA with post-hoc pairwise testing to control type 1 error rate, these 4 groups were compared via Cochrane’s Q test (a generalized form of McNemar’s test), followed by post-hoc pairwise tests using McNemar’s test with Bonferroni-corrected p-values.
As an exploratory analysis, we also repeated contrasts using a frontier language model (GPT4o). The aim of this effort was to determine the extent to which using a locally run model might yield suboptimal decision support compared to a standard cloud-based model.
As comparisons of ranked lists in the context of partial rankings (i.e., when not all the potential options receive a score) does not have a widely-accepted metric [18], [19], we elected to describe model performance in terms of proportion of times that the optimal (top-ranked) expert choice was selected as the first choice, both as proportion and as Cohen’s kappa. Secondarily, we reported how often the experts’ top choice was among the top 3, or top 5, provided by the model – that is, could the model identify the optimal choice among one of its preferred options. We similarly determined how frequently one of the models’ top 5 choices included a medication identified as a poor choice by the annotating experts.
3. Results
We first compared model-generated recommendations to those of the expert gold standard. We identified modest overlap among these experts with an average overlap of 0.75 out of 5 meds (SD 0.79), or about 15 %. The unaugmented model identified the optimal expert treatment in 116/320 (36.2 %) of vignettes; for 140/320 (43.8 %), the optimal treatment was among the first 3 identified by the model, and for 152/320 (47.5 %), the optimal treatment was among the model’s top 5. For 5/320 (1.6 %), the model identified at least one medication considered a poor choice by at least one expert.
We then compared the unaugmented model to an augmented model, which was primed with treatment guidelines for major depressive disorder. The augmented model identified the optimal expert treatment in 114/320 (35.6 %) of vignettes; for 169/320 (52.8 %), the optimal treatment was among the first 3 identified by the model, and for 174/320 (54.4 %), the optimal treatment was among the model’s top 5 (Fig. 1). The augmented model did not select any medications considered to be a poor choice by the experts. (For analogous results using GPT4o, see Supplemental Table 1 and Supplemental Figure 1; in general, performance of these models was not superior to local models.)
Fig. 1.
Comparison of Models and Clinician Medication Selection. Fig. 1 shows the comparison of treatment selection accuracy between two versions of a large-language model and two groups of psychiatric providers, based on concordance with expert-designated optimal choices.
Next, we compared model results to two distinct groups of psychiatric prescribers who evaluated vignettes. The first group represented a total of 18 community clinicians, including 3 psychiatrists (17 %), 4 psychiatric nurse practitioners (22 %), 2 non-psychiatric (11 %), and 9 physician assistants (50 %). These clinicians worked in a wide range of settings including hospital based-practices, private practices, community mental health centers, and other community health clinics. They reported a mean of 10.4 (SD 12.4) years in practice since training and treated an average of 18.2 (SD 10.8) individuals with major depressive disorder per week. Each vignette was scored by a median of 3 [25 %-75 % interval 1–9] community clinicians. The second group of prescribers included 16 expert psychopharmacologists. These psychiatrists were based solely in hospital settings or solo private practices, reported a mean of 29.4 (SD 7.2) years in practice, and treated an average of 9.9 (SD 8.2) individuals with major depressive disorder per week. Each vignette was scored by a median of 6 [25 %-75 % interval 5–7] psychopharmacology experts.
The community clinicians identified the optimal treatment choice for 12/91 vignettes (13.2 %, 95 % CI: 7.7 – 21.6 %; Cohen’s kappa = 0.10, 95 % CI 0.03 – 0.18), and a poor choice in 30/91 (33.0 %, 95 % CI 24.2–43.1 %). Expert psychopharmacologists identified optimal treatment for 9/140 (6.4 %, 95 %CI: 3.4–11.8 %; Cohen’s kappa = 0.03, 95 % CI 0.01–0.08), and a poor choice in 15/140 (10.7 %, 95 % CI: 6.6–16.9 %) (Table 1).
Table 1.
Comparison of ratings from models or clinicians to expert recommendation.
| Top Choice |
Poor Choice |
|||||||
|---|---|---|---|---|---|---|---|---|
| Source | Kappa | 95 % CI | n | % | 95 % CI | n | % | 95 % CI |
| Other Experts | 0.03 | 0.01–0.08 | 9 | 6.4 % | 3.4–11.8 % | 15 | 10.7 % | 6.6–16.9 % |
| Community Clinicians | 0.10 | 0.03–0.18 | 12 | 13.2 % | 7.7–21.6 % | 30 | 33.0 % | 24.2–43.1 % |
| Augmented Model | 0.34 | 0.28–0.39 | 114 | 35.6 % | 30.6–41.0 % | 0 | 0.0 % | 0.0–1.2 % |
| Black man | 0.36 | 0.25–0.47 | 30 | 37.5 % | 27.7–48.5 % | 0 | 0.0 % | 0.0–4.6 % |
| Black woman | 0.36 | 0.25–0.47 | 30 | 37.5 % | 27.7–48.5 % | 0 | 0.0 % | 0.0–4.6 % |
| White man | 0.31 | 0.20–0.41 | 26 | 32.5 % | 23.2–43.4 % | 0 | 0.0 % | 0.0–4.6 % |
| White woman | 0.33 | 0.22–0.44 | 28 | 35.0 % | 25.5–45.9 % | 0 | 0.0 % | 0.0–4.6 % |
| Base Model | 0.34 | 0.29–0.40 | 116 | 36.2 % | 31.2–41.7 % | 5 | 1.6 % | 0.7–3.6 % |
| Black man | 0.33 | 0.22–0.44 | 28 | 35.0 % | 25.5–45.9 % | 2 | 2.5 % | 0.7–8.7 % |
| Black woman | 0.34 | 0.24–0.45 | 29 | 36.2 % | 26.6–47.2 % | 2 | 2.5 % | 0.7–8.7 % |
| White man | 0.37 | 0.26–0.48 | 31 | 38.8 % | 28.8–49.7 % | 0 | 0.0 % | 0.0–4.6 % |
| White woman | 0.33 | 0.22–0.44 | 28 | 35.0 % | 25.5–45.9 % | 1 | 1.2 % | 0.2–6.7 % |
Shown are the number and percentage of vignettes for which each source selected the expert-designated optimal treatment (Top Choice) and a poor or contraindicated treatment (Poor Choice), along with Cohen’s kappa and 95 % confidence intervals. Results for race- and gender-permuted vignettes are included for both model types.
Finally, we examined the gender and race-defined subgroups to determine whether the models exhibited bias in recommendations. Supplemental Tables 1 and 2 report model performance within subgroups. For both base and augmented models, rates of agreement with optimal choice were similar across subgroups.
4. Discussion
In this comparison of a clinical decision support tool applying a large language model to expert-defined gold standard treatment, we found that the model was able to identify optimal treatment for around a third (35.6 %) of cases, which compared favorably to community clinicians. In the absence of an evidence-based definition of optimal care for treatment resistant depression, gold standard treatment options were defined by two expert psychopharmacologists. While a model augmented with treatment guideline experts did not improve on a naïve model in selecting optimal treatment, it was significantly less likely to identify poor or contraindicated treatments. These results were similar across gender and race-defined groups, suggesting an absence of bias in the model results.
Our finding of little advantage to incorporating additional guidance in the model prompt, with the important exception of a diminished likelihood of identifying poor or suboptimal treatments, contrasts with prior work in bipolar disorder [15], where augmentation yielded greater advantage. Notably, while we incorporated recently-updated Canadian guidelines [5], the US equivalent has not been updated in more than a decade [20]. The lack of accuracy in treatment selection observed after augmentation may reflect less precision in treatment guidelines compared to bipolar disorder, which would render the incorporation of such guidelines less impactful.
Our results are, however, consistent with a prior preliminary report [21] that found naïve models without further augmentation identified poor treatment options more often. There is little incremental cost to inclusion of such materials, but they may diminish the ability to deploy a single model to support multiple clinical contexts; thus, the benefit of such extended prompts merits more investigation. We also found little difference in performance between the locally run model and a cloud-based frontier model. This promising result suggests the feasibility of using smaller language models, which are far simpler to apply without concerns about transmitting confidential data outside of an institution or office.
While the ability of the models to select optimal treatments did not exceed human comparators, they were substantially less likely to advocate for contraindicated or poor treatment choices, suggesting the potential for this approach to decrease adverse outcomes associated with prescribing. Given the challenge in accessing any psychopharmacological care in many community settings, this result more generally hints at the potential utility of incorporating language models for clinical decision support in management of treatment-resistant depression. How best to incorporate such models into clinical practice cannot be determined from our results.
Clinical vignettes adapted from real-world presentations represent a reasonable starting point for investigating clinical decision support in major depression, but a key next step is prospective investigation in clinical settings. Such studies could begin by silently predicting optimal treatment based on history available in electronic health records and then comparing to actual treatment decisions. However, larger studies that randomize (e.g., at the level of clinician, or site) will ultimately be required to demonstrate the effectiveness of such tools. One potential application would simply provide the clinician, at point of care, with suggested treatment options and medications to avoid [22].
Unfortunately, despite the oft-repeated quest for biomarkers to drive care in psychiatry, efforts at clinical decision support have generally failed at the implementation stage – that is, their application has not been sufficiently straightforward to impact clinical practice in a meaningful way. Whether approaches using large language models, perhaps integrated with electronic health records, can overcome this barrier remains to be seen.
4.1. Limitations
This study also has some important limitations. Most notably, these vignettes may omit relevant details that could yield better next-step treatment predictions. They focus on biological variables, necessary to operationalize biological therapies, but in many community cases of resistant depression, psychological and social variables predominate, and treatment choices must include psychological and social options. Moreover, while the vignettes are derived from real-world clinical presentations, the process of deidentification and permutation may have inadvertently removed useful data. An important area for investigation will be what clinical detail is necessary to drive clinical decision support – and whether non-clinical details, such as genomics or EEG, may facilitate improved treatment identification. Another limitation is that, while the gold standard was derived from expert ratings, the range of responses between experts is notable. Although the expert-defined optimal treatments served as a realistic benchmark in the absence of a definitive standard, the notable heterogeneity in prescribing patterns even among expert clinicians highlights the challenge in developing clinical decision support tools, while also presenting an opportunity for AI tools to reduce such variability through more standardized, evidence-informed decision-making. If anything, this result underscores the need for more systematic approaches to implementing prescribing guidance, whether or not such prescribing includes AI-driven clinical decision support.
4.2. Conclusion
Despite these limitations, this study suggests that language models running on a laptop may generate psychopharmacologic decision support that approximates expert opinion, and in the scenarios considered it likely exceeds the community standard. Given the paucity of expertise available in some settings for management of treatment resistance, randomized trials to evaluate the effectiveness of these recommendations in real-world settings are warranted.
Funding
This work is supported in part by RF1MH132335 and R01MH123804 to Dr. Perlis. The sponsor did not have any role in relation ot the study design, collection, analysis and interpretation of the data, writing of the report, or decision to submit the article for publication.
Declaration of Competing Interest
Dr. Perlis has served on scientific advisory boards or received consulting fees from Alkermes, Circular Genomics, and Genomind. He holds equity in Circular Genomics. He is a paid Editor in Chief at JAMA+ AI, and AI Editor at JAMA Network-Open. Dr. Goldberg is a Consultant at Alkermes, Genomind, Luye Pharma, Neurelis, Neuroma, Otsuka, Sunovion, Supernus; serves on speakers bureaus for Alkermes, Axsome, Intracellular; receives royalties from American Psychiatric Publishing, Cambridge University Press. Dr. Ostacher is a full time employee of the United States Department of Veterans Affairs. He has received grant funding from Otsuka and Freespira; payment from Neurocrine for service on a Data Safety and Monitoring Committee; and personal payments for expert testimony. Dr. Malhi has received grant or research support from National Health and Medical Research Council, Australian Rotary Health, NSW Health, American Foundation for Suicide Prevention, Ramsay Research and Teaching Fund, Elsevier, AstraZeneca, Janssen-Cilag, Lundbeck, Otsuka and Servier; and has been a consultant for AstraZeneca, Janssen-Cilag, Lundbeck, Otsuka and Servier. He is the recipient of an investigator-initiated grant from Janssen-Cilag (PoET Study), joint grant funding from the University of Sydney and National Taiwan University (Ignition Grant) and grant funding from The North Foundation.
Acknowledgements
The authors have no acknowledgements to report.
Footnotes
Supplementary data associated with this article can be found in the online version at doi:10.1016/j.xjmad.2025.100142.
Appendix A. Supplementary material
Supplementary material
Data availability
Vignettes, templates, and surveys used for this study are available from the corresponding author for non-commercial use.
References
- 1.Cipriani A., Furukawa T.A., Salanti G., Chaimani A., Atkinson L.Z., Ogawa Y., et al. Comparative efficacy and acceptability of 21 antidepressant drugs for the acute treatment of adults with major depressive disorder: a systematic review and network meta-analysis. Lancet. 2018;391(10128):1357–1366. doi: 10.1016/S0140-6736(17)32802-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Carter B., Strawbridge R., Husain M.I., Jones B.D.M., Short R., Cleare A.J., et al. Relative effectiveness of augmentation treatments for treatment-resistant depression: a systematic review and network meta-analysis. Int Rev Psychiatry. 2020;32(5–6):477–490. doi: 10.1080/09540261.2020.1765748. [DOI] [PubMed] [Google Scholar]
- 3.Strawbridge R., Carter B., Marwood L., Bandelow B., Tsapekos D., Nikolova V.L., et al. Augmentation therapies for treatment-resistant depression: systematic review and meta-analysis. Br J Psychiatry. 2019;214(1):42–51. doi: 10.1192/bjp.2018.233. [DOI] [PubMed] [Google Scholar]
- 4.Malhi G.S., Mann J.J. Depression. Lancet. 2018;392(10161):2299–2312. doi: 10.1016/S0140-6736(18)31948-2. [DOI] [PubMed] [Google Scholar]
- 5.Lam R.W., Kennedy S.H., Adams C., Bahji A., Beaulieu S., Bhat V., et al. Canadian Network for Mood and Anxiety Treatments (CANMAT) 2023 Update on Clinical Guidelines for Management of Major Depressive Disorder in Adults: Réseau canadien pour les traitements de l’humeur et de l’anxiété (CANMAT) 2023: Mise à jour des lignes directrices cliniques pour la prise en charge du trouble dépressif majeur chez les adultes. Can J Psychiatry. 2024;69(9):641–687. doi: 10.1177/07067437241245384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Butlen-Ducuing F., Haberkamp M., Aislaitner G., Bałkowiec-Iskra E., Mattila T., Doucet M., et al. The new European Medicines Agency guideline on antidepressants: a guide for researchers and drug developers. Eur Psychiatry. 2023;67(1) doi: 10.1192/j.eurpsy.2023.2479. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Macneil C.A., Hasty M.K., Conus P., Berk M. Is diagnosis enough to guide interventions in mental health? Using case formulation in clinical practice. BMC Med. 2012;10:111. doi: 10.1186/1741-7015-10-111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Rathnam S., Hart K.L., Sharma A., Verhaak P.F., McCoy T.H., Doshi-Velez F., et al. Heterogeneity in antidepressant treatment and major depressive disorder outcomes among clinicians. JAMA Psychiatry. 2024;81(10):1003–1009. doi: 10.1001/jamapsychiatry.2024.1778. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Adli M., Rush A.J., Möller H.J., Bauer M. Algorithms for optimizing the treatment of depression: making the right decision at the right time. Pharmacopsychiatry. 2003;36 3:S222–S229. doi: 10.1055/s-2003-45134. [DOI] [PubMed] [Google Scholar]
- 10.Adli M., Wiethoff K., Baghai T.C., Fisher R., Seemüller F., Laakmann G., et al. How effective is algorithm-guided treatment for depressed inpatients? Results from the randomized controlled multicenter german algorithm project 3 trial. Int J Neuropsychopharmacol. 2017;20(9):721–730. doi: 10.1093/ijnp/pyx043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kurian B.T., Trivedi M.H., Grannemann B.D., Claassen C.A., Daly E.J., Sunderajan P. A computerized decision support system for depression in primary care. Prim Care Companion J Clin Psychiatry. 2009;11(4):140–146. doi: 10.4088/PCC.08m00687. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Man C., Nguyen C., Lin S. Effectiveness of a smartphone app for guiding antidepressant drug selection. Fam Med. 2014;46(8):626–630. [PubMed] [Google Scholar]
- 13.Chin T., Huyghebaert T., Svrcek C., Oluboka O. Individualized antidepressant therapy in patients with major depressive disorder. Can Fam Physician. 2022;68(11):807–814. doi: 10.46747/cfp.6811807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Siu A.L., Bibbins-Domingo K., Grossman D.C., Baumann L.C., Davidson K.W., Ebell M., et al. Screening for depression in adults: US preventive services task force recommendation statement. JAMA. 2016;315(4):380–387. doi: 10.1001/jama.2015.18392. [DOI] [PubMed] [Google Scholar]
- 15.Perlis R.H., Goldberg J.F., Ostacher M.J., Schneck C.D. Clinical decision support for bipolar depression using large language models. Neuropsychopharmacology. 2024 doi: 10.1038/s41386-024-01841-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Bedi S., Liu Y., Orr-Ewing L., Dash D., Koyejo S., Callahan A., et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA. 2025;333(4):319–328. doi: 10.1001/jama.2024.21700. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Buckley T.A., Crowe B., Abdulnour R.E.E., Rodman A., Manrai A.K. Comparison of Frontier open-source and proprietary large language models for complex diagnoses. JAMA Health Forum. 2025;6(3) doi: 10.1001/jamahealthforum.2025.0040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Wilke L., Meyer D. Comparing partial rankings. UCSD. 2014 [Google Scholar]
- 19.Fagin R., Kumar R., Mahdian M., Sivakumar D., Vee E. Comparing partial rankings. SIAM J Discret Math. 2006;20(3):628–648. doi: 10.1137/05063088X. [DOI] [Google Scholar]
- 20.National Guideline C. Practice Guideline for the Treatment of Patients with Major Depressive Disorder, third edition [Internet] 2010. [Google Scholar]
- 21.Perlis R.H. Research Letter: application of GPT-4 to select next-step antidepressant treatment in major depression [Internet] medRxiv. 2023 doi: 10.1101/2023.04.14.23288595. [cited 2023 Dec 20]. p. 2023.04.14.23288595. [DOI] [Google Scholar]
- 22.Jacobs M., Pradier M.F., McCoy T.H., Perlis R.H., Doshi-Velez F., Gajos K.Z. How machine-learning recommendations influence clinician treatment selections: the example of the antidepressant selection. Transl Psychiatry. 2021;11(1):108. doi: 10.1038/s41398-021-01224-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary material
Data Availability Statement
Vignettes, templates, and surveys used for this study are available from the corresponding author for non-commercial use.

