1. Introduction
The rapid advancement of artificial intelligence (AI)-based technologies over the course of the last decade has led to dramatic innovation in healthcare technology across a wide variety of functional areas and clinical domains (Bains et al., 2024; J. I. Lim et al., 2024; Nguyen et al., 2024; Sarma et al., 2021; Williams et al., 2024b, 2024a). One historical challenge in the use of AI-based approaches in behavioral health, however, has been working with unstructured text-based data, such as clinical notes and interview transcripts. Recent advances in AI technology have led to the development of large language models (LLMs), such as OpenAI’s ChatGPT, with potential to address this gap by facilitating the processing of unstructured text.
LLMs, designed for use in natural language tasks, are text-to-text predictive models (also known as generative models) trained on very large unstructured text datasets that have shown great promise in nature language processing and understanding. Within behavioral health, LLMs have been investigated for a wide variety of uses, such as documentation creation, decision support, provider assistance, and autonomous intervention agents (Galatzer-Levy et al., 2023; Sharma et al., 2024, 2023; So et al., 2024; Taylor et al., 2024; Tierney et al., 2024; Xu et al., 2023). Reasoning, variably defined as the generation of conclusions from inputs in circumstances that require the application of logical techniques, has become an area of particular interest for application of LLMs due to the potential for real-world application of automated reasoning systems. The promise of LLMs, however, is predicated on the quality of the knowledge encoded within the models by the training process, and studies have noted that the unfiltered base models can generate dangerous or harmful responses (Grabb et al., 2024).
A significant body of literature has explored methods for improving the performance of large language models in correctly responding to requests that require performing complex multi-step reasoning, including chain-of-thought reasoning (Diao et al., 2024; Li et al., 2024; OpenAI, 2024; Wei et al., 2023), implicit reasoning (Hao et al., 2024; Wang and Zhou, 2024), self-reflection/verification (Gero et al., 2023; Jeong et al., 2024; Renze and Guven, 2024), structured prompt engineering (Verhees et al., 2025), multi-agentic reasoning (S. Lim et al., 2024; Motwani et al., 2025; Wang et al., 2024), and many other approaches. In LLMs, reasoning appears to be an emergent property arising from the use of large-scale datasets collected from publicly available written and internet literature as training data, much of which encodes examples of language-based reasoning. These sources, however, may create limitations on the applicability of the model to the specialized tasks found in the practice of psychiatry and behavioral health. One study of user intention found that 78% of patient respondents were willing to use ChatGPT for self-diagnosis (Shahsavar and Choudhury, 2023), and in our experience, patients frequently use LLMs to evaluate their own mental health concerns, provide diagnostic and treatment recommendations, and even to provide autonomous psychotherapy, despite the potential risks of such usages, the stipulations of major vendors against using these tools for medical advice, and the limited accuracy, interpretability and explainability of implicit or automated techniques for reasoning (Chen et al., 2025; Petrov et al., 2025).
Few studies have directly examined the efficacy of LLMs on tasks related to knowledge or reasoning within behavioral health. Xu et al. (2023) developed and examined the efficacy of LLMs for predicting mental health-related metrics from Reddit posts, finding that LLMs were able to classify and stratify suicidality and depression from these short text snippets better than random. Galatzer-Levy et al. (2023) examined the ability of the Med-PaLM 2 large language model to analyze patient interviews and case vignettes and predict psychometric scores and diagnoses, finding that the model has promise in both applications. Verhees et. al. (2025) examined how a systematic approach to prompt engineering can enhance model performance in the extraction of transdiagnostic psychopathological criteria from unstructured clinical text. Here, we present a study investigating 1) the ability of the GPT family of LLMs (OpenAI, San Francisco, CA) to reason clinically about behavioral health and 2) the efficacy of integrating clinical expert-derived reasoning (through the use of decision trees) into the models to improve the accuracy of diagnostic prediction.
2. Methods
2.1. Dataset
The primary dataset for this study comprised case vignettes extracted from the DSM-5-TR Clinical Cases (Barnhill, 2023) book. Each example consisted of a multi-paragraph narrative vignette about a psychiatric clinical case and one or more DSM-5-TR (American Psychiatric Association, 2022) diagnoses assigned to the patient by the author of the vignette (“author-designated diagnoses”). The vignettes were organized by the DSM-5-TR diagnostic category corresponding to the patient’s primary diagnosis. A total of 106 cases were retrieved from the book. Cases from the sections on Elimination Disorders, Gender Dysphoria, Personality Disorders, and Paraphilic Disorders were discarded due to these DSM categories not being covered by the Handbook of Differential Diagnosis (see below). The remaining 93 cases were split into training and testing sets (of 38 and 55 cases respectively), using sampling stratified on primary diagnosis DSM category.
2.2. Inference approach and large language models
2.2.1. Large language models and parameters
For comparison, three successive versions of the commonly used GPT family of LLMs (OpenAI, San Francisco, CA) were evaluated: GPT-3.5, GPT-4, and GPT-4o, which we term the “generalist large language models” (gLLMs). Additional technical detail is available in Appendix B.
2.2.2. Prompting strategies for prediction
Two prompting strategies for prediction were implemented and compared for this study in order to evaluate the capability of the study gLLMs to reason clinically about mental health and to integrate external expert reasoning into its predictions. In the first approach, termed the “base” prompting approach, the gLLM was directly prompted to assign diagnoses to the vignette, without the inclusion of outside knowledge or use of iterative prompting. This was performed by combining the System Prompt and the Direct Prompt and querying the model. In the second approach, termed the “decision tree” (DT) prompting approach, a decision tree-based system was implemented. In this approach, the LLM is iteratively prompted with specific behavioral health questions regarding the input vignette. Candidate diagnoses are then predicted based on the answers to these questions, and then the list of candidate diagnoses is narrowed through additional queries to the LLM, followed by a matching and reconciliation process that results in the final diagnosis. An overview of the experimental process is displayed in Figure 1. Prompt templates for both approaches are shown in Table 1. The study prompting was conducted in August 2024.
Figure 1. Experimental Process.

Overview of the experimental process; note that the provided example is synthetic for demonstrative purpose. There are seven numbered stages of the process. In stage 1 (Query), the initial diagnostic query is formulated using the case vignette. In stage 2 (Model), either the Base or Decision Tree (DT; see Figure 2) prompting approaches are used to evaluate the query, using the generalist large language model (gLLM) assigned to the experiment and the prompt templates from Table 1. In stage 3 (Raw Response), the raw diagnosis candidate responses from the gLLM are received; in this example, there are five raw responses (a-e). In Stage 4 (Simplify), the simplification procedure outlined in Appendix C is performed in order to standardize the candidates. For response a, the specifier “mild” is removed from the first diagnosis (rule 1), and responses b and e are simplified to generalized diagnoses (rules 2 and 3). In stage 5 (Fuzzy Match), each of the simplified responses are compared to DSM-5-TR diagnoses using the fuzzy string matching system (see Appendix C) and assigned a matching score to the closest diagnosis. In stage 6 (Reconcile), those responses with a matching score less than 99% are then reconciled by the semi-automated reconciliation protocol (see Appendix C); in this case, a is matched to major depressive disorder using rule 3 (minor typographical change), c is matched to diabetes using rule 1, and d is matched to other/unspecified schizophrenia spectrum and other psychotic disorder using rule 5. Responses b and e do not need matching due to a matching score of 100%. In stage 7 (Scoring), the final results are scored against the author-designated ground truth diagnoses and assigned as True Positive (TP) or False Position (FP). Any author-designated diagnoses that were not produced as responses would be designated as False Negative (FN).
Table 1.
Prompt templates used for experiments.
| System Prompt I am going to give you an academic psychiatry clinical case that describes a patient with one or more psychiatric DSM-5-TR diagnoses, and you are going to answer questions about that case that I provide. |
| Direct Prompt The clinical case is as follows: <X> Please provide me with a list of DSM-5-TR diagnoses, without specifiers or modifiers, that you believe apply to this patient based solely on the clinical case. Please format them as a JSON list titled “diagnoses” with one diagnosis per entry. Do not include any other text in your response. Do not include any incorrect, inappropriate, or candidate diagnoses. For example, if the diagnoses are “Insomnia Disorder” and “Bipolar I Disorder”, you would reply with: {“diagnoses”: [“Insomnia Disorder”, “Bipolar I Disorder”]} |
| Decision Tree Question Prompt The clinical case is as follows: <X> Please answer the following question ‘yes’ or ‘no’ without explanation, based on the facts stated in the clinical case. Answer with only the words ‘yes’ or ‘no’. If there is insufficient information, answer ‘no’. The question is as follows: <Y>. |
| Pairwise Comparison Prompt The following two DSM-5-TR diagnoses are candidate diagnoses for this patient: <X> and <Y>. We are interested in deciding if both diagnoses are necessary for the patient, or if one of these two diagnoses is better explained by the other. Please respond with a list of which of these two diagnoses are necessary for the patient in JSON format. For example, if the candidate diagnoses are ‘major depressive disorder’ and ‘adjustment disorder’, respond with [“major depressive disorder”, “adjustment disorder”] if both diagnoses are necessary, or with either [“major depressive disorder”] or [“adjustment disorder”] if one diagnosis is better than the other. |
2.2.3. Decision tree implementation
The DSM-5-TR Handbook of Differential Diagnosis (the “Handbook”) was used as the expert knowledge source for the initial development of the decision tree prompting model (Michael B. First, 2024). This handbook consists of a series of 28 symptom-based decision trees for diagnosis. Each tree consists of a series of yes or no questions, the answers to which lead either to other questions or to diagnoses. All 28 trees were extracted and implemented as iterated yes or no prompts (“question prompts”). For each tree, a one-paragraph prompt was written describing the symptom category pertaining to the tree and asking if the vignette describes a patient experiencing that category of symptoms (“screening prompts”). See Table 1 for prompt templates and Appendix A for examples.
To make predictions using the decision tree approach, each vignette was processed through each of the 28 decision trees. First, the screening prompt for each tree (see Supplementary Materials 1 for prompt text) was used to determine which decision trees could be applicable to the vignette, and then for each applicable tree, the question prompts were used iteratively by combining the System Prompt with the Question Prompt, collecting diagnoses from each tree into a list of candidate diagnoses for the vignette. To narrow the list of candidate diagnoses into the final list of diagnoses (“model-predicted diagnoses”), the model was prompted to compare each pair of candidate diagnoses and determine if both diagnoses were necessary, or if only one of them was needed, using the Pairwise Comparison Prompt (Table 1). An overview of the decision tree approach is provided in Figure 2.
Figure 2. Decision Tree Approach.

Overview of the decision tree prompting process; note that the provided examples are synthetic for demonstrative purpose. Prompts have been shortened for clarity and display. a) The screening prompt for each of the 28 trees (see Supplemental Materials 1) is used to determine which trees should be used for a given vignette. In this case, the speech disturbance tree is used (more than one tree could be selected for a single vignette). b) The vignette is the processed through each selected tree. The decision trees are comprised of iterated yes/no prompts and can make use of branching logic (i.e., the answer to one prompt directs which prompt is presented next); some answers lead to the addition to a candidate diagnosis. In this case, the “Yes” answer displayed leads to the addition of the candidate diagnosis “Language Disorder”). Multiple candidate diagnoses could be generated by a single tree. c) Once all candidate diagnoses from all selected trees have been collected, each pair of diagnoses is inserted into a pairwise comparison prompt (see Table 1 for full prompt template) to the generalist large language model (gLLM) used for the experiment. This may result in the elimination of one of the candidate diagnoses. In this case, the diagnosis “Language Disorder” has been eliminated.
2.3. Decision tree refinement
In the first phase of the project, the decision tree prompts from the Handbook were refined through experimentation on the 38 training set vignettes. The prompts were initially implemented using the exact decision trees provided in the source handbook. Then, the model was used to generate initial predictions on the training cases. During this process, each prompt and the LLM response to the prompt was logged for review. These responses were then reviewed for every vignette. Based on this review, nine common categories of incorrect inferences based on the decision trees were developed, eight of which were based on disorder groupings and one of which related to problems in the overall tree structure.
Once these categories were developed, they were used to refine the trees, screening prompts, and question prompts. The primary refinement approach was to address the discovered common themes through the use of known best practices for LLM prompt optimization, such as task decomposition and sequential tasking (Zhou et al., 2023) (dividing a complex task into smaller, specific tasks that are executed sequentially; e.g., breaking “Is the patient experiencing a Manic Episode” into a series of criterion-specific questions). For each category, a set of refinements was developed and implemented into the decision tree prompting system. Common refinements included expanding definitions of specialized words used in behavioral health (e.g., “egosyntonic”) and expanding references to criteria to include the full criteria (e.g., prompting specifically for each of the criteria of a Manic Episode rather than prompting the model to determine if “criteria for a Manic Episode are met”). A summary of the categories and implemented refinements is presented in Table 2.
Table 2.
Findings and results of decision tree refinement process by category.
| Depressive, Bipolar and Related Disorders. The initial DT prompts were not able to reliably assess whether criteria for Major Depressive Episode (MDE), Manic Episode, or Hypomanic Episode were met in the absence of specification (i.e., it was unable to reliably assess the question “Were criteria met for a Manic Episode?”). To address this issue, a common set of sequential, decomposed question prompts were created to specifically assess DSM-5-TR criteria for MDE, Manic Episode, and Hypomanic Episode. The DT approach was not able to reliably differentiate between different affective disorders (i.e., Major Depressive Disorder vs Bipolar I Disorder) because the underlying decision trees from the Handbook provided multiple diagnostic options (i.e., the tree result might be “Bipolar I Disorder or Major Depressive Disorder or Schizoaffective Disorder” if criteria for mania were met) at leaf nodes of the tree. To address this issue, a set of question prompts were created to specifically differentiate between these disorders. Finally, the DT approach inconsistently made diagnoses for Persistent Depressive Disorder due to different trees in the Handbook using different criteria for diagnosis. To address this issue, a consistent set of prompts was created to diagnose this disorder. |
| Substance-Related and Addictive Disorders. The initial DT approach often incorrectly diagnosed a Substance Use Disorder due to apparent misinterpretations of the term “substance” (i.e., diagnosing Pica as a Substance Use Disorder). To address this, the term “substance” was clarified in question prompts to “a substance of abuse or medication.” The DT approach was often not able to differentiate between Substance Intoxication, Substance Withdrawal, and Other Adverse Effect of Medication due to the trees in the Handbook often providing all of these as possible diagnostic options at leaf nodes. To address this, a specific set of prompts was created to determine which of these three diagnoses, if any, applied. Additionally, the DT approach often diagnosed Substance Intoxication or Withdrawal in circumstances where the vignette included a history of a past episode of intoxication or withdrawal; this was addressed by clarifying in question prompts that the diagnoses referred to “current” intoxication or withdrawal. |
| Obsessive-Compulsive and Related Disorders. The DT approach often diagnosed Obsessive-Compulsive Disorder when an anxiety disorder diagnosis was more appropriate; this was found to be due to incorrect responses to question prompts that assessed whether the patient had ego-dystonic thoughts and appeared to be due to the large language model not being able to reliably make use of the term “ego-dystonic.” This was addressed by adding an explicit definition to all prompts using the term. |
| Personality Disorders. The DT approach often diagnosed Borderline Personality Disorder in any vignette that included mention of self-injury; this was found to be due to the use of inconsistent definitions across different trees in the Handbook. To address this issue, a consistent defining clause was added to all relevant question prompts: “a persistent and pervasive pattern of instability of interpersonal relationships, self-image, and affects, and marked impulsivity, beginning by early adulthood.” |
| Trauma- and Stressor-Related Disorders. The DT approach was found to diagnose Adjustment Disorder for almost all vignettes; this was found to be due to the trees in the Handbook not explicitly making mention of DSM-5-TR criterion C or D for Adjustment Disorder (“The stress-related disturbance does not meet the criteria for another mental disorder and is not merely an exacerbation of a preexisting mental disorder,” and “The symptoms do not represent normal bereavement and are not better explained by prolonged grief disorder”). To address this issue, a specific set of prompts was implemented to assess for the full DSM-5-TR criteria. |
| Anxiety Disorders. The DT approach was found to inconsistently diagnose Generalized Anxiety Disorder; this was found to be due to inconsistent definitions used in the Handbook trees. This was addressed through the development of standardized language (“excessive worry and anxiety about several events or situations, occurring more days than not for at least 6 months, about a number of different unrelated issues, events or activities”). |
| Schizophrenia Spectrum and Other Psychotic Disorders. The DT approach was found to inconsistently make the differential diagnosis between Schizophrenia, Schizophreniform Disorder, and Schizoaffective Disorder; this appeared to be due to difficulty differentiating whether the patient had psychotic symptoms in the absence of a mood episode using the Handbook trees. To address this, a set of decomposed, sequential prompts were developed to differentiate between these three disorders. |
| Sleep/Wake Disorders and Sexual Dysfunctions. The DT approach was found to diagnose Insomnia Disorder in almost all vignettes that included sleep disruption. To address this, expanded language was added to the relevant question prompts: “occurring at least 3 nights per week for at least 3 months, and is not explainable by a mental disorder other than insomnia disorder.” The DT approach was also found to inconsistently make diagnoses in the trees for insomnia, hypersomnolence, and sexual dysfunction; this appeared to be due to these Handbook trees being designed differently than the other trees. To address this, the trees were refactored to use a single question prompt for each diagnosis. |
| Structural Limitations. The DT approach was found to miss diagnoses in circumstances where the patient had more than one diagnosis with similarities in presenting symptoms that led multiple diagnoses to be found in a single tree; this was due to the structure of the Handbook trees allowing only one diagnosis per tree to be assigned. This was addressed by altering the implementation to allow for all potential diagnoses in the tree to be considered, with structural limitations only for diagnoses that are mutually exclusive. The DT approach also produced extra diagnoses in circumstances where the Handbook trees had leaf nodes with multiple possible diagnoses (i.e., “MAJOR DEPRESSIVE DISORDER; SCHIZOHPRENIA”). This was addressed by adding additional question prompts or referrals to other trees to differentiate between the possible diagnoses. The DT approach was found to inconsistently attribute disorders that were secondary to nonpsychiatric medical conditions to primary psychiatric disorders, due to an apparent inability to distinguish between physiological effects of a nonpsychiatric condition and psychological reactions to having medical conditions. This was addressed through the addition of specific clarifying language to relevant question prompts. After making the above changes, the DT approach was found to excessively make Other Specified/Unspecified diagnoses. This was found to be due to altering the tree implementation to allow all potential diagnoses to be considered, as the Handbook trees were designed to use these diagnoses as final catch-all leaves. This was addressed by only allowing Other Specified/Unspecified diagnoses to be made if no other diagnoses were made from a decision tree. |
2.4. Diagnosis matching and simplification
To improve the initial tractability of the task, facilitate ease of comparison, and match the task most closely to the diagnostic power of the decision trees from the Handbook (which makes several simplifications to DSM-base diagnoses), all diagnoses (including author-designated and model-predicted) were systematically simplified using a standardized procedure (see Appendix C). After prediction and simplification, all diagnoses further underwent a semi-automated matching process to associate it with an exact DSM-5-TR diagnosis (or simplified diagnosis per the above protocol) when possible (see Appendix C). We opted to make use of simplification and the semi-automated matching approach for several reasons: firstly, the decision trees in the Handbook of Differential Diagnosis make use of simplifications that needed to be implemented systematically in the model outputs for comparison; secondly, we wanted to separate the question of whether the model could consistently avoid extra specifiers and other text from that of whether the model was able to reason appropriately about the diagnosis (and so, for example, felt it would be non-informative to penalize the model for adding an abbreviation to a diagnosis name).
2.5. Analysis
For each vignette, base and DT approach predictions were created for each gLLM and then compared to the vignette author-designated diagnoses. Model-generated diagnoses were scored as a True Positive (TP) if they matched one of the author-designated diagnoses, or otherwise as a False Positive (FP). Author-designated diagnoses were scored as a False Negative (FN) if there was no matching model-generated diagnosis. This scoring approach was also used for associated categories. After scoring, the positive predictive value (PPV) and sensitivity were calculated for each testing set vignette, the composite F1 performance score was calculated from these scores, and then all were averaged across the dataset for reporting (also known as macro averaging, see Appendix D for equations); in circumstances where the calculation would cause a division by zero, 0 was substituted for the result. We chose to use the macro average to give each vignette equal weight in the overall metric. For statistical testing, the paired Student’s t-test was used. For each gLLM, a two-tailed test was performed to compare PPV, sensitivity and F1 between the Base and DT approaches using a significance level of 0.05.
3. Results
3.1. Predictions, diagnosis matching and simplification
Prediction, diagnosis matching, and simplification were carried out on the 55 test set vignettes; all vignettes were successfully processed and the maximum model context length was never exceeded for any of the evaluated gLLMs. There were a total of 77 author-designated diagnoses and 73 author-designated categories across the test set vignettes. After completing diagnosis matching (for complete simplification and matching detail, see Supplemental Materials 2), a small number of base approach diagnoses were not matchable to a DSM-5-TR diagnosis, a DSM-5-TR Z code, or a non-psychiatric illness; three such diagnoses were noted for gpt-3.5, one for gpt-4, and three for gpt-4o. This represented less than 1% of the total number of predicted diagnoses for each approach. At the diagnosis level of specificity, the Base approach predicted 161–163 diagnoses, with TP, FP, and FN ranges of 55–61, 99–108, and 15–22 respectively; the DT approach predicted 76–96 diagnoses, with TP, FP, and FN ranges of 46–55, 30–42, and 22–31 respectively. At the category level of specificity, the Base approach predicted 133–135 categories, with TP, FP, and FN ranges of 68–70, 64–65, and 3–5, respectively; the DT approach predicted 67–82 categories, with 53–62, 14–20, and 11–20, respectively. Diagnosis and category counts and classifications after analysis are found in Table 3, and full scoring details by vignette, approach and model are available in Supplemental Materials 3.
Table 3.
Result and classification counts for predicted diagnoses across experiments.
| gLLM | Model | Author Dx | Model Dx | TP | FP | FN | |
|---|---|---|---|---|---|---|---|
| Diagnosis | gpt-3.5 | Base | 79 | 177 | 55 | 122 | 24 |
| DT | 79 | 74 | 44 | 30 | 35 | ||
| gpt-4 | Base | 79 | 174 | 61 | 113 | 18 | |
| DT | 79 | 96 | 54 | 42 | 25 | ||
| gpt-4o | Base | 79 | 171 | 62 | 109 | 17 | |
| DT | 79 | 89 | 55 | 34 | 24 | ||
| Category | gpt-3.5 | Base | 75 | 146 | 69 | 77 | 6 |
| DT | 75 | 67 | 53 | 14 | 22 | ||
| gpt-4 | Base | 75 | 146 | 70 | 76 | 5 | |
| DT | 75 | 84 | 61 | 23 | 14 | ||
| gpt-4o | Base | 75 | 142 | 70 | 72 | 5 | |
| DT | 75 | 81 | 61 | 20 | 14 |
Numbers reflect results after post-processing. gLLM: Generalist Large Language Model, TP: True Positives, FP: False Positives, FN: False Negatives, DT: Decision Tree Approach, Dx: Diagnosis
3.2. Statistical analysis
Analysis results by gLLM, prompting approach (Base or DT), and level of granularity (diagnosis or category) are shown in Table 4. At the diagnosis level of specificity, the DT approach had significantly higher PPV and F1 scores than the base approach (with an average increase of +22% and +0.13, respectively) for all gLLMs, without a finding of significance for comparisons of the sensitivity (see Figure 3 for diagnosis level results by vignette). At the category level of specificity, the DT approach had significantly higher PPV (average +20%) for all gLLMs but had significantly lower sensitivity for the gpt-3.5 and gpt-4 gLLMs (average −13% across all 3 models), leading to a significant increase in F1 for the gpt-4 and gpt-4o gLLMs (average +0.09 across all 3 models) (see Figure 4 for category level results by vignette).
Table 4.
Results of performance analysis of predictions for diagnosis and diagnostic category, by experiment.
| By Diagnosis | By Category | ||||||
|---|---|---|---|---|---|---|---|
| gLLM | Model | Sensitivity | PPV | F1 | Sensitivity | PPV | F1 |
| gpt-3.5 | Base | 69.24% * | 33.21% | 0.4251 | 91.21% * | 54.27% | 0.6424 |
| DT | 56.36% | 60.30% * | 0.5521 * | 74.24% | 80.91% * | 0.7479 * | |
| gpt-4 | Base | 76.36% | 38.70% | 0.4872 | 93.94% * | 54.45% | 0.6515 |
| DT | 70.00% | 60.73% * | 0.6255 * | 80.61% | 72.12% * | 0.7376 | |
| gpt-4o | Base | 76.67% | 40.42% | 0.5015 | 92.12% * | 56.64% | 0.6598 |
| DT | 70.91% | 65.27% * | 0.6585 * | 81.52% | 76.36% * | 0.7642 * | |
Bold text and * denote values that are statistically significantly higher by paired two-tailed t-testing between Base and DT approaches. All metrics were calculated using macro-averaging across vignettes. gLLM: Generalist Large Language Model, PPV: Positive Predictive Value, DT: Decision Tree Approach
Figure 3. Per-vignette evaluation results, with analysis by diagnosis.

This color map displays the computed sensitivity, PPV, and F1 statistic for every vignette, using the predicted specific diagnoses for evaluation. Vignettes are sorted based on the associated diagnostic category chapter of the DSM-5-TR Clinical Cases book in which they are found. Results are displayed per generalist large language model (gLLM) used and by predictive approach used, i.e., the Base approach or the Decision Tree (DT) approach. All values are normalized to color using the 0–1 range and a color legend is available in the bottom right of the map. Each small box represents a specific result for a specific vignette. The overall macro-averaged results are available in Table 4.
Figure 4. Per-vignette evaluation results for analysis by category.

This color map displays the computed sensitivity, PPV, and F1 statistic for every vignette, using the predicted diagnostic categories for evaluation. Vignettes are sorted based on the associated diagnostic category chapter of the DSM-V-TR Clinical Cases book in which they are found. Results are displayed per generalist large language model (gLLM) used and by predictive approach used, i.e., the Base approach or the Decision Tree (DT) approach. All values are normalized to color using the 0–1 range and a color legend is available in the bottom right of the map. Each small box represents a specific result for a specific vignette. The overall macro-averaged results are available in Table 4.
4. Discussion
In this paper, we sought to evaluate the capabilities of the GPT family of large language models when applied to psychiatric reasoning and diagnosis, and to evaluate whether directly integrating clinician-expert guidance (in the form of decision trees) into the models improved their psychiatric performance. To this end, we evaluated two paradigms for utilizing the LLMs to make diagnoses: 1) directly prompting the models to predict diagnoses without access to knowledge external to the model (the Base approach), and 2) adapting expert-created diagnostic decision trees into sequential prompts in order to produce candidate predictions, and then prompting the model to determine which candidates were most appropriate. Overall, we found that both approaches were able to make appropriate diagnostic predictions, with some trade-offs noted between the two approaches. This finding demonstrates that the underlying large language models do have psychiatric reasoning capabilities, despite not having been trained for this specific purpose.
4.1. How do LLMs perform when directly prompted to estimate diagnoses?
In direct prediction efforts using the Base approach, we found that the model was able to correctly produce the majority of correct diagnoses, with the sensitivity mean ranging from 68% (gpt-3.5) to 78% (gpt-4o). This is concordant with the findings of Galatzer-Levy et al. (2023), who used a different family of models in a restricted subset of vignettes and diagnoses and found 77.5% accuracy in predicting the correct primary diagnosis. When assessing whether the model predicted diagnoses in the correct DSM-5-TR category (rather than whether the specific diagnoses were correct), the Base approach demonstrated impressive sensitivity means of 91% (gpt-3.5) to 95% (gpt-4). We interpret this result as demonstrating that the large language models have an inherent capacity to extract symptoms and mental health concerns from narratives and reason about likely diagnoses using this information; indeed, the degree of concordance between the predicted diagnoses and the author-designated diagnoses is superior to that found for most of the mental disorders studied in the DSM-5 field trials (Clarke et al., 2013; Regier et al., 2013) (which found pooled intraclass Kappa statistics of 0.46 for schizophrenia, 0.56 for bipolar disorder, and 0.28 for major depressive disorder).
However, we also found that fewer than half of the predicted diagnoses were correct, with a PPV mean ranging from 35% (gpt-3.5) to 43% (gpt-4o) for specific diagnoses and 57% to 59% respectively for diagnostic categories. We hypothesize that the significant overdiagnosis rates represent the model’s limited capability to apply criteria and clinical judgement to determine if a patient’s symptoms meet the DSM-5-TR defined threshold for a behavioral health diagnosis.
The interpretation of sensitivity and PPV findings must be done in the context of an intended use case, as acceptable performance is dependent on the risks and potential consequences of correct or incorrect findings. For example, higher sensitivity and lower PPV may be most acceptable for screening tests that are intended to feed into further confirmatory testing (as in such circumstances, false positives can be caught later in the assessment course, but false negatives could lead to harm). On the other hand, in circumstances when a false positive itself could lead to harm (i.e., through generating anxiety or overtreatment), the PPV is essential. In the context of this study, we find the propensity of the LLMs for overdiagnosis (i.e., low PPV) to be particularly concerning given the empiric prevalence in our practice of patients who make use of publicly available LLMs for self-diagnosis; their usage of these systems is closest to our Base approach and is likely similarly subject to the same limitation. Clinicians should ensure their patients are adequately informed about this risk in the use of LLMs to assist patients in managing their own mental health. The degree of risk and potential harm may vary by patient, and future work could study how emergent patient-driven uses of public LLMs impact both the general population and people with mental disorders. Since it is likely that such uses will only increase as these technologies continue to permeate the public consciousness, the development of guidelines to prevent inappropriate clinical use of these technologies is of great importance.
All three gLLMs output some diagnoses that could not be reconciled with a DSM-5-TR diagnosis. These “Non-DSM” diagnoses fell into two broad categories: medical features (such as chronic pain or obesity) and psychiatric features (such as suicidal ideation or postpartum psychosis). We see the medical features as model alignment errors, as the model was clearly instructed to focus on psychiatric DSM-5-TR diagnoses in the System Prompt and the Base Prompt; it may be possible in future work to filter these diagnoses out through the addition of additional post-processing prompts. The Non-DSM psychiatric features included symptoms (such as suicidal ideation), associated factors (such as bullying victimization), and commonly-used diagnosis-like phrases that are not used in the DSM-5-TR (such as postpartum psychosis). Notably, the majority of these features (including postpartum psychosis, postpartum depression, sexual orientation questioning, and suicidal behavior disorder) are related to diagnoses that were either in prior versions of the DSM or have been suggested in the literature as future additions to the DSM. This suggests that the parametric knowledge encoded in the LLMs is not always able to correctly separate concepts that are in the DSM from those that are not, likely due to the probabilistic nature of the models’ functioning.
4.2. Does the integration of expert decision trees improve the diagnostic capabilities of LLMs?
In decision tree-based prediction efforts using the DT approach, we found the model was again able to correctly produce the majority of correct diagnoses, with sensitivity means ranging from 59% (gpt-3.5) to 71% (gpt-4o); we did not find a statistically significant difference between the two approaches in sensitivity scores for specific diagnoses when using the paired Student’s t-test. When looking at diagnostic categories, we did find that for the gpt-3.5 and gpt-4 models, the DT approach had statistically significantly lower sensitivity scores than the Base approach. When examining PPV, however, we found that the DT approach demonstrated a statistically significant improvement for all models across specific diagnosis and diagnostic category. This led to a significant improvement in F1 score using the DT approach for all experiments except for the evaluation of performance in diagnostic category for the gpt-3.5 model.
We hypothesize that these results demonstrate that the integration of the decision trees improved the capability of the model to apply diagnostic criteria in order to determine if behavioral health symptoms met the threshold of a diagnosis, leading to an improvement in PPV across all experiments. The concomitant reduction in sensitivity performance is consistent with the known tradeoff between the two metrics when increasing the “threshold” for predicting a positive class. This may also represent increasing difficulty in the model answering specific questions about diagnostic criteria from the unstructured vignette (as opposed to the easier task of extracting broad symptoms). Of note, we found that significant adaptation to the decision tree models found in the Handbook were required in order to optimize performance for our study. This is consistent with previously reported results (Nori et al., 2023; Zhou et al., 2023) that have demonstrated the impact of the application of prompt engineering techniques on model performance.
4.3. How has the progression of GPT models impacted their performance in psychiatric reasoning?
We found that, broadly, performance tended to improve with the use of successive GPT models. The biggest jump in performance by F1 score was between the gpt-3.5 and gpt-4 models for both approaches, with a more modest increase between gpt-4 and gpt-4o. This is consistent with reported results in other domains (Shahriar et al., 2024), and may reflect in part that changes to the large language models are aimed both to improve performance and to reduce the cost of prediction – goals that often require tradeoffs. Notably, costs are significantly different between the three models. These costs are priced in units of 1M input and output tokens, and are $0.50/$1.50 for gpt-3.5, $10/$30 for gpt-4, and $5/$15 for gpt-4o.
4.4. What are the potential limitations, ethical considerations, impact, and opportunities for further investigation presented by this work?
This effort represents the first comprehensive evaluation of the performance of GPT models in psychiatric diagnosis with and without the use of external expert guidance. Executing the study required two technical compromises that may represent potential limitations. In order to make use of standardized, commonly-available input examples that were not designed specifically for AI applications, case vignettes were obtained using the APA’s published casebook. Given that the GPT models were trained on corpora that include published books, it is possible that current or former versions of the casebook were included in the training dataset. This might have provided the models with an advantage in making diagnoses using these examples that would not be present when analyzing other data. We believe that the probability of significant influence on our results is low due to the length of the vignette examples. Additionally, there is no published material on the application of the Handbook decision trees on the casebook vignettes, reducing the probability that memorization would affect the DT results. The use of vignettes from a single source may also limit the extensibility of our results; however, we are reassured by the fact that the vignettes are written by a wide variety of authors with variable styles. Since the vignettes largely are of patients with at least one psychiatric diagnoses, we made use of a high prevalence population and our system prompt (Table 1) assumes that the subject has at least one diagnosis. This may limit the applicability of our results in a low prevalence population due to the spectrum effect.
Additionally, in order to facilitate direct comparison between author-designated diagnoses and predicted diagnoses, semi-automated review was used with a standardized diagnosis simplification system. This approach could have hidden important differences between the model’s output and the author-designated diagnoses. We believe that this potential trade-off was worthwhile given that our goal was to evaluate the model’s ability to reason about diagnosis and make use of expert guidelines, but future efforts may wish to investigate the performance of the models in more specific areas, such as the generation of appropriate DSM specifiers or in differentiation between types and severity of neurocognitive disorders or substance use disorders. Additionally, we excluded vignettes with primary diagnoses from DSM-5-TR chapters not covered by the Handbook, such as personality disorders and paraphilic disorders; future efforts could investigate the adaptation of other expert resources that apply to those areas.
We believe that the main impact of our findings on psychiatry should be to emphasize the enduring importance of expert and consensus knowledge in the development of semi-automated and automated systems based on generative artificial intelligence models, such as LLMs. By integrating expert-derived reasoning processes with LLMs, the developers of clinical decision support (CDS) systems can enable the automated processing of unstructured text-based inputs (such as clinical notes and narratives, or even patient-reported text and transcribed interviews) through known and well-understood pathways. This can build confidence and trust in generative CDS, ensure that recommendations are explainable, and, as demonstrated by our results, improve system performance.
Future efforts could investigate the use of LLMs for other types of psychiatric reasoning. DSM diagnosis has limited interrater reliability (Clarke et al., 2013; Regier et al., 2013), and in the absence of a human comparison arm, our ability to evaluate the significance of the model’s performance is restricted. Additionally the true prevalence of mental illness may differ significantly in other situations (such as in private individuals making use of public LLM-base chatbots), and in the absence of true negative vignettes, our ability to evaluate the performance of the model in such circumstances is limited. The advent of LLMs and other advanced artificial intelligence-based modeling tools could allow for the development of new diagnostic schema that could overcome some of the limitations of the DSM through the automated interpretation of large volumes of patient-related data. Such an approach could allow for more precise quantification of language-based phenotypes, motivate new approaches for disorder subtyping, and allow for the discovery of new links between behavioral phenotypes and neurobiological mechanisms, with the overall goal of matching the right treatments to the right patients at the right time. Additionally, future work making use of larger datasets (including negative examples), clinically derived datasets, and more broadly examining the impact of hyperparameters, alternative model selection, prompting strategies, and other prompt optimization tools, such as pre-computed or real-time simulated reasoning, could further elucidate how best to incorporate LLMs for these use cases.
Several published works in the literature have surveyed and discussed the ethical, legal and privacy implications of the use of LLMs in behavioral health (Chen et al., 2024; Hua et al., 2025; Obradovich et al., 2024; Orrù et al., 2025). Future studies aimed at developing clinical systems will need to address these implications in order to ensure that AI systems maximize potential benefits and mitigate associated risks, including bias, misalignment, data leakage and diversion, and inequitable allocation of benefits. We do not believe that the approach that we have demonstrated could be used effectively as a replacement for human clinical judgement due to these risks. Instead, our findings can be used to improve autonomous generative systems that are already generally available and to develop CDS systems that augment human performance in behavioral health.
4.5. Conclusion
In this work, we have demonstrated that the GPT family of large language models has the emergent capability for psychiatric reasoning and that it is able to interpret case vignettes and apply expert guidelines to make diagnoses. We found that directly prompting the models without external information led the models to predict the majority of correct diagnoses, with the limitation of significant overdiagnosis. We found that incorporating adapted expert decision-tree based diagnostic guidelines reduced overdiagnosis and improved overall model performance. These results illustrate the potential risks and benefits of the use of large language models for language analysis in behavioral health and motivate the need for systems that integrate language modeling with expert knowledge for use in clinical applications.
Supplementary Material
Columns:
“Case” – Case number
“Model_Dx” – Original output of algorithm
“Match_Score”-Score of best fuzzy string match to a DSM-5-TR diagnosis string
“Reconciled_Dx” – Final output after reconciliation (blank if no reconciliation was required)
“Dx_Reconciliation_Type” – Coded reason for reconciliation (see codes below)
Diagnosis Reconciliation Codes:
0 – No reconciliation required
1 – Same diagnosis (format or minor language change)
2 – Generalization due to excess specification
3 – Intentionally unused
4 – Changed to other / unspecified due to insufficient output to assign to specific diagnosis
5 – Not a DSM diagnosis
6 – Renamed or removed from a prior edition of DSM
7 – Z code
8 – Generalization due to specific simplification rule
See “scoring_base.xlsx” and “scoring_DT.xlsx”
Columns:
“case_number_dx” – Case number
“answers_dx” – List of ground truth diagnoses
“answers_cat” – List of ground truth diagnostic categories
“model_dx” – Model predicted diagnoses (after reconciliation)
“model_cat” – Model predicted diagnostic categories
“dx_TP” – Count of true positive diagnoses
“dx_TP_list” – List of true positive diagnoses
“dx_FP” – Count of false positive diagnoses
“dx_FP_list” – List of false positive diagnoses
“dx_FN” – Count of false negative diagnoses
“dx_FN_list” – List of false negative diagnoses
“dx_recall” – Calculated recall for diagnoses
“dx_precision” – Calculated recall for diagnoses
“dx_f1” – Calculated F1 statistic for diagnoses
“cat_TP” – Count of true positive diagnostic categories
“cat_TP_list” – List of true positive diagnostic categories
“cat_FP” – Count of false positive diagnostic categories
“cat_FP_list” – List of false positive diagnostic categories
“cat_FN” – Count of false negative diagnostic categories
“cat_FN_list” – List of false negative diagnostic categories
“cat_recall” – Calculated recall for diagnostic categories
“cat_precision” – Calculated recall for diagnostic categories
“cat_f1” – Calculated F1 statistic for diagnostic categories
Acknowledgements
We thank the UCSF AI Tiger Team, UCSF Academic Research Services, UCSF Research Information Technology, and the UCSF Chancellor’s Task Force for Generative AI for their support in developing LLM resources used for this project. This research was made possible through the use of content belonging to the American Psychiatric Association; express permission was obtained from the American Psychiatric Association for the use of such content (DSM-5-TR Clinical Cases and DSM-5-TR Handbook of Differential Diagnosis (Copyright © 2023 and 2024). American Psychiatric Association. All Rights Reserved, including rights for text and data mining (TDM), Artificial Intelligence (AI) training, and similar technologies).
Funding:
This work was supported by the National Institute of Mental Health of the National Institutes of Health [grant number R25 MH060482].
Appendices
Appendix A. Decision Tree Prompting
For the initial experiments (prior to refinement), question prompts were created based on suggested language in the DSM-5-TR Handbook of Differential Diagnosis. For example, the following is an example from the “Depressed Mood” decision tree:
Are there at least 2 weeks of depressed mood or diminished interest plus associated characteristic symptoms (e.g., changes in weight and appetite, fatigue, feelings of worthlessness or guilt, changes in sleep, suicidal thoughts)?
A screening prompt was developed for each decision tree, such as the following example from the “Self-Injurious Behavior” tree:
For the purposes of this discussion, self-injurious behavior refers to intentional self-inflicted acts to injury or mutilate one’s own body. This includes include cutting, burning, head banging, hair pulling, skin picking, self-biting, and hitting of various parts of one’s own body, but does not include socially or culturally sanctioned practices (such as piercing or artistic scarification). It does not include behavior intended to end one’s own life. Based on this definition, is there evidence in the clinical case that the patient has self-injurious behavior?
Appendix B. LLM Technical Specifications
For this paper, three specific commercial large language models developed by OpenAI (San Francisco, CA) were used. The models used were gpt-3.5, gpt-4-turbo, and gpt-4o. To enable reproducibility, specific model versions were used as follows: gpt-3.5-turbo-0125, gpt-4–0125-preview, and gpt-4o-2024-05-13. A python interface was used to execute queries to the gLLMs using the OpenAI API. All configurable content filters were turned off, and all queries were made with a temperature of 0, top_p of 1, and without frequency or presence penalties. If a query returned either an error or an unparseable response, the query was repeated with exponential backoff until a valid response was received. All queries ultimately produced parseable responses using this method, and no queries were rejected due to content filtering. To ensure that the models did not learn from experimental inputs during the study, all queries were executed under an agreement to not use inputs, outputs, or any other data generated during the study for training purposes. Each query to the model was performed independently, without including context from other queries.
Appendix C. Diagnosis Simplification and Matching
Simplification:
All DSM specifiers, modifiers and codes were removed; if after this step diagnoses were identical, they were combined.
All neurocognitive disorders were collapsed into “Delirium” or “Neurocognitive Disorder,” removing disease-specific language and combining the mild and major classes.
All substance-specific disorders were collapsed into “Substance Use Disorder,” “Substance Intoxication,” or “Substance Withdrawal” (removing identification of the specific substance)
All breathing-related sleep disorder diagnoses (as defined by the DSM Sleep-Wake Disorders section) were combined into a single “Breathing-Related Sleep Disorder” diagnosis
All DSM diagnoses making reference to a specific other causative medical condition were combined into a “due to another medical condition” diagnosis based on the category
All “Other Specified” and “Unspecified” diagnoses were combined into single “Other Specified or Unspecified” diagnoses for each category
Matching:
First, all candidate diagnosis strings were scored for matching against the list of DSM-5-TR diagnoses using a fuzzy string matching algorithm (WRatio, rapidfuzz Python library). If the diagnosis exactly matched a single DSM-5-TR diagnosis (with strength >99%), that diagnosis was used. Otherwise, the diagnosis was added to a reconciliation list with the highest matching diagnosis and the matching score with that diagnosis. The reconciliation list was then reviewed and reconciliations were resolved using the following process:
If the diagnosis was clearly non-psychiatric (e.g., “Hypertension”), the diagnosis was retained (as a false positive) and the category “Non-DSM” was assigned
If the diagnosis was clearly related to a Z code (e.g., “Nonsuicidal Self-Injury”), the matching Z code diagnosis was used and the category “Other Conditions That May BE a Focus of Clinical Attention” was assigned
If the diagnosis was clearly associated to one specific DSM-5-TR diagnosis but contained a minor typographical error or minor additional language (i.e., “Generalized Anxiety Disorder (GAD)” rather than “Generalized Anxiety Disorder), the diagnosis was normalized to the base diagnosis and assigned the associated category
If the diagnosis contained excess specifying language (i.e., “Major Depressive Disorder, severe” rather than “Major Depressive Disorder”) or was a specific instance of a diagnosis (i.e., “Cannabis-Induced Psychotic Disorder” rather than “Substance/Medication-Induced Psychotic Disorder), the diagnosis was normalized to the base diagnosis and assigned the associated category
If the diagnosis was clearly related to a diagnostic category but contained insufficient information to assign a specific diagnosis (i.e., “Depressive Episode”), the appropriate other/unspecified diagnosis and associated category were assigned (i.e., “Other Specific or Unspecified Depressive Disorder”)
If the diagnosis was a valid entity in a prior edition of the DSM that had been changed in the DSM-5-TR or clearly associated to an entity in the DSM-5-TR, the diagnosis was replaced with the associated new DSM-5-TR diagnosis and category
Otherwise, the diagnosis was retained (as a false positive) and assigned the category “Non-DSM”
All reconciliations were tracked with a numeric code, and once a specific reconciliation was established, it was put into a database and reused if the same string appeared again in the analysis for consistency. The full set of reconciliations is available in Supplemental Materials 2.
Appendix D. Performance Metric Equations
The following equations were used to calculate performance metrics on a per-vignette basis and then averaged to produce the reported results (macro averaging):
| Eq. (C.1): |
| Eq. (C.2): |
| Eq. (C.3): |
Footnotes
Previous Presentation: The authors appreciated the opportunity to present early partial components of this project at the annual meetings of the Northern California Psychiatric Society (conference abstract/poster, Mar 16, 2024), the American Medical Informatics Association (conference abstract/talk, Nov 12, 2024), the Technology in Psychiatry Summit (symposium abstract/talk, Dec 7, 2024), and the American College of Neuropsychopharmacology (conference abstract/poster, Dec 10, 2024). No text, tables or figures used in these presentations were reused for this work.
References
- Bains JK, Williams CYK, Johnson D, Schwartz H, Sabbineni N, Butte AJ, Kornblith AE, 2024. Enhancing emergency department charting: Using Generative Pre-trained Transformer-4 (GPT-4) to identify laceration repairs. Acad Emerg Med. 10.1111/acem.14995 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen D, Liu Y, Guo Y, Zhang Y, 2024. The revolution of generative artificial intelligence in psychology: The interweaving of behavior, consciousness, and ethics. Acta Psychol (Amst) 251, 104593. 10.1016/j.actpsy.2024.104593 [DOI] [PubMed] [Google Scholar]
- Chen Y, Benton J, Radhakrishnan A, Uesato J, Denison C, Schulman J, Somani A, Hase P, Wagner M, Roger F, Mikulik V, Bowman SR, Leike J, Kaplan J, Perez E, 2025. Reasoning Models Don’t Always Say What They Think. 10.48550/arXiv.2505.05410 [DOI] [Google Scholar]
- Diao S, Wang P, Lin Y, Pan R, Liu X, Zhang T, 2024. Active Prompting with Chain-of-Thought for Large Language Models, in: Ku L-W, Martins A, Srikumar V (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Presented at the ACL 2024, Association for Computational Linguistics, Bangkok, Thailand, pp. 1330–1350. 10.18653/v1/2024.acl-long.73 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galatzer-Levy IR, McDuff D, Natarajan V, Karthikesalingam A, Malgaroli M, 2023. The Capability of Large Language Models to Measure Psychiatric Functioning. 10.48550/arXiv.2308.01834 [DOI] [Google Scholar]
- Gero Z, Singh C, Cheng H, Naumann T, Galley M, Gao J, Poon H, 2023. Self-Verification Improves Few-Shot Clinical Information Extraction. 10.48550/arXiv.2306.00024 [DOI] [Google Scholar]
- Hao S, Sukhbaatar S, Su D, Li X, Hu Z, Weston J, Tian Y, 2024. Training Large Language Models to Reason in a Continuous Latent Space. 10.48550/arXiv.2412.06769 [DOI] [Google Scholar]
- Hua Y, Beam A, Chibnik LB, Torous J, 2025. From statistics to deep learning: Using large language models in psychiatric research. International Journal of Methods in Psychiatric Research 34, e70007. 10.1002/mpr.70007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jeong M, Sohn J, Sung M, Kang J, 2024. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics 40, i119–i129. 10.1093/bioinformatics/btae238 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li M, Zhou H, Yang H, Zhang R, 2024. RT: a Retrieving and Chain-of-Thought framework for few-shot medical named entity recognition. Journal of the American Medical Informatics Association 31, 1929–1938. 10.1093/jamia/ocae095 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lim JI, Rachitskaya AV, Hallak JA, Gholami S, Alam MN, 2024. Artificial intelligence for Retinal Diseases. Asia Pac J Ophthalmol (Phila) 100096. 10.1016/j.apjo.2024.100096 [DOI] [PubMed] [Google Scholar]
- Lim S, Kim Y, Choi C-H, Sohn J, Kim B-H, 2024. ERD: A Framework for Improving LLM Reasoning for Cognitive Distortion Classification, in: Naumann T, Ben Abacha A, Bethard S, Roberts K, Bitterman D (Eds.), Proceedings of the 6th Clinical Natural Language Processing Workshop. Association for Computational Linguistics, Mexico City, Mexico, pp. 292–300. 10.18653/v1/2024.clinicalnlp-1.25 [DOI] [Google Scholar]
- First Michael B., 2024. DSM-5-TR® handbook of differential diagnosis. American Psychiatric Association Publishing, Arlington, TX. [Google Scholar]
- Motwani SR, Smith C, Das RJ, Rafailov R, Laptev I, Torr PHS, Pizzati F, Clark R, Witt C.S. de, 2025. MALT: Improving Reasoning with Multi-Agent LLM Training. 10.48550/arXiv.2412.01928 [DOI] [Google Scholar]
- Nguyen T, Ong J, Masalkhi M, Waisberg E, Zaman N, Sarker P, Aman S, Lin H, Luo M, Ambrosio R, Machado AP, Ting DSJ, Mehta JS, Tavakkoli A, Lee AG, 2024. Artificial intelligence in corneal diseases: A narrative review. Cont Lens Anterior Eye 102284. 10.1016/j.clae.2024.102284 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Obradovich N, Khalsa SS, Khan WU, Suh J, Perlis RH, Ajilore O, Paulus MP, 2024. Opportunities and risks of large language models in psychiatry. NPP—Digit Psychiatry Neurosci 2, 1–8. 10.1038/s44277-024-00010-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- OpenAI, 2024. Learning to reason with LLMs [WWW Document]. URL https://openai.com/index/learning-to-reason-with-llms/ (accessed 4.7.25).
- Orrù G, Melis G, Sartori G, 2025. Large language models and psychiatry. International Journal of Law and Psychiatry 101, 102086. 10.1016/j.ijlp.2025.102086 [DOI] [PubMed] [Google Scholar]
- Petrov I, Dekoninck J, Baltadzhiev L, Drencheva M, Minchev K, Balunović M, Jovanović N, Vechev M, 2025. Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad. 10.48550/arXiv.2503.21934 [DOI] [Google Scholar]
- Renze M, Guven E, 2024. Self-Reflection in LLM Agents: Effects on Problem-Solving Performance, in: 2024 2nd International Conference on Foundation and Large Language Models (FLLM). pp. 476–483. 10.1109/FLLM63129.2024.10852493 [DOI] [Google Scholar]
- Sarma KV, Harmon S, Sanford T, Roth HR, Xu Z, Tetreault J, Xu D, Flores MG, Raman AG, Kulkarni R, Wood BJ, Choyke PL, Priester AM, Marks LS, Raman SS, Enzmann D, Turkbey B, Speier W, Arnold CW, 2021. Federated learning improves site performance in multicenter deep learning without data sharing. Journal of the American Medical Informatics Association. 10.1093/jamia/ocaa341 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sharma A, Lin IW, Miner AS, Atkins DC, Althoff T, 2023. Human–AI collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nat Mach Intell 5, 46–57. 10.1038/s42256-022-00593-2 [DOI] [Google Scholar]
- Sharma A, Rushton K, Lin IW, Nguyen T, Althoff T, 2024. Facilitating Self-Guided Mental Health Interventions Through Human-Language Model Interaction: A Case Study of Cognitive Restructuring, in: Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI ‘24. Association for Computing Machinery, New York, NY, USA, pp. 1–29. 10.1145/3613904.3642761 [DOI] [Google Scholar]
- So J, Chang J, Kim E, Na J, Choi J, Sohn J, Kim B-H, Chu SH, 2024. Aligning Large Language Models for Enhancing Psychiatric Interviews through Symptom Delineation and Summarization. 10.48550/arXiv.2403.17428 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taylor N, Kormilitzin A, Lorge I, Nevado-Holgado A, Cipriani A, Joyce DW, 2024. Model development for bespoke large language models for digital triage assistance in mental health care. Artificial Intelligence in Medicine 102988. 10.1016/j.artmed.2024.102988 [DOI] [PubMed] [Google Scholar]
- Tierney AA, Gayre G, Hoberman B, Mattern B, Ballesca M, Kipnis P, Liu V, Lee K, 2024. Ambient Artificial Intelligence Scribes to Alleviate the Burden of Clinical Documentation. NEJM Catalyst 5, CAT.23.0404. 10.1056/CAT.23.0404 [DOI] [Google Scholar]
- Verhees FG, Huth F, Meyer V, Wolf F, Bauer M, Pfennig A, Ritter P, Kather JN, Wiest IC, Mikolas P, 2025. clickBrick Prompt Engineering: Optimizing Large Language Model Performance in Clinical Psychiatry. 10.1101/2025.06.28.25330267 [DOI] [Google Scholar]
- Wang Q, Wang Z, Su Y, Tong H, Song Y, 2024. Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key? 10.48550/arXiv.2402.18272 [DOI] [Google Scholar]
- Wang X, Zhou D, 2024. Chain-of-Thought Reasoning Without Prompting, in: Globerson A, Mackey L, Belgrave D, Fan A, Paquet U, Tomczak J, Zhang C (Eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., pp. 66383–66409. [Google Scholar]
- Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E, Le Q, Zhou D, 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. 10.48550/arXiv.2201.11903 [DOI] [Google Scholar]
- Williams CYK, Miao BY, Kornblith AE, Butte AJ, 2024a. Evaluating the use of large language models to provide clinical recommendations in the Emergency Department. Nat Commun 15, 8236. 10.1038/s41467-024-52415-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Williams CYK, Zack T, Miao BY, Sushil M, Wang M, Kornblith AE, Butte AJ, 2024b. Use of a Large Language Model to Assess Clinical Acuity of Adults in the Emergency Department. JAMA Netw Open 7, e248895. 10.1001/jamanetworkopen.2024.8895 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu X, Yao B, Dong Y, Gabriel S, Yu H, Hendler J, Ghassemi M, Dey AK, Wang D, 2023. Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data. 10.48550/arXiv.2307.14385 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou D, Schärli N, Hou L, Wei J, Scales N, Wang X, Schuurmans D, Cui C, Bousquet O, Le Q, Chi E, 2023. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. Presented at the The Eleventh International Conference on Learning Representations, Kigali, Rwanda. 10.48550/arXiv.2205.10625 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Columns:
“Case” – Case number
“Model_Dx” – Original output of algorithm
“Match_Score”-Score of best fuzzy string match to a DSM-5-TR diagnosis string
“Reconciled_Dx” – Final output after reconciliation (blank if no reconciliation was required)
“Dx_Reconciliation_Type” – Coded reason for reconciliation (see codes below)
Diagnosis Reconciliation Codes:
0 – No reconciliation required
1 – Same diagnosis (format or minor language change)
2 – Generalization due to excess specification
3 – Intentionally unused
4 – Changed to other / unspecified due to insufficient output to assign to specific diagnosis
5 – Not a DSM diagnosis
6 – Renamed or removed from a prior edition of DSM
7 – Z code
8 – Generalization due to specific simplification rule
See “scoring_base.xlsx” and “scoring_DT.xlsx”
Columns:
“case_number_dx” – Case number
“answers_dx” – List of ground truth diagnoses
“answers_cat” – List of ground truth diagnostic categories
“model_dx” – Model predicted diagnoses (after reconciliation)
“model_cat” – Model predicted diagnostic categories
“dx_TP” – Count of true positive diagnoses
“dx_TP_list” – List of true positive diagnoses
“dx_FP” – Count of false positive diagnoses
“dx_FP_list” – List of false positive diagnoses
“dx_FN” – Count of false negative diagnoses
“dx_FN_list” – List of false negative diagnoses
“dx_recall” – Calculated recall for diagnoses
“dx_precision” – Calculated recall for diagnoses
“dx_f1” – Calculated F1 statistic for diagnoses
“cat_TP” – Count of true positive diagnostic categories
“cat_TP_list” – List of true positive diagnostic categories
“cat_FP” – Count of false positive diagnostic categories
“cat_FP_list” – List of false positive diagnostic categories
“cat_FN” – Count of false negative diagnostic categories
“cat_FN_list” – List of false negative diagnostic categories
“cat_recall” – Calculated recall for diagnostic categories
“cat_precision” – Calculated recall for diagnostic categories
“cat_f1” – Calculated F1 statistic for diagnostic categories
