An overview of diagnostics and therapeutics using large language models

Matteo Malgaroli; Daniel McDuff

doi:10.1002/jts.23082

. Author manuscript; available in PMC: 2025 Oct 1.

Published in final edited form as: J Trauma Stress. 2024 Jul 18;37(5):754–760. doi: 10.1002/jts.23082

An overview of diagnostics and therapeutics using large language models

Matteo Malgaroli ¹, Daniel McDuff ²

PMCID: PMC11444874 NIHMSID: NIHMS2008175 PMID: 39024299

Abstract

There is an acute need for solutions to treat stress and trauma–related sequelae, and there are well-documented shortages of qualified human professionals. Artificial intelligence (AI) presents an opportunity to create advanced screening, diagnosis, and treatment solutions that relieve the burden on people and can provide just-in-time interventions. Large language models (LLMs), in particular, are promising given the role language plays in understanding and treating traumatic stress and other mental health conditions. In this article, we provide an overview of the state-of-the-art LLMs applications in diagnostic assessments, clinical note generation, and therapeutic support. We discuss the open research direction and challenges that need to be overcome to realize the full potential of deploying language models for use in clinical contexts. We highlight the need for increased representation in AI systems to ensure there are no disparities in access. Public datasets and models will help lead progress toward better models; however, privacy-preserving model training will be necessary for protecting patient data.

Automating clinical tasks using artificial intelligence (AI) could provide scalable solutions when the availability of qualified humans is a barrier. Furthermore, when properly utilized, technology has the potential to improve the assessment and treatment of stress disorders through accurate and personalized models (Malgaroli & Schultebraucks, 2020). Reducing clinician burden has become increasingly crucial, as the risk of exposure to traumatic events has escalated globally (World Health Organization [WHO], 2022), driven by a rising number of conflicts, natural disasters, and the COVID-19 pandemic. In 2023, over 114,000,000 people were displaced by war, violence, and persecution (United Nations High Commissioner for Refugees, 2023). Concurrently, severe weather events, including hurricanes and wildfires, have become more frequent, leading to devastation as well as financial and human losses (WHO, 2022). The surge in trauma exposure strains an already stretched global clinical workforce, making it increasingly challenging to provide mental health support and treatment to affected individuals. Delivering care in high-risk zones, where combat or natural disasters are ongoing, presents further challenges. This situation is particularly critical in lower middle–income economies where clinical services are already scarce and underresourced. Given these challenges, innovative solutions are needed to support scalable assessment and intervention, with the goal of reducing the incidence of stress and trauma–related disorders.

Natural language processing (NLP), a branch of AI that focuses on analyzing and generating linguistic data, is becoming increasingly important for mental health automation (Malgaroli, Hull, et al., 2023) given the central role of language in diagnostic evaluations and treatment. NLP methods have been used to screen for mental health symptoms (Malgaroli, Hull, et al., 2023), diagnose posttraumatic stress disorder (PTSD; Schultebraucks et al., 2022), capture conversational stress markers (Malgaroli, Tseng, et al., 2023), evaluate adjustment to traumatic events (Son et al., 2023), and support PTSD treatment (Norman et al., 2020). Despite their potential, earlier NLP systems were limited in their diffusion by three major constraints (Malgaroli, Hull, et al., 2023). First, they were primarily designed for a single language, usually English. Second, although they could capture semantic meaning, they lacked contextual understanding. Third, they required fine-tuning using case-specific data to perform accurately beyond their initial training settings (Singhal et al., 2023). Advancements in NLP were made possible by the development of attention layers (Bahdanau et al., 2014) and transformer architectures (Vaswani et al., 2017), which capture associations between words, thus allowing contextual understanding. As these model architectures scale very efficiently with data volume, they can be trained on large corpora and encode large amounts of information. Large language models (LLMs) are generative transformer architectures so named because of their massive model size containing billions of parameters (Brown et al., 2020).

LLMs are trained on vast datasets, in some cases across multiple languages, using self-supervised learning techniques including masked-language modeling and next-sentence prediction. By training on trillions of documents, LLMs amass extensive knowledge (Brown et al., 2020) allowing them to interpret and produce human-like text. The relationship between training scale and LLMs’ capabilities is hotly debated, centering on whether performance in language tasks increases continuously with scale (Schaeffer et al., 2023) or if discontinuous “leaps” indicative of “emergent” behaviors manifest only at certain scales (Brown et al., 2020). LLMs can be further adapted to use in medical domains through additional specialized training (Singhal et al., 2023). Following instruction tuning, LLMs are utilized through text prompts that contain guidelines to generate text adhering to specific instructions. For instance, in mental health applications, LLMs can be prompted to offer diagnostic assessments based on interview transcripts (Galatzer-Levy et al., 2023) or provide supportive responses to help-seekers (Sharma, Lin, et al., 2023). These applications address the limitations of conventional clinical methods, which grapple with issues of subjectivity, scalability, and accessibility (Malgaroli, Hull, et al., 2023).

In this manuscript, we provide an overview of the rapidly developing clinical research applications of LLMs in mental health, with a particular focus on their potential use for PTSD and stress-related pathology. We cover open questions and challenges that need to be addressed to realize the potential of LLMs in the complex case of traumatic stress. Furthermore, we discuss the ethical and practical implications of using LLMs, which require thorough investigation to maximize the benefits and minimize the potential harms of these technologies.

Overview of LLM applications

LLMs for screening, diagnostics, and symptom detection

Screening, diagnosis, and symptom assessment are important components of medical care. In an individual’s journey toward improved well-being, these processes help facilitate appropriate treatment recommendations. Computer and AI systems capable of performing or assisting clinicians in screening and diagnosis have been built, with impressive results (Y. Liu et al., 2020; McDuff et al., 2023; Rauschecker et al., 2020; Szolovits & Pauker, 1978). Several research studies have found that LLMs are capable of diagnostic performance that is on par with or exceeds that of humans. These LMMs include systems that aim to produce a singular, final diagnosis and differential diagnosis (Y. Liu et al., 2020; McDuff et al., 2023; Rauschecker et al., 2020). The opportunity to improve diagnostic capabilities through AI is particularly crucial for psychiatric evaluations, which currently rely on self-report tools or screening instruments. These methods result in subjective linguistic descriptions rather than biological values that can be mathematically parsed, a unique limitation compared to other areas of medicine. The diagnostic accuracy of language models across mental health conditions suggests that LLMs are well-suited to this domain (Malgaroli, Hull, et al., 2023) given the language-based nature of mental health measurement and interventions.

LLMs are a natural fit and powerful tool for improving the quality and scalability of mental health assessments given the volume of medical data and knowledge encoded in language. One study (Galatzer-Levy et al., 2023) examined the zero-shot performance in assessing the psychiatric functioning of the Med-PaLM 2, a variant of the PaLM 2 LLM, which was trained on medical knowledge (Singhal et al., 2023). In this context, zero-shot refers to the ability of an LLM to perform a task for which it has not explicitly been trained. The study used transcripts from clinical interviews to predict the results of structured depression and PTSD assessments in terms of both diagnostic labels and symptom scores. Additionally, the study employed clinical case summaries to request Diagnostic and Statistical Manual of Mental Disorders (5th ed.; DSM-5; American Psychiatric Association, 2013) diagnoses from the LLM across different disorder categories, including mood disorders, psychotic disorders, and addiction. The results demonstrated high accuracy in labeling DSM-5 diagnoses and scoring screening surveys without the need for fine-tuning. Furthermore, the model provided reasonable confidence ratings and written explanations for interview scoring (Galatzer-Levy et al., 2023). These explanations were consistent with expert clinical interpretations of the features (i.e., words and phrases) that were predictive of the respective scores.

Although these initial results related to evaluating psychiatric functioning using LLMs are promising, words alone do not capture all the necessary information to complete assessment tasks. Grounding, or tuning, in nonlinguistic data that capture actions, experiences, or perceptions is an essential part of deriving meaning from language (Glenberg et al., 2005) and could be used to expand and generalize LLMs’ capabilities. Grounding of this kind is necessary to connect the knowledge contained within a language model to events or actions in the physical world (Ahn et al., 2022). For these reasons, models have been combined with data from other modalities, most commonly images (Huber et al., 2018; Li et al., 2022; Mostafazadeh et al., 2017), audio (Xu et al., 2021), video (L. Zhou et al., 2019), or physical actions (Ahn et al., 2022). Initial studies in this area suggest the benefits of grounding in sensor data to improve the accuracy of LLM in stress detection (X. Liu et al., 2023).

There are many sources and types of information that are relevant to behavioral and mental health (Mohr et al., 2017), and many “stock” LLMs may not have received training samples that contain examples of these types of data. Combining representations encoded into language models with quantitative physiological and behavioral measurements from time series data through few-shot tuning (a technique that adapts models using minimal examples) can lead to substantial improvements in tasks including, mental health screening (X. Liu et al., 2023). A comprehensive study of different LLMs across a range of data sets found that these results were generalizable (Kim et al., 2024).

LLMs for clinical documentation

The ability to estimate screening survey results is, perhaps, the most accessible task for researchers to evaluate LLMs as mental health assistants. However, there are many other properties necessary to be useful in serving populations suffering from trauma-related symptoms or PTSD. More broadly, LLMs can help boost productivity and efficiency in tasks that involve reading and writing language (Cambon et al., 2023). The high administrative burden in medicine is well-established. LLMs are effective at summarizing information in and extracting information from unstructured text (Brown et al., 2020). These capabilities make LLMs promising as a tool for relieving some of the overhead in clinical work and increasing the efficiency of human clinicians (Webster, 2023).

One of the first studies to examine LLMs’ note-taking capabilities for mental health utilized Generative Pre-trained Transformer–4 (GPT-4) to identify relevant clinical features from semistructured interviews with North Korean defectors, focusing on the participants’ traumatic experiences and mental health issues (So et al., 2024). The findings suggest that LLMs were capable of producing interview summaries that included relevant traumatic events and functioning problems and categorized them consistent with DSM-5 PTSD criteria. Importantly, the use of LLMs for clinical notetaking in the mental health domain is in its infancy. Clinical documentation takes many forms and is complex, and, consequently, it is nontrivial to define performance metrics on how LLMs perform in this domain. To ensure proper evaluation, future studies utilizing LLMs for this purpose need to examine their deployment in parallel with assessments of the quality and alignment of these systems relative to the needs of human clinicians.

LLMs for mental health interventions

In addition to assisting with screening symptoms and compiling documentation, LLMs have the potential to support humans in providing therapeutic interventions (Schueller & Morris, 2023). NLP models, including LLMs, can be used to assist humans in cognitive restructuring (Sharma, Rushton, et al., 2023), automate the labeling of peer-counseling responses in motivational interviews, and analyze the behavior of counselors at scale (Shah et al., 2022). Language models can be used to support individuals engaged in specific interventions and enhance how these interventions are delivered (e.g., increasing empathy). In a large-scale study, Sharma et al. (Sharma, Lin, et al., 2023) developed a system to help peers respond with greater empathy to mental health support messages. The results indicated that the model increased empathy, and the findings also provided a framework for evaluation that could be generalized to other stylistic properties of text generation. The support of LLM-generated responses, which clinicians can review and finalize, has also been shown to alleviate health care burden across a broader scope of medical tasks, including patient portal messaging (Garcia et al., 2024). Importantly, in all these cases, a human is “in the loop”—knowledgeable about the LMM-generated responses and able to make assessments or adjustments to the model or how it is being employed.

Researchers are also exploring the use of LLMs to administer interventions directly, aiming to boost adherence and the use of self-guided interventions. In a study that included over 15,000 participants, Sharma, Rushton, et al. (2023) designed an LLM-based interactive system to support individuals in engaging with self-guided cognitive restructuring. The system assisted individuals in identifying cognitive biases and used GPT-3 to interactively reformulate cognitive thoughts more skillfully while incorporating psychoeducation at every step. The study results indicated a positive emotional shift and a reported increased mastery in dealing with negative thoughts.

Future research directions for LLMs include their potential to serve as therapeutic agents, moving beyond previous rule-based chatbots to provide flexible interventions tailored to individual needs. Despite the significant potential of this new modality to deliver therapeutic interventions, numerous steps are required before deployment is possible, including evaluating the readiness of LLMs for safety-critical use and assessing the quality and equity of their outputs. One step in this direction is the design of BOLT by Chiu and colleagues (2024), a computational framework that examines the quality of LLM-generated psychotherapy interventions within the motivational interviewing (MI) counseling framework. BOLT simulates therapist–patient interactions, annotates LLM therapeutic behavior at the utterance level using MI treatment integrity ratings, and compares their frequencies with human-established ratings of high- and low-quality counseling therapy sessions. Study results suggest that interventions delivered by general-use LLMs (i.e., the GPT-4, GPT-3.5, and Llama-2 models) resemble low-quality sessions and are not consistent with human-delivered, high-quality care. These findings underscore the importance of designing specialized LLMs that are fine-tuned for clinical applications, with guardrails in place to ensure the appropriateness and quality of the interventions before considering their deployment in the real world. Beyond encoding knowledge, these systems need interfaces that are appropriate, reliable, engaging, and efficient for effective clinical utilization.

Open problems and opportunities

The fast evolution of the capabilities of LLMs has resulted in a large number of open research questions concerning properties such as reasoning ability, consistency, and biases.

Representation

Multilingual models (e.g., BLOOM; Le Scao et al., 2022) will be important to realize the full potential of AI systems and ensure that there are no disparities in access to AI-driven tools. Yet, at the present time, the representations in and performance of NLP tools are better in English compared with many other languages (Jin et al., 2023; Petrov et al., 2024) and contain weaker representations related to information relevant to economically disadvantaged countries (K. Zhou et al., 2022). Both factors could exacerbate inequalities. Worse still, little validation is even performed in most cases on text corpora that are not primarily in English. Thus, not only is there a divide in performance but that gap is likely to grow without a concerted effort toward equity.

Biases

All AI systems are a function of the data used to train them. Datasets have biases and, unchecked, these result in biases in model outputs, including differential quality in the responses generated. For example, LLM testing has revealed a tendency to produce content that is consistent with gender stereotypes (Acerbi & Stubbersfield, 2023). Biases may result in inaccurate results, but there are many other issues they present in the mental health domain. Disparities across demographic groups mean that some will present with higher levels of risk and poorer experiences. As such, when designing LLMs to be used in clinical settings, it is crucial to evaluate how their performance may differentially impact diverse groups (Gabriel et al., 2024). Anticipating and mitigating biases through specialized model training and prompting techniques is essential.

Public datasets and models

“Openness” in the machine learning community is credited for the speed at which progress in AI has been attained. Open models, code, and data are necessary for the distributed and democratized development of the technology. In the domain of traumatic stress, there is a limited number of open datasets that can be used for benchmarking and model comparison. There is a need for more comprehensive datasets to realize the benefits of using LLMs to improve assessment and treatment. The lack of open, comprehensive, and ecologically valid benchmark datasets has, and will continue to, limit progress in the field.

Engagement

Treatment engagement is a long-standing challenge. LLMs present several opportunities for making systems that are more responsive and personalized, two properties that can make a platform engaging. First, AI systems can interact with the user rapidly, allowing someone to receive a response within fractions of a second and without the bandwidth of limitations of human providers. This can contribute to maintaining engagement in the short term. Second, it is well established that digital platforms can be personalized very effectively, making them “sticky.” Generating content tailored to a specific person can help support longer-term engagement. However, these systems need to be designed with care, as optimizing for engagement alone can have negative consequences (e.g., social media addiction).

Explanations and reasoning

Language models are attractive compared to traditional classifiers, as they can be prompted to provide explanations or reasoning for their responses (Galatzer-Levy et al., 2023) and can be designed to reason through solutions to problems (Brown et al., 2020). However, these models do not always act in a logical fashion and can produce hallucinations and confabulations. Integrating different NLP- and human-based evaluations into systems where LLMs are deployed would provide additional layers of interpretability, allowing for further review of results when necessary.

Adoption

The integration of LLMs in medicine necessitates considerations beyond examining their potential capabilities to improve health care outcomes. This is a complex socioeconomic issue that requires parlance from all involved stakeholders, including state lawmakers, health care systems, clinicians, developers, and individuals with lived experiences. Ethical implications form an integral part of this still-ongoing discourse. As such, the adoption of LLMs must ensure that their deployment is not only technologically feasible but also socially responsible.

Privacy

“Federated learning” refers to the process of training models using data from many “clients” (e.g., devices) in which the data do not leave the client. Recent demonstrations of LLMs trained in a fully federated manner have been published (Ye et al., 2024). It is likely that federated methods will be necessary in health care given the importance of protecting patient data and the intellectual property that it represents. However, these systems are technically challenging to design, implement, and evaluate. Strong cross-disciplinary efforts will be needed to create them.

Risks of harm

As with any system, it is inevitable that these models will make mistakes. Safeguards are needed to minimize the negative impact of these errors. The negligent deployment of AI without concentrated human-centered research and transparency will lead to harm and impact trust in the technology. Human-in-the-loop systems will be necessary in some cases to provide safety; however, there is no simple single solution that will remove the risk of negative outcomes.

Conclusions

LLMs offer significant opportunities to scale and innovate approaches to understanding, diagnosing, and treating stress-related pathology. LLMs encode vast clinical knowledge, can recognize stress patterns, and generate human-like interactions, which can allow for the development of personalized treatment plans; enhanced diagnostic models; and, ultimately, improved patient outcomes. AI is also likely to be an important component of systems that offer just-in-time interventions in situations that require rapid triage. However, ethical considerations, including data privacy and the potential for AI bias, must be addressed to ensure these technologies benefit all sectors of society equitably. As LLMs are integrated into mental health care, interdisciplinary collaborations are essential to harness their transformative potential effectively.

Acknowledgments

Mario Malgaroli was supported by the National Institute of Mental Health (K23MH134068).

Daniel McDuff receives salary and compensation from Google, LLC, which owns MedPalm 2 and the Gemini series of large language models. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

REFERENCES

Acerbi A, & Stubbersfield JM (2023). Large language models show human-like content biases in transmission chain experiments. Proceedings of the National Academy of Sciences, 120(44), Article e2313790120. 10.1073/pnas.2313790120 [DOI] [PMC free article] [PubMed] [Google Scholar]
Ahn M, Brohan A, Brown N, Chebotar Y, Cortes O, David B, Finn C, Fu C, Gopalakrishnan K, Hausman K, Herzog A, Ho D, Hsu J, Ibarz J, Ichter B, Irpan A, Jang E, Jauregui Ruano R, Jeffrey K, Jesmonth S,…Zeng A (2022). Do as I can, not as I say: Grounding language in robotic affordances. arXiv. 10.48550/arXiv.2204.01691 [DOI] [Google Scholar]
American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). 10.1176/appi.books.9780890425596 [DOI] [Google Scholar]
Bahdanau D, Cho K, & Bengio Y (2014). Neural machine translation by jointly learning to align and translate. arXiv. 10.48550/arXiv.1409.0473 [DOI] [Google Scholar]
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal A, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D, Wu J, Winter C, …Amodei D (2020). Language models are few-shot learners. In Larochelle H, Ranzato M, Hadsell R, Balcan MF, & Lin H (Eds). Advances in Neural Information Processing Systems 33 (pp. 1877–1901). NeurIPS. [Google Scholar]
Cambon A, Hecht B, Edelman B, Ngwe D, Jaffe S, Heger A, Vorvoreanu M, Peng S, Hofman J, Farach A, Bermejo-Cano M, Knudsen E, Bono J, Sanghavi H, Spatharioti S, Rothschild D, Goldstein DG, Kalliamvakou E, Cichon P,…Teevan J (2023). Early llm-based tools for enterprise information workers likely provide meaningful boosts to productivity [Technical report]. https://www.microsoft.com/en-us/research/publication/early-llm-based-tools-for-enterprise-information-workers-likely-provide-meaningful-boosts-to-productivity
Chiu YY, Sharma A, Lin IW, & Althoff T (2024). A computational framework for behavioral assessment of llm therapists. arXiv. 10.48550/arXiv.2401.00820 [DOI] [Google Scholar]
Gabriel S, Puri I, Xu X, Malgaroli M, & Ghassemi M (2024). Can AI relate: Testing large language model response for mental health support. arXiv. 10.48550/arXiv.2405.12021 [DOI] [Google Scholar]
Galatzer-Levy IR, McDuff D, Natarajan V, Karthikesalingam A, & Malgaroli M (2023). The capability of large language models to measure psychiatric functioning. arXiv. 10.48550/arXiv.2308.01834 [DOI] [Google Scholar]
Garcia P, Ma SP, Shah S, Smith M, Jeong Y, Devon-Sand A, Tai-Seale M, Takazawa K, Clutter D, Vogt K, Lugtu C, Rojo M, Lin S, Shanafelt T, Pfeffer MA, & Sharp C (2024). Artificial intelligence–generated draft replies to patient inbox messages. JAMA Network Open, 7(3), Article e243201. 10.1001/jamanetworkopen.2024.3201 [DOI] [PMC free article] [PubMed] [Google Scholar]
Glenberg AM, Havas D, Becker R, & Rinck M (2005). Grounding language in bodily states. In Pecher D & Zwaan RA (Eds). Grounding cognition: The role of perception and action in memory, language, and thinking (pp. 115–128). Cambridge University Press. [Google Scholar]
Huber B, McDuff D, Brockett C, Galley M, & Dolan B (2018). Emotional dialogue generation using imagegrounded language models. In CHI ‘18: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (pp. 1–12). 10.1145/3173574.3173851 [DOI] [Google Scholar]
Jin Y, Chandra M, Verma G, Hu Y, Choudhury MD, & Kumar S (2023). Better to ask in English: Cross-lingual evaluation of large language models for healthcare queries. arXiv. 10.48550/arXiv.2310.13132 [DOI] [Google Scholar]
Kim Y, Xu X, McDuff D, Breazeal C, & Park HW (2024). Health-llm: Large language models for health prediction via wearable sensor data. arXiv. 10.48550/arXiv.2401.06866 [DOI] [Google Scholar]
Le Scao T, Fan A, Akiki C, Pavlick E, Ilic S, Hesslow D, Castagné R, Luccioni AS, Yvon F, Gallé M, Tow J, Rush AM, Biderman S, Webson A, Ammanamanchi PS, Wang T, Sagot B, Muennighoff N, Villanova del Moral A, … Wolf T (2022). BLOOM: A 176b-parameter open-access multilingual language model. arXiv. 10.48550/arXiv.2211.05100 [DOI] [Google Scholar]
Li LH, Zhang P, Zhang H, Yang J, Li C, Zhong Y, Wang L, Yuan L, Zhang L, Hwang J-N, Chang K-W, & Gao J (2022). Grounded language-image pre-training. arXiv. 10.48550/arXiv.2112.03857 [DOI] [Google Scholar]
Liu X, McDuff D, Kovacs G, Galatzer-Levy I, Sunshine J, Zhan J, Poh M-Z, Liao S, Di Achille P, & Patel S (2023). Large language models are few-shot health learners. arXiv. 10.48550/arXiv.2305.15525 [DOI] [Google Scholar]
Liu Y, Jain A, Eng C, Way DH, Lee K, Bui P, Kanada K, de Oliveira Marinho G, Gallegos J, Gabriele S, et al. (2020). A deep learning system for differential diagnosis of skin diseases. Nature Medicine, 26(6), 900–908. 10.1038/s41591-020-0842-3 [DOI] [PubMed] [Google Scholar]
Malgaroli M, Hull TD, Zech JM, & Althoff T (2023). Natural language processing for mental health interventions: A systematic review and research framework. Translational Psychiatry, 13(1), Article 309. 10.1038/s41398-023-02592-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
Malgaroli M, & Schultebraucks K (2020). Artificial intelligence and posttraumatic stress disorder (PTSD). European Psychologist, 25(4), 272–282. 10.1027/1016-9040/a000423 [DOI] [Google Scholar]
Malgaroli M, Tseng E, Hull TD, Jennings E, Choudhury TK, & Simon NM (2023). Association of health care work with anxiety and depression during the COVID-19 pandemic: Structural topic modeling study. JMIR AI, 2(1), Article e47223. 10.2196/47223 [DOI] [PMC free article] [PubMed] [Google Scholar]
McDuff D, Schaekermann M, Tu T, Palepu A, Wang A, Garrison J, Singhal K, Sharma Y, Azizi S, Kulkarni K, Hou L, Cheng Y, Liu Y, Mahdavi SS, Prakash S, Pathak A, Semturs C, Patel S, Webster DR,…Natarajan V (2023). Towards accurate differential diagnosis with large language models. arXiv. 10.48550/arXiv.2312.00164 [DOI] [Google Scholar]
Mohr DC, Zhang M, & Schueller SM (2017). Personal sensing: Understanding mental health using ubiquitous sensors and machine learning. Annual Review Of Clinical Psychology, 13, 23–47. 10.1146/annurev-clinpsy-032816-044949 [DOI] [PMC free article] [PubMed] [Google Scholar]
Mostafazadeh N, Brockett C, Dolan B, Galley M, Gao J, Spithourakis GP, & Vanderwende L (2017). Image-grounded conversations: Multimodal context for natural question and response generation. arXiv. 10.48550/arXiv.1701.08251 [DOI] [Google Scholar]
Norman KP, Govindjee A, Norman SR, Godoy M, Cerrone KL, Kieschnick DW, & Kassler W (2020). Natural language processing tools for assessing progress and outcome of two veteran populations: Cohort study from a novel online intervention for posttraumatic growth. JMIR Formative Research, 4(9), Article e17424. 10.2196/17424 [DOI] [PMC free article] [PubMed] [Google Scholar]
Petrov A, La Malfa E, Torr P, & Bibi A (2024). Language model tokenizers introduce unfairness between languages. In Advances in neural information processing systems, 36, (pp. 1–28). NeurIPS. [Google Scholar]
Rauschecker AM, Rudie JD, Xie L, Wang J, Duong MT, Botzolakis EJ, Kovalovich AM, Egan J, Cook TC, Bryan RN, Nasrallah IM, Mohan S, & Gee JC (2020). Artificial intelligence system approaching neuroradiologist-level differential diagnosis accuracy at brain MRI. Radiology, 295(3), 626–637. 10.1148/radiol.2020190283 [DOI] [PMC free article] [PubMed] [Google Scholar]
Schaeffer R, Miranda B, & Koyejo S (2023). Are emergent abilities of large language models a mirage? arXiv. 10.48550/arXiv.2304.15004 [DOI] [Google Scholar]
Schueller SM, & Morris RR (2023). Clinical science and practice in the age of large language models and generative artificial intelligence. Journal of Consulting and Clinical Psychology, 91(10), 559–561. 10.1037/ccp0000848 [DOI] [PubMed] [Google Scholar]
Schultebraucks K, Yadav V, Shalev AY, Bonanno GA, & Galatzer-Levy IR (2022). Deep learning-based classification of posttraumatic stress disorder and depression following trauma utilizing visual and auditory markers of arousal and mood. Psychological Medicine, 52(5), 957–967. 10.1017/S0033291720002718 [DOI] [PubMed] [Google Scholar]
Shah RS, Holt F, Hayati SA, Agarwal A, Wang Y-C, Kraut RE, & Yang D (2022). Modeling motivational interviewing strategies on an online peer-to-peer counseling platform. Proceedings of the ACM on HumanComputer Interaction, 6, Article 527. [Google Scholar]
Sharma A, Lin IW, Miner AS, Atkins DC, & Althoff T (2023). Human–AI collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nature Machine Intelligence, 5(1), 46–57. 10.1038/s42256-022-00593-2 [DOI] [Google Scholar]
Sharma A, Rushton K, Lin IW, Nguyen T, & Althoff T (2023). Facilitating self-guided mental health interventions through human-language model interaction: A Case study of cognitive restructuring. arXiv. 10.48550/arXiv.2310.15461 [DOI] [Google Scholar]
Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Hou L, Clark K, Pfohl S, Cole-Lewis H, Neal D, Schaekermann M, Wang A, Amin M, Lachgar S, Mansfield P, Prakash S, Green B, Dominowska E, y Arcas BA, … Natarajan V (2023). Towards expert-level medical question answering with large language models. arXiv. 10.48550/arXiv.2305.09617 [DOI] [Google Scholar]
So J. -h., Chang J, Kim E, Na J, Choi J, Sohn J. -y., Kim B-H, & Chu SH (2024). Aligning large language models for enhancing psychiatric interviews through symptom delineation and summarization. arXiv. 10.48550/arXiv.2403.17428 [DOI] [Google Scholar]
Son Y, Clouston SA, Kotov R, Eichstaedt JC, Bromet EJ, Luft BJ, & Schwartz HA (2023). World Trade Center responders in their own words: Predicting PTSD symptom trajectories with AI-based language analyses of interviews. Psychological Medicine, 53(3), 918–926. 10.1017/S0033291721002294 [DOI] [PMC free article] [PubMed] [Google Scholar]
Szolovits P, & Pauker SG (1978). Categorical and probabilistic reasoning in medical diagnosis. Artificial Intelligence, 11(1–2), 115–144. [Google Scholar]
United Nations High Commissioner for Refugees. (2023). Mid-year trends 2023. https://www.unhcr.org/mid-year-trends-report-2023
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, & Polosukhin I (2017). Attention is all you need. In Advances in Neural Information Processing Systems 30. NeurIPS. [Google Scholar]
Webster P (2023). Six ways large language models are changing healthcare. Nature Medicine, 29(12), 2969–2971. 10.1038/s41591-023-02700-1 [DOI] [PubMed] [Google Scholar]
World Health Organization. (2022). World mental health report: Transforming mental health for all. https://www.who.int/publications/i/item/9789240049338
Xu X, Dinkel H, Wu M, & Yu K (2021). Text-to-audio grounding: Building correspondence between captions and sound events. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 606–610). ICASSP. [Google Scholar]
Ye R, Wang W, Chai J, Li D, Li Z, Xu Y, Du Y, Wang Y, & Chen S (2024). Openfedllm: Training large language models on decentralized private data via federated learning. arXiv. 10.48550/arXiv.2402.06954 [DOI] [Google Scholar]
Zhou K, Ethayarajh K, & Jurafsky D (2022). Richer countries and richer representations. arXiv. 10.48550/arXiv.2205.05093 [DOI] [Google Scholar]
Zhou L, Kalantidis Y, Chen X, Corso JJ, & Rohrbach M (2019). Grounded video description. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6578–6587). IEEE/CVF. [Google Scholar]

[R1] Acerbi A, & Stubbersfield JM (2023). Large language models show human-like content biases in transmission chain experiments. Proceedings of the National Academy of Sciences, 120(44), Article e2313790120. 10.1073/pnas.2313790120 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Ahn M, Brohan A, Brown N, Chebotar Y, Cortes O, David B, Finn C, Fu C, Gopalakrishnan K, Hausman K, Herzog A, Ho D, Hsu J, Ibarz J, Ichter B, Irpan A, Jang E, Jauregui Ruano R, Jeffrey K, Jesmonth S,…Zeng A (2022). Do as I can, not as I say: Grounding language in robotic affordances. arXiv. 10.48550/arXiv.2204.01691 [DOI] [Google Scholar]

[R3] American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). 10.1176/appi.books.9780890425596 [DOI] [Google Scholar]

[R4] Bahdanau D, Cho K, & Bengio Y (2014). Neural machine translation by jointly learning to align and translate. arXiv. 10.48550/arXiv.1409.0473 [DOI] [Google Scholar]

[R5] Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal A, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D, Wu J, Winter C, …Amodei D (2020). Language models are few-shot learners. In Larochelle H, Ranzato M, Hadsell R, Balcan MF, & Lin H (Eds). Advances in Neural Information Processing Systems 33 (pp. 1877–1901). NeurIPS. [Google Scholar]

[R6] Cambon A, Hecht B, Edelman B, Ngwe D, Jaffe S, Heger A, Vorvoreanu M, Peng S, Hofman J, Farach A, Bermejo-Cano M, Knudsen E, Bono J, Sanghavi H, Spatharioti S, Rothschild D, Goldstein DG, Kalliamvakou E, Cichon P,…Teevan J (2023). Early llm-based tools for enterprise information workers likely provide meaningful boosts to productivity [Technical report]. https://www.microsoft.com/en-us/research/publication/early-llm-based-tools-for-enterprise-information-workers-likely-provide-meaningful-boosts-to-productivity

[R7] Chiu YY, Sharma A, Lin IW, & Althoff T (2024). A computational framework for behavioral assessment of llm therapists. arXiv. 10.48550/arXiv.2401.00820 [DOI] [Google Scholar]

[R8] Gabriel S, Puri I, Xu X, Malgaroli M, & Ghassemi M (2024). Can AI relate: Testing large language model response for mental health support. arXiv. 10.48550/arXiv.2405.12021 [DOI] [Google Scholar]

[R9] Galatzer-Levy IR, McDuff D, Natarajan V, Karthikesalingam A, & Malgaroli M (2023). The capability of large language models to measure psychiatric functioning. arXiv. 10.48550/arXiv.2308.01834 [DOI] [Google Scholar]

[R10] Garcia P, Ma SP, Shah S, Smith M, Jeong Y, Devon-Sand A, Tai-Seale M, Takazawa K, Clutter D, Vogt K, Lugtu C, Rojo M, Lin S, Shanafelt T, Pfeffer MA, & Sharp C (2024). Artificial intelligence–generated draft replies to patient inbox messages. JAMA Network Open, 7(3), Article e243201. 10.1001/jamanetworkopen.2024.3201 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Glenberg AM, Havas D, Becker R, & Rinck M (2005). Grounding language in bodily states. In Pecher D & Zwaan RA (Eds). Grounding cognition: The role of perception and action in memory, language, and thinking (pp. 115–128). Cambridge University Press. [Google Scholar]

[R12] Huber B, McDuff D, Brockett C, Galley M, & Dolan B (2018). Emotional dialogue generation using imagegrounded language models. In CHI ‘18: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (pp. 1–12). 10.1145/3173574.3173851 [DOI] [Google Scholar]

[R13] Jin Y, Chandra M, Verma G, Hu Y, Choudhury MD, & Kumar S (2023). Better to ask in English: Cross-lingual evaluation of large language models for healthcare queries. arXiv. 10.48550/arXiv.2310.13132 [DOI] [Google Scholar]

[R14] Kim Y, Xu X, McDuff D, Breazeal C, & Park HW (2024). Health-llm: Large language models for health prediction via wearable sensor data. arXiv. 10.48550/arXiv.2401.06866 [DOI] [Google Scholar]

[R15] Le Scao T, Fan A, Akiki C, Pavlick E, Ilic S, Hesslow D, Castagné R, Luccioni AS, Yvon F, Gallé M, Tow J, Rush AM, Biderman S, Webson A, Ammanamanchi PS, Wang T, Sagot B, Muennighoff N, Villanova del Moral A, … Wolf T (2022). BLOOM: A 176b-parameter open-access multilingual language model. arXiv. 10.48550/arXiv.2211.05100 [DOI] [Google Scholar]

[R16] Li LH, Zhang P, Zhang H, Yang J, Li C, Zhong Y, Wang L, Yuan L, Zhang L, Hwang J-N, Chang K-W, & Gao J (2022). Grounded language-image pre-training. arXiv. 10.48550/arXiv.2112.03857 [DOI] [Google Scholar]

[R17] Liu X, McDuff D, Kovacs G, Galatzer-Levy I, Sunshine J, Zhan J, Poh M-Z, Liao S, Di Achille P, & Patel S (2023). Large language models are few-shot health learners. arXiv. 10.48550/arXiv.2305.15525 [DOI] [Google Scholar]

[R18] Liu Y, Jain A, Eng C, Way DH, Lee K, Bui P, Kanada K, de Oliveira Marinho G, Gallegos J, Gabriele S, et al. (2020). A deep learning system for differential diagnosis of skin diseases. Nature Medicine, 26(6), 900–908. 10.1038/s41591-020-0842-3 [DOI] [PubMed] [Google Scholar]

[R19] Malgaroli M, Hull TD, Zech JM, & Althoff T (2023). Natural language processing for mental health interventions: A systematic review and research framework. Translational Psychiatry, 13(1), Article 309. 10.1038/s41398-023-02592-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Malgaroli M, & Schultebraucks K (2020). Artificial intelligence and posttraumatic stress disorder (PTSD). European Psychologist, 25(4), 272–282. 10.1027/1016-9040/a000423 [DOI] [Google Scholar]

[R21] Malgaroli M, Tseng E, Hull TD, Jennings E, Choudhury TK, & Simon NM (2023). Association of health care work with anxiety and depression during the COVID-19 pandemic: Structural topic modeling study. JMIR AI, 2(1), Article e47223. 10.2196/47223 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] McDuff D, Schaekermann M, Tu T, Palepu A, Wang A, Garrison J, Singhal K, Sharma Y, Azizi S, Kulkarni K, Hou L, Cheng Y, Liu Y, Mahdavi SS, Prakash S, Pathak A, Semturs C, Patel S, Webster DR,…Natarajan V (2023). Towards accurate differential diagnosis with large language models. arXiv. 10.48550/arXiv.2312.00164 [DOI] [Google Scholar]

[R23] Mohr DC, Zhang M, & Schueller SM (2017). Personal sensing: Understanding mental health using ubiquitous sensors and machine learning. Annual Review Of Clinical Psychology, 13, 23–47. 10.1146/annurev-clinpsy-032816-044949 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Mostafazadeh N, Brockett C, Dolan B, Galley M, Gao J, Spithourakis GP, & Vanderwende L (2017). Image-grounded conversations: Multimodal context for natural question and response generation. arXiv. 10.48550/arXiv.1701.08251 [DOI] [Google Scholar]

[R25] Norman KP, Govindjee A, Norman SR, Godoy M, Cerrone KL, Kieschnick DW, & Kassler W (2020). Natural language processing tools for assessing progress and outcome of two veteran populations: Cohort study from a novel online intervention for posttraumatic growth. JMIR Formative Research, 4(9), Article e17424. 10.2196/17424 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Petrov A, La Malfa E, Torr P, & Bibi A (2024). Language model tokenizers introduce unfairness between languages. In Advances in neural information processing systems, 36, (pp. 1–28). NeurIPS. [Google Scholar]

[R27] Rauschecker AM, Rudie JD, Xie L, Wang J, Duong MT, Botzolakis EJ, Kovalovich AM, Egan J, Cook TC, Bryan RN, Nasrallah IM, Mohan S, & Gee JC (2020). Artificial intelligence system approaching neuroradiologist-level differential diagnosis accuracy at brain MRI. Radiology, 295(3), 626–637. 10.1148/radiol.2020190283 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Schaeffer R, Miranda B, & Koyejo S (2023). Are emergent abilities of large language models a mirage? arXiv. 10.48550/arXiv.2304.15004 [DOI] [Google Scholar]

[R29] Schueller SM, & Morris RR (2023). Clinical science and practice in the age of large language models and generative artificial intelligence. Journal of Consulting and Clinical Psychology, 91(10), 559–561. 10.1037/ccp0000848 [DOI] [PubMed] [Google Scholar]

[R30] Schultebraucks K, Yadav V, Shalev AY, Bonanno GA, & Galatzer-Levy IR (2022). Deep learning-based classification of posttraumatic stress disorder and depression following trauma utilizing visual and auditory markers of arousal and mood. Psychological Medicine, 52(5), 957–967. 10.1017/S0033291720002718 [DOI] [PubMed] [Google Scholar]

[R31] Shah RS, Holt F, Hayati SA, Agarwal A, Wang Y-C, Kraut RE, & Yang D (2022). Modeling motivational interviewing strategies on an online peer-to-peer counseling platform. Proceedings of the ACM on HumanComputer Interaction, 6, Article 527. [Google Scholar]

[R32] Sharma A, Lin IW, Miner AS, Atkins DC, & Althoff T (2023). Human–AI collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nature Machine Intelligence, 5(1), 46–57. 10.1038/s42256-022-00593-2 [DOI] [Google Scholar]

[R33] Sharma A, Rushton K, Lin IW, Nguyen T, & Althoff T (2023). Facilitating self-guided mental health interventions through human-language model interaction: A Case study of cognitive restructuring. arXiv. 10.48550/arXiv.2310.15461 [DOI] [Google Scholar]

[R34] Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Hou L, Clark K, Pfohl S, Cole-Lewis H, Neal D, Schaekermann M, Wang A, Amin M, Lachgar S, Mansfield P, Prakash S, Green B, Dominowska E, y Arcas BA, … Natarajan V (2023). Towards expert-level medical question answering with large language models. arXiv. 10.48550/arXiv.2305.09617 [DOI] [Google Scholar]

[R35] So J. -h., Chang J, Kim E, Na J, Choi J, Sohn J. -y., Kim B-H, & Chu SH (2024). Aligning large language models for enhancing psychiatric interviews through symptom delineation and summarization. arXiv. 10.48550/arXiv.2403.17428 [DOI] [Google Scholar]

[R36] Son Y, Clouston SA, Kotov R, Eichstaedt JC, Bromet EJ, Luft BJ, & Schwartz HA (2023). World Trade Center responders in their own words: Predicting PTSD symptom trajectories with AI-based language analyses of interviews. Psychological Medicine, 53(3), 918–926. 10.1017/S0033291721002294 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Szolovits P, & Pauker SG (1978). Categorical and probabilistic reasoning in medical diagnosis. Artificial Intelligence, 11(1–2), 115–144. [Google Scholar]

[R38] United Nations High Commissioner for Refugees. (2023). Mid-year trends 2023. https://www.unhcr.org/mid-year-trends-report-2023

[R39] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, & Polosukhin I (2017). Attention is all you need. In Advances in Neural Information Processing Systems 30. NeurIPS. [Google Scholar]

[R40] Webster P (2023). Six ways large language models are changing healthcare. Nature Medicine, 29(12), 2969–2971. 10.1038/s41591-023-02700-1 [DOI] [PubMed] [Google Scholar]

[R41] World Health Organization. (2022). World mental health report: Transforming mental health for all. https://www.who.int/publications/i/item/9789240049338

[R42] Xu X, Dinkel H, Wu M, & Yu K (2021). Text-to-audio grounding: Building correspondence between captions and sound events. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 606–610). ICASSP. [Google Scholar]

[R43] Ye R, Wang W, Chai J, Li D, Li Z, Xu Y, Du Y, Wang Y, & Chen S (2024). Openfedllm: Training large language models on decentralized private data via federated learning. arXiv. 10.48550/arXiv.2402.06954 [DOI] [Google Scholar]

[R44] Zhou K, Ethayarajh K, & Jurafsky D (2022). Richer countries and richer representations. arXiv. 10.48550/arXiv.2205.05093 [DOI] [Google Scholar]

[R45] Zhou L, Kalantidis Y, Chen X, Corso JJ, & Rohrbach M (2019). Grounded video description. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6578–6587). IEEE/CVF. [Google Scholar]

PERMALINK

An overview of diagnostics and therapeutics using large language models

Matteo Malgaroli

Daniel McDuff

Abstract

Overview of LLM applications

LLMs for screening, diagnostics, and symptom detection

LLMs for clinical documentation

LLMs for mental health interventions

Open problems and opportunities

Representation

Biases

Public datasets and models

Engagement

Explanations and reasoning

Adoption

Privacy

Risks of harm

Conclusions

Acknowledgments

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

An overview of diagnostics and therapeutics using large language models

Matteo Malgaroli

Daniel McDuff

Abstract

Overview of LLM applications

LLMs for screening, diagnostics, and symptom detection

LLMs for clinical documentation

LLMs for mental health interventions

Open problems and opportunities

Representation

Biases

Public datasets and models

Engagement

Explanations and reasoning

Adoption

Privacy

Risks of harm

Conclusions

Acknowledgments

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases