Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Feb 18.
Published in final edited form as: Behav Ther. 2025 Mar 10;56(4):680–688. doi: 10.1016/j.beth.2025.02.005

Feasibility of Using ChatGPT to Generate Exposure Hierarchies for Treating Obsessive-Compulsive Disorder

Emily E Bernstein 1, Adam C Jaroszewski 1, Ryan J Jacoby 1, Natasha H Bailen 1, Jennifer Ragan 1, Aisha Usmani 1, Sabine Wilhelm 1
PMCID: PMC12912708  NIHMSID: NIHMS2142106  PMID: 40541371

Abstract

Obsessive-compulsive disorder (OCD) is a chronic, severe condition. Although exposure and response prevention (ERP), the first-line treatment for OCD, is highly effective, too few clinicians are equipped to deliver it. One barrier is the time and expertise required to develop personalized exposure hierarchies. In this study, we examined the feasibility and promise of using large language models (LLMs) to generate appropriate exposure suggestions for OCD treatment. We used ChatGPT-4 (Generative Pretrained Transformer, Version 4) to generate 10-item exposure hierarchies for simulated patient cases that were systematically varied along the following dimensions: OCD subtype, symptom complexity or number, level of detail, patient age, and patient gender. Expert clinicians also generated hierarchies for a subset of prompts. ChatGPT-generated hierarchies were first rated for completeness and degree to which input information was incorporated. Three OCD experts blinded to the aims of the study then rated each ChatGPT- and expert-generated hierarchy’s appropriateness, specificity, variability, safety/ethics, and overall usefulness or quality. ChatGPT generated partial (n = 15) or complete (n = 55) responses to 70 of 72 prompts and incorporated most input information (M = 4.29 out of 5, SD = 0.85). The only significant predictor of degree of input information incorporated was number of OCD symptoms; prompts with the most symptoms were rated as incorporating less input information than prompts with both low and moderate number of symptoms, ps < .05. Overall, ChatGPT-generated hierarchies were viewed as appropriate (M = 4.47, SD = 0.58), specific (M = 4.17, SD = 0.65), variable (M = 3.96, SD = 0.79), safe/ethical (M = 4.89, SD = 0.24), and useful (M = 3.99, SD = 0.82). However, expert human-generated hierarchies were still rated as significantly more appropriate, specific, variable, and useful, ps < .05, but not more or less safe and ethical than ChatGPT-generated hierarchies, p = .24. Only the level of symptom detail included in prompts was associated with ratings of ChatGPT-generated hierarchies, ps < .05, such that hierarchies were rated significantly better when prompts had been more detailed. Results suggest that LLMs such as ChatGPT hold great promise in helping generate effective OCD exposure hierarchies, while also highlighting key limitations that require resolution prior to clinical implementation. Given that few clinicians specialize in OCD treatment, it would be advantageous to establish how face-to-face or digital treatment can be augmented with this technology.

Keywords: obsessive-compulsive disorder, cognitive behavioral therapy, exposure and response prevention, large language models, artificial intelligence


Obsessive-compulsive disorder (OCD) is characterized by persistent, intrusive thoughts coupled with compulsions or repetitive actions or rituals intended to neutralize or rid oneself of the obsessions (American Psychiatric Association, 2013). OCD occurs in approximately 2% of the population and was named the 10th leading cause of impairment among all health conditions by the World Health Organization given disproportionately high rates of poor outcomes like unemployment (up to 41%) and suicidal behaviors (up to 27%) (Eaton et al., 2008; Nagy et al., 2020; Pinto et al., 2006; Ruscio et al., 2010). Without treatment, OCD has a chronic and severe course, underscoring the critical importance of access to effective treatment for OCD sufferers (singh et al., 2023).

Cognitive behavioral therapy (CBT), and specifically exposure and response prevention (ERP), has been extensively validated as the first-line treatment for OCD (Abramowitz et al., 2001; Ferrando & Selai, 2021; Rosa-Alcázar et al., 2008). ERP involves systematically confronting the thoughts, images, objects, and situations that make the individual anxious or provoke their obsessions while limiting engagement in compulsive responses (Hezel & Simpson, 2019). Exposures can occur both in real life (e.g., touching surfaces perceived to be contaminated) and in one’s imagination by reading or listening to an exposure script (e.g., imagining contracting a deadly virus). During ERP, patients learn that their feared outcomes are less likely to come true than they initially thought (e.g., that they do not show signs of illness despite not washing their hands after touching something that they considered to be contaminated) and that they can tolerate distress without engaging in compulsive or ritualized behaviors. Despite robust supporting evidence for ERP, the majority (57.3%) of individuals with OCD cannot access any treatment, let alone ERP with a qualified clinician (Kohn et al., 2004; Marques et al., 2010).

Technological advances in recent years have begun to reduce the treatment gap. First, the availability of high-quality, virtual provider training has increased the number of providers equipped to deliver ERP by lowering barriers like cost and the logistics of traveling to in-person programs. Program evaluations show that clinicians new to treating OCD can be effectively trained in ERP through virtual, light touch instruction (Jacoby et al., 2019). Second, smartphone- and internet-based platforms have been developed to deliver or support ERP (typically with the help of clinicians or coaches), which can reduce the cost, associated stigma, and other barriers to care as users can access content whenever and wherever is convenient. Multiple trials have shown that guided digital CBT for OCD is feasible, safe, and effective, delivered by both internet (Andersson et al., 2012; Kyrios et al., 2018; Lundström et al., 2022; Mahoney et al., 2014) and smartphone app (Boisseau et al., 2017; Gershkovich et al., 2021; Hwang et al., 2021).

Even for those who are able to access ERP for OCD, response and remission rates across in-person and virtual treatment remain too low (Geiger et al., 2024). Personalizing treatment, particularly developing a hierarchy of exposures that are considered to be useful and safe, is difficult; this difficulty contributes to providers delivering suboptimal ERP, as well as to patients and coaches struggling with digital and self-guided treatments (Gillihan et al., 2012). Treatment personalization requires creativity and clinical acumen, as effective exposures must directly address a patient’s specific obsessions and compulsions, be appropriately graded (i.e., not too difficult or too easy), and be both feasible and safe within a patient’s context. This can also be time consuming, further straining already burdened clinicians and discouraging self-help or digital treatment users who may struggle to develop their own hierarchies. Given the high heterogeneity of obsessions and compulsions seen in OCD (e.g., McKay et al., 2004), example hierarchies and exposures provided in a smartphone app, treatment manual, or training may not be directly relevant to a given patient’s unique symptoms. For example, an individual who worries that having an unacceptable taboo thought while getting dressed means that they need to change their clothes to prevent something bad from happening may struggle to apply ERP principles to their specific obsessions and compulsions if the examples provided to them include more traditional contamination and harm themes. Accordingly, greater guidance in planning exposures is needed to realize the promise of scaling trainings and treatments for OCD.

An exciting potential solution to these challenges is to leverage large language models (LLMs). Such generative AI-based platforms, like OpenAI’s ChatGPT, use natural language processing algorithms to engage in human-like conversation. The utility of these platforms has already been demonstrated across wide arrays of tasks, including in the medical space. For example, early research shows that ChatGPT can support clinical decision making, pass medical exams, answer patient queries, and assist in scientific writing and literature reviews with largely “passing” performance (Li et al., 2024). Within psychiatry specifically, LLMs have shown preliminary promise in supporting diagnosis and clinical decision making Elyoseph et al., 2024; Fahmi et al., 2023; Levkovich & Elyoseph, 2023b; Omar et al., 2024) and directly supporting patients through psychoeducation and therapeutic conversations (Li et al., 2023; Omar et al., 2024). Overall, LLMs, and ChatGPT in particular, appear more adept in following clinical guidelines and answering simple questions and less adept when presented with more severe or complex cases or asked to provide personalized clinical guidance (Ayoub et al., 2024; Dergaa et al., 2024; Levkovich & Elyoseph 2023a; Omar et al., 2024; Sezgin et al., 2023). Overall, publicly available platforms like ChatGPT perform increasingly well on clinical tasks, while also continuing to exhibit concerning weaknesses, including inaccurate references and sometimes even dangerous suggestions; to safely and optimally integrate these promising technologies into clinical workflows, it is recommended that models are evaluated by experts and further trained using specialized datasets (Li et al., 2024).

In the present study we aimed to demonstrate the current ability of a widely used and publicly available LLM (GPT-4) to produce exposure hierarchies for OCD treatment. We evaluated the impact of various clinical (e.g., subtype and complexity of OCD symptoms) and demographic (e.g., patient age) features of simulated patient case descriptions on the ChatGPT-generated output. We also benchmarked ChatGPT performance against expert, doctoral-level OCD therapists. Given the low rate of clinicians offering, let alone specializing, in OCD treatment, it would be advantageous to establish whether face-to-face and digital care can be augmented with this technology and what present weaknesses could be addressed with future, expert-informed fine tuning.

Material and Methods

We used ChatGPT (GPT-4) to generate 10-item exposure hierarchies for a series of simulated patient cases (described further below). All model testing was performed in new ChatGPT sessions to limit influence from previous prompts. Prompts were written by authors (EB, RJ), experts in the diagnosis and treatment of OCD and providers in an OCD specialty clinic, and manually entered into the ChatGPT website. Responses were copied verbatim into a database for review (see Data Analysis). The project was determined to not meet the criteria for human subject research as defined by Mass General Brigham Institutional Review Board (IRB) policies and Health and Human Services regulations set forth in 45 CFR 46, and therefore did not require IRB approval.

Prompts were comprised of (1) context for the request, (2) how the model was meant to respond, (3) the specific request, and (4) the output format (Heston & Khun, 2023). Specifically, prompts used the following template: “I am a mental health provider creating a treatment plan. Please respond as if you are an expert in Exposure and Response Prevention (ERP) treatment for Obsessive Compulsive Disorder (OCD). Please create a 10-item graded exposure hierarchy for a patient whose primary obsession(s) is (are)__. The corresponding compulsion(s) is (are)__. The patient is__years old and identifies as a__.” Given the known impact of prompt engineering on generative LLM output (Heston & Khun, 2023), we standardized prompt formats and systematically varied the following dimensions: OCD subtype (contamination, accidental harm, sexual/physical violence), symptom complexity or number (low [1 obsession and 1 compulsion], moderate [2 obsessions and 4 compulsions], high [3 obsessions and 6 compulsions]), level of detail (low, high), patient age (15 years old, 40 years old), patient gender (woman, man).

An example of a low detail prompt is: “I am a mental health provider creating a treatment plan. Please respond as if you are an expert in Exposure and Response Prevention (ERP) treatment for Obsessive Compulsive Disorder (OCD). Please create a 10-item graded exposure hierarchy for a patient whose primary obsession is fear and disgust of being contaminated by bodily excretions (urine and feces). The corresponding compulsion is excessive handwashing. The patient is 15 years old and identifies as a man.” An example of a high detail prompt is: “I am a mental health provider creating a treatment plan. Please respond as if you are an expert in Exposure and Response Prevention (ERP) treatment for Obsessive Compulsive Disorder (OCD). Please create a 10-item graded exposure hierarchy for a patient whose primary obsession is fear of being contaminated by bodily excretions (urine and feces). He is afraid that urine and feces will get on his skin after using the bathroom or in public restrooms, unknowingly enter his body, and get him sick and that he will then unknowingly spread this sickness to others. He also reports feeling disgusted after using the restroom at the thought of fecal matter being on his skin. The corresponding compulsion is washing his hands with very hot water for more than 20 minutes every time he uses the bathroom or has a thought about being contaminated. He also washes his hands in a ritualized order by going in between each finger in order and then washing up the arm to the elbow. The patient is 15 years old and identifies as a man.”

In total, 72 prompts were submitted to ChatGPT, capturing all possible combinations of the aforementioned dimensions. In the event that ChatGPT failed to produce a hierarchy (i.e., responded with an error), the response was still pasted into the database for review. Subsequently, investigators were to prompt ChatGPT a second time using: “What activities might help a patient with these symptoms?” This output was used for blinded reviews (described below). In order to compare human-vs-ChatGPT–generated hierarchies, expert clinicians (EB, AJ, RJ) generated hierarchies for 18 of the 72 case description prompts (short and long version of each subtype/number of symptom combinations; gender and age were counterbalanced).

DATA ANALYSIS

Initial Review

Two staff psychologists (EB, AJ) each reviewed all ChatGPT responses and evaluated them along the following dimensions: (1) task completion and (2) degree to which input information was incorporated in the output. Task completion was rated categorically: not at all, partial, or complete. A ChatGPT response was marked as “complete” if it included a list of 10 scenarios that fit the definition of an exposure with response prevention. A response was marked as “not at all” complete if ChatGPT failed to generate any suggestions after the initial prompt. All other responses were marked “partial.” A third psychologist (RJ) reviewed responses to resolve any discrepancies. The degree to which input information (i.e., prompt content) was incorporated was rated on a 5-point scale from 1 = not at all to 5 = completely. A response received a 1 (not at all) if, for example, the prompt referred to contamination and violence obsessions for an adult and the response provided symmetry exposures in a school environment. A response received a 5 (completely) if all suggestions fit the prompt and all prompt content was incorporated. Note that this rating was not an evaluation of the quality of the suggestions. Interrater reliability was assessed by calculating intraclass correlation coefficients (ICC) for each rating dimension, based on a mean rating (k = 2), absolute agreement, two-way mixed-effects model. Raters had good agreement on both “task completion” (ICC = .76) and “input information incorporated” (ICC = .74).

We present descriptives and quantitatively evaluated whether any variable (e.g., prompt length, OCD subtype) performed worse than others on task completion (chi-squares) and input information incorporation (t-tests, ANOVAs).

Blinded Review

Three blinded staff psychologists (NB, JR, AU) with expertise in OCD and ERP served as blinded raters. Raters all split their time practicing in both hospital (M = 56.33%, SD = 14.15) and private practice (M = 40.00%, SD = 18.03%) settings and were on average 13.33 years (SD = 9.07) post-terminal degree. Raters have each treated over 100 patients with OCD and reported using ERP in the treatment of nearly all of them (M = 99.67% of patients with OCD, SD = 0.58). They were presented with all ChatGPT-generated (n = 72) and human-generated (n = 18) hierarchies but were blinded to the generation source and aims of the project. They were told that team members were developing new programs to train and support people in delivering ERP for OCD and looking for outside experts to assess how the pilot program went. They rated each hierarchy along the following dimensions using a 5-point scale (1 = strongly disagree to 5 = strongly agree): (1) appropriateness (how appropriate the suggestions are given patient’s symptoms), (2) specificity (how actionable the suggestions are), (3) variability (to what degree suggestions fit the expected range of exposures needed for a full hierarchy; they were reminded that instructions were always to generate 10-item exposures, so hierarchies should not be penalized for length), (4) safety/ethics (to what degree suggestions are safe and ethical for the patient to conduct), and (5) overall usefulness or quality. ICCs for rating dimensions were based on a mean rating (k = 3), absolute agreement, two-way mixed-effects model. Raters had low agreement on hierarchy specificity (ICC = .31) but moderate agreement on appropriateness (ICC = .49), variability (ICC = .52), and overall usefulness (ICC = .61). Safety ratings had high negative skew, little variability, and many ties between raters, violating the assumptions of ICC and related tests. However, raw ratings show there was strong interrater agreement on hierarchy safety: on 74.4% of hierarchies (n = 67), all three raters rated 5/5 (strongly agree) and, on 95.6% of hierarchies (n = 86), two out of three raters rated 5/5. After rating all 72 possible hierarchies, raters were then informed that some hierarchies were AI-generated and some not. They were subsequently presented with 18 pairs of hierarchies (one ChatGPT-generated, one human-generated, matched by prompt), asked to identify which was AI-generated, and rated their confidence in this judgment (0–100% confident). We present descriptives of ratings, and quantitatively evaluate whether any variable (e.g., prompt length, OCD subtype) biased the above ratings (t-tests, ANOVAs).

Results

INITIAL REVIEW

ChatGPT generated partial (n = 15,20.8%) or complete (n = 55, 76.4%) responses to 70 of 72 initial prompts. The two prompts (2.8%) that resulted in errors (i.e., flagged by OpenAI’scontent filtering system, with no output generated) were both cases describing sexual/violent obsessions with high number of symptoms and high detail. With a single follow-up prompt, ChatGPT produced hierarchy suggestions for these as well. Although ChatGPT occasionally failed to produce all 10 exposure suggestions, the cause for a hierarchy being labeled as “partially” completed was more often (13 out of 15, 86.7%) due to the inclusion of activities that did not meet the definition of an exposure (e.g., discussing one’s feelings or restructuring unhelpful thoughts). Using the initial output, suggested hierarchies were rated as having integrated most input information (M = 4.29 out of 5, SD = 0.85).

OCD subtype, symptom detail, patient age, and patient gender were all unrelated to completeness or input information integration, ps > .05 (see Table 1). Number of symptoms was not associated with task completion, but prompts with a high number of symptoms (M = 3.88, SD = 1.05) were rated as integrating less input information than prompts with both low (M = 4.56, SD = 0.54, p = .01) and moderate (M = 4.44, SD = 0.76, p = .048) number of symptoms.

Table 1.

Predictors of Task Completion

Task Completion Input Integration
OCD Sub-Type χ2 = 6.11, p = .19, V = .21 F(2,69) = 1.90, p = .16, η2 = .05
Number of Symptoms χ2 = 4.65, p = .32, V = .18 F(2,69) = 4.94, p = .01, η2 = .13
Symptom Detail χ2 = 5.72, p = .06, V = .28 t(68.15) = −0.69, p = .49, d = .16
Age χ2 = 0.76, p = .68, V = .10 t(67.84) = 0.28, p = .78, d = .06
Gender χ2 = 3.05, p = .22, V = .21 t(59.64) = 1.39, p = .17, d = .33

BLINDED REVIEW

Blinded ratings are summarized in Table 2. Overall, ChatGPT-generated hierarchies were viewed as largely appropriate (M = 4.47, SD = 0.58), specific (M = 4.17, SD = 0.65), variable (M = 3.96, SD = 0.79), safe (M = 4.89, SD = 0.24), and overall useful (M = 3.99, SD = 0.82). However, expert human-generated hierarchies were still rated as significantly more appropriate, specific, variable, and useful, ps < .05, though, importantly, just as safe and ethical, p = .24. Whereas ratings of the expert human-generated hierarchies never fell below a 3 out of 5, there was more variability in the ChatGPT-generated hierarchies, and some did receive unacceptable scores.

Table 2.

Blinded Ratings of Hierarchies

ChatGPT-Generated
(N = 72)
Human-Generated
(N = 18)
M (SD) Range M (SD) Range Test
Appropriateness 4.47 (0.58) 2.33 – 5.0 4.94 (0.13) 4.67 – 5.0 t(87.56) = 6.43, p < .001, d = 1.14
Specificity 4.17 (0.65) 2.33 – 5.0 4.78 (0.34) 4.0 – 5.0 t(50.85) = 5.51, p < .001, d = 1.18
Variability 3.96 (0.79) 1.67 – 5.0 4.63 (0.41) 3.33 – 5.0 t(52.08) = 5.01, p < .001, d = 1.07
Safety 4.89 (0.24) 3.67 – 5.0 4.81 (0.29) 4.0 – 5.0 t(23.53) = 1.08, p = .29, d = .30
Overall 3.99 (0.82) 2.0 – 5.0 4.70 (0.39) 3.67 – 5.0 t(57.80) = 5.34, p < .001, d = 1.11

Delving into the ChatGPT-generated hierarchies specifically, OCD subtype, number of symptoms, patient age, and patient gender were all unrelated to hierarchy ratings, ps > .05 (see Table 3). Symptom detail was a significant predictor of appropriateness, specificity, variability, and overall utility ratings, ps < .05 (but not safety ratings, p = .15), such that hierarchies were rated significantly better on these dimensions when prompts were more detailed. Examining qualitative feedback provided by the raters, specificity and overall usefulness ratings were typically lower when raters noted insufficient detail regarding ritual prevention. Variability ratings were typically lower in cases where raters felt there were not enough challenging exposures or where suggestions were focused on limited contexts (e.g., exposure only to public toilets, not including imaginal exposures).

Table 3.

Predictors of Ratings for ChatGPT-Generated Hierarchies

Appropriateness Specificity Variability Safety Overall Usefulness
OCD Sub-Type F(2,69) = .73,
p = .49
F(2,69) = 2.52,
p = .09
F(2,69) = .54,
p = .59
F(2,69) = 1.29,
p = .28
F(2,69) = 1.26,
p = .29
Symptom Number F(2,69) = .33,
p = .72
F(2,69) = 1.71,
p = .19
F(2,69) = .09,
p = .92
F(2,69) = .41,
p = .66
F(2,69) = .03,
p = .97
Symptom Detail F(1,70) = 6.92,
p = .01
F(1,70) = 11.51,
p = .001
F(1,70) = 8.31,
p = .005
F(1,70) = 2.16,
p = .15
F(1,70) = 6.93,
p = .01
Age F(1,70) = 1.05,
p = .31
F(1,70) = .13,
p = .72
F(1,70) = .72,
p = .40
F(1,70) = .65,
p= .42
F(1,70) = .18,
p = .67
Gender F(1,70) = 1.36,
p = .25
F(1,70) = .02,
p = .90
F(1,70) = .72,
p = .40
F(1,70) = .03,
p = .87
F(1,70) = 2.86,
p = .10

After unblinding, raters correctly identified 94.44% (Rater 1: 17/18), 66.67% (Rater 2: 12/18), and 66.67% (Rater 3: 12/18) pairs of ChatGPT- versus human-generated hierarchies and reported varying levels of confidence M(SD) = 86.72% (15.32%), M(SD) = 46.72% (16.09%), and M(SD) = 55.11% (14.70%), respectively.

Discussion

ChatGPT was able to generate graded exposure hierarchies for a range of OCD presentations and levels of complexity. Suggested exposures were, on average, rated as being fairly appropriate, specific or actionable, variable in difficulty, safe, ethical, and useful. Where novice clinicians may struggle with designing exposures for certain OCD subtypes, such as violent or sexual obsessions, or more complex symptom presentations, ChatGPT showed no such bias (at least once a follow-up prompt was used to bypass original error flags with sexual/violent obsessions). Additionally, no biases were detected for the gender or age of the participant. Encouragingly, all suggestions provided by ChatGPT were rated as safe and ethical and safety/ethics ratings were not statistically different than those for hierarchies created by OCD experts. This feasibility study demonstrates clear promise for using LLMs to support ERP hierarchy generation.

Still, important weaknesses in task performance indicate that ChatGPT should not exclusively be relied on for this task. The ultimate goal of this clinical support tool should be to match or exceed expert-level performance and further testing and safeguards are needed to ensure that no harm is done. First, although appropriateness, specificity, variability, and overall utility of hierarchies benefited when prompts contained more clinical detail, such prompts were also less likely to fully integrate all symptoms. However, the fact that some symptoms were occasionally omitted in the output, particularly where the prompt included a high number of symptoms (3 obsessions and 6 compulsions), is plausibly an artifact of the prompt constraining the output to just 10 hierarchy items. The larger issue regarding task completion scores was the inclusion of suggestions that were not exposures; although sometimes these alternative activities were benign or otherwise therapeutic (e.g., recognizing unhelpful thinking patterns), other times they would be contraindicated (e.g., providing reassurance about the safety of home appliances). Relatedly, ChatGPT suggested using relaxation strategies during exposures in ways that would typically not be recommended (Greist et al., 2002). These results highlight a limitation of existing LLMs, which draw on broad publicly available data and thus can repeat misinformation or common misunderstandings about exposure therapy (Monteith et al., 2024; Spallek et al, 2023).

Additionally, although exposure suggestions were largely reviewed positively, a meaningful number were rated as inadequate. Specifically, raters indicated that 2.78% of generated hierarchies were not appropriate, 5.56% were not specific, 15.28% were not variable, and 18.06% were not useful (average rating ≤ 3). In contrast, no human-generated hierarchies received these scores. Unsurprisingly then, ChatGPT- and human-generated hierarchies were distinguishable more often than not, and ChatGPT-generated hierarchies received significantly lower ratings (as well as more variable ratings) across these dimensions compared to human-generated hierarchies. Some drivers of this difference include ChatGPT providing insufficient ritual prevention (e.g., limiting handwashing in addition to touching contaminated surfaces), omitting imaginal exposures where they would be appropriate, omitting difficult exposures (e.g., using a sharp knife close to a loved one), and not deviating from the specific examples provided in the prompt (e.g., all contamination exposures focused on toilets, no exposures combined multiple feared stimuli). Such weaknesses underscore the importance of leveraging clinical expertise to fine-tune models for this task. Continuous monitoring and updating would also be important; as models suggest increasingly challenging exposures, risk of suggesting dangerous or unethical exposures may increase.

Results also highlight the limitations of existing chatbots’ endeavors to intuit the intention of prompts without ever asking follow-up questions. This is problematic in clinical practice where terminology can be imprecise, differential diagnosis critical, and personalization required (Dergaa et al., 2024; Omar et al., 2024). This was made clear in the finding that prompts with low detail received significantly lower ratings than prompts with high detail. In these cases, OCD experts drew on deep patient experience and clinical knowledge to creatively build hierarchies; in contrast, ChatGPT often generated too narrow or irrelevant suggestions. Future iterations must implement thoughtful prompt engineering as well as a user interface to ensure sufficient detail in the input. Additionally, to fully realize the promise of this approach and limit errors, many more OCD subtypes and presentations as well as potentially relevant demographic and medical factors (e.g., psychiatric or medical comorbidities) must be tested.

Overall, this feasibility study should encourage greater exploration of how LLMs can empower more providers to deliver high-quality, personalized ERP. Findings align with broader calls for specially trained models (Thirunavukarasu et al., 2023). Most immediately, much of the critical literature is behind paywalls and therefore inaccessible for training of widely available platforms; moreover, privileging certain sources (e.g., expert manuals) over others would likely improve the quality and reliability of the output. Clinical science advances quickly, encouraging focused attention to frequent retraining or updating of clinical models (Hulman et al., 2023; Zhavoronkov, 2023). Careful development and implementation strategies are essential to guard against inherent risks of these tools. For example, users should be reminded to confirm the safety and feasibility of the suggestions for a given patient’s context and not enact any suggestions outside their competence. Involving provider and patient stake-holders throughout this process is essential (Vial et al., 2022). User-centered design and usability testing ensure that the tool fits the intended users’ needs, context, and preferences, integrates into clinical workflows, and ultimately improves outcomes. Central to this process will be addressing outstanding ethical and privacy questions. For example, such applications must uphold data security and privacy standards and patients must be informed about how their data would be used. Moreover, given how new this technology is, it remains an open question how comfortable clinicians and/or patients will be sharing personal information with this type of application.

Conclusions

The first-line treatment for OCD and a range of other disorders (e.g., anxiety disorders, body dysmorphic disorder) involves exposure and response prevention (Ferrando & Selai, 2021; Greenberg & Wilhelm, 2011; Parker et al., 2018). To be effective, exposure and response prevention exercises should be tailored to the patient and graded (ranging from easy to very challenging). Generating appropriately targeted exercises is difficult for many patients and clinicians, as it typically involves considerable session time, creativity, expertise in treating a given population, and clinical experience. This is a barrier to clinicians using exposures and to patients completing self-guided treatments (e.g., app-based; sustaining gains after treatment). Despite its strong evidence base, many patients never receive adequate exposure therapy (Kohn et al., 2004; Marquez et al., 2010). While some limitations still need to be addressed prior to clinical implementation, LLMs have the potential to allow more professional and paraprofessional support persons to deliver personalized, high-quality ERP for OCD, to enable patients to use self-help apps more effectively, and thus to reduce the treatment gap and associated disease burden.

Acknowledgments

The authors would like to thank Sara Velazquez for her valuable work on this project.

Research support was provided by the Hope Fund (Wilhelm, PI).

EEB has received research support from Koa Health Digital Solutions LLC. During the preparation of this manuscript, EEB had a consulting agreement with Otsuka Pharmaceutical Development & Commercialization, Inc. and was on the Scientific Advisory Board for AugMend Health Inc. During the preparation of this manuscript, RJJ was also supported by a grant from the National Institute of Mental Health (K23MH120351). The sponsor did not have a role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the article for publication. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. ACJ has received research support from Koa Health Digital Solutions LLC., NHB, JR, and AU have no conflicts to disclose. SW has received royalties from Elsevier Publications, Guilford Publications, New Harbinger Publications, Springer, and Oxford University Press. Dr. Wilhelm has also received speaking honoraria from various academic institutions and foundations, including a research award from the National Alliance on Mental Illness. In addition, she received honoraria for her role on the Scientific Advisory Board for One-Mind (PsyberGuide), Koa Health Digital Solutions LLC, Noom, Inc., and Jimini Health. Dr. Wilhelm has received research support from Koa Health Digital Solutions LLC.

Footnotes

The project was determined to not meet the criteria for human subject research as defined by Mass General Brigham Institutional Review Board (IRB) policies and Health and Human Services regulations set forth in 45 CFR 46, and therefore did not require IRB approval.

The datasets generated and/or analyzed during the current study are not publicly available due to pending patent applications, but are available from the corresponding author on reasonable request.

References

  1. Abramowitz JS, Brigidi BD, & Roche KR (2001). Cognitive-behavioral therapy for obsessive-compulsive disorder: a review of the treatment literature. Research on Social Work Practice, 11(3), 357–372. 10.1177/104973150101100305. [DOI] [Google Scholar]
  2. American Psychiatric Association (2013). Diagnostic and statistical manual of mental disorders: DSM-5. American Psychiatric Publishing. [Google Scholar]
  3. Andersson E, Enander J, Andrén P, Hedman E, Ljótsson B, Hursti T, Bergström J, Kaldo V, Lindefors N, Andersson G, & Rück C (2012). Internet-based cognitive behaviour therapy for obsessive–compulsive disorder: a randomized controlled trial. Psychological Medicine, 42(10), 2193–2203. 10.1017/S0033291712000244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Ayoub NF, Lee Y, Grimm D, & Divi V (2024). Head-to-head comparison of ChatGPT versus Google Search for medical knowledge acquisition. Otolaryngology–Head and Neck Surgery, 170(6), 1484–1491. 10.1002/ohn.465. [DOI] [PubMed] [Google Scholar]
  5. Boisseau CL, Schwartzman CM, Lawton J, & Mancebo MC (2017). App-guided exposure and response prevention for obsessive compulsive disorder: an open pilot trial. Cognitive Behaviour Therapy, 46(6), 447–458. 10.1080/16506073.2017.1321683. [DOI] [PubMed] [Google Scholar]
  6. Dergaa I, Fekih-Romdhane F, Hallit S, Loch AA, Glenn JM, Fessi MS, Ben Aissa M, Souissi N, Guelmami N, Swed S, El Omri A, Bragazzi NL, & Ben Saad H (2024). ChatGPT is not ready yet for use in providing mental health assessment and interventions. Frontiers in Psychiatry, 14, 1277756. 10.3389/fpsyt.2023.1277756. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Eaton WW, Martins SS, Nestadt G, Bienvenu OJ, Clarke D, & Alexandre P (2008). The burden of mental disorders. Epidemiologic Reviews, 30(1), 1–14. 10.1093/epirev/mxn011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Elyoseph Z, Levkovich I, & Shinan-Altman S (2024). Assessing prognosis in depression: comparing perspectives of AI models, mental health professionals and the general public. Family Medicine and Community Health, 12(Suppl 1), e002583. 10.1136/fmch-2023-002583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Fahmi H, Kakamad YN, Abbas HA, Hassan DQH, Hasan SJ, Omer D, Kakamad SH, HamaSalih HM, Hassan MN, Rahim HM, Salih RQ, Abdalla BA, Mohammed SH, & Mahmood YM (2023). Role of ChatGPT and Google Bard in the diagnosis of psychiatric disorders: a comparative study. Barw Medical Journal. 10.58742/4vd6h741. [DOI] [Google Scholar]
  10. Ferrando C, & Selai C (2021). A systematic review and meta-analysis on the effectiveness of exposure and response prevention therapy in the treatment of Obsessive-Compulsive Disorder. Journal of Obsessive-Compulsive and Related Disorders, 31, 100684. 10.1016/j.jocrd.2021.100684. [DOI] [Google Scholar]
  11. Geiger Y, Van Oppen P, Visser H, Eikelenboom M, Van Den Heuvel OA, & Anholt GE (2024). Long-term remission rates and trajectory predictors in obsessive-compulsive disorder: findings from a six-year naturalistic longitudinal cohort study. Journal of Affective Disorders, 350, 877–886. 10.1016/j.jad.2024.01.155. [DOI] [PubMed] [Google Scholar]
  12. Gershkovich M, Middleton R, Hezel DM, Grimaldi S, Renna M, Basaraba C, Patel S, & Simpson HB (2021). Integrating exposure and response prevention with a mobile app to treat obsessive-compulsive disorder: feasibility, acceptability, and preliminary effects. Behavior Therapy, 52(2), 394–405. 10.1016/j.beth.2020.05.001. [DOI] [PubMed] [Google Scholar]
  13. Gillihan SJ, Williams MT, Malcoun E, Yadin E, & Foa EB (2012). Common pitfalls in exposure and response prevention (EX/RP) for OCD. Journal of Obsessive-Compulsive and Related Disorders, 1(4), 251–257. 10.1016/j.jocrd.2012.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Greenberg JL, & Wilhelm S (2011). Cognitive-behavioral therapy for body dysmorphic disorder: a review and future directions. International Journal of Cognitive Therapy, 4(4), 349–362. 10.1521/ijct.2011.4A349. [DOI] [Google Scholar]
  15. Greist JH, Marks IM, Baer L, Kobak KA, Wenzel KW, Hirsch MJ, Mantle JM, & Clary CM (2002). Behavior therapy for obsessive-compulsive disorder guided by a computer or by a clinician compared with relaxation as a control. The Journal of Clinical Psychiatry, 63(2), 138–145. 10.4088/JCP.v63n0209. [DOI] [PubMed] [Google Scholar]
  16. Heston T, & Khun C (2023). Prompt engineering in medical education. International Medical Education, 2(3), 198–205. 10.3390/ime2030019. [DOI] [Google Scholar]
  17. Hezel D, & Simpson H. b. (2019). Exposure and response prevention for obsessive-compulsive disorder: a review and new directions. Indian Journal of Psychiatry, 61(7), 85. 10.4103/psychiatry.IndianJPsychiatry_516_18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Hulman A, Dollerup OL, Mortensen JF, Fenech ME, Norman K, Støvring H, & Hansen TK (2023). ChatGPT- versus human-generated answers to frequently asked questions about diabetes: a Turing test-inspired survey among employees of a Danish diabetes center. PLOS ONE, 18(8), e0290773. 10.1371/journal.pone.0290773. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Hwang H, Bae S, Hong JS, & Han DH (2021). Comparing effectiveness between a mobile app program and traditional cognitive behavior therapy in obsessive-compulsive disorder: evaluation study. JMIR Mental Health, 8(1), e23778. 10.2196/23778. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Jacoby RJ, Berman NC, Reese HE, Shin J, Sprich S, Szymanski J, Pollard CA, & Wilhelm S (2019). Disseminating cognitive-behavioral therapy for obsessive compulsive disorder: comparing in person vs. online training modalities. Journal of Obsessive-Compulsive and Related Disorders, 23, 100485. 10.1016/j.jocrd.2019.100485. [DOI] [Google Scholar]
  21. Kohn R, Saxena S, Levav I, & Saraceno B (2004). The treatment gap in mental health care. World Health Organization. Bulletin of the World Health Organization, 82(11), 858–866. [PMC free article] [PubMed] [Google Scholar]
  22. Kyrios M, Ahern C, Fassnacht DB, Nedeljkovic M, Moulding R, & Meyer D (2018). Therapist-assisted internet-based cognitive behavioral therapy versus progressive relaxation in obsessive-compulsive disorder: randomized controlled trial. Journal of Medical Internet Research, 20(8), e242. 10.2196/jmir.9566. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Levkovich I, & Elyoseph Z (2023a). Identifying depression and its determinants upon initiating treatment: ChatGPT versus primary care physicians. Family Medicine and Community Health, 11(4). 10.1136/fmch-2023-002391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Levkovich I, & Elyoseph Z (2023b). Suicide risk assessments through the eyes of chatgpt-3.5 versus chatgpt-4: vignette study. JMIR Mental Health, 10, e51232. 10.2196/51232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Li H, Zhang R, Lee Y-C, Kraut RE, & Mohr DC (2023). Systematic review and meta-analysis of AI-based conversational agents for promoting mental health and well-being. Npj Digital Medicine, 6(1), 1–14. 10.1038/s41746-023-00979-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Li J, Dada A, Puladi B, Kleesiek J, & Egger J (2024). ChatGPT in healthcare: a taxonomy and systematic review. Computer Methods and Programs in Biomedicine, 245, 108013. [DOI] [PubMed] [Google Scholar]
  27. Lundström L, Flygare O, Andersson E, Enander J, Bottai M, Ivanov VZ, Boberg J, Pascal D, Mataix-Cols D, & Rück C (2022). Effect of internet-based vs face-to-face cognitive behavioral therapy for adults with obsessive-compulsive disorder: a randomized clinical trial. JAMA Network Open, 5(3), e221967. 10.1001/jamanetworkopen.2022.1967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Mahoney AEJ, Mackenzie A, Williams AD, Smith J, & Andrews G (2014). Internet cognitive behavioural treatment for obsessive compulsive disorder: a randomised controlled trial. Behaviour Research and Therapy, 63, 99–106. 10.1016/j.brat.2014.09.012. [DOI] [PubMed] [Google Scholar]
  29. Marques L, LeBlanc NJ, Weingarden HM, Timpano KR, Jenike M, & Wilhelm S (2010). Barriers to treatment and service utilization in an internet sample of individuals with obsessive-compulsive symptoms. Depression and Anxiety, 27(5), 470–475. 10.1002/da.20694. [DOI] [PubMed] [Google Scholar]
  30. McKay D, Abramowitz JS, Calamari JE, Kyrios M, Radomsky A, Sookman D, Taylor S, & Wilhelm S (2004). A critical evaluation of obsessive–compulsive disorder subtypes: symptoms versus mechanisms. Clinical Psychology Review, 24(3), 283–313. 10.1016/j.cpr.2004.04.003. [DOI] [PubMed] [Google Scholar]
  31. Monteith S, Glenn T, Geddes JR, Whybrow PC, Achtyes E, & Bauer M (2024). Artificial intelligence and increasing misinformation. The British Journal of Psychiatry, 224(2), 33–35. 10.1192/bjp.2023.136. [DOI] [PubMed] [Google Scholar]
  32. Nagy NE, El-serafi DM, Elrassas HH, Abdeen MS, & Mohamed DA (2020). Impulsivity, hostility and suicidality in patients diagnosed with obsessive compulsive disorder. International Journal of Psychiatry in Clinical Practice, 24(3), 284–292. 10.1080/13651501.2020.1773503. [DOI] [PubMed] [Google Scholar]
  33. Omar M, Soffer S, Charney AW, Landi I, Nadkarni GN, & Klang E (2024). Applications of large language models in psychiatry: a systematic review. Frontiers in Psychiatry, 15. 10.3389/fpsyt.2024.1422807. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Parker ZJ, Waller G, Duhne PGS, & Dawson J (2018). The role of exposure in treatment of anxiety disorders: a meta-analysis. International Journal of Psychology and Psychological Therapy, 18(1), 111–141. [Google Scholar]
  35. Pinto A, Mancebo MC, Eisen JL, Pagano ME, & Rasmussen SA (2006). The Brown longitudinal obsessive compulsive study: clinical features and symptoms of the sample at intake. The Journal of Clinical Psychiatry, 67(05), 703–711. 10.4088/JCP.v67n0503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Rosa-Alcázar AI, Sánchez-Meca J, Gómez-Conesa A, & Marín-Martínez F (2008). Psychological treatment of obsessive–compulsive disorder: a meta-analysis. Clinical Psychology Review, 28(8), 1310–1325. 10.1016/j.cpr.2008.07.001. [DOI] [PubMed] [Google Scholar]
  37. Ruscio AM, Stein DJ, Chiu WT, & Kessler RC (2010). The epidemiology of obsessive-compulsive disorder in the National Comorbidity Survey Replication. Molecular Psychiatry, 15(1), 53–63. 10.1038/mp.2008.94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Sezgin E, Chekeni F, Lee J, & Keim S (2023). Clinical accuracy of large language models and google search responses to postpartum depression questions: cross-sectional study. Journal of Medical Internet Research, 25, e49240. 10.2196/49240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Singh A, Anjankar VP, & Sapkale B (2023). Obsessive-compulsive disorder (OCD): a comprehensive review of diagnosis, comorbidities, and treatment approaches. Cureus. 10.7759/cureus.48960. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Spallek S, Birrell L, Kershaw S, Devine EK, & Thornton L (2023). Can we use ChatGPT for mental health and substance use education? Examining its quality and potential harms. JMIR Medical Education, 9, e51243. 10.2196/51243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, & Ting DSW (2023). Large language models in medicine. Nature Medicine, 29(8), 1930–1940. 10.1038/s41591-023-02448-8. [DOI] [PubMed] [Google Scholar]
  42. Vial S, Boudhraâ S, & Dumont M (2022). Human-centered design approaches in digital mental health interventions: exploratory mapping review. JMIR Mental Health, 9(6), e35591. 10.2196/35591. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Zhavoronkov A (2023). Caution with AI-generated content in biomedicine. Nature Medicine, 29(3), 532. 10.1038/d41591-023-00014-w. [DOI] [PubMed] [Google Scholar]

RESOURCES