Abstract
While Large Language Models (LLMs) are being quickly adapted to many domains, including healthcare, their strengths and pitfalls remain under-explored. In our study, we examine the effects of prompt engineering to guide Large Language Models (LLMs) in delivering parts of a Problem-Solving Therapy (PST) session via text, particularly during the symptom identification and assessment phase for personalized goal setting. We present evaluation results of the models’ performances by automatic metrics and experienced medical professionals. We demonstrate that the models’ capability to deliver protocolized therapy can be improved with the proper use of prompt engineering methods, albeit with limitations. To our knowledge, this study is among the first to assess the effects of various prompting techniques in enhancing a generalist model’s ability to deliver psychotherapy, focusing on overall quality, consistency, and empathy. Exploring LLMs’ potential in delivering psychotherapy holds promise with the current shortage of mental health professionals amid significant needs, enhancing the potential utility of AI-based and AI-enhanced care services.
Introduction
Numerous studies have demonstrated the potential of Large Language Model (LLM) usage in medical applications, ranging from question-answering services to producing medical notes [1, 2]. Deployment of LLMs in AI-supported care services holds the potential to mitigate healthcare costs and broaden access to care. Companies are already offering products that attempt to undertake the roles of administrative and medical clinicians through the use of LLMs.1 One of the fields that could greatly benefit from additional resources is psychotherapy, with as many as 20% of people worldwide needing mental health care [3] in the context of a global scarcity of mental health professionals [4].
Previously, studies have demonstrated the advantages of using relevant and empathetic responses in mental health dialogues [5, 6]. Althoff et al. [5], in particular, showed that more successful human counselors use fewer templated replies and produce varied responses to similar questions. LLMs’ ability to generate coherent and contextually appropriate responses may provide an ideal tool for simulating such behavior, though concerns remain about their safety and utility as patient-facing tools. In this paper, we describe a study of LLMs’ ability to provide such relevant and empathetic responses in real-time psychotherapy dialogues.
In this study, we explore the ability of off-the-shelf LLMs to deliver Problem-Solving Therapy (PST) for family caregivers, targeting common caregiving symptoms, such as fatigue and anxiety, through a dialogue system that provides caregivers the tools to self-monitor symptoms, problem-solve, and take appropriate actions. PST is an effective form of cognitive behavioral therapy with a specific protocol, making it a good test case for LLM performance, and providing the model with specific guidelines to follow.
We demonstrate that non-medical LLMs can display a reasonable performance in PST due to their remarkable ability to use the information provided in the prompt to enhance their performance [7]. We adapt an LLM flow that already showed good performance for medical Q&A [1] to medical dialogue generation. We explore the extent to which a general-purpose LLM such as GPT-4 can be improved by methods not involving modifying the model’s weights and compare it with a previously developed human-curated rule-based system [8] in the context of PST for family caregivers with standardized patients portrayed by actors. Actors were provided sample responses and narratives describing the personas that they were playing in the interactions with the model to ensure consistent behavior. To compare the LLM performance on this domain-specific task meaningfully, we recruited multiple clinicians to evaluate the dialogues without knowing how the dialogues were created. Hence, our work has two primary objectives. The first is to investigate the use of prompt engineering methods to improve a general-purpose LLM’s ability to deliver steps of PST. The second is to provide comprehensive automated and human evaluations of the empathy and overall quality of the models’ output.
Related Work
Existing work to improve the accuracy of LLMs for medical applications leverages (1) fine-tuning the model’s weights through further training on application-specific data and/or (2) developing novel ways to query the model to trigger a high-quality response by altering the prompt or including relevant information or examples in the prompt, a method called prompt engineering. Since for fine-tuning, the end results are a reflection of the quality of the fine-tuning corpora used, effective fine-tuning typically necessitates expensive manual review and in-domain expert curation to produce meaningful improvements in the model’s responses [9], in addition to highly expensive hardware necessary to run fine-tuning algorithms. In the present study, we focus on prompt engineering, a promising option that has been shown to produce comparable improvements without the resource-intensive fine-tuning [1]. To our knowledge, this is the first study on improving the accuracy of LLM delivering PST through prompt engineering [10]
Usage of Large Language Models as Therapeutic Chatbots. Literature on the usage of LLMs in psychotherapy is nascent. A few published studies have explored LLMs’ ability to lead a therapeutic conversation with a user [11, 12, 10]. Fu et al. described utilizing an LLM to augment a human therapist, providing helpful suggestions rather than singularly leading the conversation, i.e., keeping a human in the loop who can detect hallucinations generated by the LLM and increase the controllability of the system [12]. While such a study provides a setup well-suited for current LLMs, our work explores the current limits of the models’ performances in fully autonomous settings to explore the potential for full deployment. Wang et al. performed fine-tuning of a GPT-2 model to generate suitable PST responses [10]. However, their work did not explore prompt engineering methods that recently arose to prominence due to their effectiveness with larger models [1], producing results comparable to those of fine-tuned models [13]. A recent study [11] explored zero-shot prompt engineering to facilitate a diagnostic conversation based on the Diagnostic and Statistical Manual of Mental Disorders-5 standard [14]. They explored similar questions to ours, achieving an increase in the model’s empathy in responses via the use of zero-shot prompting. In this study, we expand on prompt engineering methods by using recently emerged techniques including Zero-Shot [15], Few-Shot [7], and Zero-Shot Chain-of-Thought [16], which we will refer to as Chain-of-Thought (COT) hereafter.
Zero-shot prompting: in this setup, a model receives only a natural language instruction to perform a task, without prior demonstrations. Relying on the model’s pre-trained knowledge, this approach maximizes convenience and potential for robustness, albeit typically being the most ineffective due to the absence of examples that could clarify the task’s format or expected output [15]. Few-shot prompting: in few-shot, a model is given a small number of input-output pairs as a guide. This approach enables LLMs to continue generating appropriate output for similar inputs without fine-tuning the model’s weights [7]. Optimal example selection is key in few-shot prompting. Brown et al. [7] demonstrated that by incorporating up to eight high-quality examples in prompts, GPT-3’s performance on various natural language benchmarks significantly improved [7]. Chain-of-Thought (CoT): CoT leads LLMs through a step-by-step reasoning process, effectively encouraging them to “think out loud”. Although the specific style can vary without drastically affecting performance, its presence is crucial for improving problem-solving capabilities [16]. This technique encourages the model to consider intermediate tokens, leading to more robust responses. CoT has been shown to significantly boost the problem-solving performance of LLMs across complex tasks, from arithmetic to commonsense challenges [16].
Methods
We used various prompt engineering techniques and their combinations to improve LLM performance when delivering a PST session as a therapist bot. We were guided by findings from Nori et al. who showed that by utilizing various standard prompt engineering techniques, a generalist model, in particular state-of-the-art GPT-4, can respond to medical questions with accuracy on par with models specifically fine-tuned for this task [1]. We utilize the same methods and examine whether their insights can be applied to psychotherapy. While in this study we used GPT-4 through Microsoft Azure’s Application Programming Interface (API), at the time of writing, GPT-4 is also available to a wide audience as the backend of OpenAI’s ChatGPT conversational agent.2
In preliminary experiments, our team evaluated multiple prompting strategies in delivering PST and found large inconsistencies in expected model behavior. Models produced at times low quality or generally incoherent responses when performing full PST. To mitigate this, we decided to focus on the two PST steps that were the hardest to control and were most prone to producing unexpected and potentially harmful dialogue outputs. The two steps are problem selection and identification (in our case scenario for family caregivers, we named it caregiving symptom assessment) and goal setting. By merging our technical insights in prompt engineering with clinicians’ knowledge of PST, we designed prompts that are specifically relevant to these two steps of PST. As Van Veen et al. [2] have previously shown, human assessments of LLM performance do not always closely correlate with automated metrics, especially in domains requiring expert knowledge, such as healthcare. Hence, in addition to using an automated empathy evaluation method developed by Sharma et al. [6] to evaluate the models’ responses, we had a group of clinicians familiar with PST evaluate our dialogues without knowledge of whether the dialogue was LLM-derived. We provide more details about the prompt design and the evaluation process below.
Designing the prompts
We started with a naive prompt shown in Table 1 and used it as a baseline. We then included the three prompt engineering techniques. We also experimented with combinations of these techniques for better downstream performance.
Table 1:
Our prompts categorized by various types of prompting techniques used
| Type of Prompt | Content of the Prompt |
|---|---|
| Baseline | Your responsibility is to guide the conversation with a caregiver through the principles of PST to improve one significant symptom the caregiver is experiencing. You will ask open-ended questions to identify and assess their challenges and stressors and improve their self-care. Avoid focusing on the care receiver. Remember, your job is to help the caregiver. |
| When the caregiver asks for goal suggestions by saying ‘Can you suggest some goals for me?’, take this as your cue to thoroughly review the conversation you’ve had with them. Concentrate on identifying their unique needs and aspirations as discussed. After this review, generate two concise, achievable, and personalized goals that directly address and support their expressed needs and aspirations. Ensure these goals are not only realistic but designed to inspire and boost the caregiver’s motivation. | |
| Zero-Shot | In the process of identifying and assessing the caregiver’s symptoms, assess all five aspects if it has not been mentioned in the conversation:
|
| Few-Shot | Below are ideal dialogue examples illustrating how you, the assistant, should evaluate and address the challenges and stressors of caregivers, referred to here as the user, during conversations. |
| **Attached would be 9 excerpts from multi-turn PST therapy with a human therapist** | |
| Chain-of- Thought | Think about the user’s input step by step. |
Baseline: Leveraging insights from the RoleLLM framework [17], which highlights substantial improvements in language model performance through role-conditioned instruction tuning, we started by implementing a role-playing approach to augment our chatbot’s functionality as a PST assistant. This foundational prompt within our system prompt architecture, shown in Table 1, is composed of two parts which serve two distinct functions. First, it defines the chatbot’s role as an assistant to caregivers, delineating its core functions, tasks, and desired behaviors to facilitate conversations aligned with PST principles, aiming at the identification and analysis of caregivers’ symptoms. Second, it ensures proper behavior of the model during the goal setting stage, encouraging it to provide proper goals. Zero-Shot: To enhance the chatbot’s ability to generate structured dialogue, following the PST protocol, we built upon the baseline prompt with the instructions shown in the second row of Table 1, reflecting the sub-stages necessary for accomplishing symptom identification and assessment, as well as the goal suggestion steps of PST. Central to our strategy is including precise language explaining five predefined criteria necessary for the successful identification and assessment of caregiver symptoms: Symptom Identification, Symptom Frequency, Symptom Context, Symptom Severity, and Previous Measures [8]. This approach is an attempt to enhance the depth and relevance of the chatbot’s interactions, grounding its dialogue framework in these five essential aspects of the symptom identification and analysis step in PST. Few-Shot: Our implementation of the few-shot prompting method incorporates nine high-quality examples curated by clinicians as in-context learning examples to capture the complexities of symptom identification and analysis within the PST framework [18]. These examples aim to demonstrate to the model a set of ideal responses to realistic user inputs, providing comprehensive coverage of the symptom identification and analysis phase. The first six examples present single-turn dialogues addressing the 5 sub-steps of PST that are listed in the zero-shot prompt (see the second row in Table 1), spanning symptom identification and assessment, including frequency, context, severity, and prior mitigation efforts. Additional examples featuring multi-turn dialogues are designed to enhance the chatbot’s capability to conduct coherent and contextually relevant conversations within PST. The multi-turn dialogue examples are essential for demonstrating to the chatbot the intricacies of ongoing therapeutic dialogue, including the need for follow-up questions and the reiteration of the points made to the user during the dialogue. Chain-of-Thought: Our implementation of CoT is an attempt to systematically improve the chatbot’s analytical and problem-solving capabilities by forcing it to be more ‘thoughtful’ in its style of output, which can lead to better outcomes as demonstrated in previous studies [1, 15].
Model Development
We developed the models gradually, each step aimed at mitigating the faults noticed at the previous stage (Table 2). We produced models of various complexity, which we then evaluated for their ability to achieve the intended objectives.
Table 2:
Overview of Model Structures
Model 0: A baseline model, referred to as Model 0, introduces the chatbot in its role of guiding caregivers through step 3, “Identify and Assess Symptom,” and step 4, “Goal Setting,” following the principles of PST. Model 1 and Model 2: Expanding on Model 0, Models 1 and 2 improve the chatbot’s ability to follow PST structure by introducing key prompt engineering techniques. Model 1 builds upon Model 0 by integrating a set of structured guidelines into the prompt via zero-shot prompting. This approach systematically guides the chatbot into assessing caregiver’s challenges and stressors in a specific order. Model 2 further advances Model 0 through the integration of few-shot learning, which leverages selected single-turn and multi-turn example dialogues from previously recorded sessions to set guidelines for the conversation flow. Model 3 through 6: In our further explorations, inspired by the methodology of Nori et al. [1], we utilize various combinations of prompt engineering techniques across Models 3 to 6. Model 3 integrates zero-shot’s structured symptom assessment with few-shot learning to produce more robust output that more closely follows the intended response style. Model 4 incorporates CoT prompting to improve the models’ problem-solving results. Models 5 and 6 adjust these components to explore their combined effects on the chatbot’s PST performance, with a focus on more effective goal setting. Model 7: While not a model per se, we also curated four dialogues from the rule-based chatbot baseline to compare against our methods following our previous work [8].
Generating Dialogues for Evaluation
To achieve a higher level of consistency and protect caregivers’ privacy, we used personas to generate dialogues. We first crafted four caregiver personas with exemplary replies based on our prior work with family caregivers. We recruited three research team members with direct professional experiences interacting with family caregivers. They were asked to portray one or two of the personas and converse with each bot therapist (Model 0 through 6 in Table 2). They were instructed to use consistent and exemplary replies as much as possible. We collected 28 PST dialogues (4 personas * 7 models). We randomized the sequence of the models and did not disclose which specific models they interacted with. After we collected all the dialogues, the research team examined the conversations and found that Models 1 and 6 had at least one dialogue in which the bot therapist asked all five symptom assessment questions at once. When the caregiver responded to one of the questions, the bot therapist could not recognize that one question had been answered and repeated all the questions. This behavior was undesired and would automatically receive low ratings in the human evaluation. Thus, we eliminated them in this preliminary step prior to human evaluation. The models that proceeded to human evaluation were Models 0, 2, 3, 4, and 5. In addition, the research team curated a set of four dialogues with human-delivered PST sessions. These dialogues were curated based on the four personas and the research team’s prior rule-based bot-delivered PST protocol [8]. In total, we had 24 dialogues (6 models * 4 personas) that proceeded to human evaluation.
Evaluating Dialogues - Human Evaluation
We recruited seven clinicians, namely four nurses and three clinical psychologists, to evaluate the quality of the therapy responses. The evaluators were unaware of the models and evaluated the dialogues in random sequence. Evaluators evaluated each dialogue on two aspects, conversational quality and perception of relational capacity, which were adopted from the chatbot evaluation mechanisms by Zhang et al. [19]. For conversational quality, since the generated dialogues specifically focused on the “symptom assessment” and “goal setting” steps of PST, the evaluators were asked to evaluate the quality of these two steps. To evaluate the symptom assessment step, we asked the evaluators to consider if the therapist assessed all five aspects of symptoms. To evaluate the goal-setting step, evaluators were asked to assess the appropriateness of the goals suggested by the therapist in the dialogue. For the perception of relational capacity, we focused on therapist empathy, which was shown to be a predictor for therapy outcome [20]. We employed the three communication mechanisms to measure empathy developed by Sharma et al. [6]: Emotional Reactions (ER), Interpretations (IN), and Explorations (EX). Strong empathetic communication expresses emotions reacting to what the user said (Emotional Reaction), communicates an understanding of the user’s feeling or experience (Interpretation), and explores the user’s feelings and experiences that are not stated (Exploration) [6]. Evaluators evaluated all three aspects. Additionally, we asked evaluators to provide a rating based on the overall impression of the therapy session. Details about the evaluation components and questions are included in Table 3. Moreover, to gain a deeper understanding of the ratings and what factors contributed to the ratings, we also asked the evaluators to provide a brief rationale for each rating.
Table 3:
Evaluation components
| Dimensions | Statements or Questions | Measurements |
|---|---|---|
| Symptom assessment | The therapist successfully assessed the five aspects of the caregiver’s symptoms. | 5-point Likert scale (1 = Completely disagree; 5 = Completely agree) |
| Goal setting | The goal suggested by the therapist is appropriate for the caregiver. | |
| Emotional Reactions | The therapist expressed emotions such as warmth, compassion, concern, or similar feelings towards the caregivers. | 3-point Likert scale (0 - therapist not expressing them at all; 1 - therapist expressing them to some weak degree; 2 - therapist expressing them strongly.) We used the framework from Sharma et al. [6] to evaluate peer-to-peer dialogues for both algorithmic and human evaluation. |
| Interpretations | The therapist communicated an understanding of feelings and experiences inferred from the caregiver’s responses. | |
| Explorations | The therapist improved their understanding of the caregiver by exploring the feelings and experiences not stated in the caregiver’s response. | |
| Overall | Overall, how was this therapy session? | 5-point Likert scale (1 - very bad; 5 - very good) |
Evaluating Dialogues - Automatic Evaluation
In addition to human evaluation, we adapted the algorithm from Sharma et al. [6] to rate the conversations’ empathy. We used the default implementation of the algorithm provided on the authors’ GitHub repository, a base RoBERTa classifier trained on Reddit dialogue data labeled by the authors. The classifier quantifies the model’s Emotional Reactions (ER), measuring its ability to express positive emotions when responding to a user’s post; Interpretations (IP), evaluating the model’s ability to produce a relevant shared experience; and Explorations (EX), which captures a level of active interest and engagement with the user’s post.
Data Analysis
Each of the six models that generated dialogues received 28 sets of ratings (4 persona-based dialogues * 7 evaluators). We computed the averages and standard deviations of the ratings for each model. We report on the models with the highest and lowest ratings on each evaluation component. Three team members conducted a rapid deductive analysis of the qualitative rationales provided by the evaluators to identify specific factors contributing to the ratings.
Results
Table 4 shows the mean scores with standard deviations of both the human and automated evaluations. In general, Models 2, 3, and 4 were rated higher compared to the human-curated rule-based system (Model 7) in all evaluated areas and delivered the two steps of PST successfully. Overall, evaluators rated Model 2 the best, which is also the highest-rated model for goal-setting and the exploration dimension of empathy. The human-curated rule-based model was rated the lowest across all evaluated aspects except for symptom assessment.
Table 4:
Evaluation results. The results are reported in Mean (Standard Deviation) format.
| Symptom Assessment (1-5) | Goal Setting (1-5) | Emotional Reactions (0-2) | Interpretations (0-2) | Explorations (0-2) | Overall (1-5) | ||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Human | Human | Human | Algorithm | Human | Algorithm | Human | Algorithm | Human |
| 0 | 3.15 (1.35) | 4.00 (1.09) | 1.56 (0.58) | 0.83 (0.17) | 1.59 (0.64) | 0.04 (0.06) | 1.50 (0.69) | 0.79 (0.21) | 3.68 (1.06) |
| 2 | 4.68 (0.55) | 4.74 (0.45) | 1.71 (0.46) | 0.93 (0.08) | 1.64 (0.49) | 0.06 (0.10) | 1.39 (0.74) | 1.22 (0.08) | 4.36 (0.73) |
| 3 | 4.75 (0.44) | 4.46 (0.74) | 1.57 (0.57) | 0.92 (0.19) | 1.68 (0.55) | 0.11 (0.02) | 1.21 (0.74) | 1.29 (0.27) | 3.82 (0.90) |
| 4 | 4.80 (0.41) | 4.25 (0.93) | 1.68 (0.48) | 0.91 (0.18) | 1.75 (0.44) | 0.02 (0.03) | 1.32 (0.61) | 1.31 (0.16) | 3.82 (0.79) |
| 5 | 3.70 (1.35) | 3.86 (1.38) | 1.64 (0.49) | 0.93 (0.25) | 1.71 (0.46) | 0 (0) | 1.35 (0.69) | 0.76 (0.18) | 3.61 (1.13) |
| 7 | 4.62 (0.5) | 3.59 (1.01) | 1.22 (0.80) | 0.89 (0.14) | 1.22 (0.89) | 0.04 (0.06) | 0.77 (0.65) | 1.31 (0.13) | 3.19 (1.18) |
Symptom assessment. For symptom assessment, the improved models (Models 2-5) were all able to assess a care-giver’s symptom in at least four turns of dialogue. When the “caregiver” shared a challenge such as “I’m feeling overwhelmed caring for my father with cancer and looking for support,” the bot therapist would first provide an empathetic response, followed by a therapeutic component that continued the PST process. For example, one response by a bot to this specific “caregiver” was, “Hello, I’m really glad you reached out for support. It’s incredibly important to take care of yourself, especially when you’re in a caregiving role. It sounds like you’re going through a lot right now. Can you tell me more about what specifically is making you feel overwhelmed?” The average rating for symptom assessment across the four models was 4.5 out of 5, indicating that “the therapist successfully assessed the five aspects of the caregiver’s symptoms”. Model 4, with all prompt engineering techniques, scored the highest for symptom assessment (mean = 4.80). The majority of the evaluators thought Model 4 successfully assessed all aspects of the caregiver’s symptoms and had good symptom identification. Model 0 using only baseline prompting was rated the lowest. Evaluators reported that this bot missed assessing certain aspects of caregivers’ symptoms, such as severity and frequency, and had poor follow-up questions on context. Moreover, several evaluators mentioned that the bot asked multiple questions at a time and made it overwhelming. Tables 5 and 6 show example bot responses from the models with the highest and lowest rating for each criterion, including symptom assessment.
Table 5:
Symptom Assessment and Goal Setting Examples
| Evaluation Aspect | Human Rating | Model | Example Chatbot Response |
|---|---|---|---|
| Symptom Assessment | Highest (mean=4.80) | 4 | “It sounds like you’re dealing with some intense emotions. I want to acknow-ledge how challenging it can be to manage fear and sadness, especially in a caregiving role. To better understand and help you with these feelings, could you tell me how often you experience these feelings of fear and sadness?” |
| Lowest (mean=3.15) | 0 | “It sounds like you’re carrying a heavy load both emotionally and practically. Let’s address these one at a time. Firstly, regarding emotional support, can you tell me about your current support system? Do you have friends, family, or a community you can talk to about what you’re going through? Secondly, for the practical help with caregiving tasks, have you been able to identify specific tasks that are particularly challenging or time-consuming? Are there routine aspects of care that you think someone else could assist you with?” | |
| Goal Setting | Highest (mean=4.74) | 2 | “Excellent. Since you’re open to online support groups and you’re also looking to enhance your emotional support, here are two personalized goals that might be beneficial for you: 1. **Join an Online Support Group**: Within the next week, aim to identify and join at least one online support group for caregivers. Spend a few minutes each day, or as your schedule allows, to engage with the group. This can be a place where you can share your experiences, receive advice, and connect with others who understand what you’re going through. 2. **Schedule Regular Check-ins**: Set up a weekly check-in with a friend or family member who can offer emotional support. This could be a phone call, video chat, or even a text exchange, where you can share your feelings and challenges. Having a set time each week for emotional check-ins can provide you with something to look forward to and ensure you have a consistent outlet for your emotions. How do these goals sound to you? Are they something you feel you can commit to?” |
| Lowest (mean=3.59) | 7 | “Thanks for being open to it. Which of the two would you like to try for this week? 1. Journal writing. 2. Practice the 4-7-8 deep breathing exercise technique .” |
Table 6:
Emotional Reactions, Interpretations, and Explorations Examples
| Evaluation Aspect | Human Rating | Model | Example Chatbot Response (Algorithm rating on a scale of 0-2) |
|---|---|---|---|
| Emotional Reactions | Highest (mean=1.71) | 2 | “That must be quite exhausting for you. Not getting enough sleep can certainly exacerbate stress and pain. To better understand your situation, can you tell me more about the context of your insomnia? For example, are there specific thoughts or worries that keep you awake, or is it the pain that’s more disruptive?” (Algorithm rating: 1) |
| Lowest (mean=1.22) | 7 | “I can see that. I have one more question to ask about your overwhelmed feeling. Have you done anything to help you with your overwhelmed feeling?” (Algorithm rating: 1) | |
| Interpretations | Highest (mean=1.75) | 4 | “That sounds really tough, being at home and feeling that heightened awareness of your daughter’s anxiety and the challenges in connecting with her.” (Algorithm rating: 0) |
| Lowest (mean=1.22) | 7 | “I hear you. Let’s use the Problem Solving Therapy (PST) process that I mentioned to you to explore how we may address this.” (Algorithm rating: 0) | |
| Explorations | Highest (mean=1.50) | 0 | “That sounds quite challenging and must be affecting your daily functioning. Not being able to sleep well for 4-6 nights a week is significant. Can you share more about the context of these sleepless nights? For example, are there specific thoughts related to your daughter’s illness or your pain that tend to keep you awake?” (Algorithm rating: 2) |
| Lowest (mean=0.77) | 7 | “Great. Let’s use the Problem Solving Therapy (PST) process that I mentioned to you to explore how we may address this.” (Algorithm rating: 2) |
Goal setting. When suggesting health goals for the caregivers to improve an identified symptom, Model 2 was rated the best (mean = 4.74). Evaluators’ feedback on Model 2 included keywords “personalized”, “collaborative”, “realistic”, “appropriate”, “specific”, “straightforward”, “reasonable”, “relevant”, “well-structured”, and “detailed”. Model 7 (i.e. human-curated rule-based model) was rated the lowest (mean=3.59). Evaluators pointed out a few drawbacks including being “not clear”, “not specific”, “superficial”, and “providing only one suggestion”. Sample chatbot responses from Model 2 and Model 7 for goal setting are included in Table 5. In the same persona mentioned above, one bot therapist made the following goal recommendation: “Let’s work on setting some goals that can help you manage these overwhelming feelings more effectively. Based on what you’ve shared, here are two personalized goals that you might consider: 1. Establish a Daily Unwinding Ritual: Given that your feelings offear and sadness intensify during the evenings and nights, creating a calming evening routine could be beneficial. This might include activities like meditation, listening to soothing music, or practicing gentle yoga. The goal would be to engage in this ritual every evening for at least 15 minutes to help transition into a more relaxed state of mind before bedtime. 2. Seek Emotional Support: It’s important to have a support system in place. You could set a goal to reach out to a friend, family member, or a support group for caregivers at least once a week. Sharing your experiences and feelings with others who understand can provide comfort and may offer new perspectives or coping strategies. How do these goals sound to you?”
Empathy. For the three dimensions of empathy, the human-curated rule-based model was rated consistently as the lowest across all three dimensions, with other models being rated similarly. Model 2 scored the highest in ER (mean = 1.71), followed by Model 4 (mean = 1.68). One evaluator expressed approval of the emotional reactions exhibited by the therapist bot powered by Model 2, stating “I think the empathy came across better in this one than some others because it made a statement and then asked a question. It didn’t pile questions on top of questions like in some of the prompts… it makes the conversation seem less empathetic and two sided.” Model 4 was rated the highest in the IP (mean = 1.75). The majority of the evaluators expressed that there was good use of empathetic statements and reflections. One evaluator provided an example of “an excellent follow-up on feeling/experiences and insights into the feelings” in which the bot responded “It can be harder at night when it’s quiet.” For the Exploration dimension of empathy, evaluators rated Model 0 the highest (mean = 1.50). One evaluator expressed that “In this session, the therapist does more open ended questions that get to more of the root of the problem.” For the lowest rated human-curated rule-based model across all three dimensions of empathy, evaluators reported that the bot said things like “got it” that demonstrated poor empathy and regurgitated sentences like “I am so sorry to hear that …”, which made it feel robotic. Several evaluators also reported that the human-curated rule-based model displayed “zero or very limited explorations of deeper thoughts and feelings.” Moreover, the bot was perceived as “fake and too cheerful.” One evaluator said that “It used the word ‘glad’ very often and also thanked the caregiver quite often. It’s important for the bot to in some ways mirror the caregiver, meet them where they are at. And someone worrying 4 out of 5 may not want it to seem overly cheerful.” According to the automated evaluation on empathy, all models performed very similarly in terms of ER, with an average close to 1, meaning that the therapist expressed emotional reactions to some weak degree. Rarely were our models nor the human dialogue able to demonstrate Interpretations as all models were scored close to 0 by the automated evaluation algorithm. The majority of the models, except for Model 0 and Model 5, demonstrated moderate empathy in Explorations, with Models 4 and the human-curated rule-based model scored the highest (mean = 1.31) by the automated evaluation.
Discussion
In this study, we used various prompting techniques to improve GPT-delivered PST for family caregivers and used both human and automatic algorithms to evaluate the therapy dialogues. We found that by using prompt engineering techniques, we were able to improve the quality of the therapy conversations beyond the baseline prompt, but with considerable limitations. Empathy evaluated by both human and the algorithm did not vary significantly across models, despite improving over the baseline in the emotional reaction and interpretation dimensions. Below, we discussed specific prompt engineering techniques and their performances.
Zero-shot learning did not perform as well as the other techniques that we evaluated. Our finding that explicit directions may not be sufficient to adapt a model to a domain-specific task is in line with the literature [1, 7]. In our scenario, the difference in performance between zero-shot learning and the other techniques was more pronounced because of the nature of the downstream task, i.e. PST being a protocolized therapy. PST requires a specific way of conversing with the user. Zero-shot prompting focuses on explicitly defining tasks for the model to follow, but many aspects of what constitutes good therapy are implicit (e.g. more actionable advice being preferred over more generic, overly optimistic comments being preferred less [6]). The possible approaches to achieve the explicit goals are almost endless. Explicitly demonstrating examples of high-quality PST responses as part of few-shot prompt led to better model performance, showing that providing the model with data is still necessary to produce coherent dialogues. Our findings align with previous results from the literature, that show that providing few-shot examples typically noticeably increases the models’ performance compared to zero-shot baselines on tasks involving the generation of coherent word sequences, akin to how PST demands a coherent dialogue [7].
Adding CoT resulted in better empathy, especially exploration. However, it reduced the quality of symptom identification and goal setting. The original CoT work by Wei et al. [16] did not explore dialogue-generating tasks, and it is possible that we could experiment with alternative CoT prompts that would improve empathy while maintaining task quality. However, given that LLMs are fundamentally next-word predictors trained on large corpora of text, it is possible that in the pre-training data of the model, chunks of text beginning with phrases similar to our prompt happened to be more exploratory and less actionable. We find that explanation reasonable because, in human writings, texts nudging someone to think tend to be more contemplative than action-driven in nature.
In line with previous work [2], we found low agreement between the human and automatic evaluations. This may be partially due to the model by Sharma et al. [6] trained on single-turn Reddit posts, presenting a different setting than PST. Future studies should examine the generalizability of the automatic evaluation algorithms with domain shifts.
Although the LLM-delivered therapy dialogues received higher scores than the human-curated baseline, it is worth noting that the human baseline is not actual therapist-delivered PST dialogues. We adapted dialogues the team created for developing a rule-based chatbot to deliver PST to family caregivers. Some responses were designed to be generic such as using “I am sorry to hear that” instead of a more empathetic response using psychotherapeutic techniques such as normalizing (e.g., “I am sorry to hear that. Many caregivers feel isolated while caring for an ill child.”). In our study, to protect real caregivers’ privacy, we used a persona-based approach to collect dialogues. It would be preferable to engage with actual caregivers. In the future, we plan to develop privacy-preserving technologies in order to guarantee the privacy of the user’s information when training or interacting with LLMs, so that our in-context learning processes do not expose users’ private information to commercially-hosted LLMs. This study intended to explore the ability of off-the-shelf LLMs with prompt engineering to deliver part of PST and achieved promising results. In the future, we will continue to improve the models to deliver full PST by performing fine-tuning and Retrieval Augmented Generation (RAG) techniques. In the aforementioned reading comprehension task [7], we can see that fine-tuned models still can outperform generalists with few shot examples. However, recent studies [1, 13] showed that it is not always true, possibly due to the increased knowledge contained in the larger models, allowing them to perform on-par with or better than fine-tuned models. We will continue to explore the best techniques or combinations of techniques that enable an automated chatbot to deliver a full PST session.
Conclusion
In this paper, we adapt multiple novel prompt engineering approaches to improve an LLM’s ability to deliver part of a psychotherapy session. Consistent with previous findings [11], we demonstrate that the model’s capability to deliver protocolized therapy can be improved with the proper use of prompt engineering methods, albeit with limitations. Through both automatic and human evaluation, we show an improvement over the baseline model after applying our methods to PST, demonstrating that some prompt engineering techniques are better at improving the performance of the models than others. Hence, while the current models cannot be deployed directly in psychotherapy settings without human oversight, this work contributes to the effort in exploring the potential of LLMs as a therapeutic tool. As such, this work represents an important step toward using LLMs to address the limited availability of human therapists in the context of an escalating need for mental health services.
Acknowledgments.
The authors would like to thank Microsoft for the UW Azure Cloud Computing Credits for Research program. Daniil Filienko is a Carwein-Andrews Distinguished Fellow. This research was, in part, funded by the National Institutes of Health (NIH) Agreement No.1OT2OD032581 and R21NR020634, and the Rita and Alex Hilman Foundation Emergant Innovation Grant. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the NIH.
Footnotes
References
- 1.Nori H, Lee YT, Zhang S, Carignan D, Edgar R, Fusi N, et al. Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. CoRR. 2023;abs/2311.16452.
- 2.Van Veen D, Van Uden C, Blankemeier L, Delbrouck JB, Aali A, Bluethgen C, et al. Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization. Nature Medicine. 2024;(4):1134–42. doi: 10.1038/s41591-024-02855-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Holmes E, Ghaderi A, Harmer C, Ramchandani P, Cuijpers P, Morrison A, et al. The Lancet Psychiatry Commission on psychological treatments research in tomorrow’s science. The Lancet. 2018 Mar;5(3):237–86. doi: 10.1016/S2215-0366(17)30513-8. [DOI] [PubMed] [Google Scholar]
- 4.Patel V, Saxena S, Lund C, Thornicroft G, Baingana F, Bolton P, et al. The Lancet Commission on global mental health and sustainable development. The Lancet. 2018;392(10157):1553–98. doi: 10.1016/S0140-6736(18)31612-X. [DOI] [PubMed] [Google Scholar]
- 5.Althoff T, Clark K, Leskovec J. Natural Language Processing for Mental Health: Large Scale Discourse Analysis of Counseling Conversations. Transactions of the Association for Computational Linguistics. 2016 05. p. 4. [PMC free article] [PubMed]
- 6.Sharma A, Miner A, Atkins D, Althoff T. A Computational Approach to Understanding Empathy Expressed in Text-Based Mental Health Support. In: Webber B, Cohn T, He Y, Liu Y, editors. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) Association for Computational Linguistics; 2020. pp. p. 5263–76. Online: [Google Scholar]
- 7.Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems. 2020;vol. 33:p. 1877–901. [Google Scholar]
- 8.Kearns WR, Kaura N, Divina M, Vo C, Si D, Ward T, et al. A Wizard-of-Oz Interface and Persona-based Methodology for Collecting Health Counseling Dialog. Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems. 2020. pp. p. 1–9. CHI EA ‘20. Association for Computing Machinery.
- 9.Zhou C, Liu P, Xu P, Iyer S, Sun J, Mao Y, et al. LIMA: Less Is More for Alignment. Advances in Neural Information Processing Systems. 2023;vol. 36:p. 55006–21. [Google Scholar]
- 10.Wang L, Mujib MI, Williams J, Demiris G, Huh-Yoo J. An Evaluation of Generative Pre-Training Model-based Therapy Chatbot for Caregivers. ArXiv. 2021;abs/2107.13115.
- 11.Chen S, Wu M, Zhu KQ, Lan K, Zhang Z, Cui L. LLM-empowered Chatbots for Psychiatrist and Patient Simulation: Application and Evaluation. CoRR. 2023;abs/2305.13614.
- 12.Fu G, Zhao Q, Li J, Luo D, Song C, Zhai W, et al. Enhancing Psychological Counseling with Large Language Model: A Multifaceted Decision-Support System for Non-Professionals. arXiv preprint arXiv:230815192. 2023.
- 13.Ovadia O, Brief M, Mishaeli M, Elisha O. Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs. ArXiv. 2023;abs/2312.05934.
- 14.5th ed. Washington, DC: American Psychiatric Publishing, a Division of American Psychiatric Association; 2013. Diagnostic and Statistical Manual of mental disorders: DSM-5TM. [Google Scholar]
- 15.Wei J, Bosma M, Zhao V, Guu K, Yu AW, Lester B, et al. Finetuned Language Models are Zero-Shot Learners. 10th International Conference on Learning Representations (ICLR) 2022.
- 16.Kojima T, Gu SS, Reid M, Matsuo Y, Iwasawa Y. Large language models are zero-shot reasoners. Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022;vol. 35:p. 22199–213. [Google Scholar]
- 17.Wang ZM, Peng Z, Que H, Liu J, Zhou W, Wu Y, et al. Findings of the Association for Computational Linguistics ACL 2024. Association for Computational Linguistics; 2024. RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models; pp. p. 14743–77. [Google Scholar]
- 18.Yuwen W, Chang J, Divina M, Fan X, Kearns W, Peredo A, et al. Comparing Caregiving Needs in Asian and White Family Caregivers through a Journaling Exercise Delivered by a Conversational Agent. American Medical Informatics Association (AMIA) Annual Symposium Proceedings. 2022;2022:1208–16. [PMC free article] [PubMed] [Google Scholar]
- 19.Zhang J, Oh YJ, Lange P, Yu Z, Fukuoka Y. Artificial Intelligence Chatbot Behavior Change Model for Designing Artificial Intelligence Chatbots to Promote Physical Activity and a Healthy Diet: Viewpoint. Journal of Medical Internet Research. 2020 Sep;22(9):e22845. doi: 10.2196/22845. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Elliott R, Bohart AC, Watson JC, Murphy D. Therapist empathy and client outcome: An updated meta-analysis. Psychotherapy. 2018 Dec;55(4):399–410. doi: 10.1037/pst0000175. [DOI] [PubMed] [Google Scholar]

