Abstract
Direct observation of psychotherapy and providing performance-based feedback is the gold- standard approach for training psychotherapists. At present, this requires experts and training human coding teams, which is slow, expensive, and labor intensive. Machine learning and speech signal processing technologies provide a way to scale-up feedback in psychotherapy. We evaluated an initial proof of concept automated feedback system that generates Motivational Interviewing (MI) quality metrics and provides easy access to other session data (e.g., transcripts). The system automatically provides a report of session-level metrics (e.g., therapist empathy) as well as therapist behavior codes at the talk-turn level (e.g., reflections). We assessed usability, therapist satisfaction, perceived accuracy, and intentions to adopt. A sample of twenty-one novice (n = 10) or experienced (n = 11) therapists each completed a 10-minute session with a standardized patient. The system received the audio from the session as input, and then automatically generated feedback that therapists accessed via a web portal. All participants found the system easy to use and were satisfied with their feedback, 83% found the feedback consistent with their own perceptions of their clinical performance, and 90% reported they were likely to use the feedback in their practice. We discuss the implications of applying new technologies to evaluation of psychotherapy.
The acquisition of skills requires a regular environment, an adequate opportunity to practice, and rapid and unequivocal feedback about the correctness of thoughts and actions. When these conditions are fulfilled, skill eventually develops, and the intuitive judgments and choices that quickly come to mind will mostly be accurate.
- Daniel Kahneman
Psychotherapy research focuses on the development and evaluation of interventions that address mental health problems. There are now 1000s of clinical trials demonstrating the efficacy of different specific interventions for a range of problems (see, e.g., APA Division 12 website on ESTs; https://www.div12.org/treatments). Recent research and quality improvement efforts have focused on how to best disseminate and implement these treatments in community settings. In the last decade, health systems have spent at least 2 billion dollars on various efforts to train providers in the provision of specific evidence-based treatments (McHugh & Barlow, 2010). However, these and other efforts to provide psychotherapy training, supervision, and quality assurance are hampered by the impracticality - if not near impossibility - of offering providers rapid, objective, performance-based feedback. According to McHugh & Barlow (2010), trainings would ideally assess, “… objective assessment of fidelity including clinician competence and number and percentage of clinicians who complete training, achieve competence, and sustain competence…” (p.83). Given current technology, this is not a realistic goal.
The current gold-standard for monitoring provider fidelity to treatment relies on human raters for assessment, which is slow, and unsustainable in the vast majority of clinical settings. Human evaluation of psychotherapy sessions is simply a non-starter in community settings, and thus, virtually all of the more than 80 million psychotherapy sessions in a given year (Olfson & Marcus, 2010) are not evaluated. However, recent developments in machine learning and speech signal processing now offer a route to rapid, performance-based feedback for counseling and psychotherapy (Imel et al., 2015). The current research describes a pilot feasibility study of a proof-of-concept system that provides rapid, performance-based feedback developed via user-centered design.
The Effects of Skill-based Feedback in Psychotherapy Training
There is a well-developed literature on the necessary conditions for skill development across domains (Kulik & Kulik, 1988; Schooler & Anderson, 2008). One crucial component is regular feedback on whether the skill has been performed correctly. Feedback, in general, has been shown to increase skill performance (Kluger & DeNisi, 1996), though certain types of feedback have been found to be more important than others. For example, feedback on whether or not a task was performed correctly has been shown to be an extremely effective means for teaching skills (Hattie and Timperley, 2007). In contrast, feedback about the characteristics of the person being evaluated has negative effects on learning skills (Hattie & Timperley, 2007). More immediate feedback has a stronger impact on skill development compared to delayed feedback (Epstein, Epstein, & Brosvic, 2001; Kulik & Kulik, 1988).
In psychotherapy, there have been successful examples of feedback interventions, though most have significant drawbacks. For example, informing clinicians whether their clients are improving can bolster outcomes for at-risk clients n (e.g., Kendrick et al., 2016). Feedback on client’s symptoms has several advantages, most notably it is an efficient way of providing feedback and does not require significant time from trained staff. However, client outcome feedback also has limitations. For example, the changes in therapist performance are not durable after the feedback is removed (Lambert, Harmon, Slade, Whipple, & Hawkins, 2005). Second, this type of feedback does not target any specific therapist behaviors (e.g., what did the therapist do to lead to the problem or what can they do to solve it), and there is not yet evidence that symptom-based feedback has a measurable impact on therapist behavior.
Motivational Interviewing (MI) researchers have been at the forefront of studying how to train therapists to use specific skills (e.g., increase empathy, use of open questions and reflections; Baer et al., 2009). Ordinarily, trainings for community providers involve an in-person workshop. The training itself typically includes lectures on theory and research of MI, demonstrations, as well as practice with role-played clients where providers receive some feedback on their use of MI. Less commonly, trainee therapists submit tapes demonstrating their utilization of MI skills (e.g., Miller, Yahne, Moyers, Martinez, & Pirritano, 2004). In a recent meta-analysis of training studies, workshop-based training increased provider utilization of MI skills, but without ongoing performance-based feedback or coaching, the training effect deteriorated over time (Schwalbe, Oh, & Zweben, 2014). Unfortunately, ongoing performance-based feedback of provider behavior is rare in community settings (see Creed et al., 2016 for an exception). A major barrier to high-quality implementation of behavioral interventions is the need for observer-rated fidelity (Proctor et al., 2009). As noted previously, the time-consuming nature of behavioral coding of provider fidelity prohibits its use in community settings, with a few rare exceptions.
Without ongoing feedback, providers are not likely to maintain newly learned skills that are present when they are initially trained. The necessary post-training support to maintain new skills is clear (i.e., performance-based feedback), but the standard process of generating this feedback (i.e., using humans as the assessment tool via behavioral coding) is too slow and expensive to support wide scale adoption. Fortunately, there is initial research that suggests modern methods from computer science have the potential to speed up the feedback process.
Technology-based Evaluation of Psychotherapy
Around the time that Carl Rogers first recorded psychotherapy sessions in the 1940s (Rogers, 1951), natural language processing (NLP) began to emerge as a subfield of computer science (Jones, 1994). Currently, NLP is a subfield of machine learning with the primary aim of training computers to learn, understand, and analyze language (Hirschberg & Manning, 2015). A psychotherapy session is essentially a conversation between two individuals, rich with semantic data that often goes unanalyzed due to the laborious nature of human coding. NLP techniques allow researchers to take this conversational data, often in the form of large collections of unstructured text, and help answer important process question such as: “What did the client and therapist talk about during this session?” or “How empathic was this therapist?”. Recent examples of machine learning / NLP methods applied to psychotherapy include: (a) identifying therapist reflections (Can et al., 2016), (b) deriving the primary content of psychotherapy sessions and at what point in the session (Gaut, Steyvers, Imel, Atkins, & Smyth, 2015; Imel, Steyvers, & Atkins, 2015; Howes et al., 2013; Howes et al. 2014), (c) classifying different treatment approaches such as CBT or psychodynamic therapy (Imel et al., 2015), and (d) evaluating how empathic a therapist was solely based on audio recordings of a session (Xiao, Imel, Georgiou, Atkins, & Narayanan, 2015; see also Hasan et al., 2016; Pace et al., 2016; Pérez-Rosas et al., 2017). Finally, using a sample of more than 300 MI sessions, NLP models were able to estimate a full range of MI fidelity metrics based on therapist statements (e.g., open questions, reflections) with accuracy that approaches human performance (Tanana, Hallgren, Imel, Atkins, & Srikumar, 2016).
Current Investigation
There are now several studies demonstrating that machine learning algorithms can use session audio or transcripts to generate ratings of psychotherapy sessions that are consistent with traditional human-derived observer ratings. However, no study has attempted to use these methods to provide feedback to providers, and no research has explored how such feedback should be delivered to clinicians when it is generated rapidly by machine learning models. MI training research incorporating provider feedback has typically presented therapist fidelity scores in a paper or electronic document format or in the context of a supervisory or consultative interaction with a trainer (Miller et al., 2004). The use of computer-generated feedback introduces both complications and opportunities: (a) immediate, objective feedback offered in a visually appealing manner that offers interactivity could enhance provider engagement and learning, however (b) providers may be skeptical of their work being evaluated by a computer.
In this research, we present an initial evaluation of a web-based interactive tool that automatically provides feedback to providers on specific psychotherapy skills.1 To do so, we developed a tool that rapidly generated machine learning based feedback on Motivational Interviewing to a sample of 21 therapists who recorded 10-minute sessions with standardized patients describing problems with substance use. The primary aim of the study was to assess the usability of the tool itself, thus the sample is not powered to conduct a reliable test of consistency with human codes (see Tanana et al. 2016; Xiao et al., 2015 for prior large-scale evaluations). However, we provide an exploration of the correspondence of machine generated MI ratings with human ratings. We expected providers would be generally satisfied with the feedback and its presentation, find the feedback easy to use and interpret, perceive the quantitative feedback as reasonably accurate of their actual performance, and be inclined to adopt the technology if it were available.
Method
Participants
Participants were 11 experienced (i.e., licensed) and 10 novice (i.e., trainee) therapists (total N = 21), recruited to record a brief substance abuse counseling session with a standardized patient (SP). Note that we systematically recruited providers with different levels of experience to ensure meaningful variability in therapy performance and supervision, but we did not hypothesize or test specific differences between these groups.
Experienced therapists were local practicing and licensed professionals, recruited via a snowballing sampling procedure. The authors utilized their professional connections in the local community to recruit experienced therapists from local university-based and community mental health sites. These experienced therapists were also asked to identify other empathic and licensed community therapists, who were then contacted via email by a research assistant. We specifically recruited experienced providers who had prior knowledge or training with MI, but this was not required, as we were also in interested in feedback from therapist with no prior training in MI. The majority of experienced therapists had received their doctorate and had been licensed more than two years (64%, n = 7). Experienced therapists reported a variety of MI training from frequent use of MI in their practice (n = 2, 18%), membership in an MI organization (e.g., MINT; n = 1, 9%), familiarity with and formal training in MI (n = 7, 64%), and familiarity but no formal training (n = 1, 9%).
Beginning therapists were recruited from a local university’s master’s and doctoral level clinical training program. Clinicians were 86% White/Caucasian (n = 18), 10% Hispanic/Latin@ (n = 2), and 5% Asian American/Pacific Islander (n = 1), with a mean age of 42.1 (SD = 11.7). Trainees were all enrolled in introductory training courses (e.g., counseling micro skills and counseling theories) and were within the first year of their training program. Of the novice therapists (n = 10), 80% were Master’s students (n = 8), and reported having either no training or experience with MI (n = 4, 40%), or some familiarity with MI but no training (n = 5, 60%; 1 missing response).
Procedure.
After recruitment, participants completed study consent and demographic information via an online survey platform. A research assistant then contacted them to schedule and complete a 10-minute substance abuse session with a standardized patient (SP). The session was recorded with two lavalier microphones that were clipped to both the therapist and SP, allowing high-quality audio data capture. Following the session, participants received computerized feedback via web portal (described below). Participants were then asked to complete a web-based survey regarding their satisfaction with the tool.
Standardized Patients (SPs).
Doctoral students (n = 4; 2 female, 2 male) functioned as standardized patients. Three SP profiles were created, where all focused on presenting concerns related to substance abuse. SPs are commonly utilized in the psychotherapy and medical training literature when the target of investigation is provider behavior. SPs also avoid issues with missing data and audio quality that occur when samples are requested from community-based therapists (Baer et al., 2004). One profile described an individual struggling with methamphetamine usage who was required to attend therapy. Two profiles described college age students who were experiencing negative consequences related to drinking. SPs were trained to respond in ways consistent with their profiles, but maintained flexibility to individual therapist responses (i.e., there was no set script).
Feedback System.
In previous publications we provided descriptions of the visual design (Gibson et al., 2016) as well as development and validation of the speech signal and machine learning components of the feedback system (Atkins et al., 2014; Xiao et al., 2015; Tanana et al., 2016). Briefly, the system first separates audio segments of speech from non-speech using a process called voice activity detection (VAD); then, all the speech segments are separated into two groups, one for each speaker, in a process called speaker diarization. Person-specific speech segments are transcribed using an automatic speech recognition (ASR) pipeline developed with the kaldi software library (Povey et al., 2011). Using the automatically transcribed words, we then used a role matching model to identify which speech segments belong to the counselor and which belong to the client (details on each of these steps can be found in Xiao et al., 2016). The system next used the lexical content from the ASR session transcripts to predict specific MI fidelity codes for each session (see measures section below for details on MI fidelity codes).2 The ASR transcript results were used as inputs to a support vector regression model (Drucker, Burges, Kaufman, Smola, & Vapnik, 1997) based on information fused from a maxent language model (Berger, Pietra, & Pietra, 1996), a maximum likelihood language model (Jurafsky & Martin, 2008), and ASR lattice rescoring. In total, the combination of the speech processing and machine learning pipeline yields predicted scores for each MI fidelity code, which are either single values for the entire session (i.e., global scores) or utterance-specific (i.e., behavioral codes). Descriptions for each of the MI fidelity codes that were included in the feedback report are listed in Table 1. The prediction model was trained with a fully-transcribed and behaviorally coded set of 345 MI substance abuse treatment sessions (see Lord et al., 2015; Xiao et al., 2015). In addition, ASR language models were also trained by transcripts from a larger, general psychotherapy corpus (see Imel et al., 2015).
Table 1.
Descriptions of each MI metric provided in the feedback tool a
| MI Metric | Description |
|---|---|
| Overall MI Fidelity | The Overall MI Fidelity score ranges from 0–12 where 12 represents excellent fidelity to Motivational Interviewing. You receive 0, 1, or 2 points for your performance on each of the 6 MI fidelity metrics (MI Spirit, Empathy, Reflection to Question Ratio, Percent Open Questions, Percent Complex Reflections, and Percent MI Adherence). You receive 1 point for scores that meet (basic) proficiency benchmarks, and 2 points for scores that meet benchmarks for (advanced) competence. |
| MI Adherence | MI counselors who are ‘adherent’ are those who ask open questions, make complex reflections, support and affirm their clients, and emphasize their client’s autonomy. Non-adherent counselors are confrontational, directing, warning, and advice giving. This measure is the total number of MI adherent behaviors divided by the sum of adherent and non-adherent behaviors. Higher is better and anything less than 100% indicates some non-adherent behaviors. |
| MI Spirit | MI Spirit captures the general counseling style of Motivational Interviewing, which 1) is a collaborative approach, 2) shows interest in and evokes the client’s perspective, and 3) does not impose views on the client but rather supports the client’s ability to make their own choices for their life. |
| Empathy | Empathy is a rating of how well the therapist understands the client’s perspective and makes efforts to see the world as their client sees it. |
| Reflections to Questions Ratio | An effective MI counselor uses more reflective statements (i.e., summaries of what the client has said) vs. questions. This measure is a ratio of the total number of reflections divided by the total number of questions, and thus, higher is better. |
| Percent Open Question | When MI counselors ask questions, they strive to ask ‘open’ questions that invite a range of possible answers and may invite the client’s perspective or encourage self-exploration. Closed questions are ones that can be answered in a single, or few, words. This measure is the total number of open questions divided by the sum of open and closed questions. Higher is better. |
| Percent Complex Reflections | Reflections are summaries of what the client has expressed and said. A simple reflection is an almost verbatim restatement of what the client said. A complex reflection is a summary that adds meaning or emphasis or might integrate additional information. A complex reflection is one way that a therapist conveys they are trying to understand their client and his or her worldview. This measure is the total number of complex reflections divided by the sum of complex and simple reflections. Higher is better. |
Note. Descriptions of MI fidelity metrics were adapted from the Motivational Interviewing Skills Code Version 2.1 (Miller, Moyers, Ernst, & Amrhein, 2008).
Participants could access this information by clicking on the ‘i’ button next to score.
The therapist directed web-based tool (see example report in Figure 1, and interactive examples online here; http://sri.utah.edu/psychtest/misc/demoinfo.html)3 provided visualizations of MI fidelity scores and session content was developed through an iterative, user-centered design process (Gibson et al., 2016; Norman, 2013). Following their SP session, therapists could access their automated feedback report via a password-protected web portal. The report included visual summaries of MI fidelity and also included a session timeline visualization, indicating when either the therapist or client was talking at each point throughout the session. In addition, the session timeline linked to the ASR-based transcript of the session as well as predicted MI fidelity codes for each utterance.
Figure 1.

Example of the MI feedback report.
Measures
Motivational Interviewing Fidelity.
All SP sessions were rated by machine learning models that had been trained to evaluate fidelity to MI. In the feedback portal, every therapist utterance in the ASR transcript received a specific MI code. In addition, the system provided session level ratings that were compared to standard benchmarks for MI spirit, empathy, reflection to question ratio, and % open questions. Both MI spirit and empathy are likert ratings scored from 1–5, and the other metrics are aggregated from the utterance level labels.
The machine learning models of MI fidelity were trained by two of the most common observer-rated measures of MI fidelity, the Motivational Interviewing Skills Code (MISC) and Motivational Interviewing Treatment Integrity (MITI) Scale. The MISC v2.1 (Miller, Moyers, Ernst, & Amrhein, 2008) is an utterance level system that we used to train machine learning models for behavior codes (e.g., simple and complex reflections, open and closed questions, giving information). Human reliability (intra-class correlations) for the data used to train these models ranged from .92 (giving information) to .61 (mi-nonadherent), with 6 of the 7 utterance level codes above .7, M=.80. Human reliability reflects the amount of measurement error in the data used to train the machine learning models, and as such, provides an upper bound for the correlations of machine-generated predictions. As with human raters, the models classify each utterance, and therapist behavior codes are tallied and used to calculate specific fidelity metrics presented to the counselor in the interactive report.
To train machine learning models of session level ratings of empathy and MI spirit, we used the MITI v3.1 (Moyers, Martin, Manuel, Miller, & Ernst, 2010) a less intensive version of the MISC (see Xiao et al., 2015). The MI Spirit composite score was calculated by aggregating ratings of evocation, collaboration, and autonomy/support. Human agreement on the data used to train these models was adequate, MI Spirit, ICC=.68, Empathy, ICC=.75. The accuracy of both utterance level and session level models in a sample of over 300 sessions has been previously reported (see Tanana et al., 2016; Xiao et al. 2016).
Finally, a single human rater coded all sessions using the MITI 3.1. The coder is a member of the Motivational Interviewing Network of Trainers and was the trainer of the two previous coding teams that generated the data that trained the machine learning models described earlier (see earlier reliability estimates above). Each session was coded in its entirety in a single pass, as recommended by the MITI manual. Agreement (intra-class correlations) between the human rating and the machine generated codes varied from .23 (empathy) to .80 (closed questions), M=.48, which was on average 62% of human agreement (SD=23).
Therapist Evaluation of System.
After receiving feedback on their session, we provided therapists a survey that assessed usability, satisfaction, perceived accuracy, and intentions to adopt the technology. The survey was designed for this study, as many of the items were designed to assess features unique to the specific feedback tool. The present study focuses on two sets of clinically relevant items. These include 4 items that evaluated, 1) ease of use of the interface 2) how representative the feedback was of their performance, 3) overall satisfaction, and 4) whether they would use the tool in their clinical practice. In addition, we asked participants 5 total questions on how easy it was to understand their feedback on a several key metrics including, 1) empathy, 2) MI Spirit, 3) reflection to question ratio, 4) % open questions, and 5) % complex reflections. Across items, users responded to questions on a qualitative Likert scale with answers ranging from “Strongly Disagree” to “Strongly Agree” or “Very Unhelpful” to “Very Helpful.” Users were asked to make qualitative comments at the end of the survey, and we include a representative selection of responses (both supportive and critical) to illustrate the quantitative results.
Results
Satisfaction and Usability
Figure 2 reports participant feedback on clinical feasibility. All 21 participants endorsed that the tool was easy to use (strongly agree, n = 13; 62%, and slightly agree, n = 8; 38%). Eighteen of 21 participants (86%) either strongly (n = 8; 38%) or slightly (n = 10; 48%) agreed that the feedback was representative of their clinical performance (1 slightly disagreed, and 2 were unsure). All 21 participants indicated that they were satisfied with the computer-generated feedback report (strongly agree, n = 14; 67% and slightly agree, n = 7; 33%). Nineteen of 21 participants (90%) participants agreed that if the tool was available they would use it in their clinical practice (strongly agree, n = 15; 71%, slightly agree, n = 4; 19%, and unsure, n = 2, 10%;).
Figure 2.

Therapist ratings of feedback report feasibility.
We also assessed how easy participants found it to understand the scores they received in different MI domains (see Figure 3). Across all scores, no fewer than 17 participants (81%) and up to all 21 participants found their scores either very or somewhat easy to understand. Seventeen participants (81%) found the MI spirit rating very (n = 11) or somewhat (n = 7) easy to understand (3 found it somewhat hard). Nineteen participants (90%) found the Empathy rating very (n = 16) or somewhat (n = 3) easy to understand (1 found it somewhat hard, and 1 very hard). Nineteen participants (90%) found the reflection-to-question ratio metric very (n = 14) or somewhat (n = 5) easy to understand (1 found it somewhat hard). All 21 participants (100%) found the percent open questions measure very (n = 17) or somewhat (n = 5) easy to understand. All 21 participants (100%) found the percent complex reflection rating very (n = 17) or somewhat (n = 5) easy to understand (see Figure 3). A sample of participant free text comments are listed in Table 2. We included a selection of satisfied responses, as well as a sample of those more confused with their individual feedback.
Figure 3.

Therapist ease of understanding each of the MI metric feedback categories.
Table 2.
Participant qualitative feedback comments
| Participant | Qualitative Feedback |
|---|---|
| 1 | “I really like this kind of quantitative feedback as part of counselor training (or counselor eval in the case of working professionals). The facts tell an important part of the story-- and alongside personal feedback in the case of students I think this offers some insights that cannot be compared to the qualitative feedback of a supervisor.” |
| 2 | “I REALLY like this kind of information. I found it extremely useful and an interesting way to facilitate self-reflection on my therapeutic style. I realized that my reflection:question ratio has quite a bit of room for improvement and I have the goal of improving it now in my actual therapy. The behavioral counts specifically were fascinating. Also I thought arousal was an interesting measure. I found myself identifying the higher arousal states and wondering why I or my client was aroused and rereading the associated transcripts. I feel like I developed some interesting insight from doing that.” |
| 3 | “I found the ratio of reflection to question very confusing. It may also be helpful to have a brief descriptor of MI somewhere in the feedback? Unless the intended audience is counselors who are trained in this specific skill. As a beginner I was somewhat confused why these particular areas were highlighted over others.” |
| 4 | “Empathy score was unclear as to what my score meant in general. Did I have a good level of empathy or a ‘bad’ level?” |
Discussion
To our knowledge, this is the first evaluation of machine learning based technology to provide feedback to counselors. Previous research demonstrated the accuracy of the machine learning models that generated the feedback (e.g., Tanana et al, 2016; Xiao et al., 2015), and the technical details of the processing pipeline (Xiao et al., 2016), but this is the first evaluation of an integrated system that provides machine generated feedback directly to therapists based on a session audio recording.
At present, specific performance-based feedback such as those provided by fidelity codes are rare during training and often completely absent for post licensure therapists. There is a substantial literature on the positive effects of feedback based on client outcomes (Lambert et al., 2002) and feedback on client outcomes is quickly becoming a best practice standard for mental health treatment (Rush, 2015). Yet, symptom-focused feedback is distal from the clinical encounter and offers no specific reflection on therapist behavior. If the client’s symptoms are not improving, the feedback is simply to “do something different.” We provided an initial evaluation of how therapists may react to feedback from a tool that dramatically reduces the time and labor required to generate such feedback.
The usability results were very encouraging. Psychotherapy and counseling are inherently human and interpersonal processes, and thus, the idea of using computational algorithms to provide performance-based feedback could easily be perceived as inappropriate or simply “not possible” to therapist end-users. To the contrary, therapists in the present study were generally satisfied with the feedback they received. The majority of therapists noted that the feedback was easy to understand and perceived the feedback to be representative of their clinical work. In addition, the vast majority indicated they would consider adopting the technology if it were available to them, a strong indication that they found value in both the content of the feedback and its presentation.
Easy to generate performance based evaluations of psychotherapy interactions provide a breadth of possible clinical and training uses, including both support of post-licensure individual therapy training, as well as pre-licensure supervision. The most direct application would be to use the sort of automated feedback evaluated in this study as an adjunct to standard work shop based training. Rather than restrict feedback to brief slices of role plays that the trainer can observe or a select few sessions that are sent to a trainer and scored, a therapist could elect to receive ongoing feedback on a particular treatment (in this case MI). Meta-analyses (e.g., Schwalbe et al., 2014) suggest that this sort of ongoing feedback might help therapists maintain the skill gains they acquired during training. Independent of a specific training, automated feedback might provide a scalable mechanism for therapists to monitor changes in their behavior within and between clients over time. Objective feedback on utilization of specific skills may either challenge or support therapist’s intuition about what they are doing with their clients, and serve to increase their ability to self-reflect and be intentional about their interactions with clients.
There is also an opportunity for tools that provide automated evaluations of psychotherapy to support the practice of supervision. Typical supervision includes a supervisee telling their supervisor about their current caseload, perhaps focusing on a particularly difficult clinical situation and discussing potential strategies. However, supervisees do not always disclose their mistakes or struggles (Ladany, Hill, Corbett, & Nutt, 1996), and thus story telling based supervision can mean that feedback is distal to what actually occurred during the session. On occasion, supervision might include review of session recordings and provide feedback directly based on these observations. Ideally, supervisors would be able to review large numbers of recordings from their supervisees and ground their feedback in direct observation. However, supervisors are usually busy clinicians themselves, and it is not feasible to ask them to watch hours of tape in order to provide more detailed feedback. If all of a supervisees sessions could be recorded and evaluated, supervisors might be able to quickly scan a large swath of sessions, inspecting outliers and initiating a discussion when therapists are behaving differently (e.g., talking a lot more, asking more questions, a particularly low empathy score). The student therapist and supervisor may also be able to use the tool as a way to initiate conversations about specific skills, and provide specificity about strengths and growth edges. The supervisor could watch particular moments from a session that are highlighted by the system, and then elect to focus on them during a supervision meeting. It would also be possible to track student progress over time, providing directly observable behavioral indicators than the typical competency evaluations that are used in most training environments. Similar actions might be possible at the clinic level, where training or clinical directors could peruse aggregated metrics for what is happening in the clinic.
Beyond ongoing feedback with trainees and practicing therapists, automated feedback could be useful in scaffolding the education of brand new therapists in training. At present, counseling skills classes might contain lecture in a particular skill followed by observation, and role plays. Feedback might be provided by fellow students and the instructor then observes a small sample of a students work in class or comments on a few transcripts of longer interactions during the semester. It should be possible to use the natural language processing models evaluated in this paper to build exercises where a beginning therapist receives more frequent feedback. For example, a text based chat bot client could interact with a trainee therapist and provide instant feedback on every therapist statement. This strategy has shown promise as a way of training non-therapists to increase their utilization of active listening skills (Tanana et al., under review).
Limitations
There are a number of important limitations to this initial pilot study. Participants were sampled from a small pool of non-trainee (experienced) therapists and counseling students. Recruited therapists were known to the authors, and student therapists may have had familiarity with the faculty authors and did know the students performing the recruitment. It is possible that these relationships may have influenced participant’s evaluation of the tool. However, we expect these influences to be minimal. First, none of the faculty investigators directly participated in data collection and survey results were anonymous. Second, the study emphasized our interest in feedback concerning problems, issues, and improvements to the system, which we expected and were hoping to learn more about given the pilot nature of the system being evaluated. The free text responses at the end of the survey suggest that participants were willing to provide negative feedback. However, further assessment is necessary with a larger sample, outside of this local therapist community.
The accuracy of computer-generated MI fidelity codes and human-generated codes was lower than expected for some specific fidelity codes. For example, the correlation between human and machine models in Xiao et al. (2016) was .65, while the ICC in this report was .23. The present pilot study was not designed as a validation study of the machine learning prediction models, which have shown more robust results in previous research with far larger samples (Atkins et al., 2014; Tanana et al., 2016; Xiao et al., 2015). However, even in prior model development focused papers, performance for more rare codes (typically MI-inconsistent behaviors) was similarly poor. Ongoing research that seeks to improve accuracy of machine models for specific codes will be necessary. In addition, robustly evaluating accuracy across different clinical domains - and enhancing models for greater generalizability -- should be an important focus of future work.
Despite variable correspondence between humans and the machine models, even this current pilot technology would be a substantial improvement from the feedback vacuum that is the current state of affairs in clinical practice. Furthermore, it might be misleading to solely benchmark the utility of machine-based ratings to a team of human raters that have been trained to reliability within the confines of a well-funded research study. When feedback is available in clinical practice, it is likely from a supervisor who is offering feedback in an ad hoc manner that is based on the therapists’ verbal report, or a very small or selected sample of the therapist’s work. The consistency of a given supervisors feedback with another supervisor is wholly unknown, and likely to be substantially less than reliability in highly resourced clinical studies where reliability is emphasized and monitored.
In the context of human and machine agreement, it is also useful to consider the tradeoffs between ease of repeated measurement and reliability. As an analogy, outcome measures in psychotherapy research are often at odds with the tension between practicality and precision. Reliability of the latent construct (e.g., depression, anxiety, substance use) increases with the number of items, but each additional item makes it less practical for real-world clinical use. However, repeated measures of a simpler, shorter measure can often replicate the reliability of a longer, more in-depth assessment (Bauer et al., in press). With a technology that allows immediate performance-based feedback and does not require human labor, we may see a similar tradeoff in which repeated measures that are 60% of human reliability may ultimately be superior to infrequent (if not rare) assessments with higher reliability. Finally, numerous studies in machine learning have led to an axiom that “more data beats a cleverer algorithm” and the accuracy of models improve almost universally with more data (Domingos, 2012). Thus, there is good reason to think that the accuracy of machine learning models will not be the rate limiting factor in the ultimate success of machine-based feedback to therapists.
An additional limitation of the present research is that the current feedback is specific to MI and thus may not be useful to counselors who are not interested in this feedback or do not use this approach frequently. It is not yet possible to offer feedback to therapists based on their utilization of other treatments (e.g., CBT or psychodynamic). Despite the narrow initial focus of the feedback, MI is a useful starting point for the development of automated feedback. First, it is among the most well studied treatments available (Lundahl, Kunz, Brownell, Tollefson, & Burke, 2010). Second, many of the components of MI are highly consistent with basic counseling skills that are foundational across many treatments (e.g., empathy, active listening). Thus, there is reason to expect that therapists who do not necessarily practice MI on a regular basis might find aspects of this system helpful. Indeed, many of the therapists in our sample did not consider themselves devoted to MI, but found the feedback quite helpful.
Conclusions
The combination of modern digital recording, speech signal processing, and machine learning technology presents the opportunity for a positive, disruptive shift in training and quality assurance in psychotherapy. It is conceivable that in the near future a therapist will be able to finish a session with a client, return to their computer to write a session note, and receive a complete feedback report of the previous hour’s work. This report may include session-level metrics of empathy and annotations of reflections and questions as well as an automatically generated transcription, but may ultimately include any number of ratings. This potential new wave of technology-enhanced psychotherapy feedback will allow therapists to dive into their therapeutic work in new and exciting ways, and may provide a new mechanism by which they can improve their work with clients.
Given the novelty of machine learning based feedback, the future development of systems like this one should be concerned with more than the accuracy of the ratings. This initial study suggests that a proof-of-concept system was acceptable to therapists. Yet, this new technology could allow feedback on a scale that will be highly disruptive to current methods of quality assurance and training, in both positive and negative ways. It may ultimately be possible for a clinic director to discover therapists who are struggling with their clients and intervene quickly, offering support or training. At the same time, technology that provides such a detailed level of oversight could be used punitively and bureaucratically, disrupting the work of therapists who are doing well with their clients, and leading therapists to feel that they are being surveilled in ways similar to a call center employee who has their phone calls recorded for quality assurance. Some users may be tempted to imbue computer-based systems with more authority than is appropriate (i.e., who can argue with a computer?; Hirsch et al., 2018), and thus implementation efforts should also work to help therapist, supervisors, and administrators be appropriately skeptical of information they receive (Hirsch, Merced, Narayanan, Imel, & Atkins, 2017). The adoption of advanced technology into the very human process of psychotherapy should proceed cautiously and with the full involvement of stakeholders (e.g., therapists, supervisors, trainers, clinic administrators) who are likely to have idiosyncratic views of the technology either as suspect, and intrusive, or welcome and necessary to improve the quality of psychotherapy clients receive.
Clinical Impact Statement.
Question:
How do therapists experience automated evaluations of their sessions?
Findings:
Therapists endorsed strong satisfaction, usability, and perceived accuracy of the automated feedback.
Meaning:
Machine learning technologies have the potential to dramatically scale up the amount of feedback therapists receive after their sessions.
Next Steps:
Building on this pilot study, both usability and accuracy should be tested in larger and different types of therapist samples. Additional work should focus on the potential impact of automated feedback on therapist behavior in session.
Funded by:
● National Institute on Drug Abuse
● National Institutes of Health, National Institute on Alcohol Abuse and Alcoholism
Footnotes
By automatic, we mean that no human evaluation was necessary to generate the specific feedback scores or the report presenting them. However, the technology tested does not provide feedback in ‘real time’ during the session. It is available shortly after the session is completed, the amount of time dependent on the processing speed of the computers used to run the models.
It is likely that acoustic features of human speech are also predictive of MI fidelity codes. However, the models currently in the feedback system do not yet incorporate this information. Addition of acoustic features is a focus of ongoing research.
The session available online is not a research study session, but a fully roleplayed session.
References
- Atkins DC, Steyvers M, Imel ZE, Smyth P (2014). Scaling up the evaluation of psychotherapy: Evaluating motivational interviewing fidelity via statistical text classification. Implementation Science, 9(49), 1–11. 10.1186/1748-5908-9-49 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baer J, Rosengren D, Dunn C, Wells E, Ogle RL, & Hartzler B (2004). An evaluation of workshop training in motivational interviewing for addiction and mental health clinicians. Drug and Alcohol Dependence, 73, 99–106. [DOI] [PubMed] [Google Scholar]
- Baer JS, Wells EA, Rosengren DB, Hartzler B, Beadnell B, & Dunn C (2009). Agency context and tailored training in technology transfer: A pilot evaluation of motivational interviewing training for community counselors. Journal of Substance Abuse Treatment, 37(2), 191–202. 10.1016/j.jsat.2009.01.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bauer AM, Baldwin SA, Anguera JA, Arean PA, & Atkins DC (in press). Comparing approaches to mobile depression assessment for measurement-based care. Journal of Medical Internet Research. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berger AL, Pietra VJD, & Pietra SAD (1996). A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1), 39–71. Retrieved from http://dl.acm.org/citation.cfm?id=234285.234289 [Google Scholar]
- Can D, Marín RA, Georgiou PG, Imel ZE, Atkins DC, & Narayanan SS (2016). “It sounds like…”: A natural language processing approach to detecting counselor reflections in motivational interviewing. Journal of Counseling Psychology, 63(3), 343–350. 10.1037/cou0000111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Creed TA, Frankel SA, German RE, Green KL, Jager-Hyman S, Taylor KP, … Beck AT (2016). Implementation of transdiagnostic cognitive therapy in community behavioral health: The Beck Community Initiative. Journal of Consulting and Clinical Psychology, 84(12), 1116–1126. 10.1037/ccp0000105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Domingos P (2012). A Few Useful Things to Know about Machine Learning. Communications of the ACM, 55, 78–87. [Google Scholar]
- Drucker H, Burges CJC, Kaufman L, Smola AJ, & Vapnik V (1997). Support Vector Regression Machines. In Mozer MC, Jordan MI, & Petsche T (Eds.), Advances in Neural Information Processing Systems 9 (pp. 155–161). MIT Press. Retrieved from http://papers.nips.cc/paper/1238-support-vector-regression-machines.pdf [Google Scholar]
- Epstein ML, Epstein BB, & Brosvic GM (2001). Immediate feedback during academic testing. Psychological Reports, 88(3), 889–894. 10.2466/pr0.2001.88.3.889 [DOI] [PubMed] [Google Scholar]
- Gaut G, Steyvers M, Imel ZE, Atkins DC, & Smyth P (2015). Content Coding of Psychotherapy Transcripts Using Labeled Topic Models. IEEE Journal of Biomedical and Health Informatics, 21(2), 476–487. 10.1109/JBHI.2015.2503985 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gibson J, Gray G, Hirsch T, Imel ZE, Narayanan SS, & Atkins DS (2016). Developing an automated report card for addiction counseling: The counselor observer ratings expert for MI (CORE-MI). Paper presented at CHI, San Jose, CA. [Google Scholar]
- Hasan M, Kotov A, Carcone A, Dong M, Naar S, & Hartlieb KB (2016). A study of the effectiveness of machine learning methods for classification of clinical interview fragments into a large number of categories. Journal of Biomedical Informatics, 62, 21–31. 10.1016/j.jbi.2016.05.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hattie J, & Timperley H (2007). The power of feedback. Review of Educational Research, 77(1), 81–112. 10.3102/003465430298487 [DOI] [Google Scholar]
- Hirschberg J, & Manning CD (2015). Advances in natural language processing. Science, 349(6245), 261–266. 10.1126/science.aaa8685 [DOI] [PubMed] [Google Scholar]
- Hirsch T, Merced K, Narayanan S, Imel ZE, & Atkins DC (2017). Designing contestability: Interaction design, machine learning, and mental health. Paper presented at DIS: Designing Interactive Systems, Edinburgh, United Kingdom. 10.1145/3064663.3064703 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hirsch T, Soma C, Merced K, Kuo P, Dembe A, Caperton D, Atkins DC, Imel ZE (2018). “It’s hard to argue with a computer:” Investigating Psychotherapists’ Attitudes towards Automated Evaluation. Paper presented at ACM Conference on Designing Interactive Systems (DIS’18), Hong Kong. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Howes C, Purver M, McCabe R (March, 2013). Investigating Topic Modelling for Therapy Dialogue Analysis. In Proceedings of the Computational Semantics in Clinical Text workshop at IWCS, Potsdam. [Google Scholar]
- Howes C, Purver M, & McCabe R (2014). Linguistic indicators of severity and progress in online text-based therapy for depression. In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality (pp. 7–16). [Google Scholar]
- Imel ZE, Baldwin SA, Baer JS, Hartzler B, Dunn C, Rosengren DB, & Atkins DC (2014). Evaluating therapist adherence in motivational interviewing by comparing performance with standardized and real patients. Journal of Consulting and Clinical Psychology, 82(3), 472–481. 10.1037/a0036158 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Imel ZE, Steyvers M, & Atkins DC (2015). Computational psychotherapy research: scaling up the evaluation of patient-provider interactions. Psychotherapy, 52(1), 19–30. 10.1037/a0036841 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jones KS (1994). Natural Language Processing: A Historical Review. In Zampolli A, Calzolari N, & Palmer M (Eds.), Current Issues in Computational Linguistics: In Honour of Don Walker (pp. 3–16). Dordrecht: Springer Netherlands. 10.1007/978-0-585-35958-8_1 [DOI] [Google Scholar]
- Jurafsky D, & Martin JH (2008). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd ed.). Prentice Hall. [Google Scholar]
- Kendrick T, Moore M, Gilbody S, Churchill R, Stuart B, & El-Gohary M (2016). Routine use of patient reported outcome measures (PROMs) for improving treatment of common mental health disorders in adults. Cochrane Database of Systematic Reviews, (6). 10.1002/14651858.CD011119 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kluger AN, & DeNisi A (1996). The effects of feedback interventions on performance: A historical review, a meta-analysis, and a preliminary feedback intervention theory. Psychological Bulletin, 119(2), 254–284. [Google Scholar]
- Kulik JA, & Kulik CLC (1988). Timing of feedback and verbal-learning. Review of Educational Research, 58(1), 79–97. 10.3102/00346543058001079 [DOI] [Google Scholar]
- Ladany N, Hill CE, Corbett MM, & Nutt EA (1996). Nature, extent, and importance of what psychotherapy trainees do not disclose to their supervisors. Journal of Counseling Psychology, 43(1), 10–24. 10.1037/0022-0167.43.1.10 [DOI] [Google Scholar]
- Lambert MJ, Harmon C, Slade K, Whipple JL, & Hawkins EJ (2005). Providing feedback to psychotherapists on their patients’ progress: Clinical results and practice suggestions. Journal of Clinical Psychology, 61(2), 165–174. 10.1002/jclp.20113 [DOI] [PubMed] [Google Scholar]
- Lambert MJ, Whipple JL, Vermeersch DA, Smart DW, Hawkins EJ, Nielsen SL, & Goates M (2002). Enhancing psychotherapy outcomes via providing feedback on client progress: a replication. Clinical Psychology & Psychotherapy, 9(2), 91–103. 10.1002/cpp.324 [DOI] [Google Scholar]
- Lord SP, Can D, Yi M, Marin R, Dunn CW, Imel ZE, … Atkins DC (2015). Advancing methods for reliably assessing motivational interviewing fidelity using the motivational interviewing skills code. Journal of Substance Abuse Treatment, 49, 50–57. 10.1016/j.jsat.2014.08.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lundahl BW, Kunz C, Brownell C, Tollefson D, & Burke BL (2010). A Meta-Analysis of Motivational Interviewing: Twenty-Five Years of Empirical Studies. Research on Social Work Practice, 20(2), 137–160. 10.1177/1049731509347850 [DOI] [Google Scholar]
- McHugh RK, & Barlow DH (2010). The dissemination and implementation of evidence-based psychological treatments: A review of current efforts. American Psychologist, 65(2), 73–84. 10.1037/a0018121 [DOI] [PubMed] [Google Scholar]
- Miller WR, Moyers TB, Ernst D, & Amrhein P (2008). Manual for the Motivational Interviewing Skill Code (MISC), Version 2.1. Albuquerque, NM: Center on Alcoholism. Substance Abuse, and Addictions, The University of New Mexico. [Google Scholar]
- Miller WR, & Rollnick S (2012). Motivational interviewing: Helping people change (3rd ed.). New York, NY US: Guilford Press. [Google Scholar]
- Miller WR, Yahne CE, Moyers TB, Martinez J, & Pirritano M (2004). A randomized trial of methods to help clinicians learn motivational interviewing. Journal of Consulting and Clinical Psychology, 72(6), 1050–1062. 10.1037/0022-006X.72.6.1050 [DOI] [PubMed] [Google Scholar]
- Moyers TB, Manuel JK, & Ernst D (2014). Motivational interviewing treatment integrity coding manual 4.1. Unpublished Manual. [Google Scholar]
- Moyers TB, Martin T, Manuel JK, Miller WR, & Ernst D (2010). Revised global scales: Motivational interviewing treatment integrity 3.1. 1 (MITI 3.1. 1). Unpublished Manuscript, University of New Mexico, Albuquerque, NM. Retrieved from http://www.marrch.org/associations/4671/files/02-09-MITI-3-1-1.pdf [Google Scholar]
- Norman D (2013). The Design of Everyday Things: Revised and Expanded Edition. Basic Books. Retrieved from https://market.android.com/details?id=book-qBfRDQAAQBAJ [Google Scholar]
- Olfson M & Marcus SC (2010). National trends in outpatient psychotherapy. The American Journal of Psychiatry, 167(12), 1456–1463. 10.1176/appi.ajp.2010.10040570 [DOI] [PubMed] [Google Scholar]
- Pace B, Tanana M, Xiao B, Dembe A, Soma C, Steyvers M, & Imel ZE (2016). What about the words? Natural language processing in psychotherapy. Psychotherapy Bulletin, 51(1), 17–18. [Google Scholar]
- Pérez-Rosas V, Mihalcea R, Resnicow K, Singh S, Ann L, Goggin KJ, & Catley D (2017). Predicting counselor behaviors in motivational interviewing encounters. Paper presented at the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Retrieved from http://www.aclweb.org/anthology/E17-1106 [Google Scholar]
- Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, … Vesely K (2011). The Kaldi Speech Recognition Toolkit. Paper presented at IEEE Signal Processing Society. Retrieved from https://infoscience.epfl.ch/record/192584 [Google Scholar]
- Proctor EK, Landsverk J, Aarons G, Chambers D, Glisson C, & Mittman B (2009). Implementation research in mental health services: an emerging science with conceptual, methodological, and training challenges. Administration and Policy in Mental Health and Mental Health Services Research, 36(1), 24–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reese RJ, Duncan BL, Bohanske RT, Owen JJ, & Minami T (2014). Benchmarking outcomes in a public behavioral health setting: Feedback as a quality improvement strategy. Journal of Consulting and Clinical Psychology, 82, 731–742. [DOI] [PubMed] [Google Scholar]
- Rush AJ (2015). Isn’t it about time to employ measurement-based care in practice. The American Journal of Psychiatry, 172(10), 934–936. 10.1176/appi.ajp.2015.15070928 [DOI] [PubMed] [Google Scholar]
- Rogers CR, (1951). Studies in client-centered psychotherapy III: The case of Mrs. Oak - A research analysis. Psychological Service Center Journal, 3(1–2), 47–165 [Google Scholar]
- Schooler LJ, & Anderson JR (2008). The disruptive potential of immediate feedback. Research Showcase, 702–708. Retrieved from http://repository.cmu.edu/cgi/viewcontent.cgi?article=1079&context=psychology [Google Scholar]
- Schwalbe CS, Oh HY, & Zweben A (2014). Sustaining motivational interviewing: A meta-analysis of training studies. Addiction, 109(8), 1287–1294. [DOI] [PubMed] [Google Scholar]
- Slade K, Lambert MJ, Harmon SC, Smart DW, & Bailey R (2008). Improving psychotherapy outcome: The use of immediate electronic feedback and revised clinical support tools. Clinical Psychology and Psychotherapy, 15(5), 287–303. 10.1002/cpp.594 [DOI] [PubMed] [Google Scholar]
- Tanana M, Hallgren KA, Imel ZE, Atkins DC, & Srikumar V (2016). A comparison of natural language processing methods for automated coding of motivational interviewing. Journal of Substance Abuse Treatment, 65, 43–50. 10.1016/j.jsat.2016.01.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tanana MJ, Soma CS, Srikumar V, Atkins DC & Imel ZE (2018). Development and evaluation of ClientBot: A patient-like conversational agent to train basic counseling skills. Journal of Medical Internet Research (Under review). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiao B, Huang C, Imel ZE, Atkins DC, Georgiou P, Narayanan SS (2016). A technology prototype system for rating therapist empathy from audio recordings in addiction counseling. PeerJ Computer Science, 2:e59. 10.7717/peerj-cs.59 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiao B, Imel ZE, Georgiou PG, Atkins DC, & Narayanan SS (2015). “ Rate my therapist”: Automated detection of empathy in drug and alcohol counseling via speech and language processing. PloS One, 10(12). [DOI] [PMC free article] [PubMed] [Google Scholar]
