Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Sep 4.
Published in final edited form as: Signal Inf Process Assoc Annu Summit Conf APSIPA Asia Pac. 2013 Jan 17;2012:6411762.

Analyzing the Language of Therapist Empathy in Motivational Interview based Psychotherapy

Bo Xiao 1, Dogan Can 2, Panayiotis G Georgiou 1, David Atkins 3, Shrikanth S Narayanan 1,2
PMCID: PMC5010859  NIHMSID: NIHMS812826  PMID: 27602411

Abstract

Empathy is an important aspect of social communication, especially in medical and psychotherapy applications. Measures of empathy can offer insights into the quality of therapy. We use an N-gram language model based maximum likelihood strategy to classify empathic versus non-empathic utterances and report the precision and recall of classification for various parameters. High recall is obtained with unigram while bigram features achieved the highest F1-score. Based on the utterance level models, a group of lexical features are extracted at the therapy session level. The effectiveness of these features in modeling session level annotator perceptions of empathy is evaluated through correlation with expert-coded session level empathy scores. Our combined feature set achieved a correlation of 0.558 between predicted and expert-coded empathy scores. Results also suggest that the longer term empathy perception process may be more related to isolated empathic salient events.

Index Terms: Empathy, Motivational Interview, Language Model

1. Introduction

Empathy is a natural human ability that is studied across disciplines including psychology, neuroscience and social science [1]. Merriam-Webster dictionary defines empathy as “the action of understanding, being aware of, being sensitive to, and vicariously experiencing the feelings, thoughts, and experience of another of either the past or present without having the feelings, thoughts, and experience fully communicated in an objectively explicit manner”. In general, empathy stands for the mental ability of feeling for, and taking the perspective of, others.

In social interactions, when empathy is expressed through verbal and non-verbal behaviors, the other party would feel acknowledged, resulting in better and efficient communication. Therefore, showing empathy is deemed to be an important skill and often related to better performance in domains centered on human interaction, such as medical care and psychotherapy [2, 3].

As one form of psychotherapy, Motivational Interview (MI) emphasizes the client’s own will of making a change, where the therapist should try to understand the client and facilitate this change, instead of dictating what the client should do. Hence empathy is one of the quality indexes of therapist in MI. Conventionally, empathy is measured by observing audio or audiovisual recordings following expert designed coding manuals. Due to the abstract nature of empathy, coders must be trained to ensure reliability. Training and viewing the recordings require extensive time, and the coding process is difficult to scale up. Researchers are seeking computational techniques to automate this process and provide tools that facilitate their analysis, and multimodal Behavior Signal Processing (BSP) approaches offer promising avenues to address this problem. In addition, BSP aims to not only aid, but also transform observational practice through insights and increased observational capabilities.

As special cases of human interaction, medical care and psychotherapy dialogs are usually more structured, where the conversation follows certain implicit protocol and usually targets specific diagnostic and informational goals. Manifestation of care-provider’s empathy is embedded in their communication cues and patterns. One of the key sources of such cues is the language use. This paper focuses on computationally analyzing empathy behavior expressed in spoken language information. There is promising support for this line of work. In [3] the domain experts studied empathy behavior exemplified through transcripts of the conversation. Moreover, language modeling towards classification of a group of abstract behaviors (e.g., acceptance, blame, humor, etc.) in distressed couple interactions has been shown to be effective, even with Automatic Speech Recognition (ASR) derived lexical features [4]. This motivates us to study computationally the relation of empathy expression and the corresponding language use.

In this paper we utilize two datasets of MI based psychotherapy sessions. In the first set empathic language is annotated at the utterance level, from which we learn empathic and non-empathic language models. Precision and recall in experiments of classifying utterances into empathic or non-empathic classes are reported. The second set of sessions are given a session level score of therapist empathy. We use the language model learned on the first set to extract a group of session level lexical features. Significant correlations with the expert-coded session level empathy scores are obtained. Therefore we suggest that for MI, therapist empathy can be partially evaluated by means of computational language modeling. Experimental results also imply that coders tend to assess therapist’s empathy by accounting for salient empathic events in a session.

In Section 2 the two datasets and the details of observational behavior coding are introduced. In Section 3 we explain the way of building language models and feature extraction. Experiment results are reported in Section 4 followed by discussion in Section 5. Finally, we conclude the study in Section 6.

2. Data Sets

Both sets of data employed in the current study are from clinical trial studies using MI on substance use (drug abuse, alcohol use disorders, etc.) by college students. All sessions were manually transcribed, and only the therapist parts of transcripts are utilized. Similar text pre-processing are applied to the two sets of data: speaking turns are split into utterances either by the coder’s segmentation in the first set, or by period in the transcripts in the second set; word-external punctuations, quotes, words within parentheses, and other special symbols are then removed; capitalized characters are converted into lowercase; hyphens, apostrophes, underscores, as well as special notes in brackets such as [laughs] are retained.

The first set comes from a part of three MI studies, referred to as ESP21, ESPSB and HMCBI. Three well-trained coders evaluated these sessions based on audio and the original transcripts, following the Motivational Interviewing Skill Code (MISC) coding manual [5], which describes therapist and client behaviors at the utterance level, and assesses the therapist’s overall competence. In addition, the coding team invented the code called “Brownie points”, which was marked to an utterance whenever it locally exemplifies a type of global assessments of the therapist. “Empathy” is one of such assessments, with the coding instruction describing it as “therapists show active interest in making sure they understand what the client is saying”. Brownie points make it easier to pin-point typical empathic language perceived by the coders. In total, 28 sessions were analyzed. In order to maximize available training examples of empathic language, we collect empathy-coded utterances if any one of the coders put a marker of empathy on that utterance. Consequently an utterance is considered as non-empathic if none of the coders marked it as empathic. In total 854 empathic and 6439 non-empathic utterances are identified. We call this set as the MISC dataset.

The second set comes from a part of three other MI studies, referred to as ARC, iCHAMP and GOALS. The session level therapist coding scheme — Motivational Interviewing Treatment Integrity (MITI) [6] — was used to give global score of therapist empathy in Likert scale, ranging from 1 to 7 with 7 being highly empathic. Trained human coders used audio and original transcripts to perform the coding. In total, 88 sessions are collected, where the number of sessions having a score of 3 to 7 are 2, 24, 32, 29 and 1, respectively. Scores of 1 and 2 are not observed in the data. Therefore the empathy scores in this part are mainly 4, 5 and 6. We call this set as the MITI dataset.

In summary, the two datasets are presented in Table 1.

Table 1.

Summary of MISC and MITI datasets

Dataset Unit Ratings
MISC Utterance Empathic: 854
Non-empathic: 6439
MITI Session Empathy rating on a 1–7 Likert scale
97% in the 4–6 range

3. Language Modeling

3.1. Maximum likelihood classifier

In a Maximum Likelihood sense, we build a classifier based on language model for the empathic and non-empathic classes. Let E and N denote the two classes above. Let an utterance formed by a word sequence {wi|i = 1, 2, ⋯, l} be denoted w; the likelihoods based on N-gram language model of the two classes are P (w|E) and P (w|N). The decision of the classifier is as (1).

wargmaxCP(w|C),C{E,N} (1)

However, such a classifier would suffer from over-training because of small data size and disjoint samples in each class. To tackle this issue we utilize a language model trained on a large separate data set, and mix it with both classes in two steps. First, let the model derived from the large data set be denoted L, and a mixed universal background model (UBM) be B, the UBM is obtained by (2),

P(w|B)=λ1P(w|L)+C=E,N1λ12P(w|C) (2)

where λ1 is the weight on L. Second, the UBM is mixed with both classes to obtain the final model as (3),

P(w|C)=λ2P(w|C)+(1λ2)P(w|B),C{E,N} (3)

where λ2 is the weight on class E or N, and P(w|E), P(w|N) are the mixed language models for the two classes, respectively. The classifier is updated as in (4).

wargmaxCP(w|C),C{E,N} (4)

3.2. Session level feature extraction

The models generated above can also be used to extract session level lexical features of the therapist’s overall empathy level. Let the set of such features be denoted F. We define d(w) to be the difference in log probability of an utterance given the two models as in (5).

d(w)=logP(w|E)logP(w|N) (5)

We also use variants of (4) above to assign beliefs of empathy at the utterance level. For each of these variants, as in (6), E will denote the set of utterances from the whole session U, that will be estimated as belonging to the empathy class.

The first feature f1F being considered is the sum of d(w) for wU, interpreted as cumulative evidence of empathy in a session. Secondly, d(w) can be binarized by its polarity, and summed to give f2F, so that the session level feature is the count of decisions made at utterance level, with a threshold of 0 (can be viewed as decision with an equal prior). Thirdly, from a saliency point of view, we would like to accept wE if d(w) is larger than 0 by a moderate margin being δ3. We denote this feature as f3F. In addition, we take the ratio of empathic utterances in a session as f4, that is equal to f3 normalized by the number of utterances |U|. We also design a feature f5 that has varying δ5(i) = δ5 × li, where li is the number of words of the i-th utterance of the session (utterance end symbol </s> included), so that longer utterances have higher threshold. Finally, a feature f6 brings f4 and f5 together. The features f1 to f6 are summarized in (6).

f1=wUd(w)f2=|{w|wU,d(w)>0}|f3=|{w|wU,d(w)>δ3}|f4=|{w|wU,d(w)>δ4}|×1|U|f5=|{w|wU,d(w)>δ5×li}|f6=|{w|wU,d(w)>δ6×li}|×1|U| (6)

For simplicity, unless explicitly stated, we use δϕ to denote any of δ3 to δ6. The value of δϕ can be optimized on a development set. Note that reasonable values of δϕ range from 0 to the maximum difference of log probability over the set of possible utterances. Through these bounds on δϕ we can search for δϕ that optimizes the effectiveness of the feature.

To evaluate, let F = {fϕ(i)} denote the feature stream of fϕ and Y = {y(i)} denote session level empathy scores for K sessions (i = 1, 2, …, K). In our study, we set the target function of the optimization to be in (7), where Corr(F, Y) is the correlation between F and Y. The optimization is applied for all ϕ = 3 6.

δϕ=argmaxδϕCorr(F,Y) (7)

4. Experiments

4.1. Empathic utterance classification — MISC dataset

We use the SRILM tool [7] to implement N-gram language model. The original model P (w|E) and P (w|N) are smoothed using Kneser-Ney algorithm. The switchboard text corpus [8] is used as the large dataset (L) towards generating the UBM (B).

On the MISC dataset, a 5-fold cross-validation is carried out, where the empathic and non-empathic utterances are equally split into 5 parts respectively, and in each fold one part of each class is held out. The remaining data are used to train the classifier as described in Section 3.1.

For evaluation we will employ precision and recall in (8), where the CE denotes the set of utterances marked by experts as empathic.

precision=|{w|wEandwCE}||E|recall=|{w|wEandwCE}||CE| (8)

To test the effect of mixing parameters, we choose λ1 and λ2 from {0.1, 0.3, 0.5, 0.7, 0.9}, respectively. The empathy classification results on the held out sets using unigram, bigram or trigram features are shown in Figure 1, where points with the same λ1 or λ2 value are linked with a solid or dotted line, respectively.

Figure 1.

Figure 1

Precision and Recall of classifying empathic utterances with various λ1 and λ2

We can observe that unigram features result in higher recall and lower precision, while bigram features are higher on precision but lower on recall. The performance using trigram features is worse than bigram. The highest F1-score of 0.56, with 0.48 precision and 0.66 recall, is achieved with λ1 = 0.5, λ2 = 0.7 and using bigram features.

The experiment shows that words in isolation, i.e. unigrams, do not separate empathic utterances as reliably as word usage in a context, i.e. bigrams. However we also observe that, likely due to increased sparsity issues resulting from their higher context representation, bigram and trigram features are not as robust in recall as unigram features and in addition trigram features perform worse in both precision and recall to bigram features.

4.2. Session level empathy — MITI dataset

In this experiment we test the effectiveness of the features proposed in Section 3.2. Restricted by the size of MITI dataset, we use as much data as possible to optimize the δϕ parameters through leave-one-out cross-validation (88 times of 87-dev, 1-test1). Using the test set we evaluate the correlation between Y and the features F.

We take the bigram model learned on the whole MISC dataset with λ1 = 0.5 and λ2 = 0.7, i.e. the model yielding highest F1-score, as an example. To optimize δϕ, we did a simple stepwise search with step size being 0.01 in each round (f1 and f2 do not require optimization). The baseline correlations of f1, ⋯, f6 and Y on all the sessions are obtained in Table 2. f1 fails to correlate significantly with Y; f2 has a positive correlation at p-value = 0.001 significance. f3 to f6 are giving better correlation above 0.4. In figure 2 we plot the f3 feature value on horizontal axis and the corresponding Y on vertical axis. Also we plot the histogram of f3 value for different Y values. We can see there is a tendency of larger f3 associated with larger session level empathy score.

Table 2.

Correlations of lexical features F and Y

Feature f1 f2 f3
Correlation −0.109 0.345 0.408
p-value 0.31 1.0 × 10−3 7.9 × 10−5
Feature f4 f5 f6
Correlation 0.427 0.404 0.438
p-value 3.4 × 10−5 9.6 × 10−5 2.0 × 10−5
Feature f16 f36
Correlation 0.558 0.495
p-value 1.6 × 10−8 9.4 × 10−7

Figure 2.

Figure 2

Feature f3 and session level empathy score Y

In addition, we are interested in the combined performance of using f1 to f6. Fitting the above F and Y to a linear regression model, the predicted Y^ has a correlation of 0.558 with Y. Comparing with the nested models only using f3 to f6 individually, the extended multi-variant model significantly improves accuracy under F-test at α= 0.05. We should also note that the features in F are often highly correlated as they are not independently generated. For instance f2 and f5 have a correlation of 0.972. Therefore we adopted a Bayesian linear regression approach to mitigate the multi-collinearity issue, and achieved a correlation of 0.531 between Y^ and Y.

5. Discussion

5.1. High empathy words in unigram

To understand better the major distinguishing features of empathic and non-empathic language we report in this section the most discriminating words between the two models, ranked by the product of d(w) and the number of word occurrence in MISC dataset, as denoted D(w) in (9). With λ1 = 0.5 and λ2 = 0.7, the result is listed in Table 3 where words ranked by positive (empathic) and negative (non-empathic) D(w) are displayed separately.

Table 3.

Words having prominent discriminative power

Empathy Non-Empathy
you’re, you, it, like, sounds, so, and, you’ve, your, of, that, to, it’s, a, with, kind, not, really, for, kinda, time, friends, maybe they, mm-hmm, what, we, alcohol, this, yeah, think, about, okay, drinks, right, if, do, is, that’s, they’re, b_a_c, us, um-hum
D(w)=d(w)×number_of_occurrence(w) (9)

We can see more second person pronouns and more reflective listening related words such as “sounds” for empathy; while on the other hand there are more first and third person pronouns, and more following-neutral words like “mm-hmm”. This matches highly reflective therapy-talk such as “It sounds like you’re …”, with reflections being accepted in therapy as highly empathic language techniques.

5.2. Empathy perception as salient events

In Section 4.2 the f1 feature does not yield significant result, while features obtained via thresholding like f3 is significantly correlated with overall empathy. One interpretation is that the degree of empathy in a session perceived by the coder is not precisely the cumulative level of empathy of each utterance, but enough occurrences of salient empathic utterances act as “highlights” to strengthen the coder’s decision. This matches existing theories of perception such as the Gestalt Principle Theory of Perception [9]. In addition it’s worth noting that our non-empathic training samples are more of generic and neutral language rather than the exact opposite of empathy, so higher probability on the non-empathy model does not mean highly against empathy.

5.3. Related work on modeling empathy

There have been a few studies of computational models of empathy in the literature. In [10] the authors constructed a system of virtual environment involving the user and a virtual agent. In training mode, a human trainer guided the virtual agent to act in an empathic manner. In test mode, the system decides when and how should the virtual agent act in an empathic manner. Timing, location and intention information were employed as features within the virtual environment. Naive Bayes and Decision Tree models were adopted in learning. Experiments showed the system could provide the basis of empathic behavior control of the virtual agent. In [11] the authors suggested that the occurrence and attribute of emotional interaction (i.e., empathic, antipathetic or unconcerned) are related to facial expression and gaze in multi-person interaction. Computer vision techniques were used to detect “who is facing whom and when”, and the empathy level notes were provided by human evaluators. The authors built a Bayesian learning model to estimate level of empathy via the extracted cues. Experiments showed that the system was able to infer the empathy behavior.

5.4. Future work: towards an evaluator model of empathy

We have analyzed language modeling of empathy in a Maximum Likelihood sense. In fact, there are many more aspects that can be incorporated. For example, the client’s language is not utilized in this study. A more complete model should consider the therapist’s empathic language in the context of the conversation with the client. As studied in [3], there are opportunities of expressing empathic language for the therapist within the context. By tracking or hypothesizing such opportunities, one could get a more accurate measure of how well the therapist is doing. As shown in the above discussion of word use, empathic language is often related to reflection to the client’s talk. Locating reflections [12] by the therapist might be helpful for evaluating empathy.

Moreover, recall that empathy is not only expressed in language, but also via many other modalities, such as the way of saying as acoustic features, the body gesture and motion, facial expression and eye contact. A better evaluator model of empathy should ideally incorporate these modalities and conduct reasoning in the context of the conversation. For example, we have successfully used such BSP approaches in behavioral coding of distressed couple interactions [13].

6. Conclusion

In this paper we introduced empathy as an important aspect in social communication, especially in medical and psychotherapy applications. For psychotherapy based on Motivational Interview, the characteristics of empathic and non-empathic behavior are learned with N-gram language models. A language-based classifier of empathic versus non-empathic utterances is proposed in Maximum Likelihood sense. High recall is obtained with unigram while bigram achieved the highest F1-score. Based on the language model, a group of lexical features are proposed, and tested by the correlation with expert-coded session level empathy scores. Combined features achieved a correlation of 0.558 between predicted session level empathy scores and the expert-coded ones. The study suggests that in the scenario of psychotherapy where language use is constrained by the application, computational language modeling can provide useful insights into the expressed empathy behavior of therapists. Moreover, experiments show that human coders tend to assess session level empathy as a gestalt of salient empathic behavior.

Footnotes

1

Note that training is already assumed through the MISC dataset models

References

  • 1.Decety J. A social cognitive neuroscience model of human empathy. Social neuroscience: Integrating biological and psychological explanations of social behavior. 2007:246–270. [Google Scholar]
  • 2.Bellet P, Maloney M. The importance of empathy as an interviewing skill in medicine. Journal of the American Medical Association. 1991;266(13):1831–1832. [PubMed] [Google Scholar]
  • 3.Suchman A, Markakis K, Beckman H, Frankel R. A model of empathic communication in the medical interview. Journal of the American Medical Association. 1997;277(8):678–682. [PubMed] [Google Scholar]
  • 4.Georgiou P, Black M, Lammert A, Baucom B, Narayanan S. ‘that’s aggravating, very aggravating’: Is it possible to classify behaviors in couple interactions using automatically derived lexical features? Proc ACII Springer. 2011:87–96. [Google Scholar]
  • 5.Miller W, Moyers T, Ernst D, Amrhein P. Manual for the motivational interviewing skill code (misc), version 2.1. Substance Abuse and Addiction (CASAA), University of New Mexico. 2008 [Google Scholar]
  • 6.Moyers T, Martin T, Manuel J, Miller W. The motivational interviewing treatment integrity (miti) code: Version 2.0. Unpublished Albuquerque, NM: University of New Mexico, Center on Alcoholism, Substance Abuse and Addictions. 2008 [Google Scholar]
  • 7.Stolcke A. Srilm-an extensible language modeling toolkit. Proc ICSLP. 2002;2:901–904. [Google Scholar]
  • 8.Godfrey J, Holliman E, McDaniel J. Proc ICASSP. Vol. 1. IEEE; 1992. Switchboard: Telephone speech corpus for research and development; pp. 517–520. [Google Scholar]
  • 9.Humhprey G. The psychology of the gestalt. Journal of Educational Psychology. 1924;15(7):401–412. [Google Scholar]
  • 10.McQuiggan S, Lester J. Modeling and evaluating empathy in embodied companion agents. International Journal of Human-Computer Studies. 2007;65(4):348–360. [Google Scholar]
  • 11.Kumano S, Otsuka K, Mikami D, Yamato J. Analyzing empathetic interactions based on the probabilistic modeling of the co-occurrence patterns of facial expressions in group meetings. Proc FG IEEE. 2011:43–50. [Google Scholar]
  • 12.Can D, Georgiou P, Atkins D, Narayanan S. A case study: Detecting reflections with linguistic features in motivational interviewing based psychological therapy. Submission to IS. 2012 [Google Scholar]
  • 13.Black M, Katsamanis A, Baucom B, Lee C, Lammert A, Christensen A, Georgiou P, Narayanan S. Toward automating a human behavioral coding system for married couples interactions using speech acoustic features. Speech Communication. 2011 [Google Scholar]

RESOURCES