Abstract
Psychotherapy represents a broad class of medical interventions received by millions of patients each year. Unlike most medical treatments, its primary mechanisms are linguistic; i.e., the treatment relies directly on a conversation between a patient and provider. However, the evaluation of patient-provider conversation suffers from critical shortcomings, including intensive labor requirements, coder error, non-standardized coding systems, and inability to scale up to larger data sets. To overcome these shortcomings, psychotherapy analysis needs a reliable and scalable method for summarizing the content of treatment encounters. We used a publicly-available psychotherapy corpus from Alexander Street press comprising a large collection of transcripts of patient-provider conversations to compare coding performance for two machine learning methods. We used the Labeled Latent Dirichlet Allocation (L-LDA) model to learn associations between text and codes, to predict codes in psychotherapy sessions, and to localize specific passages of within-session text representative of a session code. We compared the L-LDA model to a baseline lasso regression model using predictive accuracy and model generalizability (measured by calculating the area under the curve (AUC) from the receiver operating characteristic (ROC) curve). The L-LDA model outperforms the lasso logistic regression model at predicting session-level codes with average AUC scores of .79, and .70, respectively. For fine-grained level coding, L-LDA and logistic regression are able to identify specific talk-turns representative of symptom codes. However, model performance for talk-turn identification is not yet as reliable as human coders. We conclude that the L-LDA model has the potential to be an objective, scaleable method for accurate automated coding of psychotherapy sessions that performs better than comparable discriminative methods at session-level coding and can also predict fine-grained codes.
Index Terms: machine learning, labeled latent dirichlet allocation, clinical communication, conversation analysis, multi-label document classification
I. Introduction
Across medical specialties, the basic medium of information gathering and intervention between the provider (i.e., MD, psychologist, nurse) and patient is a conversation. The patient describes problems and the provider listens, asks questions, and recommends solutions and specific treatment strategies. The content of this conversation can be useful across a broad variety of contexts, such as helping a primary care provider to detect and prevent suicide [1], promoting patient adherence to treatment recommendations [2], reducing cold severity and duration [3], and predicting a surgeon’s history of malpractice lawsuits [4].
Psychotherapy (sometimes called counseling or behavioral treatment) represents a particular class of interventions that has a special focus on the provider-patient interaction. With psychotherapy, the interaction contains the treatment’s active ingredients rather than simply being a means of developing rapport and forming a diagnosis. Psychotherapy ranges from brief, single session interventions [5] to multi-session interventions over weeks or months [6] and research suggests that psychotherapy is effective for a broad range of mental health disorders [7].
The typical method of summarizing the content of this conversation is based on the provider’s recollection and self-report of what happened as they record it in the medical record. Many methods exist for obtaining summary measures from transcribed text–e.g., by separating a transcript into broad semantic topics [9]–[13], [15], [16], detailed behavioral features (such as requests for clarification [17]) or syntactic parts of speech [18], among others. These summary measures can be used as context to extract and evaluate treatment information, including patient diagnosis, analysis of client communication, and evaluation of suicide risk [19]–[24].
At present the evaluation of psychotherapy sessions and other types of patient provider communication relies on human raters who summarize sessions by attaching codes (also called labeling or annotating) in order to quantify the information in treatment encounters [27]. The process of attaching these codes, called observational coding, provides theory-driven organizational systems through which complex linguistic data can be structured for further analysis. Codes can represent the subject of conversation (e.g., medications, spousal relationships), symptoms expressed (e.g., depression, anxiety, anger), or specific verbal behaviors in individual utterances or talk-turns for providers (e.g., open or closed questions by the therapist, degree of empathy) or patients (e.g., signaling intent to change or maintain behavior).
Observational coding has critical shortcomings, including intensive labor requirements, coder error, non-standardized coding systems (new codes require new training), and inability to scale up to larger coding projects [13]. Each hour of therapy takes roughly 10 hours to code and the number of alcohol and drug abuse sessions in the U.S. healthcare system alone runs into the hundreds of thousands per year. The burden of human coding leads typical psychotherapy research studies to be small, which contributes to the incredible heterogeneity across studies investigating the relationships between therapist behavior and patient outcome [25]. Accordingly, human-based coding is not a feasible method for evaluating the content of treatment encounters on a large scale. An objective, scalable method for summarizing the content of actual treatment encounters is needed.
We can describe the implementation of coding systems for text as multiple-label classification problems where multiple codes are attached to each document [28]. Machine learning approaches for automatic multiple-label document classification have been successfully used in various domains [29]–[31], [35], [36], including medical applications for disease diagnosis and medical error detection [37]–[39]. One such class of tools called topic models [40] has been used to assess the fidelity of therapist treatment [13] through prediction of behavioral codes, compare type of psychotherapy treatment [15], and predict therapy outcomes in schizophrenic patients [41], [42].
In this paper, we illustrate the ability of one specific type of topic model, Labeled Latent Dirichlet Allocation (L-LDA) [12], [43], to semi-automatically infer subject and symptom codes from a large heterogeneous psychotherapy corpus; i.e., what topics and symptoms were discussed during treatment. Every session in the corpus was manually annotated with general discussion content and patient symptom codes such that the observable outcomes of the manual annotation process are codes for the session as a whole. However, implicit in the coding process is a fine-grained, or local, evidence-accumulation process where each word, utterance or talk-turn in a session affects the decision to attach a given code. Establishing a link between specific within-session passages of text and overall codes for the session (session-level codes) is fundamental to understanding the coding procedure. We implement a model that, in addition to learning session-level coding systems, can localize specific passages of text representative of a session code. In other words, the model is able to infer codes at a local (talk-turn) level from codes that were provided at the global (session) level.
Previous work on computational analysis of psychotherapy transcripts used topic models to summarize therapy corpora and extract features for use in predictive models for therapy type [15] or as a stand-alone model to predict behavioral codes [13]. Our current work expands upon past research by using topic models to predict session content, by providing a detailed quantitative evaluation of predictive performance that includes comparisons to baseline models, and by developing methodology for talk-turn annotation using session-level metadata.
The model is evaluated and compared against a baseline discriminative model (lasso logistic regression) using standard performance measure–the receiver operating curve (ROC) and area under the curve (AUC). Additionally, we provide R-precision [26] scores for talk-turn prediction. Session-level R-precision scores can be found in the supplementary files. For all performance evaluation, we use 10-fold cross-validation at the session level to emphasize the models’ ability to predict novel data.
As we will discuss in the experimental results section, the accuracy of the proposed techniques, in terms of code prediction, are not yet at the level of human annotators. Thus, these approaches are not yet ready to be used for fully automated annotation of therapy transcripts in an off-the-shelf manner. Nonetheless, as we outline in the discussion section later in the paper, the current techniques could potentially be used as components within a semi-automated approach, for example to assist in therapist training, using the model to rank and present to a supervising therapist specific talk-turns within a trainee session in terms of the talk-turns likelihood of containing specific codes. There are multiple publicly-available L-LDA software packages [32]–[34] that could be used to support such efforts. More broadly, the work described in this paper represents the next step towards a long-term goal of fully automatic code prediction for psychotherapy transcripts.
II. Data
The primary source of data comes from a psychotherapy corpus maintained by Alexander Street Press and made available via library subscription. At the time of the present analyses, the corpus contained 1,181 therapy sessions with approximately 8 million words. Each session was conducted with a unique therapist and client. On average each session contains 250 talk-turns, which are defined as uninterrupted passages of time during which either the patient or therapist speaks. Talk turn length ranges from several words to several sentences. Sessions were conducted by prominent psychotherapists and serve as exemplars of different treatment approaches. Each session includes meta-data such as patient age, patient gender, type of psychotherapy, and two types of nominal content codes (i.e. labels) referring to subjects discussed in the session (161 possible codes) and patient symptoms discussed in the session (48 possible codes). We use subject and symptom codes because we are interested in the relationship between language and the codes’ semantic meanings (as opposed to codes for type of therapy, client gender, etc). The list of symptom and subject codes was derived from the DSM-IV manual and other primary psychology/psychiatry texts. All codes annotated in the psychotherapy corpus are session-level codes, meaning that a single label is applied (as a binary present/absent label) to the entire session, and the original corpus did not include any labels for specific subunits such as sentences, talk-turns or paragraphs. Each session is annotated with multiple codes (min = 1 code, max = 17 codes) and the average session is annotated with approximately 5 codes. Prior to analysis we applied a number of preprocessing steps, including stopword removal and n-gram extraction to convert the original corpus into a form suitable for text analysis. We chose stop words from standard lists used in natural language processing and augmented these lists with words from the corpus that were not on standard stop word lists, but that contain little semantic content (e.g., “mm-hmm”) (see Models section for details on preprocessing and supplementary files for full stop word lists). Stop words were removed from both patient and therapist speech. In the case of a talk-turn comprised completely of stopwords, we removed the talk-turn from the data. The resulting representation of the text consisted of sparse vector counts of terms for each document, including unigrams (single words such as “medicine”, “anger”), bigrams (e.g. “side effect”), and trigrams (“it sounds like”).
In order to evaluate the ability of the model to find representative talk-turns we conducted additional coding to generate labels for talk-turns within selected sessions. The aim of the additional coding was to generate data for specific within-session sections of text (in this case talk turns) against which to test our model. These coded talk-turns were only used for model evaluation, not for model training. We focused on five symptom codes: anger, anxiety, depression, low self-esteem and suicide. These codes were chosen firstly for their therapeutic importance and secondly for their high frequency of annotation in order to provide a sufficient amount of additional data. We restricted the number of symptom codes to limit the amount of human coding required for talk-turn annotation. For each of these symptoms, we randomly selected 200 client talk-turns of at least 50 characters in length (before stop word removal) from sessions that had the symptom code attached. On average, the selected talk turns were approximately 277 characters in length before stop word removal. The process led to a total of 993 talk-turns. Each talk-turn was rated in terms of the representativeness of the symptom on a scale of 1 (atypical) to 7 (very typical) by each of 6 graduate students or post-doctoral fellows with training in clinical/counseling psychology.
III. Models
We approach the problems of session coding and identifying representative talk-turns through the use of Labeled Latent Dirichlet Allocation (L-LDA) [43] [12], a semi-supervised extension of Latent Dirichlet Allocation (LDA). We first present the LDA model and then the L-LDA model. The model presentation is aimed at readers who have some experience with topic models. For readers new to topic modeling, we recommend reading a tutorial introduction [8]. Then, we show how these models can be used for document classification and how to apply the models to predicting codes and talk-turns in the general psychotherapy corpus. Finally, we present lasso logistic regression (LLR) as a baseline model against which to compare L-LDA.
A. Latent Dirichlet Allocation
LDA is an unsupervised modeling approach that learns a set of latent topics across a corpus of text. As opposed to L-LDA, there are no labels that are part of the data to learn from. The only data provided to LDA are a set of documents, where documents are treated as a “bag-of-words”; i.e., sparse vectors of word counts for each document. Thus, the order of words is not relevant for the model. We use both individual words and multi-word terms (n-grams) in the vocabulary for our model—but for simplicity will refer to both as “words.”
Standard applications of topic models assume that the text corpus can be naturally divided into documents. For example, a corpus of scientific articles is naturally divided into documents according to article. In the case of spoken dialogue, choosing a rule for partitioning a corpus into documents is less straightforward. Documents can be defined as sentences, paragraphs, entire sessions or through any type of feasible partitioning. As in past research [13], [15], for the General Psychotherapy Corpus we define documents to be individual talk-turns (although other definitions are possible as well). Using talk-turns to define documents yields a larger set of documents with more localized word co-occurrences compared to defining documents at the session level. We have found in our experiments that these localized word co-occurrences tend to result in more specific topic-word distributions and improve classification performance.
LDA specifies a generative process for the creation of text documents. From this generative process we learn a predictive model by reverse-engineering the process–i.e., learning the parameters most likely to have generated the data. In LDA, each document (in this case talk-turn) is represented as a mixture of topics, where each topic is defined as a multinomial distributions over words. The creation of each document begins by sampling a document-specific distribution over topics. To generate each word in the document, a topic is sampled from the document specific-distribution over topics and a word is sampled from that topic. Formally, let T be the total number of topics in the model and V be the size of the vocabulary (number of unique words in the corpus). Then we can specify the marginal distribution over words for a document d as:
where zw indicates the topic from which word w was drawn, P (w|zw = t) is a V-dimensional distribution over words for topic t, and P (zw|d) is a T-dimensional distribution over topics for document d. To simplify notation, we will let ϕ(t) = P (w|zw = t) represent the distribution over words for topic t and θ(d) = P (zw|d) represent the distribution over topics for each document d.
LDA incorporates a priori knowledge about topics likely to occur in a document by placing a Dirichlet prior on the distribution over topics, θ(d), for each document. The Dirichlet prior is the conjugate prior of the multinomial distribution and is used to express the prior probability of observing a topic in a given document before observing any data. The Dirichlet distribution is parameterized by the vector (α1, …, αT), where αt can be interpreted as the prior observation count for the number of times topic t is sampled in a document before having observed any actual words from that document. Thus, we can view the distribution over topics for a document d as a sample from this group-level prior distribution over topics.
In a similar manner, LDA also incorporates prior information about which words are likely to occur in a given topic. LDA does this by placing another Dirichlet prior on the distribution over words, ϕ(t), for each topic t. This second Dirichlet distribution is parametrized by the vector (β1, …, βV) where βw represents the prior observation counts of word w before observing any documents. Here we can interpret each topic as a sample from this group-level prior distribution over words. We follow the common practice of setting the Dirichlet parameters uniformly (i.e., (β1, …, βV) = (β, …, β)) which corresponds to the assumption that each word is equally likely a priori.
B. Labeled LDA Model
L-LDA is a semi-supervised variant of LDA in which some topics are placed in correspondence with labels that can be associated with a document. Documents in the training phase are assumed to have been pre-assigned to a subset of labels from a larger lexicon of possible labels. In the context of the psychotherapy corpus, possible labels include symptom and content codes and L-LDA model infers a unique topic for each code. These topics are learned by restricting inference to only the word tokens in documents annotated with the topic’s corresponding label. We use a separate unsupervised set of topics, called background topics, to account for words not associated with the known codes. These background topics allow the model to capture some of the linguistic variability in the data that is not directly related to subject and symptom codes. Without these background topics many words would have to be explained by the topics associated with the symptom and content codes, which would decrease the generalizability of those topics. During training of the L-LDA model, when sampling the topic for a word token in a document (as describe below), only topics that belong to labels associated with a document (including background labels) can be sampled. All other topics have zero probability of being expressed in the document.
Formally, let T = Tc + Tb be the total number of topics. A subset of Tc topics are in one-to-one correspondence with the labels associated with documents. The remaining Tb topics capture background information. During the generative process, for each document d, we restrict the space of possible document mixtures by restricting the hyperparameters of the Dirichlet prior on θ according to a binary topic assignment vector . We define:
We then define the hyperparameters for document d as αd = (αd1, …, αdT) = Λ(d) × α. Note that the only topics that can be expressed for a particular document are topics corresponding to a code associated with the document or background topics.
Letting D be the number of documents in the collection, the generative process of the L-LDA model can be described as follows:
-
For topic t ∈ 1, …, T
Sample a multinomial distribution over words ϕ(t) ~ Dirichlet(β1, …, βV)
-
For document d ∈ 1, …, D
Use the labels associated with document d to set the hyperparameters αd = Λ(d) × α.
Sample a multinomial distribution over topics θ(d) ~ Dirichlet(αd = (αd1, …, αdT)).
-
For each term i ∈ 1, …, Nd
Sample a topic indicator zi ~ Categorical(θ(d)).
-
Sample a word token wi ~ Categorical(ϕ(t=zi)).
where Nd is the number of word tokens in document d. Note that α and β are hyperparameters for the model. The graphical model for L-LDA is presented in Figure 1.
Fig. 1.

Graphical Model of L-LDA.
C. Training the L-LDA Model
The variables we would like to infer are the topic assignment variables zw for each word w, the document mixtures θ(d), and the topic distributions ϕ(t). For sampling we use a collapsed Gibbs sampler [14] which integrates out ϕ(t) and θ(d) so that we only sample the topic assignments zw. The topic assignments zw are then used to generate point estimates of ϕ(t) and θ(d).
The Gibbs sampling procedure considers each word token in the text collection in turn, and estimates the probability of assigning the current word token to each topic, conditioned on the current topic assignments to all other word tokens. From this conditional distribution we sample a topic assignment for the current word token. We write the conditional distribution as P (zi = t|z-i, wi, d, ·) where zi = t represents the topic assignment of token i to topic t, z-i refers to the topic assignments of all other word tokens, and “·” refers to all other known or observed information such as all other word indices w-i, distributions over topics for all other documents, and hyperparameters α, and β. The conditional distribution can be calculated as follows [14]:
| (1) |
where t is restricted to the set of topics defined by the union of (a) codes t attached to document d, and (b) background topics t > Tc. All other topics have probability 0 for document d (as specified by the generative model) and are not eligible to be sampled. In the equation above CVT and CDT are matrices of counts with dimensions V × T and D × T respectively; contains the number of times word i occurred in a document with topic t and contains the number of times a word token in document d was assigned to topic t. These matrices are incremented using the sampled topic assignment variables at each step of the Gibb’s sampler for every word w.
The Gibbs sampling algorithm is initialized by assigning each word token in document d randomly to one of the set of eligible topics for document d (i.e., the codes t attached to document d or the background topics t > Tc). For each word token, the count matrices CV T and CDT are first decremented by one for the entries that correspond to the current topic assignment. Then, a new topic is sampled from the distribution in Equation 1 and the count matrices CV T and CDT are incremented with the new topic assignment. Each Gibbs sample consists of the set of topic assignments for all N word tokens in the corpus, achieved by a single pass through all documents.
The sampling algorithm gives us samples for the topic assignment variables zw for each word w. However, we are interested in estimating the word-topic distributions ϕ(t) and topic-document distributions θ(d). We can approximate the probability of choosing the k-th word from the distribution over words for topic t, ϕ(t), using the word-topic count matrix (computed from the sampled topic assignment variables) as follows:
Here can be interpreted as the estimated probability of choosing word wk from topic t. We can also estimate the probability of choosing the t-th topic from the distribution over topics for document d, θ(d), using the count matrix CDT (also computed from the sampled topic assignment variables) as follows:
Here can be interpreted as the estimated probability of expressing topic t in document d. We later use to qualitatively examine topics corresponding to session codes and θ̂(d) to estimate the topics (and therefore symptom and content codes) expressed in document d.
D. Prediction with the L-LDA Model
We evaluate the model by predicting labels for documents unseen by the model during training using the word-topic counts (CWT) learned during training. The goal for prediction is to infer a document-topic count vector for each new document d′, where the inferred count vector contains information about the likely topics (and associated codes) for d′.
For a new document d′, we set so that any topic can be part of the document mixture. We run a Gibbs sampling procedure where we compute the posterior distribution over topic assignments:
| (2) |
where αt = α. The posterior for this sampling procedure is similar to the posterior used in the sampling procedure during training except that the word-topic count matrix CWT is not updated. Holding CWT constant formalizes the assumption that the word-topic counts are learned and that prediction consists of learning just the document-topic counts. Another difference from the sampling procedure used during training is that we sample the posterior probabilities P (zi = t|z-i, wi, di, ·) at each iteration (after burn-in) instead of the word-topic count assignments (that were sampled during training). While either word-topic counts or posterior probabilities can be used to compute prediction scores, we found that using posterior probabilities provided more accurate code predictions and required less samples for accurate prediction. We use the posterior samples to compute topic scores that represent the likelihood that a document should be annotated with the code corresponding to each topic. We compute a score ηt,d for each topic t and test document d as follows:
where the variable γt,d,i estimates the probability in Equation 2 that the tth topic was assigned to the ith word token in document d. Thus ηt,d can be interpreted as the average probability of assigning a word from document d to topic t. To calculate each word’s posterior estimate of topic assignment (γt,d,i) we average over the posterior samples of the probability of assigning word i to topic t. We compute γt,d,i as follows:
where p(zt,d,i)k is the k-th sample of the posterior probability expressed in Equation 2 and K is the total number of samples.
IV. Learning Topics from Labeled Sessions with L-LDA
A. Text Preprocessing
The original corpus contained 8 million word tokens and 40,000 unique words. Before fitting the L-LDA model, we applied a number of preprocessing steps on the corpus. We first removed any words that occur 5 or fewer times in the entire corpus on the assumption that these words are unlikely to be useful in general for categorization. This step reduced the number of unique words from 40,000 to 27,000 unique words. After removing infrequent words, we removed words that we thought contained little semantic content. We performed a preliminary filtering using a common stop word lists to remove words (see unigram stop word list in supplementary files) Next, we applied a part-of-speech tagger [18] that we used to remove determiners, adverbs, pronouns, interjections, particles, modal words, punctuation, and numbers. We used part-of-speech tags to create additional stop word lists for bigrams and trigrams, and performed a second stop word filtering using these lists. A final stop word filtering was done for interjections that are common in psychotherapy, but weren’t identified by the part-of-speech tagger. See supplementary files for a full list of stop words. The final corpus contained 28,000 unique words (including generated bigrams and trigrams) and 1.4 million word tokens.
B. Model Parameters for Training L-LDA
The L-LDA model requires a number of decisions to be made and parameters to be selected before training the model, including the number of background topics Tb, the settings for the priors α and β, the number of iterations and the number of burn-in samples.
For the number of background topics Tb, we chose Tb = 50 for the results reported in this paper, and found that the model was not particularly sensitive to the number of background topics as long as Tb was at least 20. We used uniform α and β where each element was set to 1/50 and 1/100, respectively. These are typical values used in training LDA models and we found that that the method was reasonably robust to small perturbations in these values. Our results are from a model using 100 training iterations, and 20 iterations for prediction, where the last S = 10 iterations are used for generating prediction scores. We ran several models that varied the number of iterations and burn-in samples and found results similar to the model we report.
C. Inferred Topics
Prior to assessing predictive performance measures, we qualitatively examined the topics generated by the L-LDA model (Table I). We examined three types of topics corresponding to subjects, symptoms, and background content. For the subject and symptom labels, we illustrate the topics learned by the model for the five most common labels. For the background topics, we picked an illustrative set of five topics. Qualitatively, subject and symptom topics are very interpretable–e.g., the medications topic consists of examples of medications, words used to describe administration of medication, and words used to describe the effects of medication. The background topics shown in Table I also have intuitive interpretations and contain words that are not covered by the content codes in the psychotherapy corpus. For example, there are background topics that explain word usage related to people, jobs, and sleeping (background topics 9, 36, and 39 respectively). Without these background topics, the high probability words associated with them would have to be redistributed over the content topics for subjects and symptoms, potentially decreasing their interpretability and predictive power.
TABLE I.
The most likely terms inferred for the topics associated with the five most common subject and symptoms and an illustrative set of background topics. For each topic, the 10 most likely N-grams are shown.
| Subject | Inferred Topic Distribution |
|---|---|
| medications | medicine, mg, dose, wellbutrin, medicines, lamictal, prescription, effects, side_effects, ability |
| relationships | relationship, women, feels, friend, relationships, boyfriend, date, position, example, react |
| parent-child relations | mother, father, love, remember, relationship, parents, brother, emotional, loved, needed |
| depressive disorder | depression, medication, doctor, medicine, prozac, depressed, zoloft, generic, wellbutrin, add |
| spousal relationships | wife, marriage, married, husband, relationship, mhm, children, attitude, divorce, got_married |
| Symptom | Inferred Topic Distribution |
|---|---|
| anxiety | anxiety, anxious, panic, nervous, depression, worried, worst, fine, experience, helps |
| depression | depressed, depression, doctor, pain, die, needed, drugs, low, xanax, mg |
| anger | angry, feelings, anger, express, get_angry, be_angry, reaction, feels, pissed, ’m_feeling |
| low self-esteem | love, teaching, boyfriend, positive, stupid, attractive, fit, negative, sorta, criticism |
| irritability | annoyed, irritable, message, safe, dishes, cause, wife, skin, irritated, cats |
| Background | Inferred Topic Distribution |
|---|---|
| background 9 | friends, family, mom, dad, close, sister, brother, daughter, men, lives |
| background 13 | care, stop, took, weight, takes, ready, lose, take_care, amount, body |
| background 23 | house, room, walk, bed, door, walking, rid, front, throw, clean |
| background 36 | job, wants, work, business, works, office, busy, baby, buy, paper |
| background 39 | morning, sleep, hours, friday, sleeping, monday, tomorrow, saturday, wake, bed |
V. Session-Level Prediction
A. Cross-Validation and Scoring
To test generalizability of the model to new data, we use 10-fold cross-validation where for each fold the sessions are partitioned into two disjoint sets: (a) a training set with 90% of the sessions used to train an L-LDA model and (b) a validation set with the other 10% of sessions used to evaluate the trained model. We compute an AUC score for each validation set (for each code) and report the average of the AUCs across the 10 validation sets.
To compute an AUC score for a topic corresponding to a particular code and a particular validation set we proceed as follows. For each session in a validation set, we predict scores at the talk-turn level (as described earlier) and then aggregate scores for all talk-turns in a session. We define the session likelihood score for session s and topic t as:
where Ds is the number of talk-turns in session s and d(s) is the set of all talk-turns (documents) in session s. For each topic t, using these scores, we rank the sessions in the validation set and compute the area under the curve (AUC) using the known subject and symptom codes attached to each session.
B. Results for Session-Level Predictions
For each subject and symptom code we computed the AUC for each cross-validation fold and took the average across folds to measure classification performance. Values of the AUC range in theory from .5 (chance level performance, e.g., randomly generated rankings) to 1 (perfect predictive accuracy). In practice, performance that is above the level of chance can occur even from models where the scores are randomly distributed (and unrelated to the content of the sessions). This is especially the case with codes that occur infrequently. In order to assess the significance of the predictive accuracy of our model relative to chance performance, we calculated a set of AUC scores for each code using 1000 randomly generated rankings and computed the corresponding 90% confidence intervals.
Additionally, we compare the L-LDA model to a standard machine learning classifier, lasso logistic regression (LLR). LLR is often used in classification settings where the number of features is larger than the number of observations because of it’s ability to force feature weights to zero for uninformative features. For more details about LLR see Supplementary Files.
The results of the AUC analysis are shown in Figure 2. The widths of the 90% chance confidence intervals for each code correspond closely with the inverse of the code frequencies (lower frequencies are associated with larger chance confidence intervals). The L-LDA model showed higher predictive accuracy than the LLR model and both models performed significantly better than the chance model for a large number of codes. For the L-LDA model, the average AUC score over all codes is .789 (SD=.137) and average AUC for subject and symptom codes are .800 (SD=.131) and .753 (SD=.150), respectively.
Fig. 2.
Session-level AUC scores for the labeled topic model, the lasso logistic regression model, and chance performance. Codes are reported along the y axis and are ordered by labeled topic model performance. For subject codes, one in every 4 codes names is shown.
All but 10 of 209 codes had AUC scores above .5. The 5 codes with lowest AUC scores are gender roles, withdrawn, recollections, general pain, and self-fulfilling prophecy. The language associated with each of these codes contains a broad spectrum of variation that may have contributed to poor model performance. The 5 codes with highest AUC are hallucinogen abuse, drug addiction, alcohol dependence, passiveness, and attraction.
For the LLR model, average AUC score over all codes is .702 (SD=.145) and average AUC for subject and symptom codes are .713 (SD=.146) and .667 (SD=.137), respectively. There were 29 codes with AUC scores below 0.5. Overall, the LLR model performed significantly worse on average than L-LDA (p<.001 in a Wilcoxon sign test).
A common goal of document classification is to identify the relationships between specific classifiers and characteristics of data that lead to high classification performance. Previous comparisons between L-LDA and discriminative models have shown that the L-LDA model can outperform discriminative models on low-frequency codes [12]. We analyzed this relationship on the general psychotherapy corpus and found only a weak correlation (R=0.22) between the AUC difference for the two models and code frequency. This correlation showed that L-LDA model performs slightly better in comparison to LLR at predicting low frequency codes than high frequency codes. Post-hoc qualitative analysis suggests that highly predictable codes contain unique language that facilitate prediction. For example, sessions that discuss hallucinogen abuse and drug addiction contain a range of drug-specific terms that are highly specific. Conversely, we expect that hard to predict codes, such as gender roles, are attached to sessions containing a broad spectrum of language.
VI. Talk-Turn Prediction
A. L-LDA Talk-Turn Prediction
As a second test of performance, we assessed the ability of the L-LDA model to find talk-turns that are representative of a session-level code. This comparison is novel in that the L-LDA model is trained using only session-level codes, but can then generalize the topics learned to identify representative talk-turns within each session. The evaluation procedure tests the models’ abilities to distinguish the most representative talk-turns (as judged by human raters) from all other talk-turns.
We had 6 human coders generate ratings at the talk-turn level for 993 talk-turns using 5 symptom codes chosen from the set of general psychotherapy codes. The codes used were anger, anxiety, depression, low self-esteem, suicidal behavior. Each talk-turn was assigned a continuous rating from 1 (atypical) to 7 (very typical) by each of the 6 coders. To keep model performance measures on the same scale as the session-level performance measures, we converted the continuous human ratings to binary scores (thus allowing us to compute classification performance measures). To binarize ratings, we chose a rating threshold and considered any ratings above the threshold to be representative of a symptom and any ratings below to be not representative. While there are many ways of choosing this threshold, we chose the threshold such that the top c% of ratings would be considered representative. We computed performance for c = {5, 10, and 20}% to emphasize the model’s ability to predict highly representative talk-turns.
Since raters did not rate talk-turns for the other symptom codes in the psychotherapy corpus, we created a mapping from the more detailed labels in the psychotherapy corpus to the five selected symptom codes. The motivation behind creating these code mappings is that a single symptom code (e.g., depression) might be aptly described by multiple codes in the psychotherapy corpus (e.g., depression, depressive disorder, hopelessness, …). To create the mappings, we had a clinical psychologist mark which codes from the general psychotherapy corpus are related to each of the five symptom codes. See Appendix A for more detail.
In addition to AUC scores we report the R-precision. R-precision is a measurement of precision at the threshold at which precision is equal to recall. To generate model predictions, AUC scores, and R-precisions at the talk-turn level we proceeded as follows:
An L-LDA model was trained on each of the 10 training data sets used for session-level cross-validation. For each training set any session that contained any of the coded talk-turns was removed (making the prediction problem somewhat more difficult by not allowing the model access to the coded talk-turn nor any other talk-turns from the same session). We remove these talk-turns to avoid optimistic performance results since in application the model would be identifying codes for talk-turns from novel sessions.
Each of the 10 trained models made predictions on the 993 labeled talk-turns. The ηt,d scores were computed for each general psychotherapy code t and each talk-turn (document) d as described earlier, using each model’s word-topic count matrix. To compute a score for a symptom code, we averaged the model scores from each of the related general psychotherapy codes (as defined by the code mapping described above). The 10 scores for each code and each talk-turn were then averaged across the 10 trained models.
For each code, AUC scores were generated as follows. The 993 talk-turns were ranked by their averaged model-based scores. These rankings were then compared to the ratings from each individual rater, where the ratings were binarized by using the highest 5%, 10%, top 20% of that individual’s ratings, leading to 3 different AUC scores, one for each percentile cutoff. Overall AUC scores for the model, for each of the 3 cutoffs and each of the 5 codes, were then computed by averaging across the model’s AUCs computed relative to each individual rater.
For each code, R-precisions were generated as follows. The 993 talk-turns were ranked by their averaged model-based scores. These rankings were then compared to the ratings from each individual rater, where the ratings were binarized by using the highest 5%, 10%, and 20% of that individual’s ratings. For each rater and rating cutoff c ∈ {5, 10, 20}, we compute the R-precision as the number of true positives in the top c% of ratings divided by the number of talk-turns in the top c% of ratings. The R-precision ranges from 0 to 1 and it can be shown that the R-precision is equal to recall for the top c % of ratings. We compute overall R-precision scores as the average of model R-precision scores computed relative to each individual rater.
B. Results for Talk-Turn Predictions
Table II shows example talk-turns for all 5 symptoms tested. Talk-turns are ordered by model representativeness score. We also report the human representativeness rating (1–7 scale) averaged across raters. Several talk-turns illustrate that the model learns words associated with a symptom and not just the symptom keyword itself. For example, the first example talk-turn for depression in Table II is rated by the model as most representative and is also judged by humans as highly representative. This talk-turn does not contain the word depression but only expressions related to depression (i.e. “I’m crying”). The first talk-turn for anxiety presents another interesting example. It is given the highest representativeness score by the model but only received a low human rating. The model may have learned to associate the word “roommate” with anxiety (through the other sessions in the training set), resulting in a high likelihood.
Table II.
Model predictions for most representative talk-turns for each symptom code. Talk-turns are ordered by model score Average human Likert rating (1–7) is reported to compare model scoring vs. human scoring.
| Symptom | Average Rating | Talk-Turn |
|---|---|---|
| Anger | 6.2 | Nobody every got angry; they never got angry. I don’t ever remember my parents screaming at each other, ever. I mean throughout all my childhood I can’t remember them having a yelling fight. it was never that way. and I just never knew how to scream at anybody. |
| 7 | I have occasionally felt bad about the things I’ve told you about, I have. but it’s interesting that this is the first time that a lot of anger has come out. you know, there’s another side that I really affirm here, that there’s a lot of anger that I have toward her that she’s always been able to seem to get out and express at me. | |
| 6.8 | I don’t know. but I didn’t get mad at Harold when he gave me genital warts. I felt mad, I mean, I felt betrayed and lied to and cheated on, but I didn’t - I just dealt with it, I just deal with things, and I’ve always thought that that was a positive quality, I mean, I just-i don’t think that anger is necessarily productive. but I guess in some ways it can be. I just-i work through things, I talk through things, I’m calm, I don’t get mad or yell and scream. if i-you know, I can argue with people if I don’t - you know, it’s not like I won’t express my opinions or, you know, talk about something that bothers me. but I don’t yell and scream and I don’t get angry. | |
| 7 | At night, and then I take my zyprexa and I fall asleep in two hours. the one thing I’d say I notice about her is she will be talking like this and then all of a sudden I don’t what happens, something happens and she just gets real angry, real fast, like that. we will be talking and all of sudden she will think of something that got her angry and it will be like boom. | |
| 3.6 | Even just now, when you ask me that, I don’t know, it just feels like, why are you asking me these questions? I don’t understand them. I feel like … it’s just really uncomfortable. | |
|
| ||
| Anxiety | 2 | The only-the only thing I can-like I thought back to this. when I was a senior in college I met a girl who was a roommate of my roo-well, my roommate’s - my roommate and i-my roommate had his fiance and she had a roommate. and this, anyhow, to make it all work.... |
| 3.2 | When I got there, he said that I needed to go to the hospital. so I went, he sent me to … when I got to …, they did another ekg. they told me I had a heart attack 2 weeks before that. | |
| 7 | And, um, had a little anxiety about it. I go 2 nights a week, monday and wednesday from 6 to 10, and yeah, had a little anxiety attack about it ’cause just the whole like possible failure, and like oh god, I’m like I really want, ’cause I really want it and I’m really, you know, I’m good at it, but it’s like oh god, it’s pressure, you know, that type of thing. well, I ended up calling … remember I was seeing him? | |
| 5 | I don’t really know. I’ve always been kind of just like - I’m always just really scared of - I don’t really have like a lot of, in my family there’s not really a lot of people that would help me if something like that was to happen. so I think that just kind of like fuels this like fear in me with like employment in general. it’s just kind of like, ‘ well what if there is a cutback or what if somebody buys us out or … ’ just kind of like, I just want to be okay if that happened. | |
| 2.2 | Yeah. I think I just … I never … I don’t know. since probably before I came to shimer was the last time I actually like really either showed interest in a guy. like even if I was interested, I haven’t within the past three years, like done anything about it really. brendan, close, like we’ve actually kissed and … but like … that was. | |
|
| ||
| Depression | 6 | As like two to three months ago, I was crying and this was more or less yknow I didn’t want to be doing this but now I’m crying, I’m like, I don’t care that I’m crying, yknow. |
| 5 | Well I think one of the things I wanted to ask you about was what we talked about last week the matter of guilt when I touched on that briefly. I’m a little bit confused because it seems to me that a person has desires to kind of change their ways as it were that one of the motives of them wanting to do that is them some feeling of guilt or something approximating guilt about the way they’re presently acting. and yet you said that you thought that I should feel that way, people in general too, but in this particular case me should not feel guilty for example about vanity because I’ve done that. so you think that they should feel that way but it seems to me that one of the motivating forces for me is a certain sense of guilt. or not exactly guilt but maybe something like … well I suppose it is guilt the guilt of throwing away a good part of my life. I feel guilty about that even in a moral sense as well as a practical one. so what do you mean by that? how do you work around … .... | |
| 6.6 | Um, the pamelor 50, I take that at night, that seems to be doing okay. I mean I’m still a little depressed but, you know, basically that seems to be doing okay. nr : doctor, patients are leaving. | |
| 3.6 | It might be. I guess I feel the anger because it, like a given situation turned out unhappy or sad instead of happy. | |
| 3.6 | And yknow, be supportive for my nephews and my nieces and I found myself kind of leaning with my dad in just being sad, just being, yknow, it was just different. | |
|
| ||
| Low Self-Esteem | 5 | Umm … well certainly if I’m not being obsessive and worrying over things. because the truth is, it’s obvious when I’m in that state, even if I don’t tell people. you can see it all over my face. and he notices. and so if I’m not that way, and if I’m confident and mature. |
| 4.2 | Which probably isn’t a good thing. but I just don’t ’ do it. and they tell me,“ you should do it. you should do it. you should do it. ”. and I say, notes don’t work for me. I make up excuses. I tell them that, no, that’s not me. leave me alone. let me do my own thing. and that’s sort of me not taking criticism well. | |
| 3.4 | It’s funny, because when I talk about my relationship with my parents with cole it’s always that I’ll say something negative and I’ll say, “but I know they love me, but-but-but-but,” you know? like I’ll always have an aside for “yeah, like, but it’s okay because … ” right, I guess I feel like it’s not okay to be. I know it is okay to be angry at times. and of course now I’m thinking, but I don’t blame them, they’re who they are, you know? yeah, I always have to have an excuse for them. but -. | |
| 2.6 | I think I mentioned it a little bit earlier but like, I was talking to my friend about it. I mean he - we were sort of sitting in the lunchroom and he saw a girl that he thought was attractive walk by and he was like, “wow, she’s a really attractive girl. ”. and I was like, “eh. ”. he’s like “what, you don’t think she is? ”. and I was like, “yeah, but who cares? ”. that’s just sort of been my sort of feeling lately, that it’s like yeah, she’s an attractive girl. whatever. | |
| 2.6 | No.. if someone is looking yeah because well first of all one of the answers that I think that I can’t find is like this question of what is a man looking for? | |
Again, we compare the L-LDA model to lasso logistic regression. Table III shows a table of AUC scores for the L-LDA and LLR model with 5%, 10%, and 20% cutoffs used to create binary scores for the human representativeness ratings. For the L-LDA and LLR models, we computed an AUC score for the model relative to each individual rater and then averaged the AUCs across raters. To compare the models against human raters, we also calculated a human reliability score to serve as a measure of inter-annotator agreement. These scores give us an upper bound on performance against which to compare our model (assuming that human raters are performing optimally). To calculate this reliability score, we compared each individual rater against each of the other raters by computing pairwise AUC scores. We express human-reliability as AUC scores so that L-LDA performance and human reliability are expressed in the same units and fair comparisons can be made. For an individual rater, we calculate AUCs using the ratings of the individual rater (analogous to the model scores) to predict the binarized ratings (at 5%,10%, and 20% cutoffs) of each of the other raters. We compute human reliability as the average of all AUCs calculated from each pair of raters and report the computed scores in Table III. To compute human reliability in terms of R-precision, we perform an analogous computation using R-precision instead of AUC.
Table III.
Talk-turn coding performance for the L-LDA model and Human Reliability scores. AUC and R-precision scores are shown for the top 5%, 10%, and 20% of talk-turns as rated by human coders. Human reliability is expressed in AUC and R-precision scores to enable direct comparison to model performance.
| Code | No. Talk-Turns | AUC at 5% |
AUC at 10% |
AUC at 20% |
||||||
|---|---|---|---|---|---|---|---|---|---|---|
| L-LDA | Human | L1 LR | L-LDA | Human | L1 LR | L-LDA | Human | L1 LR | ||
| anger | 197 | 0.89 | 0.94 | 0.91 | 0.84 | 0.94 | 0.85 | 0.75 | 0.87 | 0.73 |
| anxiety | 200 | 0.76 | 0.80 | 0.77 | 0.72 | 0.78 | 0.72 | 0.66 | 0.72 | 0.64 |
| depression | 198 | 0.73 | 0.87 | 0.74 | 0.67 | 0.84 | 0.73 | 0.66 | 0.84 | 0.70 |
| low self-esteem | 200 | 0.70 | 0.82 | 0.74 | 0.64 | 0.81 | 0.68 | 0.60 | 0.78 | 0.63 |
| suicidal behavior | 198 | 0.78 | 0.96 | 0.77 | 0.70 | 0.87 | 0.70 | 0.67 | 0.82 | 0.67 |
| average | 0.77 | 0.88 | 0.79 | 0.71 | 0.85 | 0.73 | 0.67 | 0.81 | 0.67 | |
| Code | No. Talk-Turns | R Precision at 5%
|
R Precision at 10%
|
R Precision at 20%
|
||||||
|---|---|---|---|---|---|---|---|---|---|---|
| L-LDA | Human | L1 LR | L-LDA | Human | L1 LR | L-LDA | Human | L1 LR | ||
| anger | 197 | 0.57 | 0.58 | 0.67 | 0.56 | 0.72 | 0.59 | 0.44 | 0.68 | 0.41 |
| anxiety | 200 | 0.22 | 0.33 | 0.26 | 0.23 | 0.43 | 0.19 | 0.31 | 0.46 | 0.28 |
| depression | 198 | 0.10 | 0.39 | 0.26 | 0.23 | 0.53 | 0.24 | 0.31 | 0.56 | 0.30 |
| low self-esteem | 200 | 0.08 | 0.36 | 0.14 | 0.11 | 0.44 | 0.14 | 0.25 | 0.45 | 0.25 |
| suicidal behavior | 198 | 0.42 | 0.78 | 0.12 | 0.34 | 0.64 | 0.13 | 0.38 | 0.60 | 0.178 |
The table shows that both L-LDA and LLR perform well at identifying representative talk-turns relative to human reliability. On average, L-LDA AUC scores are between 10–18% lower than average inter-rater AUC scores. The L-LDA model performs distinctly better at identifying talk-turns representative of anger than talk-turns representative of the other tested symptoms. The unique lexicon of words used to express anger may influence the model’s performance. In addition, the other 4 symptoms may be expressed in a broader language that is more difficult to capture through uni-, bi-, and trigrams. In addition to variation in performance by symptom, the model performs better when identifying the top 5% of representative talk-turns as compared to the top 10%. Therefore, the model is able to identify the most relevant talk-turns in a session with reasonable precision. The comparison between L-LDA and the baseline model shows that the LLR model performs about the same or marginally better than the L-LDA model on each of the three cutoffs (p=0.28, p=0.11, p=.78, respectively in pairwise t-tests).
VII. Discussion and Conclusion
In this article, we have presented the Labeled Latent Dirichlet Allocation Model as a method for the semi-automatic code annotation of psychotherapy sessions. L-LDA outperforms standard discriminative methods at identification of session-level codes, replicating results from prior psychotherapy process research and general applications in multi-document classification. In addition to session-level coding, machine-learning methods show promise for annotation of psychotherapy transcripts at fine-grained levels of detail, such as for talk-turn annotation. L-LDA and LLR can identify talk-turns representative of session-level codes with accuracy close to that of trained human coders.
Machine learning methods for document classification often focus either on topic-based classification involving large documents and many topics, or sentiment classification involving a small set of sentiment labels and often shorter documents [31]. Our work involves both topic-based classification (for session level prediction) and analysis more similar to sentiment classification (talk-turn prediction for a small set of class labels). The generative nature of L-LDA provides a natural bridge between these two types of document classification problems by inferring labels for talk-turns based on session-level metadata. Topic-based classification is performed by integrating topic information over constituent parts of a document (in our case talk-turns), and sentiment classification is performed using a mapping between topic-based class labels to sentiment labels. In this way, L-LDA provides richer information than many sentiment classification methods and more flexibility than some topic-based classification models. Examining the relationships between the mapping from topic-based classes to sentiment classes is an interest for future work and we suspect that incorporating this information will lead to improved predictive performance.
Promising results in annotation of psychotherapy transcripts suggest potential for application to clinical settings in addition to reducing labor costs and improving the scalability of observational coding. For example, in the process of training junior therapists, supervising therapists review records of the junior therapist’s sessions. Supervising therapists are often in charge of many junior therapists and are in need of tools that make the review process more efficient. One method for making this process more efficient would be to use text-based models that predict important topics discussed in the sessions (such as depression, suicide, etc.). The supervisor can get a quick summary of session content and can locate specific passages in the session by content labels. Additionally, the supervisor can provide feedback to the model on which passages were relevant to that topic and thus improve future code annotation.
L-LDA is a model for the semantics of language that, like all models, provides an approximation to the true underlying process of generating speech to convey meaning. L-LDA makes several simplifying assumptions about the process of text generation that could provide starting points for further model development. The “bag-of-words” assumption disregards information about temporal characteristics of language and their relation to semantics. L-LDA also ignores syntactic dependencies. An important direction for semantic analysis of psychotherapy sessions would be to incorporate sequential information and context into our analysis. This would involve significant feature engineering, but could benefit from already existing text processing techniques such as word and sentence embedding.
The work presented above analyzes the relationship between semantic information contained in spoken language and subjects and symptoms that encompass not just semantics, but emotion, and behavior. To gain a deeper understanding of psychotherapy, semantic language models need to be extended to encompass behavior. Considerable information is contained in behavioral cues such as tone, laughter, or body language that encompass the semantic meaning of a statement. While these behavioral cues are most likely correlated with language, we think that jointly analyzing behavior and language will lead to deeper understanding of the psychotherapy process and its effect on patient outcome.
In conclusion, we used data from the patient provider interactions in psychotherapy to illustrate the potential of machine learning methods to automate coding of key aspects of clinical conversation and to understand the linguistic processes behind psychotherapy. L-LDA is a robust automated coding method that outperforms a baseline logistic regression discriminative method at predicting codes at the session level and that can be used to localize information using only session-level metadata.
Acknowledgments
Funding for the preparation of this manuscript was provided by the National Institutes of Health / National Institute on Alcohol Abuse and Alcoholism (NIAAA) under award number R01/AA018673 (David C. Atkins/Mark Steyvers, co-PIs) and National Institute on Drug Abuse under award number (R34/DA034860).
Appendix A. Codes Used in Talk-Turn Prediction
Several of the symptoms for which we performed additional local coding are closely associated with more than a single code in the psychotherapy corpus (e.g., the anger symptom is closely associated with anger and frustration). The human raters who judged the representativeness of the 5 symptoms were unaware of the variety of content codes used in the psychotherapy corpus and therefore, the rater’s concept of suicide might not map onto the (narrow) concept of suicide in the psychotherapy corpus. We therefore had a clinical psychologist create associated code sets by selecting from the list of psychotherapy symptom codes (See Table IV) with the constraint that the code set contain the matching code term. These meta code sets were created prior to any evaluation of the model.
Table IV.
Symptom codes and sets of associated codes
| Symptom Code | Code Set |
|---|---|
| anger | anger, frustration |
| anxiety | anxiety, fear, nervousness, social anxiety, stress, death anxiety, fearfulness, panic, paranoia, restlessness |
| depression | depression, grief, guilt, hopelessness, loneliness, shame, crying, depressive disorder, despair, dysphoria, loss of appetite, problems concentrating, sadness, suicidal behavior, withdrawn |
| low self-esteem | low self-esteem, self-esteem |
| suicidal behavior | hospitalization, suicide, cutting, dysphoria, death |
For each of the 5 symptoms in the ratings experiment, we take the set of codes from the psychotherapy corpus that are closely associated with the symptom (e.g. for the anger suicide, we take the set anger and frustration) and average the model predictions across the codes in the set. This creates a model representativeness score for each talk-turn in the ratings experiment that can be compared to the binarized human ratings (highly representative/not highly representative). This approach of combining predictions among closely-related labels can be viewed as a simple implementation of the idea that labels in multi-label document classification are often dependent and leveraging such dependencies is worthwhile and can improve predictive performance [28].
Footnotes
The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Results were previously presented at ABCT 2014, 48th Annual Convention of the Association for Behavioral and Cognitive Therapies in Philadelphia, PA, November 21–24, 2014.
Contributor Information
Garren Gaut, Department of Cognitive Science, University of California Irvine, 2316 Social & Behavioral Sciences Gateway Building, Irvine, CA, USA 92697-5100, phone: (949) 824-7642.
Mark Steyvers, Department of Cognitive Science, University of California Irvine, 2316 Social & Behavioral Sciences Gateway Building, Irvine, CA, USA 92697-5100, phone: (949) 824-7642.
Zac E Imel, Department of Educational Psychology, University of Utah, Salt Lake City, UT, USA.
David C Atkins, Department of Psychiatry and Behavioral Science, University of Washington, WA, USA.
Padhraic Smyth, Department of Computer Science, University of California, Irvine, CA, USA.
References
- 1.Feldman MD, Franks P, Duberstein PR, et al. Let’s not talk about it: suicide inquiry in primary care. Ann Fam Med. 2007;5(5):412–418. doi: 10.1370/afm.719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Golin CE, Liu H, Hays RD, et al. A prospective study of predictors of adherence to combination antiretroviral medication. J Gen Intern Med. 2002;17(10):756–765. doi: 10.1046/j.1525-1497.2002.11214.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Rakel D, Barrett B, Zhang Z, Hoeft T, Chewning B, Marchand L, Scheder J. Perception of empathy in the therapeutic encounter: Effects on the common cold. Patient Educ Couns. 2011;85(3):390–397. doi: 10.1016/j.pec.2011.01.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ambady N, LaPlante D, Nguyen T, et al. Surgeons’ tone of voice: a clue to malpractice history. Surgery. 2002;132(1):5–9. doi: 10.1067/msy.2002.124733. [DOI] [PubMed] [Google Scholar]
- 5.Miller WR, Rollnick S. Motivational interviewing: Preparing people for change. 3. New York: Guilford; 2002. [Google Scholar]
- 6.Christensen A, Atkins DC, Berns S, et al. Traditional versus integrative behavioral couple therapy for significantly and chronically distressed married couples. J Consult Clin Psychol. 72(2):176. doi: 10.1037/0022-006X.72.2.176. [DOI] [PubMed] [Google Scholar]
- 7.United States. Public Health Service. Office of the Surgeon General. Surgeon general’s report. 1999 [Online]. Available: http://profiles.nlm.nih.gov/ps/access/NNBBHS.pdf.
- 8.Steyvers M, Griffiths T. Handbook of latent semantic analysis. ch. 21. London, UK: Psychology Press; 2007. Probabilistic topic models; pp. 424–440. [Google Scholar]
- 9.Purver M. Spoken language understanding: systems for extracting semantic information from speech. Hoboken, NJ: Wiley; 2011. Topic segmentation; pp. 291–317. [Google Scholar]
- 10.Dowman M, Savova V, Griffiths TL, Kording KP, Tenenbaum JB, Purver M. A probabilistic model of meetings that combines words and discourse features. IEEE Trans Audio Speech Lang Processing. 2008;16(7):1238–1248. [Google Scholar]
- 11.Howes C, Purver M, McCabe R. Investigating topic modeling for therapy dialogue analysis. Presented at the Proceedings of IWCS; 2013. [Online]. Available: http://www.ling.uni-potsdam.de/iwcs2013/Papers/CSCT-4.pdf. [Google Scholar]
- 12.Rubin TN, Chambers A, Smyth P, et al. Statistical topic models for multi-label document classification. Mach Learn. 2012;88(1–2):157–208. [Google Scholar]
- 13.Atkins DC, Steyvers M, Imel ZE, et al. Scaling up the evaluation of psychotherapy: evaluating motivational interviewing fidelity via statistical text classification. Implement Sci. 2014;9(49):1–11. doi: 10.1186/1748-5908-9-49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Griffiths T, Steyvers M. Finding scientific topics. Presented at PNAS; 2004. [Online]. Available: http://marketingpedia.com/Marketing-Library/Network%20Science/PNAS2004%20Article/PNAS-2004-Griffiths-5228-35.pdf. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Imel ZE, Steyvers M, Atkins DC. Computational psychotherapy research: scaling up the evaluation of patient-provider interactions. Presented at Psychotherapy; 2014. [Online]. Available: http://www.psiexp.ss.uci.edu/research/papers/ImelAtkinsSteyvers2014.pdf. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Mitchell M, Hollingshead K, Coppersmith G. Quantifying the language of schizophrenia in social media. Presented at the Proceedings of CLPsych; 2015. [Online]. Available: http://m-mitchell.com/clpsych2015/pdf/CLPsych02.pdf. [Google Scholar]
- 17.Purver MRJ. PhD dissertation. Dept. Comp. Sci., KCL; London, UK: 2004. The theory and use of clarification requests in dialogue. [Google Scholar]
- 18.Kristina T, Manning CD. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. Presented at In Proceedings of EMNLP/VLC; 2000. [Online]. Available: http://www-nlp.stanford.edu/cmanning/papers/emnlp2000.pdf. [Google Scholar]
- 19.Perotte A, Pivovarov R, Natarajan K, et al. Diagnosis code assignment: models and evaluation metrics. J Am Med Inform Assoc. 2014;21(2):231–237. doi: 10.1136/amiajnl-2013-002159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Mayfield E, Laws MB, Wilson IB, et al. Automating annotation of information-giving for analysis of clinical conversation. J Am Med Inform Assoc. 2014;21(e1):e122–e128. doi: 10.1136/amiajnl-2013-001898. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Iver S, Harpaz R, LePendu P, et al. Mining clinical text for signals of adverse drug-drug interactions. J Am Med Inform Assoc. 2014;21(2):353–362. doi: 10.1136/amiajnl-2013-001612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inform Assoc. 2011;18(5):544–551. doi: 10.1136/amiajnl-2011-000464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Chapman WW, Nadkarni PM, Hirschman L, et al. Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. J Am Med Inform Assoc. 2011;18(5):540–543. doi: 10.1136/amiajnl-2011-000465. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Poulin C, Shiner B, Thompson P, et al. Predicting the risk of suicide by analyzing the text of clinical notes. PloS one. 2014;9(1):e85733. doi: 10.1371/journal.pone.0085733. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Webb CA, DeRubeis RJ, Barber JP. Therapist adherence/competence and treatment outcome: a meta-analytic review. J Consult Clin Psychol. 2010;78:200–211. doi: 10.1037/a0018912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Manning CD, Raghavan P, Schutze H. Introduction to Information Retrieval. 1. Cambridge, UK: Cambridge University Press; 2008. [Google Scholar]
- 27.Imel ZE, Baldwin SA, Baer JS, Hartzler B, Dunn C, Rosengren DB, Atkins DC. Evaluating therapist adherence in motivational interviewing by comparing performance with standardized and real patients. J Consult Clin Psychol. 2014;82(3):472. doi: 10.1037/a0036158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Tsoumakas G, Katakis I. Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications. ch. 6. Hershey, PA: Information Science Reference; 2007. Multi-label classification: An overview; pp. 64–74. [Google Scholar]
- 29.Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys. 2002;34(1):1–47. [Google Scholar]
- 30.Cormack GV. Email spam filtering: a systematic review. Foundations and Trends in Information Retrieval. 2007;1(4):335–455. [Google Scholar]
- 31.Pang B, Lee L. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval. 2008;2(1–2):1–135. [Google Scholar]
- 32.Ramage D, Rosen E. Stanford Topic Modeling Toolbox. 2009 [computer software]. Available: http://nlp.stanford.edu/software/tmt/tmt-0.4/
- 33.Ott M. JGibbLabeledLDA. 2013 [computer software]. Available: https://github.com/myleott/JGibbLabeledLDA.
- 34.Shuyo N. Labeled Latent Dirichlet Allocation. 2010 [computer software]. Cybozu Labs Inc. Available: https://github.com/shuyo/iir/blob/master/lda/llda.py.
- 35.Dumais S, Chen H. Hierarchical classification of web content. Presented at In Proceedings of the 23rd ACM SIGIR; 2000. [Online]. Available: http://www.msr-waypoint.com/en-us/um/people/sdumais/sigir00.pdf. [Google Scholar]
- 36.Billsus D, Pazzani M. A hybrid user model for news story classification. Paper presented at The Proceedings of UM; 1999. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.66.5846&rep=rep1&type=pdf. [Google Scholar]
- 37.Li Q, Melton K, Lingren T, et al. Phenotyping for patient safety: algorithm development for electronic health record based automated adverse event and medical error detection in neonatal intensive care. J Am Med Inform Assoc. 2014;21(5):776–84. doi: 10.1136/amiajnl-2013-001914. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Ye Y, Tsui FR, Wagner M, et al. Influenza detection from emergency department reports using natural language processing and Bayesian network classifiers. J Am Med Inform Assoc. 2014;21(5):871–875. doi: 10.1136/amiajnl-2013-001934. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Marafino BJ, Davies JM, Bardach NS, et al. N-gram support vector machines for scalable procedure and diagnosis classification, with applications to clinical free text data from the intensive care unit. J Am Med Inform Assoc. 2014;21(5):815–823. doi: 10.1136/amiajnl-2014-002694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. The Journal of Machine Learning Research. 2003;3:993–1022. [Google Scholar]
- 41.Howes C, Purver M, McCabe R, Healey PG, Lavelle M. Predicting adherence to treatment for schizophrenia from dialogue transcripts. Presented at In Proceedings of SIGDIAL; 2012. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.361.9153&rep=rep1&type=pdf#page=97. [Google Scholar]
- 42.Howes C, Purver M, McCabe R. Using conversation topics for predicting therapy outcomes in schizophrenia. Biomed Inform Insights. 2013;6(Suppl 1):39–50. doi: 10.4137/BII.S11661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Ramage D, Hall D, Nallapati R, et al. Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. Presented at Proceedings of EMNLP; 2009. [Online]. Available: http://ai.stanford.edu/nm-ramesh/emnlp09.pdf. [Google Scholar]

