Abstract
Objective:
Efficient and effective fidelity measurement is necessary to improve rigor and reduce burden of motivational interviewing (MI) implementation studies for both fidelity outcomes and quality improvement strategies. This article reports on such a measure developed with rigorous methodology and tested in community-based substance abuse treatment settings.
Methods:
This scale development study analyzed data from a National Institute on Drug Abuse study testing the Leadership and Organizational Change for Implementation (LOCI) strategy. Utilizing item response theory (IRT) methods and Rasch modeling, we analyzed coded recordings (N=1089) of intervention sessions of 238 providers from 60 substance use treatment clinics across nine agencies in an implementation trial focusing on MI.
Results:
These methods yielded a reliable and valid 12-item scale representing single construct dimensionality, strong item-session maps, good rating scale functionality, and item fit. Separation reliability and absolute agreement for adjacent categories was high. No items were significantly misfitting, though one was borderline. The LOCI community providers were less likely to be scored in the advanced competence range and items were more difficult compared to the original development sample.
Conclusions:
The 12-item Motivational Interviewing Coach Rating Scale (MI-CRS) showed excellent performance in a large sample of community-based substance use treatment providers using real recordings. The MI-CRS is the first efficient and effective fidelity measure appropriate for diverse ethnic groups, with interventions that are MI only or interventions that integrate MI with other treatments, and with adolescents and adults. Follow-up coaching by trained supervisors may be needed for community-based providers to achieve the highest level of MI competence.
Keywords: Fidelity, Community, Substance use, Methodology
1. Introduction
Motivational interviewing (MI) is an evidence-based practice for individuals receiving substance use treatment (SAMSHA, 2019). To support its adoption and sustainment in community practice settings, MI has been the focus of numerous implementation science efforts (Pas & Bradshaw, 2021). Measuring intervention fidelity is critical to these efforts. MI competence ratings have the potential to serve both as an outcome of implementation strategies and as the basis for ongoing quality improvement efforts (Aarons et al., 2011). To fulfill these roles, MI fidelity measurements must not only be feasible (i.e., efficient for community practice settings) but also must be psychometrically sound (i.e., effective at providing precise measurements; Schoenwald et al., 2011).
The most recent review of MI fidelity measures (Hurlocker et al., 2020) identified five observer-rated tools with strong psychometric properties. Two of these measures were recommended for implementation research contexts in which the fidelity measure can be used as both an outcome measure and as a quality improvement tool. The first measure was the widely used Motivational Interviewing Treatment Integrity Codes (MITI; Moyers et al., 2016), and the second was the Independent Tape Rating Scale (ITRS; Martino et al., 2008). Both are top-tier measures; however, in terms of measurement efficiency, they may be resource intensive in community practice settings. The resource requirements include the time for training and certification (approximately 40 hours; Hurlocker et al., 2020) as well as the time for rating each session (at least 60 minutes). Additionally, the measures have not been sufficiently developed or evaluated with state-of-the art measurement models, with ethnically diverse samples, or with adolescents and emerging adults.
With respect to measurement effectiveness, psychometric evaluations should address the common data features that accompany fidelity measurement efforts in community practice settings (Schoenwald et al., 2011). These features, which often require flexible and specialized models, are readily accommodated by measurement models based in Item Response Theory (IRT; e.g., Bock & Gibbons, 2021). For instance, IRT models can evaluate the role of nested data (e.g., clients within providers; (De Boeck & Wilson, 2004; Kamata, 2001), influence of observational coders (Eckes, 2015; Linacre, 1994a), performance of rating scales (Linacre, 2002), stability of items across distinct uses of the instrument (Bond et al., 2021; Engelhard Jr, 2013), and reliability of scores at different levels of measurement (e.g., session-level scores, provider-level scores; Raudenbush et al., 2003). Further, IRT-based models routinely accommodate missing data, and some, such as the Rasch measurement model (Bond et al., 2021), have modest sample size requirements (Linacre, 1994b). Combined, these features make it possible to provide a rigorous psychometric evaluation that directly accommodates the data-related challenges of measuring intervention fidelity in community practice settings.
Our prior work utilized IRT to develop and evaluate the MI Coach Rating Scale (MI-CRS), an observational instrument for measuring the competence of provider MI delivery (Naar, Chapman, et al., 2021). The MI-CRS includes 12 items that are rated on a 4-point Likert-type scale, and coder training requires 12 hours for completion. Coders can be supervisors or consultants; that is, not highly specialized and trained observational coders, and ratings typically are assigned based on observation of a 15minute segment of the provider interaction (first five minutes, middle 10 minutes, and last five minutes). The initial development and evaluation were based on actual and standardized interactions between providers and clients and included adult patients and well as adolescents and families. However, most sessions addressed health-related behaviors, such as those associated with obesity and HIV.
The goal of the current study is to extend the original psychometric study of the the MI-CRS to community-based substance use settings. Specifically, we used the MI-CRS in the context of a large-scale implementation study across Arizona and California—the Leadership and Organizational Change for Implementation (LOCI) trial. This trial randomized workgroups within organizations to LOCI leadership training versus a webinar control condition (Aarons et al., 2017). The LOCI implementation strategy was intended to support MI implementation and sustainment at all levels of the organization, and the MI-CRS served as a primary outcome measure. The current study used IRT-based measurement models to evaluate the psychometric performance of the MI-CRS in community-based substance use treatment sessions. Additionally, the current study demonstrates the use of IRT methods that accommodate data features characteristic of fidelity measurement efforts in community practice settings. The evaluation not only demonstrates the use of a highly flexible psychometric methodology, it also provides critical evidence to inform the ongoing use, and potential modification, of the MI-CRS in community-based substance use treatment settings.
2. Method
2.1. Overview of parent study methods
The LOCI protocol has been previously described (Aarons et al., 2017). In short, in a cluster randomized trial, sixty substance use treatment clinics from nine agencies were randomly assigned (within agency) to either the LOCI implementation strategy or a leadership webinar condition (control). Providers in both conditions received a two-day MI training from a member of the Motivational Interviewing Network of Trainers (MINT). Following the training, the study asked providers in both conditions to record all sessions for submission. One session per month per provider was randomly selected for coding with the MI-CRS, and 10% these sessions were rated by two coders. To ensure real-world applicability, coding was completed by an MI training company (behaviorchangeconsulting.org) external to the academic institutions of the authors. The company provides MI training, coaching, and coding to academic institutions as well as to private and public agencies across a variety of behavioral and physical health settings. For the current study, the primary coder was a MINT member while the secondary coders were not. All coders completed a minimum one day of MI workshop and 3 hours of coding training, which is the minimum training for quality monitoring. For the purpose of research outcomes, coders co-code standard expert consensus recordings until they reach absolute agreement for 8 of 12 items and no more than one category discrepancy from the expert consensus. Research coders also attend a monthly coding lab to co-code a session and discuss discrepancies.
2.2. Measure
MI-CRS is a 12-item measure (see Table 1), rated on a 4-point ordered categorical rating scale (1 = Poor, 2 = Fair, 3 = Good, 4 = Excellent) that can be used by MI coaches as well as researchers and is designed to be rated in one pass of real or simulated encounters. The coding manual includes item-specific behavioral definitions for each category of the rating scale. The items represented essential MI components such as a collaborative stance, autonomy support, open questions to elicit motivational language (i.e., change talk), reflections of change talk, affirmations, and summaries. An average score across the 12 items (i.e., ranging from 1 to 4) is utilized as a single dimension of MI competence. Psychometric evaluation of these items is the focus of the current study.
Table 1.
MI-CRS Items Used in the LOCI Sample
| # | Item Text |
|---|---|
|
| |
| The counselor… | |
| 1. | …cultivates empathy and compassion with client(s) |
| 2. | …fosters collaboration with client(s) |
| 3. | …supports autonomy of client(s) |
| 4. | …works to evoke client(s) ideas and motivations for change |
| 5. | …balances the client’s agenda with focusing on the target behaviors |
| 6. | …demonstrates reflective listening skills |
| 7. | …uses reflections strategically |
| 8. | …reinforces strengths and positive behavior change with affirmations/affirming reflection |
| 9. | …uses summaries effectively |
| 10. | …asks questions in an open-ended way |
| 11. | …solicits feedback from client(s) |
| 12. | …manages counter change talk and discord |
Note. MI-CRS = MI Coach Rating Scale. Each item is rated on a 4-point ordered categorical rating scale (1 = Poor, 2 = Fair, 3 = Good, 4 = Excellent). The coding manual includes item-specific, behaviorally-defined category definitions.
2.3. Data source
The LOCI MI-CRS data included all coded provider/client sessions across the two conditions. A 20-minute portion of each session was coded, and for sessions that exceeded 20 minutes (as is common for treatment sessions), the coded portion was divided across the first five minutes, middle 10 minutes, and last five minutes. The final data had a maximum of 11 repeated measurements of provider competence (Mean = 4.6, Median = 4.0, SD = 3.1) nested within 238 providers. Of 1,089 rated sessions, 984 (90.4%) were rated by one coder and 105 (9.6%) were rated by two coders.
2.4. Data analysis strategy
The psychometric evaluation closely followed the methods and interpretive guidelines of Naar et al. (2021), which detailed the original development and evaluation of the MI-CRS. The prior evaluation was based in IRT, specifically, Rasch measurement models (Bond et al., 2021). The current study extends prior work to real treatment sessions—rather than simulated interactions—and community-based treatment for substance use disorders. The analyses included three types of Rasch measurement models, detailed next, to evaluate specific aspects of psychometric performance.
The Rasch model is a single parameter logistic measurement model and special case of a one-parameter IRT model. One difference is that the Rasch model assumes each item is equally discriminating; however, the model, and its extensions, are well-suited to evaluating nested data and rater data. The standard, dichotomous model formulation, with “facets” for persons and items, is
According to the model, is the probability of endorsement (or a correct response) by person on item , whereas is the probability of nonendorsement (or an incorrect response). These terms provide the odds of person endorsing item , expressed in logits (i.e., natural-log of the odds). The log-odds of endorsement is a function of the interaction between person with ability and item with difficulty . For example, on an achievement test, if the respondent has a high level of knowledge and the item is easy, the likelihood of a correct response is high.
The standard model accommodates rating scale responses; however, the LOCI sample also included data “facets” beyond persons and items, specifically: sessions, providers, and coders. As such, the standard model was extended to a Many-Facet Rasch Model (MFRM; Linacre, 1994), represented by
Two main differences exist from the prior model. First, rather than an item being endorsed or not, the model reflects the log-odds of endorsement in each of four categories—Poor, Fair, Good, Excellent—where is the difficulty of the “step” from one category to the next higher category. Second, the log-odds of endorsement is not only a function of provider competence, , item difficulty, , and step difficulty, , it also depends on the challenge of delivering MI in a specific session, , and the severity of the coder assigning ratings, . Combining these terms, if a provider has a high level of MI competence, the MI component is not difficult, the session facilitates MI delivery, and the coder is not severe, there is a high probability of the rating being Excellent.
The MFRM accommodates each data “facet”, but it does not specifically address the nesting of item responses (level-1) within sessions (level-2) within providers (level-3). Evaluation of nesting required a mixed-effects (i.e., multilevel) formulation of a Rasch-equivalent measurement model, a hierarchical generalized linear measurement model (HGLMM; Kamata, 2001; Raudenbush et al., 2003), represented by
This model excluded inter-rater data. To accommodate the four-point rating scale, the model used an ordinal outcome distribution (cumulative probability) with a logit link function. This formulation estimates the log-odds of a Poor rating, and two additional thresholds (not listed above) estimate the log-odds of ratings at-or-below Fair and at-or-below Good. The model differentiates the 12 MI-CRS items by including a series of dummy-coded indicators, with Item 1 serving as a reference item (i.e., intercept, ). The intercept reflects the difficulty of Item 1 rated as Poor (), at-or-below Fair (+ Threshold 2), or at-or-below Good (+ Threshold 3). For the remaining items, difficulty is computed as the sum of , the respective item coefficient (e.g., for Item 2), and threshold. Two random effects, representing the nesting of items within sessions and sessions within providers , conform to Rasch “ability” estimates—that is, the competence levels—for sessions and providers.1
From this general IRT-based Rasch measurement framework, the associated models provide multiple tests of psychometric functioning. The standard Rasch rating scale model evaluated dimensionality; the MFRM evaluated rating scale functioning, item fit, reliability, and coder performance; and the HGLMM evaluated nesting effects, multilevel reliability, and the suitability of the MI-CRS for measuring a full continuum of MI competence in the LOCI sample (see Supplemental Materials for the parameter estimates from each model). The Results section includes a description of each test, along with thresholds for interpretation, key findings, and measurement implications. For the standard model, the study used the software Winsteps (Linacre, 2021b), for the MFRM we used Facets (Linacre, 2021a), and for the HGLMM we used HLM (Raudenbush et al., 2020).
3. Results
Following Naar et al. (2021), the analyses evaluated psychometric performance of the MI-CRS across indicators of dimensionality and local independence, rating scale functioning, item fit, reliability and separation, coder reliability, and item invariance.
3.1. Dimensionality & local dependence
Dimensionality evaluates whether the MI-CRS items measured a single dimension of provider competence or multiple distinct dimensions. Related to dimensionality, local dependence identifies items with redundant content. The test for dimensionality was a principal component analysis of standardized residuals (Linacre, 2021b) from the standard Rasch rating scale model with facets for items and sessions. If a single dimension exists, the matrix of residuals will reflect random noise, but with more than one dimension, the residuals will exhibit a meaningful, systematic pattern. To evaluate the extent of dimensionality, the two indicators were the percentage of explained variance, which should be ≥50% for a single dimension, and the magnitude of the eigenvalue for the first PCA contrast, which should be <2.0 (Linacre, 2021b). To evaluate local dependence, the study used the indicator magnitude of residual correlations for each pair of items.
3.1.1. Summary of findings
The items and sessions explained 55% of the total observed variance (target value: >50%), and the eigenvalue for the first contrast was 1.7 (target value: < 2.0). The average positive residual correlation between item pairs was small, at .05 and with a maximum observed correlation of .13. Combined, the results indicate that the MI-CRS, based on the LOCI sample, measures a single dimension of provider MI competence, and the study found no evidence of redundant items. Thus, the results that follow are based on simultaneous evaluation of all 12 MI-CRS items, with the final scoring approach providing a single, overall competence score.
3.2. Rating scale functioning
The MI-CRS has a 4-point ordered categorical rating scale (1 = Poor, 2 = Fair, 3 = Good, 4 = Excellent), and the MFRM (with facets for items, sessions, providers, and coders) provided three indicators for evaluating the scale’s performance (Linacre, 2002). The first was that each category should be well-utilized, which was evaluated based on the descriptive percentage of endorsements received. Next, each category should be interpreted consistently across coders. The study evaluated this indicator based on the category’s Outfit mean-squared statistic relative to a target value of ≤1.5. Finally, each category should be meaningfully distinct from adjacent categories, which means that it achieves a high probability of endorsement for a distinct segment of the underlying competence construct. We evaluated this indicator based on each category’s maximum probability of endorsement, along with the distance between each category. The distance is based on category thresholds (i.e., the point where once category transitions to the next) and should span a range of ≥1.4 logits. Thus, the transition from Poor to Fair should be well-spaced from the next transition from Fair to Good, and in that segment, Fair should be the most probable rating.
3.2.1. Summary of findings
The first three categories (i.e., Poor, Fair, Good) were well utilized, receiving 19%, 43%, and 32% of ratings, respectively, but the upper category (i.e., Excellent) received only 5% of the total ratings. The study found no evidence of inconsistent category interpretation, with Outfit mean-squared statistics (target value: ≤1.5) of 0.9 (Poor), 1.0 (Fair), 1.0 (Good), and 1.2 (Excellent). Additionally, Figure 1 illustrates the desired pattern, in which the response categories form a series of “hills”—that is, each category was the most probable rating for a distinct segment of the MI competence construct. Specifically, the maximum probabilities were ≥99% (Poor), 57% (Fair), 60% (Good), and ≥99% (Excellent), and each category was distinct from adjacent categories, spaced at 2.06 logits (target value: ≥1.4 logits) from Poor to Fair, 1.96 logits from Fair to Good, and 2.26 logits from Good to Excellent. Combined, the results indicate that the 4-point ordered categorical rating scale performed as intended. As such, subsequent models retain the four-category structure, and the study did not combine any categories or otherwise revise them.
Figure 1. Rating scale category probability curves from the Many-Facet Rasch measurement model based on the LOCI sample.
Note. For each category, the curve reflects the probability of endorsement (y-axis) at each level of the MI competence construct (x-axis). The x-axis, from left to right, represents the difference between the level of competence for session and the difficulty of the competence component being rated. The leftmost position reflects advanced MI components and low MI competence, and the rightmost position reflects basic MI components and high MI competence. If a session has a high level of competence and a basic component of MI is being rated, the probability of a rating Excellent approaches 100%. Conversely, if a session has low competence and an advanced component of MI is being rated, the probability of the rating Poor approaches 100%.
3.3. Item fit
Item fit statistics identify items that contributed greater noise than precision to measurements of MI competence. For example, fit statistics could identify item content that led to inconsistent interpretations and responses, or content that reflected a different underlying dimension or process. Significantly misfitting items are candidates for revision, removal, or replacement. Item fit was evaluated in the MFRM, specifically, based on mean-squared Infit and Outfit statistics (which reflect distinct types of unexpected responses) relative to a value of ≤1.5.
3.3.1. Summary of findings
No items evidenced significant misfit. However, one item was borderline, “The counselor reinforces strengths and positive behavior change” (item 8), which had Infit and Outfit mean-squared statistics of 1.5 and 1.4 (target value: ≤1.5). This item was of approximately average difficulty, and the misfit pattern identified a tendency for unexpected ratings whether a session had a typical or extreme (low or high) level of competence. This finding means that coding training should carefully address this item to avoid such unexpected ratings.
3.4. Reliability & separation
The analyses evaluated three types of reliability for measurements of sessions (i.e., scoring sessions) and for measurements of providers (i.e., scoring providers. First, Rasch separation reliability ranges from 0 to 1 and reflects the proportion of variance not attributable to measurement error (Schumacker & Smith Jr, 2007). The study evaluated separation reliability relative to a target value of ≥.70. Second, from the separation reliability estimate, the strata statistic was computed. Strata is not bound by 1 and indexes the number of levels of MI competence that can be meaningfully differentiated based on the MI-CRS items. The target value depends on the intended use, and with the MI-CRS used for feedback and evaluation purposes, the threshold for strata was a value ≥3.0. For sessions, separation and strata estimates come from the standard Rasch model, and for providers, the estimates came from the MFRM. Third, a mixed-effects formulation of a Rasch-equivalent measurement model, the HGLMM detailed previously, estimated the reliability of measurements at the level of sessions and providers. The model also estimated the percentage of item response variance attributable to sessions and attributable to providers.
3.4.1. Summary of findings
For measuring MI competence at the level of sessions, reliability was high. Specifically, separation reliability was .89 (target value: ≥.70), strata was 4.1 (target value ≥3.0), and multilevel reliability was .83 (target values: ≥.70). Reliability also was high for measurements at the level of providers. Separation reliability was .91, strata was 4.7, and multilevel reliability was .68. Of the variance in item responses, the percentage attributable to sessions and providers was 27% and 22%, respectively.
3.5. Coder reliability
The study evaluated coder reliability in the MFRM (with facets for items, sessions, providers, and coders). With observational coders, the primary objective is for a given session to be rated in an identical manner by different coders (Linacre, 2021a). In other words, the ratings do not reflect expert judgments; instead, they are intended to reflect a standard categorization of observed behaviors. As such, the primary indicator of coder reliability was the descriptive rate of exact agreement. The threshold for agreement was a value of ≥70.0%. Additionally, the study used coder Infit and Outfit statistics, relative to a value of ≤1.5, to identify coder-specific patterns of inconsistent or unpredictable ratings.
3.5.1. Summary of findings
For sessions rated by more than one coder, the descriptive rate of exact agreement on the 4-point ordered categorical rating scale was 48.3% (target value: ≥70.0%). Extended to agreement in adjacent categories, the rate increased to 94.9%. The study found no evidence of significantly inconsistent or unpredictable ratings across coders, with Infit and Outfit values ranging from 1.0 to 1.3 (target value: ≤1.5).
3.6. Targeting of items to sessions
An important psychometric indicator is the extent to which the sample of 12 MI-CRS items can be used to measure MI competence in the LOCI sample of sessions. For instance, if the items are overly “easy”, they would be suitable only for assessing sessions with low levels of competence, or if the items are all “difficult”, they would only be suitable for assessing high levels of competence. Ideally, the distribution of items covers a wide range that is well-aligned with the distribution of sessions, which means the items can be used to measure the full range of competence that is expected to be observed in community-based practice. To evaluate these distributions, the HGLMM provided estimates for providers, sessions, and items. We computed item estimates from item coefficients, and the provider and session estimates were based on empirical Bayes residuals (i.e., Rasch-equivalent “ability” estimates at the respective levels). The figure illustrates the item “difficulty” locations at each rating scale threshold (the lowest category, Poor, is not depicted), along with the MI competence measures for providers and sessions (of note, the session score includes the score for the associated provider).
3.6.1. Summary of findings
The results are illustrated in Figure 2. Four important attributes exist. First, the distribution of sessions is wide, which means that the LOCI sample evidenced a wide range of MI competence, spanning low to high levels. Second, the distribution of items, as a result of the well-functioning rating scale, spans a wide range of competence and is well-aligned with the sessions. The alignment between items and sessions means that the items were well-suited to measuring the sample of sessions—that there are items (and rating scale categories) capable of assessing each level of competence. As further evidence, the mean item was reasonably well aligned with the mean session (i.e., at a similar vertical level). Third, no significant “gaps” occurred in the distribution of items. A gap would be a segment of the competence continuum that was not assessed by any items, which could indicate the need for additional items. Fourth, the ordering of items, from least-to-most commonly observed, aligned with theoretical expectations. For example, “Uses summaries effectively” (item 9) and “Uses reflections strategically” (item 7) were less commonly observed that components such as “Manages counter change talk and discord” (item 12) and “Cultivates empathy and compassion with client(s)” (item 1). The figure summarizes several important measurement features, all of which indicate that the MI-CRS items were well-suited to measuring the wide range of MI competence observed in the LOCI sample.
Figure 2. Map of Provider, Session, and Item Estimates from a Rasch-Equivalent Hierarchical Generalized Linear Measurement Model.
Note. The figure is based on a mixed-effects formulation of a Rasch-equivalent measurement model and illustrates the MI competence score distributions for providers (left panel, grey) and sessions (left panel, black), as well as the “difficulty” distribution for items at different rating scale thresholds (right panel; Poor category is omitted). On the left side of the vertical dividing line are markers for the distribution of sessions, with locations for the mean, ±1 SD, and ±2 SDs. On the right side, the markers reflect the item distribution at the category threshold for Good (i.e., the transition from a rating of Poor or Fair to a rating of Good). Indicators for evaluating the figure are detailed in the Results.
3.7. Item Invariance
A stringent assumption of IRT-based measurement models is that the difficulty level of each item is invariant across different uses of the instrument—that is, the item performs in the same way across measurement contexts. To test invariance, HGLMMs provided item difficulty estimates (and SEs) for comparing the LOCI sample to two independent samples reported by Naar et al. (2021). The study found two indicators of item invariance. The first was the correlation coefficient for the 12 MI-CRS items from the LOCI and comparison samples from the original instrument development. Strong correlations would provide evidence of consistent item performance across samples. The second, using methods detailed by Kamata (2001) and Bond et al. (2021), was to construct 95% confidence regions for the cross-plot of item estimates from the LOCI and comparison samples. Significant departures were identified as items falling outside of the confidence region. Items within the confidence region provide evidence of stable performance across samples.
3.7.1. Summary of findings
The correlations between the 12 MI-CRS items from the LOCI sample and the two comparison samples were moderately high at .84 and .82. Although multiple items were located outside of the 95% confidence regions for estimates from the LOCI and comparison samples, most were near the boundary or had extreme scores in both samples (with one sample being even more extreme). In one comparison sample, the most extreme departure was for “The counselor demonstrates reflective listening skills” (item 6), which, in the LOCI sample, tended to be endorsed more strongly (i.e., an easier item). In both samples, the most consequential departure was for “The counselor uses summaries effectively” (item 9), which was endorsed less strongly in the LOCI sample (i.e., a more difficult item). Combined, the results provide evidence of a general level of item invariance—that the MI-CRS items perform similarly in completely independent samples. However, with multiple items nearly the 95% confidence boundaries, there was some indication of differential performance, with community providers demonstrating lower competency than the original development samples on some items.
4. Discussion
Efficient and effective competence measurement is critical to ensure fidelity in clinical trials, to assess implementation outcomes, and to establish sustainable fidelity monitoring and feedback systems in real-world settings. The Motivational Interviewing Coach Rating Scale (MI-CRS) was rigorously developed for MI supervisors/coaches to listen to real or standard patient interactions and immediately rate 12 items in a “one pass” coding system. Implementation science studies can also use the measure as a primary outcome. Our results extend findings of the psychometric properties of the MI-CRS to real-world substance use and integrated care treatment settings.
First, dimensionality results indicated that the MI-CRS continues to measure a single underlying construct of MI competence compared to other conceptions of MI skill as having at least two dimensions (Magill et al., 2017). The study found no indication of a need for additional items. The items also measured a wide range of MI competence, at the level of both providers and sessions. Additionally, the distributions of scores for items and sessions were well-aligned.
The original psychometric analysis showed excellent performance of the rating scale, and in this study, the four categories were meaningfully distinct, performed as intended, and the study found no evidence of inconsistent interpretation across coders. However, compared to prior studies, the highest competency category was less frequently endorsed for substance use treatment providers, and items appeared to be more difficult than in original samples. MI may have been more difficult for substance use treatment providers than for providers addressing other health-related behaviors as in the original development sample. A more likely explanation is that in the LOCI study participants did not receive standard follow-up coaching from a trained MI supervisor. Thus, we do not find it surprising that the highest levels of competency were reached less often. Several studies have demonstrated that workshop only is not sufficient for MI competency and that follow-up feedback or coaching is necessary (Madson et al., 2009).
Consistent with the original psychometric analysis, item-session maps were indicative of a well-performing instrument, and the variance in competence due to sessions and due to providers was roughly equivalent. With substantial variance at the level of both providers and sessions, scoring and evaluation are possible at both levels, which is compatible with a range of uses in real-world implementation efforts. Item fit was generally good, with only one item, related to reinforcing strengths and positive behavior change, exhibiting borderline misfit. We have revised the instructions for anchors for this rating to more specifically reflect the MI skill of affirmations and to indicate that high quality affirmations are specific “it is great that you filled out that job application last week” versus generic “you are really making progress”: 1=No affirmations or occasional low-quality affirmations; 2=Occasional high-quality affirmations or consistent low-quality affirmations; 3= Counselor more often than not uses high quality affirmations; 4=.Counselor consistently uses high quality affirmations.
Reliability was high both for sessions and providers. For inter-rater reliability, absolute agreement was suboptimal in the current sample and the initial development samples. However, absolute agreement is a stringent target, and agreement for adjacent categories was very high. Not only are ratings on a four-point scale [word missing?] versus a dichotomous rating, all twelve of the components being rated are applicable in all sessions and they may occur dynamically throughout the session. As such, ample opportunity for disagreement exists. The MI-CRS has shown good reliability using more typical measures such as interclass correlation (Naar et al., 2022).
A possible limitation of this study was its use of the Rasch measurement model rather than the IRT graded response model; however, our decision to do so was largely guided by practical considerations. For instance, the more lenient sample size requirements of the Rasch model permitted thorough evaluations of the MI-CRS in the LOCI sample. With a sufficiently large sample, IRT models could evaluate whether there were meaningful differences occurred in item discrimination, and the more complex model would most likely provide better model fit. At the same time, the MI-CRS was developed, and has since been refined, using a Rasch-based framework.
Acknowledging the limitations, the simpler model has benefits for new measurement development and evaluation in the context of real-world implementation efforts. The ultimate consideration is the impact on the resulting scores, and based on the original evaluation (Naar, Chapman, et al., 2021), scores based on the Rasch model and IRT graded response model were nearly perfectly correlated (i.e., ≥ 0.99).
In summary, a 12-item measure of MI competence designed to be rated in one pass of real or simulated encounters and originally targeting health behaviors showed excellent psychometric properties in community-based substance use treatment settings. To our knowledge, ours is the first MI competence measure developed with rigorous measurement methodology such as IRT. The MI-CRS requires less training and coding time than the existing recommended measures (Hurlocker et al., 2020) and may be utilized for implementation research outcomes as well as real-world quality monitoring and improvement. The measure has shown good psychometric properties in multiple contexts, with diverse clients, and for a range of target behaviors. Studies show that the MI-CRS is able to detect change in competence following implementation interventions (Naar et al., 2022; Naar, Pennar, et al., 2021). In these studies, a coaching model utilizing the MI-CRS within coaching sessions for immediate evaluation, feedback, and practice following initial workshop training resulted significant improvements in MI competence in HIV settings. Future research should test MI-CRS-based coaching in substance use settings, to compare real provider-patient sessions with standard patient interactions, and to test the measure in additional samples to test generalizability of findings to other implementation contexts.
Supplementary Material
Highlights.
-
-
Efficient and effective fidelity measurement is critical for research outcomes and for real-world quality management in Motivational Interviewing implementation.
-
-
The Motivational Interviewing Coach Rating scale was developed with Item Response Theory methods and was evaluated in a sample of 1089 substance use treatment sessions.
-
-
The 12-item Motivational Interviewing Coach Rating Scale showed excellent performance in a large sample of community-based substance abuse treatment providers using real recordings.
Acknowledgments
Funding for this project came from National Institute on Drug Abuse grants R01DA049891 R01DA038466 (PI: Aarons)
Footnotes
Authors’ Statement
The instrument is copyrighted by Florida State University and licensed to a non-profit agency, Behavior Change Consulting (behaviorchangeconsulting.org). Funding for this project came from National Institute on Drug Abuse grants R01DA049891 R01DA038466 (PI: Aarons)
Sylvie Naar: Conceptualization, Original Draft Preparation;
Jason Chapman: Methodology, Software, Writing – draft of methods and results; Greg Aarons: Original Investigation, Data curation, Writing- Reviewing and Editing.
The ordinal distribution confers a “difficulty” interpretation for item parameters. However, to obtain the “competence” interpretation for sessions and providers, the sign was reverse for the corresponding residual Bayes estimates.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Contributor Information
Sylvie Naar, Florida State University.
Jason E. Chapman, Oregon Social Learning Center
Gregory A. Aarons, University of California, San Diego UC San Diego ACTRI Dissemination and Implementation Science Center.
References
- Aarons GA, Ehrhart MG, Moullin JC, Torres EM, & Green AE. (2017). Testing the leadership and organizational change for implementation (LOCI) intervention in substance abuse treatment: a cluster randomized trial study protocol. Implementation Science, 12(1), 29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Aarons GA, Hurlburt M, & Horwitz SM. (2011). Advancing a conceptual model of evidence-based practice implementation in public service sectors. Administration and Policy in Mental Health and Mental Health Services Research, 38(1), 4–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bock RD, & Gibbons RD. (2021). Item response theory. John Wiley & Sons. [Google Scholar]
- Bond TG, Yan Z, & Heene M. (2021). Applying the Rasch Model: Fundamental Measurement in the Human Sciences (4th ed.). Routledge. [Google Scholar]
- De Boeck P, & Wilson M. (2004). Explanatory item response models: A generalized linear and nonlinear approach (Vol. 10). Springer. [Google Scholar]
- Eckes T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd ed.). Peter Lang. [Google Scholar]
- Engelhard G Jr (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. Routledge. [Google Scholar]
- Hurlocker MC, Madson MB, & Schumacher JA. (2020). Motivational interviewing quality assurance: A systematic review of assessment tools across research contexts. Clinical psychology review, 82, 101909. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kamata A. (2001). Item analysis by the hierarchical generalized linear model. Journal of Educational Measurement, 38, 79–93. [Google Scholar]
- Linacre JM. (1994). Many-Facet Rasch Measurement. MESA Press. [Google Scholar]
- Linacre J. (2021a). FACETS Rasch measurement computer program, version 3.83. 6. Chicago, IL: Winsteps. com. [Google Scholar]
- Linacre J. (2021b). A User’s Guide to Winsteps and Ministep: Rasch-Model Computer Programs. Program Manual 4.8. [Google Scholar]
- Linacre JM. (1994a). Many-facet Rasch measurement. Mesa Press; Chicago, IL. [Google Scholar]
- Linacre JM. (1994b). Sample size and item calibration stability. Rasch Measurement Transactions, 7(4), 328. [Google Scholar]
- Linacre JM. (2002). Optimizing rating scale category effectiveness. Journal of applied measurement, 3(1), 85–106. [PubMed] [Google Scholar]
- Madson M, Loignon A, & Lane C. (2009). Training in motivational interviewing: A systematic review. Journal of substance abuse treatment, 36, 101–109. 10.1016/j.jsat.2008.05.005 [DOI] [PubMed] [Google Scholar]
- Magill M, Apodaca T, Borsari B, Gaume J, Hoadley A, Gordon R, . . . Moyers T. (2017). A Meta-Analysis of Motivational Interviewing Process: Technical, Relational, and Conditional Process Models of Change. Journal of Consulting and Clinical Psychology, 86. 10.1037/ccp0000250 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martino S, Ball SA, Nich C, Frankforter TL, & Carroll KM. (2008). Community program therapist adherence and competence in motivational enhancement therapy. Drug and alcohol dependence, 96(1–2), 37–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moyers TB, Rowell LN, Manuel JK, Ernst D, & Houck JM. (2016). The motivational interviewing treatment integrity code (MITI 4): rationale, preliminary reliability and validity. Journal of Substance Abuse Treatment, 65, 36–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Naar S, Chapman JE, Cunningham PB, Ellis DA, Todd L, & MacDonell K. (2021). Development of the Motivational Interviewing Coach Rating Scale (MI-CRS) for Health Equity Implementation Contexts. Health Psychology. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Naar S, MacDonell K, Chapman JE, Todd L, Wang Y, Sheffler J, & Fernandez MI. (2022). Tailored Motivational Interviewing in Adolescent HIV Clinics: Primary Outcome Analysis of a Stepped-Wedge Implementation Intervention Trial. JAIDS Journal of Acquired Immune Deficiency Syndromes, 90(1), S74–S83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Naar S, Pennar A, Wang B, Brogan Hartlieb K, & Fortenberry D. (2021). Tailored Motivational Interviewing (TMI): Translating Basic Science in Skills Acquisition in a Behavioral Intervention to Improve Community Health Worker Motivational Interviewing Competence for Youth Living with HIV. Health Psychology. [DOI] [PubMed] [Google Scholar]
- Pas ET, & Bradshaw CP. (2021). Introduction to the special issue on optimizing the implementation and effectiveness of preventive interventions through motivational interviewing. Prevention Science, 22(6), 683–688. [DOI] [PubMed] [Google Scholar]
- Raudenbush SW, Bryk AS, & Congdon R. (2020). HLM 8: Hierarchical linear and nonlinear modeling for Windows (computer software). Scientific Software International, Inc. [Google Scholar]
- Raudenbush SW, Johnson C, & Sampson RJ. (2003). A multivariate, multilevel Rasch model with application to self–reported criminal behavior. Sociological Methodology, 33(1), 169–211. [Google Scholar]
- Schoenwald SK, Garland AF, Chapman JE, Frazier SL, Sheidow AJ, & Southam-Gerow MA. (2011). Toward the effective and efficient measurement of implementation fidelity. Administration and Policy in Mental Health and Mental Health Services Research, 38(1), 32–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schumacker RE, & Smith EV Jr (2007). Reliability: A Rasch perspective. Educational and Psychological Measurement, 67, 394–409. [Google Scholar]
- Substance Abuse and Mental Health Services Administration. (2019). Enhancing Motivation for Change in Substance Use Disorder Treatment. SAMHSA Publication No PEP19–02-01–003. [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


