Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 May 1.
Published in final edited form as: Behav Ther. 2023 Sep 26;55(3):605–620. doi: 10.1016/j.beth.2023.09.002

Using Adherence and Competence Measures Based on Practice Elements to Evaluate Treatment Fidelity for Two CBT Programs for Youth Anxiety

Stephanie Violante 1, Bryce D McLeod 2, Michael A Southam-Gerow 3, Bruce F Chorpita 4, John R Weisz 5
PMCID: PMC11055981  NIHMSID: NIHMS1935392  PMID: 38670672

Abstract

Measures designed to assess the quantity and quality of practices found across treatment programs for specific youth emotional or behavioral problems may be a good fit for evaluating treatment fidelity in effectiveness and implementation research. Treatment fidelity measures must demonstrate certain reliability and validity characteristics to realize this potential. This study examines the extent to which two observational measures, the Cognitive-Behavioral Treatment for Anxiety in Youth Adherence Scale (CBAY-A) and the CBAY-Competence Scale (CBAY-C), can assess the quantity (the degree to which prescribed therapeutic techniques are delivered as intended) or quality (the competence with which prescribed techniques are delivered) of practices found in two distinct treatment programs for youth anxiety. Treatment sessions (N = 796) from 55 youth participants (M age = 9.89 years, SD = 1.71; 46.0% female; 55.0% White) with primary anxiety problems who participated in an effectiveness study were independently coded by raters who coded quantity, quality, and the youth-clinician alliance. Youth received one of three treatments: (a) Standard (i.e., cognitive behavioral therapy program); (b) Modular (i.e., a cognitive-behavioral and parent-training program); and (c) Usual Clinical Care. Interrater reliability for the CBAY-A items was good across the Standard and Modular conditions but mixed for the CBAY-C items. Across the Standard and Modular conditions, the CBAY-A Model subscale scores demonstrated evidence of construct validity, but the CBAY-C Model subscale scores showed mixed evidence. The results provide preliminary evidence that the CBAY-A can be used across different treatment programs but raise concerns about the generalizability of the CBAY-C.

Keywords: treatment fidelity, CBT, youth anxiety


Treatment fidelity is commonly assessed in psychosocial treatment evaluation and implementation research (Sutherland et al., 2022). Most observational treatment fidelity measures are designed to evaluate the quantity (i.e., the degree to which prescribed therapeutic techniques are delivered as intended) or quality (i.e., the competence with which the prescribed techniques are delivered) with which the techniques from a specific treatment protocol are delivered (called protocol fidelity; Cox et al., 2019; Regan et al., 2019). This measurement approach is a good fit for research questions related to intervention development and evaluation (e.g., a manipulation check). However, measures that can assess the quantity and quality of techniques from a type of treatment (e.g., cognitive-behavioral therapy [CBT]) for a specific problem area (e.g., anxiety; e.g., Southam-Gerow et al., 2016) may have valuable applications for assessing treatment fidelity in some effectiveness and implementation research aimed at improving mental health services in community-based mental health centers (Sutherland & McLeod, 2022).

As treatments move down the translational pipeline, the goals of fidelity measurement may shift (Sutherland et al., 2022). Early in the pipeline, fidelity measurement is used to operationalize the core techniques in a treatment protocol hypothesized to engage the mechanisms of change. This level of specification allows researchers to use a measure to assess protocol fidelity and utilize the resulting data to refine the treatment. As treatment moves down the pipeline, questions focus more on effectiveness and implementation as the goal of the research focuses on improving mental health services for youth. For some research questions at these later stages, there may be some benefits to using measures that can assess fidelity across multiple protocols (McLeod et al., 2013; Sutherland et al., 2022). First, the approach is more flexible because it is not protocol-specific (McLeod et al., 2022). When usual care is used as a comparison condition, such measures can assess protocol fidelity and differentiation (e.g., whether elements of the comparison condition are found in usual care; McLeod et al., 2013). Second, when multiple protocols from the same treatment approach are studied, using a single measure to assess protocol fidelity offers more efficiency than separate protocol-based measures (Schoenwald et al., 2011).

The Cognitive-Behavioral Therapy Adherence Scale for Youth Anxiety (CBAY-A; Southam-Gerow et al., 2016) and the Cognitive-Behavioral Therapy Competence Scale for Youth Anxiety (CBAY-C; McLeod et al., 2018) are observational fidelity measures with a parallel structure. The two are designed to assess the quantity and quality of practice elements (i.e., discrete therapeutic techniques used in treatments for a specific problem area; Chorpita et al., 2007) found across multiple CBT protocols for youth anxiety rather than specific treatment techniques found in a single protocol. Items are rated on 7-point extensiveness and competence scales that estimate the quantity or quality of each practice element (McLeod et al., 2013). Previous research supports the content validity of the CBAY-A and CBAY-C. Data on the initial interrater reliability, construct validity, and discriminant validity of the item, subscale, and scale scores are also promising (McLeod et al., 2018; Southam-Gerow et al., 2016). However, the measures’ score reliability and validity data have only been reported for a single treatment protocol, the Coping Cat program (McLeod et al., 2018; Southam-Gerow et al., 2016), limiting the generalizability of the findings.

The current study aimed to assess if scores on the CBAY-A and CBAY-C show potential to assess protocol fidelity across multiple CBT programs. Though the CBAY-A and CBAY-C hold potential benefits for effectiveness and implementation research, supportive psychometric data are needed to support their expanded use with multiple protocols. First, interrater reliability data are necessary for the two measures when used with a treatment protocol for youth anxiety that is not Coping Cat (McLeod et al., 2022). Further, evidence of construct validity is needed, including (a) strong relations with scores on fidelity measures designed for a specific treatment protocol, (b) moderate relations between quantity (e.g., adherence) and quality (e.g., competence) measures (Carroll et al., 2000; Hogue et al., 2008), and (c) low relations with other therapy process measures, such as the client-clinician alliance (McLeod et al., 2022). Finally, scores on the CBAY-A and CBAY-C items should distinguish between a treatment protocol and usual clinical care (i.e., discriminative validity; Weisz et al., 2012). Together, these features would help establish that scores on the CBAY-A and CBAY-C assess protocol fidelity (i.e., they can detect the quantity and quality of treatment techniques from a specific treatment protocol; Regan et al., 2019) and thus support using the measures for manipulation checks across multiple treatment protocols for youth anxiety.

In short, we used the CBAY-A and CBAY-C to code treatment sessions from a randomized controlled effectiveness trial for youths with primary anxiety problems. The youth and clinicians who participated in the trial were assigned to one of three conditions: (a) a standard manual condition that delivered the Coping Cat program (Kendall & Hedtke, 2006), (b) a modular condition that delivered the Modular Approach to Therapy with Children (MATCH; Chorpita & Weisz, 2005), or (c) usual care. The Coping Cat program (Kendall & Hedkte, 2006) is a youth-focused CBT program that targets anxiety disorders. The MATCH program is comprised of techniques from CBT and parent management training that target anxiety, depression, and disruptive behavior problems. The youth in the present study presented with primary anxiety problems, so the clinicians who delivered MATCH relied on CBT techniques designed to target anxiety. However, other techniques from CBT and parent management approaches could be utilized if needed to address depressive or disruptive behavior problems.

Method

Data Sources and Participants

Treatment data were collected from 55 youth aged 7 to 15 (M age = 9.89 years, SD = 1.71; 46.0% female, 54.0% male; 54.6% White, 5.5% Black, 1.8% Asian American, 3.6% Latinx, 30.9% multiracial, 3.6% other race) who participated in a randomized controlled trial called the Child STEPS Multisite Trial (Weisz et al., 2012). The Child STEPs Multisite Trial included a total of 174 youth participants. To be included in the Child STEPS Multisite Trial, the youth had to meet criteria for a DSM-IV-TR disorder based on the Children’s Interview for Psychiatric Syndromes (CHIPS; Weller et al., 1999a,b) or have clinically elevated problems (a T-score above 65) on the Child Behavior Checklist (CBCL) or Youth Self Report (YSR; Achenbach & Rescorla, 2000) in one or more problem areas: anxiety, depression, or disruptive behavior. A youth assigned to a participating clinician as part of the usual referral process was screened and invited to consent if found eligible for the study. The trial employed a cluster randomization design in which clinicians were randomly assigned to one of three conditions: standard manualized treatment (Standard), modular manualized treatment (Modular), or usual care. Stratification was done by education level (master’s vs. doctoral degree).

Youth from the Child STEPS Multisite Trial were included in the current study if they met certain criteria (i.e., a CHIPS diagnosis of specific phobia, separation anxiety disorder, generalized anxiety disorder, social phobia, obsessive-compulsive disorder, posttraumatic stress disorder, panic disorder without agoraphobia, or a T-score > 65 on the CBCL or YSR; see Weisz et al., 2012). Second, the youth had to have at least two audibly recorded treatment sessions. Third, the youth had to receive treatment from a single clinician. Thirteen youths with primary anxiety problems and six clinicians did not meet these criteria from the original sample. Recorded treatment sessions collected as part of the Child STEPS Multisite Trial served as the data for the current investigation. The IRB from each institution approved this research. Parents provided written informed consent, and youths provided written or oral assent.

Fifty-five youths with primary anxiety problems met the inclusion criteria for the study. Twenty-two youths were allocated to the Standard condition (M age = 9.77 years, SD = 1.51; 50.0% female, 50.0% male; 72.7% White, 4.5% Asian American, 4.5% Latinx, 18.2% multiracial), 16 youths to the Modular condition (M age = 9.94 years, SD = 1.88; 43.7% female, 56.3% male; 43.8% White, 12.5% Black, 37.5% multiracial, 6.3% other race), and 17 youths to the Usual Care condition (M age = 10.00 years, SD = 1.87; 41.2% female, 58.8% male; 41.2% White, 5.9% Black, 5.9% Latinx, , 41.2% multiracial, 5.9% other race). Weisz et al. (2012) found that youth in the Modular condition showed significant improvement on several clinical measures compared to the youth in the Standard and Usual Care conditions. See Table 1 for descriptive information.

Table 1.

Youth Descriptive Data and Comparisons

Variable M (SD) or %
F or χ2 p
Standard
(n = 22)
Modular
(n = 16)
Usual Care
(n = 17)
Age 9.77 (1.51) 9.94 (1.88) 10.00 (1.87) 0.91 .91
Sex 0.33 .85
 Female 50.0% 43.7% 41.2%
 Male 50.0% 56.3% 58.8%
Race/Ethnicity 10.56 .39
 White 72.7% 43.8% 41.2%
 Black - 12.5% 5.9%
 Asian American 4.5% - -
 Latinx 4.5% - 5.9%
 Multiracial 18.2% 37.5% 41.2%
 Other - 6.3% 5.9%
CBCL T-scores
 Total (pre) 65.27 (7.49) 63.63 (10.39) 66.35 (5.23) 0.50 .61
 Internalizing (pre) 70 (6.72) 69.56 (9.33) 68.82 (5.68) 0.13 .88
 Externalizing (pre) 59 (11.28) 55.06 (11.64) 60.18 (8.76) 1.04 .36
Family Income 3.77 .15
 Up to 60k per year 54.5% 31.3% 70.6%
Number of Sessions 21.91 (11.17) 20.69 (6.15) 20.87 (11.95) 0.08 .92
Weeks in Treatment 32.05 (13.50) 30.38 (7.71) 38.06 (30.43) 0.73 .49
Session Length 40.49 (13.53) 43.34 (12.27) 40.65 (10.70) 4.22 .02
Number of Coded Sessions 16.32 (8.71) 15.25 (5.98) 11.35 (7.79) 2.10 .13

Note. Standard = Standard condition; Modular = Modular condition; Usual Care = Usual Care condition; CBCL = Child Behavior Checklist.

A total of 39 clinicians met the inclusion criteria for the current study. Clinicians volunteered to participate in the Child STEPS Multisite Trial and were randomly assigned to condition. The Standard condition consisted of 16 clinicians (M age = 43.56 years, SD = 9.96; 81.2% female, 18.8% male; 50.0% White, 18.8% Black, 12.5% Asian American, 6.3% multiracial, 12.5% not reported). The clinicians averaged 7.17 years (SD = 7.75) of clinical experience and included master’s-level social workers (37.5%), master’s level psychologists (31.3%), doctoral-level psychologists (6.3%), and degrees classified as “other” (25.0%; e.g., marriage and family therapist). The Modular condition consisted of 10 clinicians (M age = 35.20 years, SD = 6.81; 80.0% female, 20.0% male; 40.0% White, 10.0% Black, 40.0% Asian American, 10.0% other race). These clinicians averaged 5.25 (SD = 4.83) years of clinical experience and included master’s-level social workers (50.0%), master’s level psychologists (20.0%), doctoral-level psychologists (10.0%), and degrees classified as “other” (20.0%). There were 13 clinicians in the Usual Care condition (M age = 40.00 years; SD = 9.18; 76.9% female, 23.1% male; 61.5% White). These clinicians averaged 4.69 years (SD = 5.34) of clinical experience and included master’s-level social workers (38.5%), doctoral-level psychologists (30.8%), master’s-level psychologists (23.1%), and degrees classified as “other” (7.7%). See Table 2 for descriptive information.

Table 2.

Clinician Descriptive Data and Comparisons

Variable M (SD) or %
F or χ2 p
Standard
(n = 16)
Modular
(n = 10)
Usual Care
(n = 13)
Age 43.56 (9.96) 35.20 (6.81) 40.00 (9.18) 2.66 .08
Sex 0.09 .96
 Female 81.2% 80.0% 76.9%
 Male 18.8% 20.0% 23.1%
Race/Ethnicity 8.99 .34
 White 50.0% 40.0% 61.5%
 Black 18.8% 10.0% -
 Asian American 12.5% 40.0% 23.1%
 Multiracial 6.3% - -
 Other - 10.0% -
 Not reported 12.5 - -
Degree Type 4.90 .56
 MSW 37.5% 50.0% 38.5%
 MA Psych 31.3% 20.0% 23.1%
 PsyD/PhD 6.3% 10.0% 30.8%
 Other 25.0% 20.0% 7.7%
License 62.5% 60.0% 23.1% 5.70 .22
Years of Experience 7.17 (7.75) 5.25 (4.83) 4.69 (5.34) 0.57 .57
Theoretical Orientation 3.22 .92
 CBT 31.3% 40.0% 30.8%
 Psychodynamic 25.0% 20.0% 15.4%
 Family Systems 6.3% - 7.7%
 Eclectic 31.3% 40.0% 30.8%
 Other 6.3% - 15.4%

Note. Standard = Standard condition; Modular = Modular condition; Usual Care = Usual Care treatment condition; MSW = Master of Social Work; PsyD = Doctor of Psychology; PhD = Doctor of Philosophy; CBT = cognitive behavioral therapy.

Treatment Conditions

Standard

The Coping Cat program was delivered in the Standard condition, which is a child-focused CBT program for youth diagnosed with anxiety disorders (Kendall & Hedtke, 2006). The program consists of 16–20 sessions designed to address anxiety through skill-building (e.g., cognitive restructuring, relaxation, problem-solving), graduated exposure to feared objects or situations, and continued practice of skills both in (e.g., role plays) and out (i.e., homework assignments) of a session. Coping Cat sessions are delivered in a predetermined sequence. If the youth had a secondary depressive or disruptive behavior disorder, a treatment program for depression (PASCET; Weisz et al., 1999) or disruptive behavior (Defiant Children; Barkley, 2013) was delivered once the Coping Cat program was completed.

Modular

The MATCH (Chorpita & Weisz, 2005) program was delivered in the Modular condition, which consists of modules that address anxiety, depression, and disruptive behavior problems. Modules are made up of CBT and parent-training techniques corresponding to those found in (a) Coping Cat (Kendall & Hedtke, 2006), (b) PASCET, a CBT program for depression (Weisz et al., 1999), and (c) Defiant Children, a behavioral parent training program for conduct problems (Barkley, 2013). For each problem area, clinical flowcharts detail a default order of modules. Scores on the youth and caregiver baseline measures determined the primary problem areas and the corresponding flowchart. As youths in this study presented with primary anxiety problems, all clinicians used the flowchart associated with anxiety. However, if during treatment, an emergent event or comorbid condition arose, clinicians could reference the flowcharts to incorporate modules designed to address those conditions, permitting a return to a focus on anxiety. The treatment protocol also allowed a clinician to shift away from anxiety if evidence suggested another problem (e.g., disruptive behavior) warranted primary clinical consideration.

Usual Care

The clinicians randomly allocated to the Usual Care condition were instructed to continue their typical treatment approach, including the rate and nature of supervision (see Weisz et al., 2012). The clinicians in Usual Care reported the following theoretical orientations: 30.8% CBT, 30.8% eclectic, 15.4% psychodynamic, 15.4% “other,” and 7.7% family systems.

Therapist Training and Consultation

The same training and consultation procedures were offered to clinicians in the Standard and Modular conditions, which included a treatment protocol, training workshop, and weekly consultation with an expert. A six-day training workshop was provided to each clinician, with each problem area (anxiety, depression, disruptive behavior) given two days. Project consultants with a Ph.D. in clinical psychology and expertise in CBT were trained in MATCH components by experts in each treatment protocol. Each week the project consultants engaged clinicians in a review of measurement feedback on client progress and practice delivered (Chorpita et al., 2008). A protocol adherence check indicated that 92.8% of the session content in the Standard condition was protocol specific (i.e., included techniques from the treatment protocol) and 7.2% was not; 82.9% of the content in the Modular condition was protocol specific, and 17.1% of the content was not. In Usual Care, 91.4% of content was not in the Standard or Modular protocols, indicating that most techniques delivered in Usual Care were distinct from those delivered in the Standard and Modular conditions. See Weisz et al. (2012) for details.

Measures

Cognitive-Behavioral Treatment for Anxiety in Youth Adherence Scale (CBAY-A; Southam-Gerow et al., 2016)

The CBAY-A is a 22-item observational measure that assesses the quantity of CBT for youth anxiety in three areas: (a) 4 Standard items that represent techniques found in most CBT sessions (e.g., Agenda Setting), (b) 12 Model items that assess model-specific content (e.g., Relaxation, Cognitive Anxiety, Problem Solving, Exposure), and (c) 6 Delivery items that assess how model items are delivered (i.e., Modeling, Rehearsal). Coders watch entire sessions and rate each item on a 7-point extensiveness scale: 1 = not at all, 3 = somewhat, 5 = considerably, 7 = extensively. CBAY-A item and scale scores have demonstrated evidence of interrater reliability and construct validity (Southam-Gerow et al., 2016; McLeod et al., 2018). Because the current study focused on practice elements of CBT for youth anxiety, only model items were used. A single CBAY-A Model subscale used in previous research with the Coping Cat program (see Southam-Gerow et al., 2016; McLeod et al., 2018) was created to represent adherence to the Standard and Modular protocols. The Model subscale included 11 items: “Psychoeducation,” “Emotion Education,” “Fear Ladder,” “Relaxation,” “Cognitive,” “Problem Solving,” “Self-Reward,” “Coping Plan,” “Exposure: Prep,” “Exposure,” and “Exposure: Debrief.” One model item, “Maintenance,” was not included because it is not part of the Coping Cat or MATCH protocols (see Southam-Gerow et al., 2016). The Model subscale scores were created by averaging item scores across coders and then averaging items on the subscale.

Cognitive-Behavioral Treatment for Anxiety in Youth Competence Scale-Revised (CBAY-C; McLeod et al., 2018).

The CBAY-C is an observational measure designed to capture the quality (i.e., skill and responsiveness) of clinician delivery of CBT for youth anxiety. The CBAY-C parallels the content found in the CBAY-A and consists of 4 standard items (e.g., Homework Assignment), 12 model items (e.g., Emotion Education), and 7 delivery items (e.g., Modeling). Coders watch full sessions and rate each item on a 7-point Likert-type competence scale: 1 = very poor, 3 = acceptable, 5 = good, 7 = excellent. Scores on a previous version of the CBAY-C have shown evidence of item-level interrater reliability (ICCs ranged from .37 to .80; M = .67, SD = .11) and construct validity (McLeod et al., 2018). One Model subscale score was created with the 11 items used to create the CBAY-A Model subscale that consisted of “Psychoeducation,” “Emotion Education,” “Fear Ladder,” “Relaxation,” “Cognitive,” “Problem Solving,” “Self-Reward,” “Coping Plan,” “Exposure: Prep,” “Exposure,” and “Exposure: Debrief.” The CBAY-C Model subscale was created by averaging scores across coders on each item and then averaging all items on the subscale. Item scores of 0 were considered missing, and only items scored by both coders were included in the subscale scores.

MATCH and Standard Consultation Records (Ward et al., 2013).

The MATCH and Standard Consultation Records represent two consultant-rated measures designed to capture the techniques delivered by clinicians in Standard and Modular conditions (Ward et al., 2013). The items on the Standard Consultation Record consist of techniques from the Coping Cat protocol (e.g., “FEAR Plan”, “Problem Solving”). In contrast, the MATCH Consultation Record items consist of MATCH protocol techniques (e.g., “Cognitive STOP”). The consultation records were filled out for each session during weekly consultation meetings by project consultants collaborating with clinicians. Scores on the Standard Consultation Record and MATCH Consultation Record have demonstrated good interrater reliability (ICCs ranged from .50 to 1.00; M = .80) when rated by independent observers and shown evidence of convergent validity when compared to consultation records scored by independent observers (Ward et al., 2013). For each item, the consultant could select one of three options: (a) “no selection,” (b) “covered-part” (partial coverage of the session content), or (c) “covered-full” (full coverage of the session content). Each Consultation Record item score was recoded for this study on a 3-point scale. If “no selection” was selected, the item was assigned a value of “0”; if “covered-part” was chosen, the item was assigned a value of “1”; and if “covered full” was selected, the item was assigned a value of “2”. Subscale scores for each session were created by averaging all items. If an item was missing for a given session, the average of the remaining items comprised the subscale score.

Therapy Process Observational Coding System for Child Psychotherapy-Revised Strategies Scale (TPOCS-RS; McLeod et al., 2015).

The TPOCS-RS is a 47-item observational measure designed to assess clinician delivery of therapeutic techniques. There are five theory-based subscales: Cognitive (e.g., Cognitive Education), Behavioral (e.g., Relaxation), Psychodynamic (e.g., Explores Past), Family (e.g., Parenting Skills), and Client-Centered (e.g., Validates Client). There are 17 additional items representing techniques that play a meaningful role in treatment but are not associated with a specific theory-based subscale (e.g., homework, play therapy). Coders rate the extent to which the clinician engages in each item during an entire session using a 7-point Likert-type extensiveness scale ranging from 1 = not at all to 7 = extensively. Item and subscale scores have demonstrated reliability and validity in past studies (e.g., McLeod & Weisz, 2010; McLeod et al., 2015, McLeod et al., 2022). The TPOCS-RS Anxiety subscale was used to estimate protocol adherence in both conditions because the techniques used to address primary anxiety problems were the same in the Standard and Modular conditions. This subscale has been used in previous research focused on the Coping Cat and MATCH programs (see McLeod et al., 2015; McLeod et al., 2022; Smith et al., 2017; Southam-Gerow et al., 2010). The Anxiety subscale comprised six items: “Relaxation,” “Cognitive Education,” “Cognitive Distortion,” “Coping Skills,” “Operant-Child,” and “Respondent.” Items on the subscale were averaged to produce the Anxiety subscale scores. We also created condition-specific subscales to facilitate the evaluation of discriminant validity (see McLeod et al., 2022). First, we generated the TPOCS-RS Non-Standard subscale comprised of the child-focused CBT techniques not used in the Anxiety subscale: “Functional Analysis,” “Skill Building,” and “Behavioral Activation.” Second, for the Modular condition we created a TPOCS-RS Non-Modular subscale comprised of the “Functional Analysis” and “Skill Building” items— “Behavioral Activation” was not included because it is in the MATCH protocol as part of techniques to address depression, and our goal was to create a subscale comprised of techniques that are not found in the MATCH protocol. Scores on each TPOCS-RS subscale were generated by producing a mean score on each item across coders and then averaging the item scores on each subscale. In the Standard condition, interrater reliability (ICC(2,2)) for the TPOCS-RS Anxiety subscale was .84 and was .56 for the Non-Standard subscale. In the Modular condition, interrater reliability was .80 for the Anxiety subscale and .58 for the Non-Modular subscale.

Therapy Process Observational Coding System-Alliance Scale (TPOCS-A; McLeod & Weisz, 2005).

The TPOCS-A is a 9-item observational measure of the youth-clinician alliance in youth treatment. The TPOCS-A is designed to objectively capture two commonly emphasized dimensions of alliance: bond (i.e., affective aspects of the youth-clinician relationship) and task (i.e., client participation in the treatment activities). Items are rated on a 6-point Likert-type scale ranging from 0 = not at all to 5 = a great deal. Scores on the TPOCS-A have displayed evidence of interrater reliability (ICCs ranged from .40 to .75; M = .59, SD = .10), internal consistency (α = .95), convergent validity with a self-report alliance measure, and predictive validity with youth clinical outcomes (Liber et al., 2010; McLeod et al., 2021; McLeod & Weisz, 2005). The TPOCS-A scale score was created by averaging the nine items. In this study, internal consistency in the Standard, Modular, and Usual Care conditions was α =.89, α =.85, and α = 0.92, respectively. Interrater reliability was ICC(2,2) = .86, ICC(2,2) = .82, and ICC(2,2) = .88, respectively.

Observational Coding Procedures

Three doctoral students were part of a team that coded the CBAY-A and TPOCS-A (M age = 26.00 years, SD = 2.00; 100% female; 66.7% White, 33.33% Asian American), two comprised the CBAY-C team (M age= 26.50 years, SD = 2.12; 100.0% female; 50.0% White, 50.0% Latina), and two formed the TPOCS-RS team (M age= 27.00 years, SD = 1.41; 100.0% female; 50.0% White, 50.0% Latina). The training progressed through the same steps for each measure. Training took approximately six weeks and involved weekly meetings and regular independent coding (i.e., about 12 hours per week). First, coders received didactic instruction and discussion of the scoring manuals, reviewed sessions with the trainers, and engaged in exercises designed to expand their understanding of each item. Second, coders engaged in coding, and results were discussed in weekly meetings. Lastly, each coder independently coded 40 recordings, and reliability was assessed against master codes. To be certified for independent coding, each coder had to demonstrate “good” reliability on each item (ICC > .59; Cicchetti, 1994). Following certification, regular meetings were held to help prevent coder drift. Sessions were randomly assigned to coders who were naïve to study hypotheses. Each session was double-coded, and the mean score was used in analyses to reduce measurement error. A total of 876 sessions were held. For this study, all sessions were coded except (a) the first and last session (as these sessions may contain intake or termination content); (b) audible content was shorter than 15 minutes, or (c) less than 75% of the dialogue was in English. 796 (90.9% of the full sample) met these criteria and were coded (94.7% Standard, n = 359; 93.8% Modular, n = 244; 81.4% Usual Care, n = 193).

Clinical Measures Collected in Original RCT

In the current investigation, T-scores on three Child Behavior Checklist (CBCL; Achenbach & Rescorla, 2000) scales were used for descriptive purposes and group comparisons: Total, Internalizing (broadband), Externalizing (broadband).

Results

Preliminary Analyses

Sample bias analyses were conducted to ascertain if the 55 youths and 39 clinicians included in the study varied from the remaining participants in the anxiety subsample of the Child STEPS Multisite Trial (see Weisz et al., 2012). The 13 youths with primary anxiety problems and six clinicians not included in the Child STEPS Multisite Trial did not differ from those in this sample. Next, we compared the youths and clinicians in the current sample who were in the Standard, Modular, and Usual Care conditions. One significant difference was found: sessions in the Modular condition (M = 43.34 minutes, SD = 12.27) were significantly longer than the sessions in the Standard (M = 40.49 minutes, SD = 13.54; t = 2.63, p = .009) and Usual Care conditions (M = 40.65 minutes, SD = 10.70; t = 2.40, p = .02). We examined if the proportion of treatment sessions coded across the three conditions differed. No significant difference was found, F(1, 3) = .22, p = .80. See Tables 1 and 2 for condition comparisons.

Interrater Reliability and Construct Validity

The treatment data used for this study are nested with repeated measurements within youth, youth within clinicians, and clinicians within sites. The interrater reliability and construct validity analyses do not rely on analytic models that may be influenced by nesting (e.g., fixed effect GLM; Zucker, 1990), so the nested structure of the data does not inflate Type I error. We thus followed standard practice for these analyses within the treatment fidelity field and conducted the analyses at the session level (e.g., Barber et al., 2006; Carroll et al., 2000; Hogue et al., 2008).

Interrater Reliability

The interrater reliability for the CBAY-A and CBAY-C Model items was calculated using the intraclass correlation coefficient (ICC). The ICC(2,2) model based on a two-way random-effects model was used as it provides a reliability estimate of the average score of all coders and allows for the generalizability of the findings to other samples (Shrout & Fleiss, 1979). Based on recommendations by Cicchetti (1994), ICCs below .40 were considered poor, between .40 and .59 were considered fair, between .60 and .74 were considered good, and .75 and above were considered excellent. Based on the total sample, ICCs for the CBAY-A Model items ranged from .77 to .93 (M ICC = .85, SD = .05), with all items in the excellent range. ICCs for the CBAY-C Model items ranged from .59 to .86 (M ICC = .76, SD = .09), with all items considered at least fair, and most items in the excellent range. These findings indicate that the CBAY-A and CBAY-C items displayed fair to excellent interrater reliability.

We next evaluated if the interrater reliability for the CBAY-A and CBAY-C items was comparable across the Standard and Modular conditions. Table 3 indicates that the interrater reliability for the CBAY-A and CBAY-C items differed across the conditions. In the Standard condition, ICCs for CBAY-A Model items ranged from .69 to .94 (M ICC = .84, SD = .07), with all items in the good to excellent range. The ICCs for the CBAY-C Model items ranged from .55 to .84 (M ICC = .71, SD = .08), with a single item in the fair range (“Relaxation”). In the Modular condition, ICCs for CBAY-A Model items ranged from −.02 to .91 (M ICC = .71, SD = .26) with a single item in the poor range (“Self-Reward”). ICCs for CBAY-C Model items ranged from .34 to .89 (M ICC = .50, SD = .20), with two items (“Coping Plan,” “Exposure”) in the poor range. For two CBAY-C items (“Problem Solving,” “Self-Reward”) ICCs could not be calculated due to lack of variance. In the Usual Care condition, CBAY-A Model item ICCs ranged from −.03 to .83 (M ICC = .51; SD = .38), with two items in the poor range (“Problem Solving,” “Exposure: Prep”) and one item in the fair range (“Fear Ladder”). Four CBAY-A items (e.g., “Self-Reward,” “Coping Plan”) could not be calculated in the Usual Care condition due to lack of variance. For the CBAY-C Model items, we were unable to calculate ICCs for 8 items (e.g., “Relaxation,” “Fear Ladder”) due to low variance. For the remaining items, the ICCs ranged from −.89 to .71 (M ICC = .23, SD = .97), with one item in the poor range (“Emotion Education”). Overall, interrater reliability was comparable across the Standard and Modular conditions for most CBAY-A items but was more variable for the CBAY-C items across the two conditions.

Table 3.

CBAY-A and CBAY-C Item and Subscale Descriptive Data and Interrater Reliability

Standard
Modular
Usual Care
Total
Item Range M(SD) ICC Range M(SD) ICC Range M(SD) ICC ICC
CBAY-A
Psychoeducation 6 1.96 (1.16) .685 6 2.17 (1.45) .851 2.5 1.23 (0.54) .621 .772
Emotion Education 6 1.93 (1.50) .857 4.5 1.18 (0.61) .754 5 1.16 (0.57) .829 .858
Fear Ladder 6 1.66 (1.18) .828 5.5 1.81 (1.29) .806 2.5 1.04 (0.23) .563 .819
Relaxation 6 1.57 (1.24) .899 5.5 1.18 (0.68) .839 3.5 1.12 (0.42) .774 .891
Cognitive 6 1.82 (1.40) .872 4.5 1.30 (0.80) .749 4 1.24 (0.69) .821 .853
Problem Solving 6 1.20 (0.76) .809 2 1.02 (0.15) .691 1.5 1.01 (0.12) −.008 .809
Self-Reward 6 1.23 (0.90) .940 1 1.01 (0.10) −.022 0 1.00 (0.00) NV .934
Coping Plan 6 1.90 (1.39) .850 3.5 1.13 (0.48) .603 2 1.01 (0.15) NV .849
Exposure: Prep 5 1.62 (1.15) .843 5 1.74 (1.10) .794 1 1.03 (0.16) −.029 .825
Exposure 6 1.72 (1.36) .894 5.5 2.00 (1.51) .913 1 1.02 (0.15) NV .902
Exposure: Debrief 5 1.40 (0.86) .765 5 1.59 (1.06) .867 1 1.01 (0.11) NV .818
Model Subscale 1.6 1.64 (0.31) .797 1.4 1.58 (0.39) .875 0.6 1.08 (0.14) .789 .842

CBAY-C
Psychoeducation 3.5 3.70 (0.84) .672 4 3.68 (0.75) .484 2.5 3.00 (0.84) .857 .592
Emotion Education 5 4.31 (1.03) .776 2 3.50 (0.64) .444 0.5 2.75 (0.27) −.889 .786
Fear Ladder 4 3.66 (1.07) .758 4 3.66 (1.05) .832 0 3.50 (0.00) NV .771
Relaxation 3.5 3.99 (0.82) .538 2.5 3.46 (0.72) .764 0.5 2.67 (0.26) NV .824
Cognitive 4.5 4.41 (1.00) .646 1.5 3.50 (0.87) .889 3 3.08 (0.70) .714 .814
Problem Solving 4 4.50 (1.30) .674 0 NR NV 0 NR NV .715
Self-Reward 3.5 4.18 (1.12) .841 0 NR NV 0 NR NV .861
Coping Plan 4.5 4.03 (0.93) .696 2 3.94 (0.73) .336 0 NR NV .649
Exposure: Prep 5 3.35 (1.23) .797 4.5 2.65 (0.70) .567 0 NR NV .836
Exposure 4.5 3.39 (1.01) .674 4 2.97 (0.84) .378 0 NR NV .687
Exposure: Debrief 4.5 3.65 (1.20) .729 3 3.02 (0.84) .625 0 NR NV .781
Model Subscale 5 3.86 (1.01) .769 4.3 3.35 (0.88) .665 2.8 2.89 (0.55) .462 .794

Note. Standard = Standard condition; Modular = Modular condition; Usual Care = Usual Care condition; CBAY-A = The Cognitive-Behavioral Therapy Adherence Scale for Youth Anxiety; CBAY-C = The Cognitive-Behavioral Therapy Competence Scale for Youth Anxiety; ICC = intraclass correlation coefficient; NV = ICCs not calculated due to lack of variance.

Construct Validity

To evaluate the performance of the CBAY-A and CBAY-C Model subscales scores across the two treatment programs, the analyses for construct validity were conducted separately for the Standard and Modular conditions. These analyses focused on the magnitude and pattern of the Pearson product-moment correlations between the observer-rated measures (CBAY-A Model, TPOCS-RS Anxiety, TPOCS-RS Non-Standard/Non-Modular), consultation records (Consultation Standard, Consultation MATCH Anxiety), and an observer-rated measure focused on the youth-clinician alliance (TPOCS-A). Within each condition, we hypothesized that scores on the CBAY-A Model subscale would demonstrate evidence of convergent validity via a large correlation with the TPOCS-RS Anxiety subscale and the corresponding Consultation Record scale (Standard, Modular Anxiety) as scores on these three measures are intended to represent the same construct (i.e., quantity). We hypothesized that the CBAY-C Model subscale would evidence small to medium correlations with scores on the adherence measures (CBAY-A Model, TPOCS-RS Anxiety, Consultation Record) as scores on these measures represent different treatment fidelity components (i.e., quantity and quality). We hypothesized that within each condition, the CBAY-A and CBAY-C Model subscales would demonstrate evidence of discriminant validity via small to medium correlations with scores on the TPOCS-RS Non-Standard/Modular subscale and the TPOCS-A (Hogue et al., 2008). The magnitude of the correlations was interpreted following Rosenthal and Rosnow’s (1984) guidelines: correlations are small if .10 ≤ r < .24, medium if .24 ≤ r < .36, and large if r ≥ .36. Follow-up contrasts of the absolute values of correlations were calculated using Fisher’s r-to-z transformation.

Standard Condition.

As seen in Table 4, the CBAY-A Model subscale, the TPOCS-RS Anxiety subscale, and the Consultation Record Standard evidenced the largest correlations. The CBAY-A Model subscale evidenced a large correlation with the TPOCS-RS Anxiety subscale (r = .66) and a medium correlation with the Consultation Record Standard (r = .35). The CBAY-A Model subscale had a medium correlation with the CBAY-C Model subscale (r = .24) and a small correlation with the TPOCS-A (r = .15) and TPOCS-RS Non-Standard (r = .06) subscale. The CBAY-C Model subscale had a large correlation with the TPOCS-RS Anxiety subscale (r = .40) and the TPOCS-A (r = .41). The CBAY-C Model subscale had small correlations with the Consultation Record Standard (r = −.06) or the TPOCS-RS Non-Standard (r = −.01) subscale.

Table 4.

Correlations Between the CBAY-A Model Subscale, Consultation Record Standard, TPOCS-RS Subscales, CBAY-C Model Subscale, and TPOCS-A in the Standard Condition

M (SD) 2. 3. 4. 5. 6.
1. CBAY-A Model 1.64 (0.31) .352** .659** .240** .062 .154**
2. Consultation Standard 0.21 (0.10) .357** −.061 .041 .019
3. TPOCS-RS Anxiety 2.06 (0.59) .399* −.066 .201**
4. CBAY-C Model 3.86 (1.01) −.010 .407**
5. TPOCS-RS Non-Standard 1.09 (0.19) −.010
6. TPOCS-A 3.44 (0.59)

Note. CBAY-A = The Cognitive-Behavioral Therapy Adherence Scale for Youth; Consultation Standard = Standard Consultation Record; TPOCS-RS = Therapy Process Observational Coding System for Child Psychotherapy-Revised Strategies Scale; CBAY-C = The Cognitive-Behavioral Therapy Competence Scale for Youth Anxiety; TPOCS-A = Therapy Process Observational Coding System-Alliance Scale.

*

p < .05

**

p < .01

Comparisons indicated that the correlation between the CBAY-A Model subscale and the TPOCS-RS Anxiety subscale (r = .66) was significantly larger than the correlation between the CBAY-A Model subscale and the (a) Consultation Standard scale (r = .35), z = 5.61, p < .001; (b) CBAY-C Model subscale (r = .24), z = 6.98, p < .001, (c) the TPOCS-RS Non-Standard subscale (r = .06), z = 9.73, p < .001, and (d) the TPOCS-A (r = .15), z = 8.31, p < .001.

The correlation between the CBAY-A Model subscale and the Consultation Record Standard scale (r = .35) was significantly higher than the correlation between the CBAY-A Model subscale and the (a) TPOCS-RS Non-Standard subscale (r = .06), z = 4.05, p < .001, and (b) TPOCS-A (r = .15), z = 2.76, p = .006. However, the correlation between the CBAY-A Model and the Consultation Standard scale (r = .35) was not significantly higher than the correlation between the CBAY-A Model and the CBAY-C Model subscale (r = .24), z = 1.56, p = .119.

The correlation between the CBAY-C Model subscale and the CBAY-A Model subscale (r = .24) was significantly higher than the correlation between the CBAY-C Model subscale and the TPOCS-RS Non-Standard subscale (r = .01), z = 2.88, p = .004, and significantly lower than the correlation between the CBAY-C Model subscale and the TPOCS-A (r = .41), z = −2.44, p = .015. The construct validity of the CBAY-A and CBAY-C Model subscales was supported in the Standard condition. However, the correlation between the CBAY-C and TPOCS-A was higher than hypothesized.

Modular Condition.

As seen in Table 5, the CBAY-A Model subscale, Consultation Record Modular Anxiety, and TPOCS-RS Anxiety evidenced the largest correlations in the Modular condition. Correlations were large between the CBAY-A Model subscale and the (a) TPOCS-RS Anxiety subscale (r = .71), and (b) Consultation Record Modular Anxiety scale (r = .46). The CBAY-A Model subscale had a medium correlation with the TPOCS-A (r = .28), a small correlation with the CBAY-C Model subscale (r = .11), and a small correlation with the TPOCS-RS Non-Modular subscale (r = −.19). The CBAY-C Model subscale evidenced small correlations with the TPOCS-RS Anxiety subscale (r =.16), Consultation Record Modular Anxiety (r = −.14), TPOCS-RS Non-Modular subscale (r = −.12), and the TPOCS-A (r = .05).

Table 5.

Correlations Between the CBAY-A Model Subscale, Consultation Record Modular, TPOCS-RS Subscales, CBAY-C Model Subscale, and TPOCS-A in the Modular Condition

M (SD) 2. 3. 4. 5. 6.
1. CBAY-A Model 1.46 (0.33) .459** .712** .112 −.192 .277**
2. Consultation Modular 0.23 (0.16) .455** −.138 −.098 .075
3. TPOCS-RS Anxiety 1.59 (0.43) .155* −.171** .213
4. CBAY-C Model 3.35 (0.88) −.119 .054
5. TPOCS-RS Non-Modular 1.10 (0.24) −.132
6. TPOCS-A 3.29 (0.57)

Note. CBAY-A = The Cognitive-Behavioral Therapy Adherence Scale for Youth; Consultation Modular = MATCH Consultation Record; TPOCS-RS = Therapy Process Observational Coding System for Child Psychotherapy-Revised Strategies Scale; CBAY-C = The Cognitive-Behavioral Therapy Competence Scale for Youth Anxiety; TPOCS-A = Therapy Process Observational Coding System-Alliance Scale.

*

p < .05

**

p < .01

Comparisons indicated that the correlation between the CBAY-A Model subscale and the TPOCS-RS Anxiety subscale (r = .71) was significantly larger than the correlation between the CBAY-A Model subscale and the (a) Consultation Modular Anxiety scale (r = .46), z = 4.27, p < .001, (b) CBAY-C Model subscale (r = .11), z = 7.79, p < .001, (c) TPOCS-RS Non-Modular subscale (r = .19), z = 7.63, p < .001, and (d) TPOCS-A (r = .28), z = 6.48, p < .001. The correlation between the CBAY-A Model subscale and the Consultation Record Modular Anxiety scale (r = .46) was significantly higher than the correlation between the CBAY-A Model subscale and the (a) TPOCS-RS Non-Modular subscale (r = .19), z = 3.30, p = .001, (b) TPOCS-A (r = .28), z = 2.23, p = .026, and (c) CBAY-C Model subscale (r = .11), z = 3.79, p < .001.

The correlation between the CBAY-C Model subscale and the CBAY-A Model subscale (r = .11) was not significantly different than the correlation between the CBAY-C Model subscale and the TPOCS-RS Non-Modular subscale (r = .12), z = 0.09, p = .928, or the correlation between the CBAY-C Model subscale and the TPOCS-A (r = .05), z = 0.53, p = .596. Overall, in the Modular condition, construct validity was supported for the CBAY-A but not for the CBAY-C Model subscale.

Variance Components Analysis

Variance components analysis was used to determine if measurement targets were associated with variation in scores on the CBAY-A Model and CBAY-C Model subscales. Mixed models in SAS/STAT Software 9.4 were used (see Barber et al., 2004; McLeod et al., 2015) to divide the variance among the subscale scores into variance sources. To calculate variance components, restricted maximum likelihood estimation was used to estimate random factors that represent potential sources of variance in treatment fidelity scores, taking into consideration the nesting within the data (see e.g., Barber et al., 2004; McLeod et al., 2015): (a) Condition; (b) Clinician (nested in condition); (c) Youth (nested in clinician, condition); (d) Time (nested in youth, clinician, condition); and (e) Coder. “Condition” refers to the effect the groups (Standard and Usual Care; Modular and Usual Care) exert on variation in the CBAY-A Model and CBAY-C Model subscale scores; “clinician” represents variations across clinicians on subscale scores; “youth” refers to changes across youth on subscale scores; “time” reflects the influence weeks in treatment has on subscale scores; “coder” reflects differences across coder ratings on subscales. The variance estimates for each random effect were transformed into proportions of variance based on total variance estimates. A separate analysis was conducted for each subscale (CBAY-A Model subscale, CBAY-C Model subscale) for Standard (and Usual Care) and Modular (and Usual Care). Condition was posited to represent the highest proportion of variance on the CBAY-A Model and CBAY-C Model subscale scores because it is expected that the quantity and quality of the practices used in these conditions would differ from those delivered in Usual Care. The CBAY-A Model and CBAY-C Model subscales’ discriminative validity was evaluated by ascertaining whether scores on each subscale could detect known differences between each condition (Standard, Modular) and Usual Care. Adjusted least-square means (LSMs) scores for the subscale scores were generated in each condition derived from the mixed-model analysis used for the variance components taking into consideration the nesting within the data. LSMs produce subscale scores corrected for the influence of other variables (i.e., condition, clinician, youth, time, coder). The CBAY-A Model and CBAY-C Model subscale scores were recalculated using the highest-scored item from each subscale for these analyses. This procedure is intended to provide a more accurate estimate of the quantity and quality of CBT delivered to youth (McLeod et al., 2022; Smith et al., 2017). We hypothesized that the Model subscales would have significantly higher scores in each condition (Standard or Modular) than in Usual Care.

Standard Condition

As seen in Table 6, condition accounted for variance in the CBAY-A Model subscale (.72 or 72%), which indicates that scores on the CBAY-A Model subscale varied across the Standard and Usual Care conditions. Time also accounted for variation in the CBAY-A Model subscale scores (.17), suggesting that scores varied over treatment. Clinician, youth, and coder did not account for more than 2% of the total variation in scores on the CBAY-A Model subscale. A similar pattern was observed for the CBAY-C Model subscale. Condition accounted for the most variance in the CBAY-C Model subscale (.46). In contrast, time accounted for the second highest proportion of variation in the CBAY-C Model subscale (.15). Clinicians and youth accounted for 10% and 7% of the variation in scores on the CBAY-C Model subscale, whereas coder only accounted for 1%. Because condition accounted for a high proportion of the variance in the CBAY-A and CBAY-C Model subscales, follow-up contrasts were conducted. As hypothesized, the score on the CBAY-A Model subscale was significantly higher in Standard (M = 4.66) than in Usual Care (M = 1.65), t(37) = 36.60, p < .001. Likewise, the score on the CBAY-C Model subscale was significantly higher in Standard (M = 4.11) than in Usual Care (M = 2.50), t(37) = 13.03, p < .001. These findings suggest that condition and time account for the most variation in the CBAY-A and CBAY-C Model subscale scores.

Table 6.

Variance Components for CBAY-A Model and CBAY-C Model Subscales in the Standard/Usual Care Conditions and Modular/Usual Care Conditions

Subscale Variance Components
Condition Clinician Youth Time Coder Residual
Standard and Usual Care
CBAY-A Model subscale .72 <.01 .02 .17 <.01 .10
CBAY-C Model subscale .46 .10 .07 .15 .01 .21
Modular and Usual Care
CBAY-A Model subscale .50 .05 .04 .30 <.01 .12
CBAY-C Model subscale .29 .08 .02 .16 .05 .40

Note. CBAY-A = The Cognitive-Behavioral Therapy Adherence Scale for Youth; CBAY-C = The Cognitive-Behavioral Therapy Competence Scale for Youth Anxiety.

Modular Condition

As seen in Table 6, condition accounted for variance in the CBAY-A Model subscale (.50 or 50%), suggesting that the scores varied across the Modular and Usual Care conditions. Time also accounted for variation in the CBAY-A Model subscale scores (.30), suggesting that the scores varied over treatment. Clinician and youth accounted for 5% and 4% of the total variation in scores on the CBAY-A Model subscale, whereas coder did not account for any variation. A similar pattern was observed for the CBAY-C Model subscale. Condition and time accounted for the most variance in the CBAY-C Model subscale scores (.29 and .16, respectively). Clinician, coder, and youth accounted for 8%, 5%, and 2% of the variation in scores on the CBAY-C Model subscale. Follow-up contrasts were conducted due to condition accounting for a high proportion of the variance in the CBAY-A and CBAY-C Model subscales. As hypothesized, the score on the CBAY-A Model subscale was significantly higher in Modular (M = 3.74) than in Usual Care (M = 1.65), t(31) = 21.78, p < .001. Scores on the CBAY-C Model subscale were also significantly higher in Modular (M = 3.46) than in Usual Care (M = 2.55), t(31) = 8.28, p < .001. These findings suggest that condition and time account for the most variation in the CBAY-A and CBAY-C scores.

Discussion

This study aimed to evaluate if the CBAY-A and CBAY-C Model subscales show promise for assessing protocol fidelity across two distinct CBT programs for youth anxiety. Independent raters reliably coded the CBAY-A items across the two CBT conditions but did not reliably code all CBAY-C items in the Modular condition. Across the two conditions, the CBAY-A Model subscale scores showed evidence of construct validity, and the CBAY-A and CBAY-C Model subscales appear to measure distinct fidelity components. Discriminative validity was supported for the CBAY-A and CBAY-C Model subscale scores, indicating that both subscales can distinguish between clinicians who have and have not been trained to deliver CBT. However, validity evidence for the CBAY-C Model subscale was mixed across the two conditions. Overall, the CBAY-A shows promise in gauging protocol adherence across different programs, but our findings raise questions about the generalizability of the CBAY-C.

Within the conditions, the mean interrater reliability of the CBAY-A items was “excellent” in the Standard condition (M ICC = .84) and “good” in the Modular condition (M ICC = .71; Cicchetti, 1994). In contrast, the mean interrater reliability for the CBAY-C items was “good” (M ICC = .71) in the Standard condition and “fair” (M ICC = .50) in the Modular condition. Interrater reliability for quantity is consistently higher than quality for treatment fidelity systems in the literature (see e.g., Carroll et al., 2000; Hogue et al., 2008; McLeod et al., 2018; Southam-Gerow et al., 2016), suggesting that interrater reliability may be more challenging to achieve for quality items. Compared to quantity ratings, greater inference is required to code quality, which has led some to suggest that experienced coders are needed to rate the fidelity component (see McLeod et al., 2013 for a discussion). Of course, it could be difficult for expert coders to achieve interrater reliability as their standards and implicit values could influence their ratings, though we are not aware of any research that directly speaks to this particular issue. Further, if more experienced coders are necessary to rate quality, such as experienced clinicians, then these measures may become more costly. Interestingly, the interrater reliability for quantity and quality was higher in the Standard condition. The lower interrater reliability in the Modular condition was due to low variance for some items, which influences ICC estimates (McLeod et al., 2022). As the Modular condition allowed for the flexible delivery of techniques, it is possible that clinicians did not consistently deliver some practices. The lower ICCs in the Usual Care condition are undoubtedly due to low variance, as many items were not delivered. Taken together, interrater reliability is supported for the CBAY-A Model items across both the Standard and Modular conditions. Still, the interrater reliability of the CBAY-C Model items was more variable.

Evidence for the construct validity of the CBAY-A Model subscale was consistent across the two conditions. The largest correlations were observed between the CBAY-A Model subscale and the two measures designed to assess adherence. Smaller correlations were seen with the other aspects of treatment fidelity and the alliance. Moreover, these findings are consistent with previous research. The magnitude of the correlations between the CBAY-A Model subscale and the consultant-rated adherence measure across conditions is consistent with what has previously been found between these two informants on adherence measures (e.g., Dennhag et al., 2012; Ward et al., 2013). The strength of the correlations between the CBAY-A Model subscale and the alliance was also consistent with previous research (Hogue et al., 2008; McLeod et al., 2018; Southam-Gerow et al., 2016).

Evidence for the construct validity of the CBAY-C Model subscale was mixed across both conditions. On the one hand, the magnitude of the correlations between the CBAY-C and CBAY-A Model subscales in the two conditions indicate that these measures assess distinct fidelity components. Our findings thus contribute to a growing body of literature suggesting that quantity and quality can be distinguished when independent coding teams assess each fidelity component (see Bjaastad et al., 2016; McLeod et al., 2018). On the other hand, the magnitude and pattern of correlations between the CBAY-C Model subscale and the different fidelity and alliance measures varied across both conditions. We hypothesized that the correlation between the CBAY-C and the CBAY-A would be higher than the relation between the CBAY-C and the alliance. This pattern was not observed in either condition. Indeed, in the Standard condition the CBAY-C Model subscale evidenced a significantly higher correlation with scores on the alliance measure. Though this ran counter to our hypothesis, this pattern of correlations between competence, adherence, and alliance has been observed in other studies that assessed these constructs in multiple treatment programs (see Carroll et al., 2000; Hogue et al., 2008). Competence is believed to help promote the alliance (see Fjermestad et al., 2016), so these variations in the correlation between competence and the alliance may reflect differences in how treatment programs focus on strengthening the client-clinician alliance.

Further support for the score validity of the CBAY-A and CBAY-C Model subscales is provided by the variance components analysis. Condition accounted for the highest variance in the CBAY-A and CBAY-C Model subscales in the Standard and Modular conditions. Follow-up analyses revealed that scores on the CBAY-A and CBAY-C Model subscales were significantly higher in both conditions than in usual care. The clinicians in the Standard and Modular conditions received training and consultation and the treatment fidelity ratings from the original Child STEPS Multisite Trial indicated little overlap in content across the conditions (see Weisz et al., 2012). Thus, quantity and quality scores were expected to be higher in the Standard and Modular conditions (Barber et al., 2004; McLeod et al., 2015, 2018). Interestingly, a higher proportion of the variance was accounted for by condition across both quantity and quality scores in the Standard condition. This is consistent with findings using another fidelity measure (see McLeod et al., 2022) and likely stems from clinicians’ ability to deliver content from anxiety, depression, and conduct modules in the Modular condition. Our findings are also consistent with previous findings in that a higher proportion of variance was accounted for by condition in adherence relative to competence scores (see McLeod et al., 2018). This suggests that adherence may vary more by condition, and competence may vary more at the clinician level (Hogue et al., 2008; McLeod et al., 2018). The findings support the discriminative validity of the CBAY-A and CBAY-C subscales.

Beyond condition, time in treatment accounted for the highest proportion of non-error variance in the CBAY-A and CBAY-C Model subscales. This indicates that these subscale scores changed systematically over time, consistent with findings that suggest quantity and quality scores may vary over treatment (Barber et al., 2004; Hogue et al., 2008). The direction of change in subscale scores over time is not evaluated by the variance components analysis in the current study and is an important question for future research. As noted above, clinician effects were also observed for the quantity and quality measures. Youth and coder accounted for systematic variance in the quantity and quality scores. These analyses support the assertion that the CBAY-A and CBAY-C assess distinct fidelity components because different factors account for variation across the subscales within and across conditions.

Although not the central focus of this study, our findings support the convergent validity of the Standard and MATCH Consultation Records. The moderate to high correlations between scores on the consultation records and the CBAY-A subscales provide validity evidence that can supplement the reliability and validity evidence found in previous studies (see McLeod et al., 2022; Ward et al., 2013). Given the rare instance that different informants show convergence on fidelity measures (see McLeod et al., 2022), the magnitude and pattern of these correlations are notable. As the Consultation Records utilize consultant-report, they may represent a more pragmatic way to gauge adherence in community settings. Identifying cost-effective fidelity measures represents an important direction for future research (McLeod et al., 2022). Indeed, few pragmatic fidelity measures exist that can be incorporated into the workflow of community mental health centers. Determining if the Consultation Records demonstrate similar score validity when used by supervisors in community mental health centers represents an important direction for future research that may lead to the development of pragmatic fidelity measures.

Although the study has several strengths, some limitations warrant discussion. First, the small number of youth nested within clinicians makes estimating client effects in our variance components analysis challenging. Second, although the flexibility of fidelity measures based on practice elements has advantages, the CBAY-A and CBAY-C still rely on highly trained observational coders, which may make using these measures burdensome in many settings. Establishing the score reliability and validity of less burdensome measures, like clinician- or consultant-report, is thus an important direction for future research. Third, the current study provides initial evidence that the CBAY-A, and to a lesser degree, the CBAY-C, can be applied to two separate CBT protocols for youth anxiety. However, both measures should undergo psychometric evaluations across multiple studies, samples, and settings (Martinez et al., 2014).

Additionally, because the CBAY-A is not intended to be used only with Coping Cat and MATCH treatment sessions, future research should also evaluate the psychometric properties of the CBAY-A when used with alternative CBT protocols for youth anxiety. Fourth, despite their strengths, the CBAY-A and CBAY-C only assess one dimension of treatment fidelity—i.e., what techniques are delivered along with the extent and competence of their use (i.e., protocol fidelity; Regan et al., 2019. As a result, other important dimensions of treatment fidelity are not assessed with these measures, such as fidelity to service coordination (e.g., how techniques are selected and ordered over the course of treatment). Such treatment fidelity dimensions may be especially important to assess when seeking to explain how fidelity relates to clinical outcomes in the Standard and Modular conditions (Weisz et al., 2012). Fifth, though both treatment protocols contained CBT techniques, the descriptive data reported in this study suggest that different content was delivered across the two conditions. It was beyond the scope of the present study to compare the techniques delivered in the two conditions, but future research that investigates this issue may help illuminate whether the content varied across the conditions and if this was important for youth clinical outcomes. Finally, we could not access an independent measure of protocol quality and thus could not fully assess the convergent validity of the CBAY-C. Therefore, an important direction for future research is to establish if independent quality measures converge (Cecilione et al., 2022).

Our findings indicate that using the CBAY-A Model items and subscale may be possible to estimate protocol adherence across different CBT programs for youth anxiety. Considering previous research on the CBAY-A and CBAY-C (see McLeod et al., 2018), evidence suggests the two measures assess distinct facets of treatment fidelity. However, further work is needed to bolster the interrater reliability of the individual CBAY-C items and the validity of the CBAY-C Model subscale.

Highlights.

  • A single measure can estimate adherence across two CBT programs for youth anxiety

  • A measure of competence had mixed psychometric evidence across two CBT programs

  • Observer- and consult-rated adherence measures converged across two CBT programs

  • Findings offer new ways of assessing quantity and quality in youth anxiety treatment

Acknowledgments

Preparation of this article was supported in part by a grant from the National Institute of Mental Health Grant (RO1 MH086529; McLeod & Southam-Gerow).

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contributor Information

Stephanie Violante, Department of Psychology, Virginia Commonwealth University.

Bryce D. McLeod, Department of Psychology, Virginia Commonwealth University

Michael A. Southam-Gerow, Department of Psychology, Virginia Commonwealth University

Bruce F. Chorpita, Department of Psychology, University of California, Los Angeles

John R. Weisz, Department of Psychology, Harvard University

References

  1. Achenbach TM, & Rescorla LA (2000). Manual for the ASEBA preschool forms and profiles (Vol. 30) University of Vermont, Research Center for Children, Youth, & Families Burlington, VT. [Google Scholar]
  2. Barber JP, Foltz C, Crits-Christoph P, & Chittams J (2004). Therapists’ adherence and competence and treatment discrimination in the NIDA collaborative cocaine treatment study. Journal of Clinical Psychology, 60(1), 29–41. [DOI] [PubMed] [Google Scholar]
  3. Barber JP, Gallop R, Crits-Christoph P, Frank A, Thase ME, Weiss RD, & Connolly Gibbons MB (2006). The role of therapist adherence, therapist competence, and alliance in predicting outcome of individual drug counseling: Results from the National Institute Drug Abuse Collaborative Cocaine Treatment Study. Psychotherapy Research, 16(2), 229–240. [Google Scholar]
  4. Barkley RA (2013). Defiant children: A clinician’s manual for assessment and parent training Guilford Press. [Google Scholar]
  5. Bjaastad JF, Haugland BSM, Fjermestad KW, Torsheim T, Havik OE, Heiervang ER, & Öst LG (2016). Competence and Adherence Scale for Cognitive Behavioral Therapy (CAS-CBT) for anxiety disorders in youth: Psychometric properties. Psychological Assessment, 28(8), 908–916. [DOI] [PubMed] [Google Scholar]
  6. Carroll KM, Nich C, Sifry RL, Nuro KF, Frankforter TL, Ball SA, Fenton L, & Rounsaville BJ (2000). A general system for evaluating therapist adherence and competence in psychotherapy research in the addictions. Drug and Alcohol Dependence, 57(3), 225–238. [DOI] [PubMed] [Google Scholar]
  7. Cecilione JL, McLeod BD, Southam-Gerow MA, Weisz JR, & Chorpita BF (2021). Examining the relation between technical and global competence in two treatments for youth anxiety. Behavior Therapy, 52(6), 1395–1407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chorpita BF, Becker KD, & Daleiden EL (2007). Understanding the common elements of evidence-based practice: Misconceptions and clinical examples. Journal of the American Academy of Child & Adolescent Psychiatry, 46(5), 647–652. [DOI] [PubMed] [Google Scholar]
  9. Chorpita BF, Bernstein A, Daleiden EL, & the Research Network on Youth Mental Health (2008). Driving with roadmaps and dashboards: Using information resources to structure the decision models in service organizations. Administration and Policy in Mental Health and Mental Health Services Research, 35, 114–123. [DOI] [PubMed] [Google Scholar]
  10. Chorpita BF, & Weisz JR (2005). Modular Approach to Therapy for Children with Anxiety, Depression, and Conduct Problems (MATCH-ADC). Unpublished Treatment Manual
  11. Cicchetti DV (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284–290. [Google Scholar]
  12. Cox JR, Martinez RG, & Southam-Gerow MA (2019). Treatment integrity in psychotherapy research and implications for the delivery of quality mental health services. Journal of Consulting and Clinical Psychology, 87(3), 221–233. [DOI] [PubMed] [Google Scholar]
  13. Dennhag I, Gibbons MBC, Barber JP, Gallop R, & Crits-Christoph P (2012). How many treatment sessions and patients are needed to create a stable score of adherence and competence in the treatment of cocaine dependence? Psychotherapy Research, 22(4), 475–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Fjermestad KW, McLeod BD, Tully CB, & Liber JM (2016). Therapist characteristics and interventions: Enhancing alliance and involvement with youth. In Maltzman S (Ed.), The Oxford handbook of treatment processes and outcomes in psychology: A multidisciplinary, biopsychosocial approach (pp. 97–116). Oxford University Press. [Google Scholar]
  15. Hogue A, Dauber S, Chinchilla P, Fried A, Henderson C, Inclan J, Reiner RH, & Liddle HA (2008). Assessing fidelity in individual and family therapy for adolescent substance abuse. Journal of Substance Abuse Treatment, 35(2), 137–147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kendall PC, & Hedtke K (2006). Cognitive-behavioral therapy for anxious children: Therapist manual 3rd ed. Ardmore, PA: Workbook Publishing. [Google Scholar]
  17. Liber JM, McLeod BD, Van Widenfelt BM, Goedhart AW, van der Leeden AJ, Utens EM, & Treffers PD (2010). Examining the relation between the therapeutic alliance, treatment adherence, and outcome of cognitive behavioral therapy for children with anxiety disorders. Behavior Therapy, 41(2), 172–186. [DOI] [PubMed] [Google Scholar]
  18. Martinez RG, Lewis CC, & Weiner BJ (2014). Instrumentation issues in implementation science. Implementation Science, 9(1), 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. McLeod BD, Cecilione JL, Jensen-Doss A, Southam-Gerow MA, & Kendall PC (2021). Reliability, factor structure, and validity of an observer-rated alliance scale with youth. Psychological Assessment, 33(10), 1013–1023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. McLeod BD, Martinez RG, Southam-Gerow MA, Weisz JR, & Chorpita BF (2022). Can a single measure estimate protocol adherence for two psychosocial treatments for youth anxiety delivered in community mental health settings? Behavior Therapy, 53(1), 119–136. 10.1016/j.beth.2021.06.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. McLeod BD, Porter N, Hogue A, Becker-Haimes, & Jensen-Doss, A. (2022). What is the status of multi-informant treatment fidelity research? Journal of Clinical Child and Adolescent Psychology [DOI] [PubMed]
  22. McLeod BD, Smith MM, Southam-Gerow MA, Weisz JR, & Kendall PC (2015). Measuring treatment differentiation for implementation research: The Therapy Process Observational Coding System for Child Psychotherapy Revised Strategies Scale. Psychological Assessment, 27(1), 314–325. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. McLeod BD, Southam-Gerow MA, Rodríguez A, Quinoy AM, Arnold CC, Kendall PC, & Weisz JR (2018). Development and initial psychometrics for a therapist competence instrument for CBT for youth anxiety. Journal of Clinical Child & Adolescent Psychology, 1–14. [DOI] [PMC free article] [PubMed]
  24. McLeod BD, Southam‐Gerow MA, Tully CB, Rodríguez A, & Smith MM (2013). Making a case for treatment integrity as a psychosocial treatment quality indicator for youth mental health care. Clinical Psychology: Science and Practice, 20(1), 14–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. McLeod BD, & Weisz JR (2005). The Therapy Process Observational Coding System-Alliance Scale: Measure characteristics and prediction of outcome in usual clinical practice. Journal of Consulting and Clinical Psychology, 73(2), 323–333. [DOI] [PubMed] [Google Scholar]
  26. McLeod BD, & Weisz JR (2010). The therapy process observational coding system for child psychotherapy strategies scale. Journal of Clinical Child and Adolescent Psychology, 39(3), 436–443. [DOI] [PubMed] [Google Scholar]
  27. Regan J, Park AL, & Chorpita BF (2019). Choices in treatment integrity: Considering the protocol and consultant recommendations in child and adolescent therapy. Journal of Clinical Child & Adolescent Psychology, 48(sup1), S79–S89. [DOI] [PubMed] [Google Scholar]
  28. Rosenthal R, & Rosnow RL (1984). Essentials of behavioral research: Methods and data analysis McGraw-Hill. [Google Scholar]
  29. Schoenwald SK, Garland AF, Chapman JE, Frazier SL, Sheidow AJ, & Southam-Gerow MA (2011). Toward the effective and efficient measurement of implementation fidelity. Administration and Policy in Mental Health and Mental Health Services Research, 38(1), 32–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Shrout PE, & Fleiss JL (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. [DOI] [PubMed] [Google Scholar]
  31. Smith MM, McLeod BD, Southam-Gerow MA, Jensen-Doss A, Kendall PC, & Weisz JR (2017). Does the delivery of CBT for youth anxiety differ across research and practice settings? Behavior Therapy, 48(4), 501–516. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Southam-Gerow MA, McLeod BD, Arnold CC, Rodríguez A, Cox JR, Reise SP, Kendall PC (2016). Initial development of a treatment adherence measure for cognitive–behavioral therapy for child anxiety. Psychological Assessment, 28(1), 70–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Southam-Gerow MA, Weisz JR, Chu BC, McLeod BD, Gordis EB, & Connor-Smith JK (2010). Does cognitive behavioral therapy for youth anxiety outperform usual care in community clinics? An initial effectiveness test. Journal of the American Academy of Child & Adolescent Psychiatry, 49(10), 1043–1052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Sutherland KS, McLeod BD, Conroy MA, Lyon AR, & Peterson N (2022). Implem entation science in Special Education: Progress and promise. In Handbook of Special Education Research, Volume I (pp. 204–216). Routledge. [Google Scholar]
  35. Sutherland KS, McLeod BD, & Conroy M (2022). Developing treatment integrity measures for teacher-delivered interventions: Progress, recommendations, and future directions. School Mental Health, 14(1), 7–19. [Google Scholar]
  36. Ward AM, Regan J, Chorpita BF, Starace N, Rodriguez A, Okamura K, Daleiden EL, Bearman SK, Weisz JR, & Research Network on Youth Mental Health. (2013). Tracking evidence based practice with youth: Validity of the MATCH and Standard Manual Consultation Records. Journal of Clinical Child & Adolescent Psychology, 42(1), 44–55. [DOI] [PubMed] [Google Scholar]
  37. Weisz JR, Chorpita BF, Palinkas LA, Schoenwald SK, Miranda J, Bearman SK, Daleiden EL, Ugueto AM, Ho A, Martin J, Gray J, Alleyne A, Langer DA, Southam-Gerow MA, Gibbons RD, & Research Network on Youth Mental Health (2012). Testing standard and modular designs for psychotherapy treating depression, anxiety, and conduct problems in youth: A randomized effectiveness trial. Archives of General Psychiatry, 69(3), 274–282. [DOI] [PubMed] [Google Scholar]
  38. Weisz JR, Weersing VR, Valeri SM, & McCarty CA (1999). Therapist’s Manual PASCET: Primary and secondary control enhancement training program University of California Los Angeles. [Google Scholar]
  39. Weller EB, Weller RA, Rooney MT, & Fristad MA (1999a). CHIPS-Children’s Interview for Psychiatric Syndromes American Psychiatric Press; Washington, DC. [DOI] [PubMed] [Google Scholar]
  40. Weller EB, Weller RA, Rooney MT, & Fristad MA (1999b). Children’s Interview for Psychiatric Syndromes-Parent Version (P-ChIPS) American Psychiatric Association; Washington, DC. [Google Scholar]
  41. Zucker DM (1990). An analysis of variance pitfall: The fixed effects analysis in a nested design. Educational and Psychological Measurement, 50(4), 731–738. [Google Scholar]

RESOURCES