Skip to main content
The British Journal of Occupational Therapy logoLink to The British Journal of Occupational Therapy
. 2024 Nov 9;88(3):133–141. doi: 10.1177/03080226241283292

Achieving Inter-Rater Agreement and Inter-Rater Reliability to Assess Fidelity of an Occupation-Based Coaching (OBC) Clinical Trial Intervention

Amy Ann Abbott 1, Julia Shin 1,, Kathryn Carlson 2, Marion Russell 1, Yongyue Qi 1, Hannah Storm 3, Vanessa Dawn Jewell 4
PMCID: PMC12033844  PMID: 40343155

Abstract

Introduction:

Establishing inter-rater agreement and reliability ascertains that multiple raters consistently evaluate observed interventions to ensure that clinical research protocols are delivered as intended by the trial protocol.

Purpose:

Using the Guidelines for Reporting Reliability and Agreement Studies, we (a) exemplified the steps to establish inter-rater reliability and inter-rater agreement on the occupation-based coaching Video Evaluation Tool and (b) evaluated best practices that promoted high inter-rater reliability and inter-rater agreement between blinded raters prior to starting a pilot randomized controlled trial. The randomized controlled trial examined the preliminary effectiveness of occupation-based coaching via telehealth for rural families with children living with type 1 diabetes to improve family quality of life, participation, self-efficacy, and child health outcomes.

Method:

We created a library of 13 occupation-based coaching videos portraying a range of evaluations, scores, and ratings. The inter-rater agreement and reliability on the occupation-based coaching Video Evaluation Tool were established through the iterations of (a) blinded rater training, (b) data collection using the tool, and (c) statistical analysis using Cohen’s kappa and Cronbach’s alpha.

Findings:

Occurrence and Non-Occurrence Checklist (κ = 0.881, p < 0.001); “Caregiver Talk” and “Interventionist Talk Analysis” (ICC = 0.991–0.999, p < 0.001); Evidence of Independent Capacity Rating (ICC = 0.867 p = 0.006).

Conclusion:

Strong inter-rater reliability and inter-rater agreement was established by engaging two blinded raters through multifaceted training, integrating real-life clients and contexts into the instrumentation and training, and precisely defined rubric criteria. By employing such practices, high inter-rater reliability and agreement can be achieved in clinical research involving interventions and instruments that are highly subjective and individualized. To ascertain greater scientific confidence in the intervention effect, developing a multidomain fidelity framework and establishing high inter-rater agreement and reliability in the instruments a priori to implementation of clinical trials are necessary.

Keywords: Clinical trial, fidelity, measurement, telehealth, chronic conditions, type 1 diabetes

Introduction

Inter-rater agreement is defined as the degree to which two or more raters provide identical absolute scores on items from a research instrument (Gisev et al., 2013; Kottner et al., 2011; Slaug et al., 2010). In contrast, inter-rater reliability is distinguished as the level of consistency among two or more raters to detect and differentiate variability inherent in measurements (Bajpai et al., 2015; Gisev et al., 2013; Kottner et al., 2011; Slaug et al., 2010). In other words, the inter-rater agreement relates to the sameness of ratings produced by two or more raters, while the inter-rater reliability relates to the consistency in which two or more raters produce responses in agreement. As an example, when two or more raters are using a highly reliable observational checklist to evaluate a participant’s performance, they are likely to (a) produce same ratings and (b) consistently detect relative differences in low to high performances among the participants (Gisev et al., 2013). According to Kottner et al. (2011), inter-rater agreement and inter-rater reliability are products of interactions among the type of instrument, construct or subject being measured, rater characteristics, administration process, and statistical approach and are specific to the research context. In other words, for accurate interpretation and translation of inter-rater agreement and inter-rater reliability, a thorough evaluation of its context and steps of execution is necessary (Ibrahim and Souraya, 2016; Kottner et al., 2011; Slaug et al., 2010).

Nevertheless, evaluating the procedures to achieve optimal inter-rater agreement and inter-rater reliability is scarcely discussed in healthcare research with inconsistent methodology and reporting patterns (Barry et al., 2014; Farzin et al., 2017; Hallgren, 2012; Han et al., 2022; Hand et al., 2018; Slaug et al., 2010; Streiner & Kottner, 2014). When reported, partial or inadequate information is provided, especially pertaining to sample selection, study design, and statistical analysis methods (Barry et al., 2014; Kottner et al., 2011; Slaug et al., 2010). Because inter-rater agreement and inter-rater reliability are relative constructs, not fixed properties, missing information can significantly impede how healthcare educators, practitioners, and researchers evaluate and apply the research instrument (Barry et al., 2014; Hallgren, 2012; Hand et al., 2018; Souza et al., 2017). For instance, missing information on sample selection to calculate inter-rater agreement and inter-rater reliability may mislead others to apply the instrument on inappropriate demographic groups, ultimately resulting in invalid data. Furthermore, inter-rater agreement and inter-rater reliability are often exclusively reported in sections of psychometric studies that seek to validate novel or existing assessments (Bottari et al., 2010; Bowyer and Tkach, 2019; May-Benson et al., 2021) rather than in clinical research contexts. In clinical research, documenting and reporting how inter-rater agreement and inter-rater reliability are established is especially crucial, because they promote assurance in the validity of the intervention outcomes (Kottner et al., 2011). Specifically, a clinical study that utilized instruments with high inter-rater agreement and inter-rater reliability reduces the risk of measurement and intervention biases (Kottner et al., 2011).

In response to the call for more rigorous studies documenting the methodology to establishing inter-rater agreement and inter-rater reliability in clinical research contexts, the two-fold purpose of this methodological paper is to (a) exemplify the steps followed to establish inter-rater agreement and inter-rater reliability on a novel instrument intended to capture fidelity ratings of the occupation-based coaching (OBC) randomized controlled trial (RCT) interveners and (b) evaluate the strategies that promoted high inter-rater agreement and inter-rater reliability among blinded raters within the specific contexts of occupational therapy telehealth research. This paper is organized following the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) to adhere to best practice with reporting outcomes using a comprehensive, standardized protocol (Kottner et al., 2011).

The Guidelines for Reporting Reliability and Agreement Studies

The inconsistencies and inadequacies observed in reliability and agreement studies in healthcare research can be partly attributed to the lack of reporting standards or frameworks (Gerke et al., 2018; Kottner et al., 2011; Streiner and Kottner, 2014). To fill this need, a team of experts in instrument development, reliability and agreement estimation, and systematic review of reliability studies developed the GRRAS (Kottner et al., 2011). Accepted into the Enhancing the Quality and Transparency of Health Research, the GRRAS 15-item standards and corresponding checklist (see Table 1) facilitate accurate, transparent, and standardized reporting of research contexts, procedures, and results in reliability and agreement studies (Kottner et al., 2011; Streiner and Kottner, 2014). Since their inception, the GRRAS criteria have been employed to guide reliability studies (Charidimou, 2017; Rasmussen et al., 2020; Zhu et al., 2023), systematic reviews of reliability and agreement studies and instruments (Farzin et al., 2017; Han et al., 2022), and research education and consultation (Gerke et al., 2018). Throughout this paper, the authors explicitly demonstrate the application of this useful framework within the unique contexts of occupational therapy research to inform current and future healthcare educators, practitioners, and researchers.

Table 1.

The Guidelines for Reporting Reliability and Agreement Studies checklist and evidence of adherence (Kottner et al., 2011).

Section Item # Checklist item Page #
Reported
Title/Abstract 1 Identify in title or abstract that inter-rater/intra-rater
reliability or agreement was investigated.
1
Introduction 2 Name and describe the diagnostic or measurement device of interest explicitly. 3–5
3 Specify the subject population of interest. 2–3
4 Specify the rater population of interest (if applicable). 5
5 Describe what is already known about reliability and agreement and provide a rationale for the study (if applicable). 2–3
Methods 6 Explain how the sample size was chosen. State the determined number of raters, subjects/objects, and replicate observations. 5–6
7 Describe the sampling method. 5
8 Describe the measurement/rating process (e.g., time interval between repeated measurements, availability of clinical information, blinding). 5–6
9 State whether measurements/ratings were conducted independently. 5–6
10 Describe the statistical analysis. 6
Results 11 State the actual number of raters and subjects/objects which were included and the number of replicate observations which were conducted. 6
12 Describe the sample characteristics of raters and subjects (e.g. training, experience). 5
13 Report estimates of reliability and agreement including measures of statistical uncertainty. 6
14 Discuss the practical relevance of results. 6–8
Auxiliary Material 15 Provide detailed results if possible (e.g., online). Available Upon Request

Defining the contexts of occupational therapy telehealth research

Based on the GRRAS criteria, documenting the research contexts, including subject population of interest and measurement constructs facilitates accurate evaluation of the instrument for the purposes of applicability and reproducibility (Kottner et al., 2011). An interdisciplinary research team including those living with and/or caring for someone with type 1 diabetes (T1D) developed a novel instrument, the OBC Video Evaluation Tool, to continuously monitor and evaluate fidelity throughout the implementation of a 12-week OBC telehealth intervention trial (Jewell et al., 2022; Jewell et al., in press; Shin et al., 2022). This pilot OBC telehealth trial involved occupational therapy interventionists working with rural-dwelling caregivers with a child diagnosed with T1D using a collaborative, family-centered coaching protocol grounded in a strength-based practice and positive psychology (Jewell et al., 2022; Jewell et al., in press). The participants were required to reside in the United States in one of the following states: Nebraska, Iowa, Minnesota, or Colorado at the time of enrollment with a minimum of one-hour commute to a pediatric endocrinology office. Guided through the steps of benefit finding, reflection and feedback, guided discovery, joint plan, and summary, the participants (intervention group of eight caregiver-child dyads) developed goals and action plans to promote child health outcomes, family quality of life, and caregiver health management self-competence (Jewell et al., 2022; Jewell et al., in press; Shin et al., 2022). Due to the open-ended, client-centered nature of the OBC protocol, ensuring that each 1-hour session was delivered and received as intended across the interventionists, families, and sessions surfaced as a challenge. As one of the orchestrated efforts to ascertain fidelity, the OBC Video Evaluation Tool was developed and implemented as part of a comprehensive fidelity framework based on the recommendations by the National Institutes of Health Behavior Change Consortium ([NIH BCC]; Bellg et al., 2004; Borelli, 2011; Shin et al., 2022). The process of developing the comprehensive fidelity framework across the domains of study design, provider training, treatment delivery, treatment receipt, and treatment enactment is discussed elsewhere (Shin et al., 2022).

OBC Video Evaluation Tool

OBC Video Evaluation Tool is specifically developed to ascertain fidelity in the domains of treatment delivery, treatment receipt, and treat enactment (Bellg et al., 2004; Borelli, 2011). The instrument is comprised of three parts: Occurrence and Non-Occurrence Checklist, “Caregiver Talk and “Interventionist Talk” Analysis, and Evidence of Independent Capacity Rating.

Occurrence and Non-Occurrence Checklist

The Occurrence and Non-Occurrence Checklist includes (a) five “Adherence” elements that correspond to the core principles of the OBC protocol and (b) two “Drift from Adherence” elements that indicate the interventionist deviating from the core principles of the OBC protocol. Table 2 provides an excerpt from the Occurrence and Non-Occurrence Checklist (Dunn et al., 2018; Little et al., 2018). In a high-fidelity session, the interventionist adheres to the “Adherence” elements while avoiding engaging in the “Drift from Adherence” elements. Using this checklist, a blinded rater can objectively evaluate and determine that an OBC session is delivered as intended with high fidelity.

Table 2.

Excerpt from the occurrence and non-occurrence checklist (Dunn et al., 2018; Little et al., 2018; Shin et al., 2022).

Adherence Elements Occurrence
Yes, Occurred No, Not Occurred
The interventionist starts the session by asking the client to recall and share something positive that occurred in the past week or since the last intervention session.
EX) What was something good that happened this week?
The interventionist uses open-ended questions to engage the client in formulating a joint plan.
EX) What do you plan to do for the following week? What does this plan look like?
Drift from Adherence Elements Occurrence
Yes, Occurred No, Not Occurred
The interventionist offers expert opinions.
EX) Have you heard of this intervention? You should try doing this instead.
The interventionist asks yes/no questions.
EX) Do you agree that this strategy is helpful?

EX: example.

“Caregiver talk and “interventionist talk” analysis

Adapted from the seminal work by Dunn et al. (2018) and Little et al. (2018), the “Caregiver Talk” and “Interventionist Talk” analysis can ascertain that the OBC session is delivered and received as intended. This analysis yields numeric data in word count and percentages. Specifically, the “Caregiver Talk” and “Interventionist Talk” data are yielded from counting the total number of words uttered by the caregiver and the interventionist, in a single OBC session, respectively. To prepare for analysis, the verbatim transcription of an OBC session is separated into two “Caregiver Talk” and “Interventionist Talk” files. Then, the words from these two files are counted and converted into percentages of “Caregiver Talk” and “Interventionist Talk.” Because the OBC protocol emphasizes the caregiver becoming increasingly independent with setting goals and developing action plans with fewer prompts and guidance from the interventionist, a high percentage of caregiver engagement or high “Caregiver Talk” and a low “Interventionist Talk” are indicative of a high-fidelity session where the intervention is delivered and received as intended (Dunn et al., 2018; Little et al., 2018). In addition, the number of open-ended questions (i.e., indicating higher fidelity in treatment delivery) and closed-ended questions and expert comments (i.e., indicating lower fidelity in treatment delivery) is recorded to yield additional data.

Evidence of independent capacity rating

Finally, the Evidence of Independent Capacity Rating allows each recorded OBC session to be rated on an ordinal scale from “A [Independent from Interventionist with New Attempts]” to “C [Dependent on Interventionist with No Progress and No New Attempt]” based on the caregiver’s enactment of targeted knowledge and skills gained from the intervention as generalized to real-life contexts (Shin et al., 2022). As displayed in Figure 1, if the caregiver implements the action plan, evaluates the outcome, and develops an action plan outside the OBC session without the interventionist guidance, increasing caregiver self-efficacy and autonomy are demonstrated. This demonstration of independent problem-solving, empowerment, and self-determination from the caregiver is the outcome of cardinal importance in the 12-week OBC telehealth intervention trial where the caregiver obtains lasting knowledge and skills to assist in the child’s T1D management and outcomes (Jewell et al., 2022; Jewell et al., in press). Using the rating scale, a blinded rater can systematically observe the caregiver’s reflections and behaviors to assign the letter grade and determine if the OBC session is enacted as intended.

Figure 1.

Figure 1.

Evidence of independence capacity rating from the Occupation-Based Coaching Video Evaluation Tool.

Method

The inter-rater agreement and inter-rater reliability of the OBC Video Evaluation Tool were established following the three phases of (a) blinded rater training, (b) data collection using the OBC Video Evaluation Tool, and (c) statistical analysis. This project was part of a pilot RCT approved by the Creighton University Institutional Review Board #2000257 and in accordance with the Helsinki Declaration our research protocols were conducted with strict adherence to ethical standards. Data were retrieved for training and instrument development purposes only with no personally identifiable information collected at any time. The blinded raters and trainer were part of the research team conducting the RCT.

Participants

The two blinded raters were a nursing professor and an undergraduate nursing student at a Midwestern university whose specific roles on the research team were to serve as the blinded raters (Jewell et al., 2022; Jewell et al., in press). The raters completed extensive training in OBC principles, T1D technology, and T1D management to obtain baseline knowledge related to the intervention along the research team. However, the raters were excluded from the development of the OBC Video Evaluation Tool and blinded to the conditions of the RCT, including treatment assignment and interventionist debrief meetings, to reduce any bias in their judgment and evaluation.

The Library of Video Samples

To allow for calculation of inter-rater agreement and inter-rater reliability, the researchers created a library of sample OBC sessions on a continuum from low to high fidelity. These recordings exhibited maximal variation in client characteristics, goals, and action plans as well as adherence to the core principles of the OBC protocol. The videos met the following criteria: (a) the interventionist delivering the session was trained in the OBC protocol, (b) the client was either living with T1D, was a caregiver of a child living with T1D, or was knowledgeable about T1D as evidenced by documented record of diabetes training and education, and (c) the video captured an entire OBC session from beginning to end to allow for holistic evaluation. In addition, the video collection (a) included sessions delivered by multiple interventionists to eliminate bias toward one interventionist and (b) reflected maximal variation in the degree of fidelity, including ambiguous sessions that were difficult to judge as high fidelity and low fidelity. Overall, the library of video samples allowed for a wide range of possible evaluations, scores, and ratings on the OBC Video Evaluation Tool. The finalized collection included 13 full-length videos of OBC sessions delivered by four different interventionists with both real-life and simulated clients. While the sessions were not scripted, the interventionist and the simulated clients were provided with information on the degree of fidelity (i.e., low, high, moderate) and session contexts (e.g., for simulated clients, the child’s age, family dynamics, primary goals) as appropriate. To ensure consistency across the videos, the degree of fidelity was communicated to the interventionists and simulated clients by one designated researcher a priori by watching sample low- and high-fidelity videos and brainstorming the session content. For example, in a low fidelity session, the researcher specifically asked the interventionist to engage in “Drift from Adherence” behavior while preparing the simulated client for an appropriate response.

Procedure

Phase 1: Blinded rater training

The two blinded raters (first and fifth authors) were trained in the principles of OBC and T1D technology and management alongside the interdisciplinary team. The raters completed a two-day, in-person OBC workshop provided by leading OBC practitioners and researchers. The raters attended a three-day T1D training series presented by an endocrinologist, a pharmacist (also a certified diabetes care and education specialist), a diabetes continuing education conference, and associated courses. The raters also completed all assigned reading on key/seminal publications and listened to podcasts relevant to occupational-therapy-specific diabetes health management. Then, the raters completed an additional 7 hours of training using the OBC Video Evaluation Tool under the guidance of a licensed occupational therapist who was extensively trained in instrument development and application. During this training, the raters were provided with a detailed handout consisting of instructions, definitions, and sample evaluation forms associated with the OBC Video Evaluation Tool. Next, the raters watched two 20-minute sample sessions and worked through the Occurrence and Non-Occurrence Checklist, “Caregiver Talk and “Interventionist Talk” Analysis, and Evidence of Independent Capacity Rating. The raters and trainer (second author) analyzed the sample videos while actively interacting and engaging in discussions and feedback. This allowed for clarification and calibration of inter-rater agreement (e.g., occurrence vs. non-occurrence, caregiver vs. interventionist talk, A-rating vs. B-rating). The raters then watched and independently analyzed a third 20-minute video utilizing the OBC Video Evaluation Tool. The researchers analyzed the data yielded from this video using Cohen’s kappa (κ) and Cronbach’s alpha (α). The inter-rater reliability was “almost perfect” for the Occurrence and Non-Occurrence Checklist (κ = 0.877; p < 0.001; Landis and Koch, 1977) and “excellent” for the “Caregiver Talk” and “Interventionist Talk” Analysis (α = 0.995; p < 0.001; George and Mallery, 2003); 100% absolute agreement between the raters was found on the Evidence of Independent Capacity Rating, and therefore, Cronbach’s alpha was not calculated. These findings affirmed that the next phase of implementing the OBC Video Evaluation Tool to establish inter-rater agreement and inter-rater reliability could commence. In preparation, the raters were provided with consultation sessions via video conferencing technology as needed for 2 weeks about completing the forms and scoring the items on the OBC Video Evaluation Tool. During this time, any discrepancies in rater observations and ratings were discussed and mediated for further calibration.

Phase 2: Data collection using the OBC Video Evaluation Tool

Beginning the next phase, the trainer shared a folder with 10 randomly selected sample OBC videos from the library with the two blinded raters. Inside each folder, the video recording, verbatim transcription of the video, and session notes were provided to aid holistic evaluations. The random selection of 10 videos ensures a good representation of recorded videos and the feasibility of analyzing many transcriptions. For each video, the raters were asked to independently complete the Occurrence and Non-Occurrence Checklist, “Caregiver Talk and “Interventionist Talk” Analysis, and Evidence of Independent Capacity Rating. The raters were not allowed to interact when implementing the OBC Evaluation Tool during this phase, but they had unlimited access to the trainer and the resources related to the OBC protocol and T1D management. The completed OBC Video Evaluation Tool forms on the 10 videos were independently forwarded to the trainer. The trainer coded the data across the Occurrence and Non-Occurrence Checklist, “Caregiver Talk and “Interventionist Talk” Analysis, and Evidence of Independent Capacity Rating into a spreadsheet in preparation for the statistical analysis.

Phase 3: Statistical analysis

Cohen’s kappa (κ) was used to assess the inter-rater agreement and inter-rater reliability for the categorical data of the Occurrence and Non-Occurrence Checklist. Kappa is calculated from the observed and expected frequencies on the diagonal of a square contingency table, commonly used to measure the agreement between two categorical variables. Cohen’s kappa is a robust statistic that ranges from −1 to +1, where 0 represents the amount of agreement that occurs from random chance and 1 represents perfect agreement between the two raters. According to Landis and Koch (1977), kappa (κ) = 0.81–1.00 indicates “almost perfect” inter-rater agreement and kappa (κ) = 0.61–0.80 indicates “substantial” strength of agreement.

Intraclass correlation coefficient (ICC) was used to measure the inter-rater reliability of ratings in “Caregiver Talk” and “Interventionist Talk” including word counts and percentages from the transcriptions and number of open-ended questions, closed-ended questions, and expert comments between the two raters. ICC measures the degree of correlation and agreement between measurements and is a desirable measure of inter-rater reliability. ICC is used to evaluate the consistency or conformity of measurements between two or more raters measuring the same quantity (Shrout and Fleiss, 1979). The value of an ICC can range from 0 to 1, indicating no reliability to perfect reliability among raters. An ICC value greater than 0.9 represents excellent reliability, whereas ICC values from 0.75 to 0.9 are considered good reliability (Koo and Li, 2016). ICC was also used for the Evidence of Independent Capacity Rating to assess the accuracy and consistency in the letter grades applied by the raters on client independence with formulating action plans across the 10 sample videos. In collaboration with the research team, a statistician created the statistical analysis plan, conducted sample randomization for video selection, and performed statistical analyses. Additionally, the statistician contributed to drafting and editing the data collection and statistical analysis sections. The Statistical Package for Social Science (SPSS) for Windows version 28 was used for all data analysis and a p-value less than .05 was considered statistically significant.

Findings

Across the 10 videos, the inter-rater reliability on the OBC Occurrence and Non-Occurrence Checklist yielded “almost perfect” agreement (κ = 0.881, p < 0.001). ICC between the two raters was also significant for the “Caregiver Talk” and “Interventionist Talk Analysis,” indicating “excellent” agreement between the raters (α = 0.999, p < 0.001). The inter-rater reliability of the number of open-ended questions, closed-ended questions, and expert opinions was also statistically significant with “excellent” agreement (α = 0.991; p < 0.001). Finally, the inter-rater reliability on the letter grades assigned to the 10 videos using the OBC Evidence of Independent Capacity Rating indicated “good” agreement, which was statistically significant (α = 0.867 p = 0.006).

Discussion

This paper illustrated the steps taken to establish inter-rater agreement and inter-rater reliability on a novel instrument, the OBC Video Evaluation Tool, in the context of a 12-week OBC telehealth pilot study. Across the three parts of Occurrence and Non-Occurrence Checklist, “Caregiver Talk and “Interventionist Talk” Analysis, and Evidence of Independent Capacity Rating, “good” to “almost perfect” inter-rater reliability estimations were calculated. In evaluating the process in how such high degrees of inter-rater agreement and inter-rater reliability statistics were achieved, the following strategies utilized by the research team were noteworthy: (a) engagement of blinded raters through didactic, practical, and applied training; (b) integration of real-life clients and contexts into the instrumentation and training; and (c) precisely defined rubric criteria that were based on observable and quantifiable qualities.

Actively involving the raters throughout the research process, not limited to training exclusively on the instrument, is essential to ensure that the raters are knowledgeable about the study’s contexts, the core principles of the intervention and protocol, and the expectations in their role as a blinded rater. According to Sadler et al. (2017), rater training modalities can be categorized into (a) didactic where only instructional materials are provided with or without discussion, (b) practical where rater evaluation occurs on stimuli (e.g., video, audio, photo) in comparison to a gold standard with feedback, and (c) applied where evaluation occurs with live or remote observation with feedback. In their analysis of 29 articles on rater training and clinical outcomes, Sadler et al. (2017) found that more than half (n = 18) of the studies used a combination of didactic and practical training modalities. The researchers concluded that without comprehensive and multifaceted rater training, even experienced raters can commit errors in their evaluation, which may be influenced by their judgments, training history, and motivation (Sadler et al., 2017). When training the two blinded raters in the current study, a combination of didactic, practical, and applied modalities of training were used. For instance, the raters completed the didactic and practical training series in the principles of OBC and T1D technology and management, not just within the scope of using the OBC Video Evaluation Tool. Then, they moved onto applied learning where they used the OBC Video Evaluation Tool to analyze the OBC sessions while being observed and provided feedback live by the trainer. The raters also engaged in practice sessions, group discussions, and consultations with the trainer to further calibrate their evaluation skills. Such efforts are in congruence with the best practice recommendations for rater training to embed opportunities for rationale and critical thinking development, repeated practice, and continuous calibration with expert feedback (Pufpaff et al., 2015). Because the raters understood the contexts of what to expect from the perspectives of the client and the interventionist from the extensive, multiphased training, they were able to rate the sample videos with higher intra- and inter-agreement and consistency.

Throughout the instrumentation and rater training, integration of real-life clients and realities of the research contexts were prioritized. For instance, the OBC Video Evaluation Tool was developed and reviewed by a team of experts in OBC, T1D, telehealth, and the lived experience (i.e., a person and a caregiver of a child living with T1D). The library of video samples involved actual clients whose children were affected T1D or simulated clients who were knowledgeable in T1D in relation to child health outcomes and family quality of life. Furthermore, the library included video samples that were ambiguous and therefore difficult to judge as high or low fidelity. McClellan (2010) and Pufpaff et al. (2015) recommended that raters should be exposed to samples of both clear, unambiguous responses and other varied types of responses. Due to the open-ended and reflective nature of the OBC where the caregiver and the interventionist engage in spontaneous conversations to arrive at strategies, solutions, and action plans, the recorded OBC sessions used in this study as well as from the RCT were highly ambiguous and variable. By integrating precisely aligned and realistic training materials that captured this complexity, the two raters in this study were well-prepared to analyze complex sessions that often ended up in-between scoring criteria and letter grade ratings.

Finally, achieving the unusually high degree of inter-rater agreement and inter-rater reliability on the OBC Video Evaluation Tool was facilitated by using a clearly defined scoring criteria and rubric based on observable and quantifiable qualities. When developing the fidelity instruments for the OBC protocol, the researchers were perplexed—the highly subjective observational data collected from the OBC sessions were ironically required to be objectively analyzed. As a solution, the researchers specifically developed the scoring criteria and rubric to be based on observable data. For instance, the Occurrence and Non-Occurrence Checklist and the Evidence of Independent Capacity Rating clearly and simply defined the observable interventionist qualities to indicate “Adherence” and “Drift from Adherence” and caregivers’ behaviors and qualities of A, B, B-, and C performance, respectively. The “Caregiver Talk and “Interventionist Talk” Analysis yielded absolute quantifiable data as measured by words uttered by the caregiver and the interventionist. Moreover, by employing a three-part instrument that yielded multiple data across the fidelity domains of treatment delivery, receipt, and enactment, there was more than a single point of assurance in ascertaining the fidelity (Shin et al., 2022). Specifically, the pattern observed in high-fidelity OBC sessions (i.e., high “Adherence” and low “Drift from Adherence” of the interventionist, high “Caregiver Talk,” and A grade on caregiver independent capacity) can be contrasted to the pattern observed in low fidelity OBC sessions (i.e., low “Adherence” and high “Drift from Adherence” of the interventionist, low “Caregiver Talk,” and B- or C rating on caregiver independent capacity). This unique feature of the OBC Video Evaluation Tool can further reinforce the fidelity of the sessions delivered. In summary, by following the best practices in reliability and agreement studies and actively utilizing these strategies, the research team was able to successfully achieve strong inter-rater agreement and inter-rater reliability on the OBC Video Evaluation Tool before the implementation of the 12-week OBC telehealth RCT. Entering the clinical phase of the study, the OBC Video Evaluation Tool served as a useful instrument to assist in continuous and retrospective monitoring and ascertainment of the intervention fidelity. Specifically, this work allowed the research team to have on-going assurance that the intervention was delivered as intended as confirmed and reinforced by the data across the Occurrence and Non-Occurrence Checklist, “Caregiver Talk and “Interventionist Talk” Analysis, and Evidence of Independent Capacity Rating. In addition, from establishing high inter-rater agreement and inter-rater reliability, rater bias was effectively controlled, increasing the scientific confidence that these observed changes over time in participants were indeed due to intervention effect rather than from a measurement error. Future research may include studies that examine the relationship between high-fidelity intervention delivery and study outcomes.

Conclusion

As one of the first studies in the occupational therapy literature to integrate the GRRAS, the methodology, outcomes, and insights gained from implementing the OBC Video Evaluation Tool will encourage others to consider best practices when establishing inter-rater agreement and inter-rater reliability on a novel instrument. Ascertaining fidelity in health behavior change interventions is especially imperative in occupational therapy research contexts, because the targeted constructs and intervention protocols, due to the nature of the profession, are often highly subjective and individualized. To obtain greater scientific confidence that the intervention yielded the intervention effect observed, developing a multidomain fidelity framework and establishing fidelity instruments with high inter-rater agreement and reliability a priori to implementation of a clinical trial are necessary. Both the NIH BCC’s five-domain fidelity framework (Bellg et al., 2004; Borelli, 2011; Shin et al., 2022) and the GRRAS framework (Kottner et al., 2011) can serve as strong foundations for current and future educators, practitioners, and researchers to increase rigor, transparency, adherence, and sophistication within and outside of occupational therapy research endeavors.

Key findings

  • Optimal inter-rater agreement and reliability was achieved using creative, systematic, purposeful training and a detailed rubric.

  • Using fidelity and reporting frameworks increased rigor and sophistication in this inter-rater agreement and reliability research.

  • Healthcare clinical trial researchers, including researchers in occupational therapy, should report methodologies and protocols of intervention trials to increase rigor and transparency of findings.

What the study has added

Engaging the blinded raters in multiphased training, integrating real-life clients and contexts, and creating precisely defined rubric criteria based on observable and quantifiable qualities can assist in achieving optimal inter-rater agreement and inter-rater reliability.

Acknowledgments

The authors would like to thank Drs. Lauren Little and Anna Wallisch for their knowledge and contributions regarding occupation-based coaching and Drs. Emily Knezevich and Andrea George regarding type 1 diabetes management. Finally, we would like to thank the following nursing and occupational therapy students for their contributions to video data collection and analysis, Allison Condon, Lisa Lorey, and Romy Luo.

Footnotes

Research ethics: Not applicable. This project was completed to ensuring our raters were reliable as part of a funded pilot randomized control trial IRB Creighton University #2000257, approved 2020.

Consent: Not applicable. The raters were researchers involved with the funded pilot randomized control trial for which these analyses were completed.

Patient and public involvement data: During the development, progress, and reporting of the submitted research, Patient and Public Involvement in the research was included at all stages of the research.

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Creighton University Health Sciences Strategic Health Fund (PI: Vanessa Jewell IRB #2000257); Creighton University College of Nursing Dean’s grant (PI: Amy Abbott). Creighton University Undergraduate Research Center for Undergraduate Research and Scholarship (PI: Amy Abbott); Iota Tau Chapter of Sigma Theta Tau International (PI: Amy Abbott).

Contributorship: AA served as a blinded rater to yield data and contributed to manuscript writing. JS identified the frameworks used in the study, led the fidelity instrument development and processes used in this study and completed drafting and heavy editing of the manuscript. KC served as a trainer to blinded raters to collect and analyze data. HS served as the other blinded rater to yield data. YQ completed all statistical analyses and wrote the specific findings. MR researched literature and assisted in developing fidelity instruments and processes. VJ was the PI of the RCT, assisted in the fidelity tool development, and completed editing of the manuscript. All authors contributed to, reviewed, and edited the manuscript and approved the final version of the manuscript.

References

  1. Bajpai S, Bajpai R, Chaturvedi H. (2015) Evaluation of inter-rater agreement and inter-rater reliability for observational data: An overview of concepts and methods. Journal of the Indian Academy of Applied Psychology 41: 20–27. [Google Scholar]
  2. Barry AE, Chaney B, Piazza-Gardner AK, et al. (2014) Validity and reliability reporting practices in the field of health education and behavior: A review of seven journals. Health Education & Behavior 41: 12–18. [DOI] [PubMed] [Google Scholar]
  3. Bellg AJ, Borrelli B, Resnick B. (2004) Enhancing treatment fidelity in health behavior change studies: Best practices and recommendations from the NIH behavior change consortium. Health Psychology 23: 443–451. [DOI] [PubMed] [Google Scholar]
  4. Borelli B. (2011) The assessment, monitoring, and enhancement of treatment fidelity in public health clinical trials. Journal of Public Health Dentistry 71: S52–S63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bottari CL, Dassa C, Rainville CM, et al. (2010) The IADL profile: Development, content validity, intra- and interrater agreement. Canadian Journal of Occupational Therapy 77: 90–100. [DOI] [PubMed] [Google Scholar]
  6. Bowyer P, Tkach MM. (2019) Treatment fidelity in Model of Human Occupation research. British Journal of Occupational Therapy 82: 263–271. [Google Scholar]
  7. Charidimou A, Schmitt A, Wilson D, et al. (2017) The Cerebral Haemorrhage Anatomical RaTing inStrument (CHARTS): Development and assessment of reliability. Journal of the Neurological Sciences 372: 178–183. [DOI] [PubMed] [Google Scholar]
  8. Dunn W, Little LM, Pope E, et al. (2018) Establishing fidelity of occupational performance coaching. OTJR: Occupation, Participation and Health 38: 96–104. [DOI] [PubMed] [Google Scholar]
  9. Farzin B, Gentric JC, Pham M, et al. (2017) Agreement studies in radiology research. Diagnostic and Interventional Imaging 98: 227–233. [DOI] [PubMed] [Google Scholar]
  10. George D, Mallery P. (2003) SPSS for Windows Step by Step: A Simple Guide and Reference, 4th edn. Boston: Allyen and Bacon. [Google Scholar]
  11. Gerke O, Möller S, Debrabant B, et al. (2018) Experience applying the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) indicated five questions should be addressed in the planning phase from a statistical point of view. Diagnostics 8: 69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gisev N, Bell JS, Chen TF. (2013) Interrater agreement and interrater reliability: Key concepts, approaches, and applications. Research in Social and Administrative Pharmacy 9: 330–338. [DOI] [PubMed] [Google Scholar]
  13. Hallgren KA. (2012) Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology 8: 23–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Han O, Tan HW, Julious S, et al. (2022) A descriptive study of samples sizes used in agreement studies published in the PubMed repository. BMC Medical Research Methodology 22: Article 242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Hand BN, Darragh AR, Persch AC. (2018) Thoroughness and psychometrics of fidelity measures in occupational therapy: A systematic review. American Journal of Occupational Therapy 72: 7205205050p1–7205205050p10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Ibrahim S, Sidani S. (2016) Intervention fidelity in interventions: An integrative literature review. Research Theory for Nursing Practice 30: 258–271. [DOI] [PubMed] [Google Scholar]
  17. Jewell V, Qi Y, Knezevich E, et al. (2022) Evaluation of a rural telehealth occupation-based coaching intervention for type 1 diabetes health management. American Journal of Occupational Therapy 76(Suppl. 1): 7610510018. [Google Scholar]
  18. Jewell VD, Russell M, Shin J, et al. (in press) Telehealth occupation-based coaching for rural parents of children with type 1 diabetes: A pilot randomized controlled trial. Am J Occup Ther 2025; 79(1). [DOI] [PubMed] [Google Scholar]
  19. Koo TK, Li MY. (2016) A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine 15: 155–163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kottner J, Audigé L, Brorson S, et al. (2011) Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. Journal of Clinical Epidemiology 64: 96–106. [DOI] [PubMed] [Google Scholar]
  21. Landis JR, Koch GG. (1977) The measurement of observer agreement for categorical data. Biometrics 33: 159–74. [PubMed] [Google Scholar]
  22. Little LM, Pope E, Wallisch A, et al. (2018) Occupation-based coaching by means of telehealth for families of young children with autism spectrum disorder. American Journal of Occupational Therapy 72: 7202205020p1–7202205020p7. [DOI] [PubMed] [Google Scholar]
  23. May-Benson TA, Schoen SA, Teasdale A, et al. (2021) Inter-rater reliability of goal attainment scaling with children with sensory processing disorder. Open Journal of Occupational Therapy 9: 1–13. [Google Scholar]
  24. McClellan CA. (2010) Constructed-response scoring—Doing it right. R & D Connections 13: 1–7. https://www.ets.org/Media/Research/pdf/RD_Connections13.pdf [Google Scholar]
  25. Pufpaff LA, Clarke L, Jones RE. (2015) The effects of rater training on inter-rater agreement. Mid-Western Educational Researcher 27: 117–141. [Google Scholar]
  26. Rasmussen GHF, Kristiansen M, Arroyo-Morales M, et al. (2020) Absolute and relative reliability of pain sensitivity and functional outcomes of the affected shoulder among women with pain after breast cancer treatment. PloS One 15: e0234118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Sadler ME, Yammamoto RT, Khurana L, et al. (2017) The impact of rater training on clinical outcomes assessment: A literature review. International Journal of Clinical Trials 4: 101–110. [Google Scholar]
  28. Shin J, Jewell VD, Abbott AA, et al. (2022) Fidelity protocol for a telehealth type 1 diabetes occupation-based coaching intervention. Canadian Journal of Occupational Therapy 89: 159–169. [DOI] [PubMed] [Google Scholar]
  29. Shrout PE, Fleiss JL. (1979) Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin 86: 420–428. [DOI] [PubMed] [Google Scholar]
  30. Slaug B, Schilling O, Helle T, et al. (2012) Unfolding the phenomenon of interrater agreement: A multicomponent approach for in-depth examination was proposed. Journal of Clinical Epidemiology 65: 1016–1025. [DOI] [PubMed] [Google Scholar]
  31. Souza ACD, Alexandre NMC, Guirardello EDB. (2017). Psychometric properties in instruments evaluation of reliability and validity. Epidemiologia e Servicos de Saúde 26: 649–659. [DOI] [PubMed] [Google Scholar]
  32. Streiner DL, Kottner J. (2014) Recommendations for reporting the results of studies of instrument and scale development and testing. Journal of Advanced Nursing 70: 1970–1979. [DOI] [PubMed] [Google Scholar]
  33. Zhu W, Wu J, Yang L, et al. (2023) Construction of nursing care quality evaluation indicators for post-anaesthesia care unit in China. Journal of Clinical Nursing 32: 137–146. [DOI] [PubMed] [Google Scholar]

Articles from The British Journal of Occupational Therapy are provided here courtesy of SAGE Publications

RESOURCES