Inter-rater reliability data of classroom observation: Fidelity in large-scale randomized research in education

Fuhui Tong; Shifang Tang; Beverly J Irby; Rafael Lara-Alecio; Cindy Guerrero

doi:10.1016/j.dib.2020.105303

. 2020 Feb 17;29:105303. doi: 10.1016/j.dib.2020.105303

Inter-rater reliability data of classroom observation: Fidelity in large-scale randomized research in education

Fuhui Tong ^a,^∗, Shifang Tang ^a, Beverly J Irby ^a,^b, Rafael Lara-Alecio ^a, Cindy Guerrero ^a

PMCID: PMC7044508 PMID: 32140518

Abstract

This dataset belongs to a large-scale randomized controlled trial (RCT) in educational research targeting English learning students and their teachers' instructional capacity. The dataset includes ratings conducted through classroom observations of 45-minute English as a Second language (ESL) blocks. Each coder rated 60 recorded video segments collected from each teacher. During the 20-second segment, ratings of six domains of teachers' instruction (i.e., ESL Strategies, Group, Activity Structure, Mode, Language Content, Language of Teacher, Language of Student) were collected. The dataset is organized by teacher, by coder, and by domain, for researchers to analyze inter-rater reliability among coders by domain and/or cross-domain. This data article is related to the research article Tong et al. [3] on “The determination of appropriate coefficient indices for inter-rater reliability: using classroom observation instruments as fidelity measures in large-scale randomized research”.

Keywords: Inter-rater reliability, Classroom observation, Fidelity of implementation, Bilingual/ESL education

Specifications Table

Subject	Social Sciences
Specific subject area	Education; bilingual/English as second language (ESL) education
Type of data	Table
How data were acquired	Virtually-recorded classroom videos were coded via an online coding platform http://tbop.teachbilingual.com/
Data format	Cleaned and labelled raw data
Parameters for data collection	Recordings were collected during a 45-minute block of ESL instruction. Research staff and graduate students of the project rated the recordings using a multi-dimension-multi-response (MDMR) instrument, i.e., Transitional bilingual observation protocol (TBOP, [1,2]). The coding was based on the following dimensions: ESL Strategies, Group, Activity Structure, Mode, Language Content, Language of Teacher, and Language of Student. These coders received training provided by the research team who were the developers of the observation instrument.
Description of data collection	Classroom instruction was recorded virtually. Videos were then coded by trained personnel. A value representing the category in each dimension is recorded to indicate the presence of a certain pedagogical behaviour.
Data source location	College Station, Texas, the United States
Data accessibility	With the article or Mendeley Data link https://data.mendeley.com/datasets/479hwxdfwb/draft?a=10b48a38-dc39-46b3-8402-4ffbe51548f3
Related research article	Tong, F., Tang, S., Irby, B. J., Lara-Alecio, R., & Guerrero, C., (2020). The Determination of Appropriate Coefficient Indices for Inter-Rater Reliability: Using Classroom Observation Instruments as Fidelity Measures in Large-Scale Randomized Research. International Journal of Educational Research [3].

Open in a new tab

Value of the Data

•
Classroom observation is recommended as an objective approach to measuring fidelity of implementation (FOI) in experimental research [[4], [5], [6]]. The establishment of inter-rater reliability of observation instruments is receiving more attention [7,8]. Our FOI data were acquired via classroom observations with extensive resources and personnel and quality training in the most rigorous design in educational research through a U.S federal-funded project.
•
The empirical dataset presents multiple coders' rating of an MDMR protocol with nominal data (i.e., TBOP) that were used for FOI in an RCT on improving elementary bilingual teachers' instructional capacity and their ELs' language achievement in English.
•
The dataset can be used to conduct inter-rater analyses and calculate different indices of inter-rater agreement. It can be used to compare different agreement indices in order to test the levels of reliability related to MDMR data in the bilingual/ESL context.
•
The dataset demonstrates a scientific approach to record data for rater agreement when there are multiple raters involved in classroom observation, which is informative to practitioners, researchers, and administrators in their future data analysis in educational research where public data are not readily available regarding classroom observation.

Open in a new tab

1. Data description

The dataset contains seven observation dimensions: ESL Strategies, Group, Activity Structure, Mode, Language Content, Language of Instruction/Teacher, and Language of Student that were compiled in one workbook (see Table 1). Each sheet (tab) was named after the domain of observation. The top row contains variable names: teacher (1–3), segment (1–60), and coder (1–6). The first column is the index of three teachers whose lessons were recorded and coded. The second column is the index of segments that were coded. Each video is coded into 60 segments, and each segment lasts 20 seconds. Columns 3–8 are coders' ratings of each segment. For example, column 3 is Coder 1's ratings of three videos of 60 20-second segments that were rated. The value in each cell from columns 3–8 represents a unique code corresponding to an instructional activity within each domain that was observed by the coders during that segment. The explanation of these codes and coding protocol can be referred to Lara et al. [1] and Tong et al. [2]. These data are therefore nominal in nature and can be used to calculate different indices of inter-rater agreement as was reported in Tong et al. For future analysis, researchers need to select appropriate inter-rater indices for the nominal data in educational research.

Table 1.

Descriptive statistics of the data used to calculate inter-rater reliability.

# of teachers	# of domains (worksheets)	# of coded 20-second segments	# of coders
3	7	180	6

Open in a new tab

2. Experimental design, materials, and methods

Data in this paper were part of a larger database of classroom observation from a randomized project. We randomly selected 10% of the data to ensure sample representativeness. The purpose of obtaining such a sample was to calculate interrater reliability among the coders of their observations. We recommended 10% in consideration of the total number of raters involved in this process (between 5 and 8, see Tong et al. [2]). After receiving intensive training of the observation instruments, each coder was assigned three recorded lessons to code individually with the intent to reach inter-rater reliability. In the coding process, the coders watched the first five-minute of the recorded video to obtain a general sense of the lesson. In the next five minutes, the rater coded 15 segments with each segment lasting for 20 seconds. The coders repeated such circle four times till he/she completed coding 60 20-second segments of a 45-minute ESL lesson. For each 20-second segment, a coder rated the above-mentioned six domains. The data presented in the paper were cleaned and organized by domain in the order of 20-second segments that were coded for each teacher.

Acknowledgments

This dataset was part of Project English Language and Literacy Acquisition-Validation (ELLA-V), supported by the Office of Innovation and Improvement, United States Department of Education, #U411B120047.

Footnotes

^{Appendix A}

Supplementary data to this article can be found online at https://doi.org/10.1016/j.dib.2020.105303.

Conflict of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Supplementary data

The following are the Supplementary data to this article:

Multimedia component 1

mmc1.xlsx^{(53.1KB, xlsx)}

Multimedia component 2

mmc2.xml^{(375B, xml)}

References

1.Lara-Alecio R., Parker R.I. A pedagogical model for transitional English bilingual classrooms. Biling. Res. J. 1994;18 199-133. [Google Scholar]
2.Tong F., Tang S., Irby B.J., Lara-Alecio R., Guerrero C., Lopez T. A process for establishing and maintaining inter-rater reliability for two observation instruments as a fidelity of implementation measure: a large-scale randomized controlled trial perspective. Stud. Educ. Eval. 2019;62:18–29. [Google Scholar]
3.Tong F., Tang S., Irby B.J., Lara-Alecio R., Guerrero C. The determination of appropriate coefficient indices for inter-rater reliability: using classroom observation instruments as fidelity measures in large-scale randomized research. Int. J. Educ. Res. 2020;99 [Google Scholar]
4.Nelson M.C., Cordray D.S., Hulleman C.S., Darrow C.L., Sommer E.C. A procedure for assessing intervention fidelity in experiments testing educational and behavioral interventions. J. Behav. Health Serv. Res. 2012;39(4):374–396. doi: 10.1007/s11414-012-9295-x. [DOI] [PubMed] [Google Scholar]
5.Noell G.H. Empirical and pragmatic issues in assessing and supporting intervention implementation in school. In: Peackock G.G., Ervin R.A., Daly E.J., Merrell K.W., editors. Practical Handbook in School Psychology. Guilford; New York, NY: 2010. pp. 513–530. [Google Scholar]
6.Smith S.W., Daunic A.P., Taylor G.G. Treatment fidelity in applied educational research: expanding the adoption and application of measures to ensure evidence-based practice. Educ. Treat. Child. 2007;30(4):121–134. [Google Scholar]
7.Lee O., Penfield R., Maerten-Rivera J. Effects of fidelity of implementation on science achievement gains among English language learners. J. Res. Sci. Teach. 2009;46(7):836–859. [Google Scholar]
8.Missett T.C., Foster L.H. Searching for evidence-based practice: a survey of empirical studies on curricular interventions measuring and reporting fidelity of implementation published during 2004-2013. J. Adv. Acad. 2015;26(2):96–111. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia component 1

mmc1.xlsx^{(53.1KB, xlsx)}

Multimedia component 2

mmc2.xml^{(375B, xml)}

[bib1] 1.Lara-Alecio R., Parker R.I. A pedagogical model for transitional English bilingual classrooms. Biling. Res. J. 1994;18 199-133. [Google Scholar]

[bib2] 2.Tong F., Tang S., Irby B.J., Lara-Alecio R., Guerrero C., Lopez T. A process for establishing and maintaining inter-rater reliability for two observation instruments as a fidelity of implementation measure: a large-scale randomized controlled trial perspective. Stud. Educ. Eval. 2019;62:18–29. [Google Scholar]

[bib3] 3.Tong F., Tang S., Irby B.J., Lara-Alecio R., Guerrero C. The determination of appropriate coefficient indices for inter-rater reliability: using classroom observation instruments as fidelity measures in large-scale randomized research. Int. J. Educ. Res. 2020;99 [Google Scholar]

[bib4] 4.Nelson M.C., Cordray D.S., Hulleman C.S., Darrow C.L., Sommer E.C. A procedure for assessing intervention fidelity in experiments testing educational and behavioral interventions. J. Behav. Health Serv. Res. 2012;39(4):374–396. doi: 10.1007/s11414-012-9295-x. [DOI] [PubMed] [Google Scholar]

[bib5] 5.Noell G.H. Empirical and pragmatic issues in assessing and supporting intervention implementation in school. In: Peackock G.G., Ervin R.A., Daly E.J., Merrell K.W., editors. Practical Handbook in School Psychology. Guilford; New York, NY: 2010. pp. 513–530. [Google Scholar]

[bib6] 6.Smith S.W., Daunic A.P., Taylor G.G. Treatment fidelity in applied educational research: expanding the adoption and application of measures to ensure evidence-based practice. Educ. Treat. Child. 2007;30(4):121–134. [Google Scholar]

[bib7] 7.Lee O., Penfield R., Maerten-Rivera J. Effects of fidelity of implementation on science achievement gains among English language learners. J. Res. Sci. Teach. 2009;46(7):836–859. [Google Scholar]

[bib8] 8.Missett T.C., Foster L.H. Searching for evidence-based practice: a survey of empirical studies on curricular interventions measuring and reporting fidelity of implementation published during 2004-2013. J. Adv. Acad. 2015;26(2):96–111. [Google Scholar]

PERMALINK

Inter-rater reliability data of classroom observation: Fidelity in large-scale randomized research in education

Fuhui Tong

Shifang Tang

Beverly J Irby

Rafael Lara-Alecio

Cindy Guerrero

Abstract

1. Data description

Table 1.

2. Experimental design, materials, and methods

Acknowledgments

Footnotes

Conflict of Interest

Appendix A. Supplementary data

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Inter-rater reliability data of classroom observation: Fidelity in large-scale randomized research in education

Fuhui Tong

Shifang Tang

Beverly J Irby

Rafael Lara-Alecio

Cindy Guerrero

Abstract

1. Data description

Table 1.

2. Experimental design, materials, and methods

Acknowledgments

Footnotes

Conflict of Interest

Appendix A. Supplementary data

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases