Toward Causal Inferences in Discipline-Based Education Research: Using Regression Discontinuity Design to Understand the Effect of Classroom Interventions

Shangmou Xu; Elli J Theobald

doi:10.1187/cbe.25-03-0053

. 2026 Spring 1;25(1):rm1. doi: 10.1187/cbe.25-03-0053

Toward Causal Inferences in Discipline-Based Education Research: Using Regression Discontinuity Design to Understand the Effect of Classroom Interventions

Shangmou Xu ^1,^*, Elli J Theobald ^1,^*

Editor: Brian A Couch,

PMCID: PMC12930255 PMID: 41481863

Abstract

This paper overviews a quasi-experimental approach, the Regression Discontinuity (RD) design, as a viable tool to estimate the effects of classroom interventions in discipline-based education research (DBER). Classroom interventions have been widely used in undergraduate science, technology, engineering, and mathematics (STEM) instruction to improve student outcomes and promote educational equity. Yet two common approaches to access the impacts of these interventions on student outcomes, randomized control trials and covariate adjustment models, may not be an optimal choice when (1) it is not feasible or ethical to conduct randomized experiments, and (2) the instructor does not acquire sufficient student background characteristics to account for nonrandom assignments of students to the intervention. Fortunately, the RD designs exploit a predetermined intervention threshold and, under testable assumptions, can estimate the impact of an intervention by comparing students who narrowly qualified for the intervention to students who narrowly did not. Utilizing an extended example data from a real-world classroom intervention, we demonstrate why and how to perform RD analysis with classroom intervention data. We also provide step-by-step R Markdown (https://github.com/TheobaldLab/RegressionDiscontinuity.git) to encourage the implementation of the RD design in DBER.

INTRODUCTION

Classroom interventions have been widely used in undergraduate science, technology, engineering, and mathematics (STEM) instruction to improve student outcomes and promote educational equity. Specifically, classroom interventions have been implemented as interventions vis-á-vis values affirmations, utility values, and curriculum, among other things and can help break the opportunity barriers in STEM education (Theobald et al., 2015; Jordt et al., 2017; Canning et al., 2018). In implementing and examining the effectiveness of an intervention, one may consider randomized experimental design that establishes a comparable condition between treatment groups (i.e., students who receive an intervention) and control groups (i.e., students who do not receive an intervention); that is, the possibility of receiving the intervention is equal for all students regardless of their background, previous experience, or investigators’ decision. Randomized experimental design enables researchers to produce unbiased estimation and causally examine the effectiveness of an intervention, and thus is considered as the gold standard in establishing causation (e.g., What Works Clearinghouse, 2022). Yet, in real classroom interventions, such randomized experiments are often times not feasible as the enrollment process of undergraduate STEM courses are, in general, not manipulable by the researchers or instructors (but refer to Stanich et al., 2018 for a study that did randomize student enrollment). Additionally, and in particular in single-classroom settings, it is sometimes undesirable or unethical to randomly provide a theoretically-beneficial intervention when it could have benefited all underperforming students.

At least since the late 20^th century, educational researchers have sought to develop research designs and analytical strategies to causally understand the effects of treatment (e.g., an intervention, or a policy change) when the randomized experimental design is not feasible (Angrist and Lavy, 1999; Shadish et al., 2002). The term “causal” works at least in two basic ways. First, the hypothesized cause (i.e., the intervention) has to produce the anticipated effect before the causal effect is measured. Second, more essentially, it is important to rule out all other plausible explanations to establish the causal relationship between an intervention and its effect. In classroom interventions, one common practice is to target interventions at underperforming students (e.g., Meaders et al., 2025). But one weakness of this design is that in some research settings, previous achievement and other unobserved factors may affect both the selection into the treatment group and learning outcomes, and thus the presence of the intervention is not the sole explanation of observed effects.

To causally examine the effects of an intervention, one analytical strategy is to exploit a clearly-defined threshold at which the intervention is provided and compare treated and untreated students near the threshold; the key is that students around the threshold likely share similar backgrounds and exhibit similar abilities. This analytic strategy ignores the experimental assignment to treatment and instead leverages the values just above and just below the threshold to understand the effect. This design is called the Regression Discontinuity (RD) design. As its name suggests, the basic RD model describes a visual representation where the effect of an intervention appears as a “jump” in the dependent variable that breaks the original regression (e.g., linear) between the value that determines the intervention and the learning outcome.

This paper aims to introduce the basics of RD design and explain when and how to use the RD approach to understand the causal effect of classroom interventions. Specifically, this paper encourages and supports discipline-based education researchers and undergraduate STEM instructors to tailor research designs and analytical strategies for better understanding of the effectiveness of classroom interventions. Procedurally, this paper first reviews common approaches that examine data from classroom interventions (or referred as pre–posttest in some DBER literatures, e.g., Theobald and Freeman, 2014). Then, this paper applies an extended example to illustrate the use of RD design in a real classroom intervention study. The extended example includes visualizations, parametric estimations, bandwidth selections, and nonparametric estimations that one might include in a research paper. We choose to first introduce parametric estimations, although nonparametric estimations are more commonly used, because it illustrates the basic concept of RD design and helps connect the research design to real classroom intervention practices. This paper also provides R code to help readers develop their own RD research design (available here: https://github.com/TheobaldLab/RegressionDiscontinuity.git).

COMMON ANALYTICAL STRATEGIES FOR CLASSROOM INTERVENTIONS

Figure 1 serves as a conceptual model that helps us understand problems and solutions to analyzing the effects of classroom interventions. To understand the intervention effect, one common approach is to examine the effect of assignment status on observed learning outcomes; that is, examine whether being assigned to the treatment group has positive effects on students’ learning outcomes.¹ Figure 1, Path 1 depicts this direct intervention effect on learning outcomes. Yet, in a real-world experimental design, a broad array of unobserved factors (e.g., students’ previous learning experiences, background, etc.) may bias the estimation of the intervention effects in two basic ways. First, unobserved factors can affect the estimation by biasing the intervention group selection process (Path 2). For example, if high-performing students have higher chances of receiving an intervention, the estimated effects may capture less gains in performance, thus underestimating the effect of the intervention. Moreover, the unobserved factors, such as students’ previous learning experiences, can also directly affect the learning outcomes (Path 3), reducing the ability of estimating the causal effects of a classroom intervention.

Existing DBER methods directly consider two general approaches to improving the causal estimation of the intervention effects: random assignment and statistical controlling. First, randomly assigning students to an intervention can reduce the extent to which unobserved factors influence the chances of receiving an intervention condition (weakening or eliminating path 2 in Figure 1). Random assignment is a simple but powerful approach to create a comparable condition between treated and untreated students, and thus is adopted in some famous social experiments, such as Tennessee Student Teacher Achievement Ratio (STAR) experiments (Krueger, 1999) or New York Scholarship Program (NYSP) lottery system (Howell and Peterson, 2006). Second, studies utilizing thoughtful statistical controls assume that all unobservable factors can affect students’ learning outcomes and that, by considering and specifying these factors in a statistical model, their influence on estimation can be ruled out. The statistical controlling approach has been widely adopted in examining intervention data (Theobald and Freeman, 2014) as this approach provides an effective tool for identifying the direct intervention effect from the impact of unobservable student characteristics on learning outcomes (decreasing the influence of path 3 in Figure 1).

Yet, despite that researchers have promoted these approaches to help improve the estimation of the intervention effects, DBER researchers may not necessarily find these approaches suitable for their intervention studies. For example, classroom-level randomization is often unachievable in undergraduate instruction. Even within a single-classroom research setting, it is often times ethically questionable to randomly provide a theoretically-beneficial intervention when it could have benefited more students. Moreover, although statistical controlling may reduce the tension between the optimal experimental research designs and instructional/administrative practices, doing so requires investigators’ extensive knowledge of students’ prior learning experiences and, more broadly, their demographic information to be able to adequately specify the structure of covariates. Furthermore, the administrative data from university registrar's or admission offices does not always capture adequate information about students’ academic and demographic background. With both of these considerations in mind, this paper aims to introduce RD as a viable tool for a general DBER audience, while elevating some strict assumptions/prerequisites when using RD to estimate the intervention effects. To help readers find clarifying notes, Box 1 summarizes a glossary of terms used throughout this paper.

Box 1. Glossary of terminology used throughout the paper.

Bandwidth: A bandwidth is a range of running variable around the threshold that constrains the analytical sample. The bandwidth should be symmetrical and centered at the threshold. A bandwidth selection process determines the optimal analytical sample and creates a comparable condition near the threshold where the assignment to the treatment is seemingly random.

Baseline Specification: Baseline specification is a model specification that only contains running variable and intervention status. Baseline specification is a simple yet efficient way to examine the intervention effect and display the visualization of the Regression Discontinuity (RD) design.

Confidence Interval: A confidence interval is the range between which the estimated intervention effect falls. The confidence interval serves a useful visual tool in bandwidth selection process.

Covariate Balance: Covariate balance is an analytical approach that examines whether the distributions of covariates are similar between treated and untreated groups. RD design does not require a fuller specification of covariates but requires evidence that exhibits basic covariate balance. Common covariates are students background characteristics.

Fixed Effects: A fixed effect describes the impact of a variable that remains fixed (or constant) across observations. The fixed effect is commonly used as statistical controls in statistical models.

Interaction Analysis: An interaction analysis examines the extent to which the magnitude of a main effect (i.e., the relationship between intervention status and learning outcomes) depends on other conditions.

Quasi-experimental Design: A quasi-experimental design focuses on establishing the relationship between manipulation of independent variable (e.g., an intervention) and outcome variable without formally administering random assignment. Research that applies quasi-experimental design aims to strategize the comparisons that are similar to a randomly assigned research setting.

Random Effects: A random effect assumes that the estimated effects can vary across groups. The random effect considers the non-independence embedded in a nested data structure and accounts for the variation coming from the multilevel data.

Running Variable: A running variable determines the eligibility of receiving an intervention. Some common examples are test scores, GPA, or SAT scores. In most RD studies, the running variable has to be continuous. The running variable is sometimes referred to an eligibility index.

Threshold: A threshold is a clearly-defined cutoff value of the running variable at which the intervention is assigned. Participants on the opposite sides of the threshold should receive different treatment conditions.

EXTENDED EXAMPLE: USING REGRESSION DISCONTINUITY TO UNDERSTAND THE EFFECT OF A LIGHT-TOUCH TEACHING ASSISTANT REACH-OUT INTERVENTION

Overview of the intervention

For the remainder of this paper, we use an extended example from a real classroom intervention study to illustrate the use of RD design. The dataset draws from a light-touch Teaching assistant (TA) reach-out intervention study that aimed to encourage more frequent interactions between students and their graduate TA in undergraduate STEM classrooms. The rationale behind the intervention is that a good interaction with a TA can have a positive influence on undergraduate students’ learning outcomes, such as self-efficacy (Stang and Roll, 2014), achievement (Wheeler et al., 2017), or STEM retention (O'Neal et al., 2007). Yet, minoritized students experience barriers to interacting with TAs (Solanki and Xu, 2018), and thus are disproportionately less benefited from TAs in the STEM classroom (Perlmutter et al., 2023). This intervention study, by enforcing light TA interactions, aimed to improve student learning outcomes in the course.

In practice, this intervention study was administered in an introductory-series biology course from a large R1 public university. Participants were students enrolled in a large introductory biology course. The study was replicated in four sections throughout the 2022–2023 academic year (n = 3062 students). In each section, the instructor asked TAs to reach out to students who were falling behind during the first half of the quarter. Critical to this analysis, students whose cumulative score were below 71.5 points got the intervention and students above this threshold did not. Students who got the intervention received an email from their TA, encouraging them to interact with TAs. The instructor tried to make this easy and accessible to all TAs so provided email templates as well as potential responses if students responded to the initial reach-out email (see Supplemental Material, Section A for the email template). The instructor and course coordinator provided TAs with a list of students who should receive a “reach out” email and the nature of the email was to inspire growth mindset (“…you can do this…”) and offer resources (e.g., office hours). The materials that were provided to TAs can be found in the Supplemental Materials.

The intervention design had several characteristics that made the RD approach appropriate and made other common analytical approaches less effective. First, the assignment of the treatment relied on a predetermined threshold of cumulative score, as opposed to a randomized experimental process. In other words, students below the threshold received the treatment whereas students above the threshold did not. Second, in many DBER studies, as with the current example, it is less likely that researchers can gain access to enough statistical controls to fully account for unobservable differences between low-performing and high-performing students in a class. Stated in another way, students that are far below and above the 71.5 points tend to have different backgrounds, whereas students just below and above the threshold are extremely similar except for the fact that one group is eligible for the intervention and the other group is not. Thus, sophisticated statistical controls are not strictly required to understand the intervention effect among students whose cumulative scores are local to the threshold.

In subsequent sections, we provide step-by-step instructions of RD and explain why RD is appropriate for analyzing such intervention data by answering this overarching research question: Does a light-touch intervention from a TA have a positive effect on short-term learning outcomes?

Setting up an RD design: A predetermined threshold

An RD design exploits a predetermined clearly-defined threshold that determines the eligibility for and assignment of the intervention. In the TA reaching out intervention example, we selected 71.5 cumulative points as the threshold below which the TA reach-out intervention was assigned, as a final cumulative score of 71.5 points was necessary for students to advance to subsequent introductory-series biology courses. Critical to a basic RD design, the running variable that determines the intervention eligibility (in this case, the cumulative grades before reaching out) must be continuous and should exhibit no gap at least around the threshold.² In practice, however, it is unrealistic to achieve a “perfect” continuity around the threshold, and thus it is necessary to make a couple of assumptions.

This first assumption of the RD design ensures that there is no drastic change in the running variable right above or below the threshold (i.e., there is continuity in the running variable). A drastic change in the running variable around the threshold may indicate 1) evidence of manipulation or strategies that influence the positions around the threshold, or 2) other factors that determine the distribution of the running variable. Following McCrary (2008), we can plot the density distribution of the running variable and examine the smoothness of the running variable around the threshold. As shown in Figure 2, the density distribution below and above the threshold does not exhibit a drastic change, and thus we can conclude that the first assumption has been met in this example dataset.

FIGURE 2. — Visualization of smoothness of the distribution of the running variable around the threshold.

The second assumption examines whether there are noncompliers; in this case, students who were able to manipulate the eligibility or opt in/out of the intervention. In an RD with little to no noncompliers, a Sharp RD, discussed in this paper, is safe to use. In an RD design with a substantial number of noncompliers (commonly referred to as a Fuzzy RD design, as opposed to a Sharp RD design discussed in the current study), an additional analytical step is necessary. Such an approach conducts extensive examination of eligibility and receiving the treatment and explores the magnitude of the Treatment on the Treated (TOT) or Local Average Treatment Effect (LATE). Students who received treatment (treated group) might be different from students who should receive a treatment (intended treatment group) in a Fuzzy RD design (Page et al., 2019). As fuzzy RD is not the focus of the current study, we will forego this additional step.

To examine the sharpness around the threshold, we plot the relationship between the reaching out status and students’ cumulative grade before reaching out (i.e., the selection criterion) with the threshold as a vertical line. For either side of the threshold, data points that move across the threshold vertical line indicate noncompliers (e.g., students whose cumulative grade was above 71.5, but received intervention). As shown in Figure 3, there are virtually no noncompliers on either side of the threshold, suggesting that a Sharp RD should be used.

FIGURE 3. — Visualization of intervention assignment along a continuous running variable and clearly-defined threshold.

Examining the running variable and the noncompliance is critical in setting up an RD design. DBER researchers should check the related assumptions to confirm the continuity of the running variable and functionality of the threshold when designing an RD study. Box 2 summarizes all assumptions that ensure the appropriate use of RD. All annotated R code, including code to check these assumptions, is freely available in the GitHub repository (https://github.com/TheobaldLab/RegressionDiscontinuity.git).

Box 2. Summary of Regression Discontinuity assumptions.

Assumption 1: There is no drastic change in the running variable right above or below the threshold.

Assumption 2 (sharp Regression Discontinuity [RD] only): There are little to no non-compliers when administrating the intervention study.

Assumption 3: Students just above and just below the threshold tend to have very similar characteristics.

Establishing a comparable condition: Bandwidth matters

The next step in establishing a causal relationship between the assignment to the intervention and learning outcomes is creating a comparable condition between treated and untreated students. Recall that both the random assignment approach and the statistical controlling approach may not be suitable for all classroom-based instructional interventions due to the lack of both control of enrollment process and information of students’ background. Instead, studies that apply an RD approach assume that students near the threshold tend to have very similar characteristics. In other words, if the threshold is 71.5 points, in practice there is likely very little difference between a student who earned 71 points (just below the cutoff) and 72 points (just above the cutoff).

At an extreme, treated and untreated students are only comparable if they are extremely close to the threshold (e.g., one point above or below the threshold). Yet, in practice, the estimation of the causal effects requires a substantial number of sampled students on both sides of the threshold. Therefore, it becomes an art to balance the considerations between creating a comparable condition and keeping an acceptable sample size; we call this art “bandwidth selection.”

A bandwidth captures the extent to which selected students’ running scores are away from the threshold. In the TA reaching out study, for example, a bandwidth of 30 indicates that the sample within the bandwidth only considers students whose cumulative grade was between 56.5 points and 86.5 points (i.e., plus and minus 15 points from the threshold). Figure 4 visualizes the changes in the sample size at both sides of the threshold in the process of bandwidth selection, by picking four representative bandwidths as illustrative examples. As shown in Figure 4, in the process of creating a comparable subsample, the sample size of both sides of the threshold becomes smaller.

FIGURE 4. — Creating comparable samples by narrowing samples around threshold. (a) All students, (b) Bandwidth = 40 points, (c) Bandwidth = 30 points, and (d) Bandwidth = 20 points.

To further explain the way in which narrowing bandwidth creates a comparable sample, we can follow Dee and Penner (2017) and examine the covariate balancing by estimating each key covariate using reaching out status. These estimations have the general form,

where Covs_i indicates each of the key covariates, which includes gender and race/ethnicity identities, and parents’ education. The estimation of µ indicates whether sampled students above and below the threshold share similar background characteristics (e.g., a similar distribution of gender, racial/ethnicity, or parents’ education). In practice, a nonsignificant (in this case, p-value > 0.05) coefficient indicates a balanced covariance. Table 1, Panel A summarizes this covariates balance checking process for all three covariates by presenting the model estimation of µ, from equation 1, using the same four selected representative bandwidths as illustrative examples. As shown in Table 1, Panel A, at a wider bandwidth (e.g., 30 points or wider), students above and below the threshold exhibit different background characteristics (e.g., significantly different proportions of first generation, and racially minoritized students), whereas at a bandwidth of 20 points, students above and below the threshold share a similar set of background characteristics.³ In practice, depending on how much background information can be collected, DBER researchers should locate bandwidths where most relevant covariates are balanced across treated and untreated students. We further checked the covariate balance using nine different bandwidths from 20 to 30 points in Table 1, Panel B. Results indicate that students exhibit balanced background characteristics at bandwidths that are narrower than 25 points. Additionally, to understand how narrowing bandwidths reduces the magnitude of the imbalance, we calculated the effect sizes of the imbalance between treated and untreated students. The effect sizes are measured by standardized mean differences of group composition with a range from zero to one. In general practice, an effect size of 0.2 or smaller is considered as small effect size (i.e., balanced covariance). Supplemental Table S1 summarizes the magnitude of the imbalance between all group identities with different bandwidths. As shown in Supplemental Table S1, for example, our full sample was imbalanced between racial minority and majority groups (effect size = 0.38), whereas at a narrower bandwidth, the effect size became at a 0.1 level, indicating a more balanced student covariance between groups.

TABLE 1.

Setting a comparable condition: Covariate balance checking.

Bandwidth	First generation students	Female	Racial minority
Panel A. Four representative bandwidths
All students (n = 3062)	−.065 (.025)^***	−.038 (.024)^∼	.162 (.023)^***
40 points (n = 2380)	−.043 (.026)^∼	−.017 (.027)	.135 (.025)^***
30 points (n = 1596)	.001 (.028)	−.018 (.029)	.089 (.028)^**
20 points (n = 939)	.018 (.032)	−.019 (.033)	.017 (.030)
Panel B. Bandwidths from 20 to 30 points
29 points (n = 1533)	.003 (.028)	−.035 (.029)	.069 (.025)^**
28 points (n = 1449)	.007 (.028)	−.037 (.030)	.055 (.026)^*
27 points (n = 1411)	.016 (.028)	−.036 (.030)	.055 (.026)^*
26 points (n = 1335)	.024 (.029)	−.043 (.030)	.054 (.027)^*
25 points (n = 1280)	.018 (.029)	−.038 (.031)	.055 (.027)^*
24 points (n = 1198)	.023 (.030)	−.038 (.031)	.046 (.028)
23 points (n = 1139)	.021 (.030)	−.037 (.031)	.040 (.029)
22 points (n = 1064)	.020 (.031)	−.051 (.032)	.037 (.030)
21 points (n = 1017)	.020 (.031)	−.044 (.033)	.022 (.029)

Open in a new tab

*** p<.001, ** p<.01, * p<.05, ∼ p<.1

The analytical approaches exhibited in both Figure 4 and Table 1 introduce the intuition that at a larger bandwidth, the analysis has greater statistical power, but treated and untreated students are less comparable, whereas at a smaller bandwidth, the treated and untreated students become more comparable, but the statistical power is limited. The bandwidth selection is critical in setting an RD analysis, as an appropriate bandwidth ensures a comparable condition between treated and untreated students with similar covariate attributes, while ensuring a considerable level of generalizability. For example, Table 1 suggests that the current RD sample becomes comparable at a bandwidth roughly within the range of 20 to 30 points, and thus the optimal bandwidth is likely within this range.⁴ Furthermore, common software is able to select optimal bandwidth automatically. In this paper, we used rdrobust package in R (Calonico et al., 2017) to perform automated bandwidth selection. Using multiple procedures with the mean square error (MSE) automated algorithm, the estimated optimal bandwidth ranges from 21.88 to 26.21 points. The complete automated bandwidth selection results are shown in Supplemental Table S2.

Bandwidth selection does not follow a one-size-fits-all approach. In empirical studies, it is generally preferrable to present RD estimation results using multiple bandwidths to strengthen the robustness of the estimation of the treatment effects. For example, Page et al., (2019) conducted RD estimations using three different bandwidths: one that exceeded the optimal bandwidth, one that was below it, and one that fell within the range of optimal bandwidth. In the subsequent analysis in this paper, however, for illustrative purposes and simplicity, we purposefully picked a bandwidth of 26 points, a bandwidth that falls into the range of optimal bandwidths from automated selection process, to conduct some of the visualizations and estimations.

Baseline RD model: The estimation of the “jump”

The three assumptions introduced earlier in the current study are critical in setting up an RD study. How do these assumptions translate into the estimation of the causal effects of the classroom intervention? In this section, we first introduce a visualization approach that allows researchers to visually examine the “jump” in the outcome at the threshold. Second, we estimate the causal effect of the intervention on learning outcomes using the baseline RD model. Combining the baseline RD model and visualization, finally, we introduce a generalized way to visually present estimation results at different bandwidths.

We begin the baseline estimation by visually examining the discontinuity at the threshold. In the TA reaching out example, we hypothesize that the intervention has a positive effect on students’ learning outcomes, and that it is expected that students who received the intervention, on average, should perform better than their peers who did not receive the intervention. Recall that, through narrowing the sample near the threshold, we strictly assumed that students just above and below the threshold had similar characteristics (including background), and thus should exhibit similar learning patterns. Therefore, the difference in learning outcomes between treated and untreated students around the threshold should be akin to the effect of the intervention on students’ learning outcomes. Figure 5 visualizes this analysis by plotting exam scores of the next exam (y-axis) against the cumulative grade before reaching out (index-axis), and generating two fitted lines for treated and untreated students separately. As shown in Figure 5, the intercept of the fitted line for treated students (golden line) is higher than the intercept of the fitted line for untreated students (light purple line) at the threshold, which represents the effect of the intervention on students’ scores on the next exam (the y-axis). Note that the intercept is at the threshold in this example for simplicity of interpretation. The extension of the fitted line for untreated students (the grey dotted line) further indicates the net gain in exam scores by comparing the actual fitted line for treated students (golden line) and the hypothetical fitted line for treated students if the instructor had never administered the intervention.

FIGURE 5. — Visualization of baseline Regression Discontinuity estimation (Bandwidth = 26 points, n = 1335). The y-axis is the score on the next exam. The golden plot represents students who received the intervention, whereas the light purple plot represents untreated students. The dotted line is the hypothetical fitted line for treated students if the instructor had never administered the intervention.

Next, we estimate the causal effects of the intervention in a baseline RD model specification. The baseline RD model has the general form,

where Exam_i is the next exam score for student i. The variable, CumGrade_i, is the running score, centered at the threshold for a better interpretation. The variable, Reachout_i, is a binary indicator of reaching out status, with the value of 1 representing students who received the intervention and 0 representing untreated students.⁵ The estimation of the coefficient, γ, indicates the “jump” of the exam score at the threshold due to the intervention (refer to annotated coefficient in Figure 5). To help readers understand the estimation from the equation, we unpack Equation 2 by plugging the value of Reachout_i into the equation. We can get,

2-1

graphic file with name cbe-25-rm1-e004.jpg

2-2

Equation 2-1 and 2-2 demonstrate two parallel fitted lines with different intercepts at 71.5 points. For untreated students, the estimated intercept at 71.5 points is β_o, whereas for students who received the intervention, the intercept at 71.5 points is (β_o + γ). Thus, the difference between two intercepts, (β_o + γ) − β_o = γ, is the estimated “jump” at the threshold.

Equation 2 is useful to understand the baseline estimation of the intervention effect. Moreover, running the baseline model with varying bandwidths enables researchers to strengthen the robustness of the estimation.

Table 2, Models 1–4 report the baseline RD model estimation results using full sample and three different bandwidths. As shown in Models 2–4, the baseline RD model estimation with different bandwidths indicates that the TA reaching out intervention increases students’ scores on their next exam by roughly four points. Additionally, Models 1–4 exhibit the way in which narrowing the bandwidth establishes a better estimation of the intervention effects but decreases the level of generalizability because of the smaller sample size. It is worthwhile to note that robust standard error is used to consider potential heteroskedasticity while shrinking the bandwidth (Bartalotti, 2019). The robust standard error can be generated by specifying [vcov = ‘robust’] when displaying the model estimation results in R (refer to GitHub repository https://github.com/TheobaldLab/RegressionDiscontinuity.git).

TABLE 2.

Baseline Regression Discontinuity estimation of the effects of Teaching assistant reaching out intervention.

	Model 1: Full sample	Model 2: BW = 40	Model 3: BW = 30	Model 4: BW = 20	Model 5: BW = 26 (selected for illustration)	Model 6: BW = 26
Reached out (γ)	1.93 (1.38)^a	3.68 (1.59)^*	3.88 (1.96)^*	4.48 (2.52)^∼	4.76 (2.13)^*	5.81 (2.22)^**
Centered Cum Grade (β₁)	1.08 (.043)^***	1.23 (.068)^***	1.24 (.112)^***	1.29 (.218)^***	1.33 (.141)^***	1.20 (.160)^***
Reached out × Centered Cum Grade (ϑ)						.562 (.335)
Intercept (β_o)	60.4 (.713)^***	58.5 (.908)^***	58.5 (1.13)^***	58.4 (1.47)^***	58.0 (1.24)^***	59.0 (1.38)^***
N	3062	2380	1596	939	1336	1336
R²	.297	.226	.147	.071	.126	.128

Open in a new tab

^aRobust standard errors in parentheses

*** p<.001, ** p<.01, * p<.05, ∼ p<.1

Recall that finding the optimal bandwidth is a matter of balancing the explanatory power of the intervention with the generalizability of the finding. Earlier in this paper, the automated bandwidth selection results indicated that the optimal bandwidth ranged from 21.88 to 26.21 points, and we selected a bandwidth of 26 points for illustration purposes. Table 2, Model 5 reports that the baseline intervention effect is 4.76 points, using the selected bandwidth that is within the range of the optimal bandwidths. With the selected optimal bandwidth, we can move on to checking the robustness of the estimation. The coefficient of interest in the baseline model, γ, indicates the change in the intercept at the threshold. Thus, the estimation of the intervention effect is local to the threshold. Can this estimation be expanded to the entire bandwidth? To examine the extent to which the estimated intervention effect changes by moving away from the threshold, following Page et al. (2019), we allow the slope of the eligibility index to flexibly vary by reaching out status, by entering an interaction term between the running variable and reaching out status into the baseline model. The model has the form,

graphic file with name cbe-25-rm1-e005.jpg

where the estimation of ϑ indicates whether two fitted lines (the golden line and light purple line) exhibited in Figure 5 have significantly different slopes. To explain the meaning of the interaction analysis, we unpack Equation 3 by plugging reaching out status (0 or 1) into the equation,

graphic file with name cbe-25-rm1-e006.jpg

3-1

graphic file with name cbe-25-rm1-e007.jpg

3-2

where Equation 3-1 represents the fitted line for untreated students (the light purple line in Figure 5) and Equation 3-2 represents the fitted line for students who received the intervention (the golden line in Figure 5). Different from Equations 2-1 and 2-2 where two fitted lines are specified to have the same slope, Equations 3-1 and 3-2 may have different slopes, β₁ and (β₁ + ϑ), respectively. To understand whether the estimated intervention effect changes by moving away from the threshold, we need to test whether two fitted lines have significantly different slopes. This is equivalent to testing whether the difference between two slopes is significantly different from zero. Deriving from Equations 3-1 and 3-2, the slope difference is (β₁ + ϑ) − β₁ = ϑ. This is exactly the estimation of the interaction term between reaching out status and the running score in Equation 3. Table 2, Model 6, using the bandwidth of 26 points, reports that two slopes are not significantly different. Therefore, at least within the range of ±13 points around the threshold, the estimated intervention effect from the baseline RD model holds true.

Another common approach that helps strengthen the robustness of the estimation is to conduct RD model estimations at multiple bandwidths. Following Jacob et al. (2012), we run the baseline RD model (Equation 2) using bandwidths ranging from 2 to 60 points, increasing in increments of 1 point, and plot the estimated intervention effects against the corresponding bandwidths (Figure 6). The figure is useful for checking whether the RD estimation is sensitive to the selected bandwidth. For example, an RD study may be problematic if the estimation does not indicate a consistent intervention effect (Jacob et al., 2012). Later in this paper, we use the same visualization strategy as a sensitivity analysis to demonstrate the robustness of our results. Figure 6 summarizes this set of estimates and indicates the significance of the intervention effects using both 95% interval (gray dashed lines) and color-coding (significant effects are labeled by light purple). As shown in Figure 6, albeit some fluctuations at bandwidths around 29 points, the baseline RD estimation indicates that the TA reaching out intervention increases students’ exam scores by approximately four points, and the result remains stable roughly from bandwidths of 20 points to 50 points.

FIGURE 6. — Robustness of estimation using baseline Regression Discontinuity (RD) models with significant levels and confidence intervals: Estimating the baseline RD model with bandwidths from 2 to 60 indicates that bandwidths between 26 and 48 result in an estimated treatment effect between 3 and 5 points on the next exam.

In fact, in many RD studies, researchers also consider the nonparametric estimation with weighting (Calonico et al., 2014; Gelman and Zelizer, 2015). Nonparametric RD estimation releases some restrictive assumptions about specific link functions (e.g., linearity used in this paper) on both sides of the threshold and allows algorithms to determine the relationship between outcomes and running variables. In this way nonparametric estimation is flexible and less sensitive to model misspecification (Lee and Lemieux, 2010). Additionally, nonparametric RD improves the estimation of local treatment effect by applying kernel weighting functions that assign higher weights to data points near the threshold while down-weighting data points away from the threshold.

In Supplemental Table S3, we summarize nonparametric RD estimation results using different bandwidths and various weighting algorithms (refer to GitHub Markdown file for detailed procedure). In general, as indicated in Supplemental Table S3, the estimated intervention effects from different nonparametric approaches range from 4.67 to 5.09 points, yielding a set of similar estimations to the parametric approach. Supplemental Figure S1 further visualizes the nonparametric estimation using the simplistic linear weighting (Triangular Kernel) with a bandwidth of 26 points, indicating a similar intervention effect around the threshold. In practice, we suggest that DBER researchers explore different nonparametric approaches and report all estimates.

It is worthwhile to note that nonparametric estimation focuses on hyper-local intervention effects by down-weighting data points away from the threshold. For this reason, however, the visualization of the nonparametric approach may not be intuitive as the parametric approach introduced earlier in this paper. In practice, it is common to report estimations from both parametric and non-parametric RD with the support from multiple bandwidths to increase the robustness of the RD estimations. Here, we chose to mostly focus on parametric estimations to highlight the logic of RD and help clarify the interpretation of coefficients. DBER researchers seeking to understand RD may focus on the basics of RD presented in this paper using parametric estimation and move to use both parametric and nonparametric estimations in their own work.

Beyond the baseline estimation: Fixed-effect model in RD framework

An RD design is able to address more research questions in addition to the main research question, “does TA reaching out benefit students.” For example, in addition to understanding the impact of the intervention for students on average, we can also test the hypothesis that that the TA reaching out intervention may disproportionately benefit minoritized groups compared to their majoritized peers. To achieve this analytical goal, we can add an interaction term between the reaching out status and their group identities, such as binary gender, racial/ethnical minority, and first-generation status.⁶ The modified model has the general form,

graphic file with name cbe-25-rm1-e008.jpg

where G_i indicates one of the three different group identities, with the value of 1 representing students who were from the minoritized group and 0 representing students from the majoritized group. To further unpack the estimation of the interaction analysis, similar to Equation 3, we plug the value of G_i into Equation 4, and have,

graphic file with name cbe-25-rm1-e009.jpg

4-1

graphic file with name cbe-25-rm1-e010.jpg

4-2

We should note that both Equations 4-1 and 4-2 have similar specifications to Equation 2 where the regression coefficient of reaching out status, Reachout_i, estimates the effects of the classroom intervention. Therefore, the different effects of the intervention on students from different identity groups (minoritized vs. majoritized) can be captured by the difference between two regression coefficients of reaching out status. Deriving from Equations 4-1 and 4-2, the difference is (γ + δ) − γ = δ, which is the regression coefficient of the interaction term between reaching out status and the group identities. Therefore, the estimation of coefficient δ examines whether the intervention significantly provides more benefit to the minoritized students.

The estimation of the interaction term is helpful in understanding whether the intervention reduces the educational inequity between minoritized and majoritized groups. Considering a classroom with an overall positive intervention effect, if the intervention benefits students from different groups at a similar rate, any education inequity that existed before the intervention will persist. If, however, the intervention instead benefits majoritized students at a higher rate than minoritized students, the inequity will be enlarged. Ideally, we expect that an intervention can benefit minoritized students at a higher rate than their advantaged peers to be able to reduce educational inequity. Table 3, Models 1 to 3 summarize the model estimation results using each of the group identities and their corresponding interaction terms. As shown in Table 3, none of the interaction terms are significant, indicating that, in general, the TA reaching-out intervention has a similar benefit for all students. Importantly, Table 3, Models 1 and 3 reveal that both female students and racially/ethnically minoritized students, on average, got three points less than their male and racially majoritized peers on the next exam when they did not receive the reaching out intervention. Thus, the interpretation remains that despite its overall positive effect for students on average, the reaching out intervention was not able to reduce the pre-existing educational inequity by disproportionally supporting students from minoritized groups. We should note that the statistical power in RD designs is generally low, as the analytical sample is narrowed down near the threshold. ⁷ This is particularly the case when more fixed effects are added into the model.

TABLE 3.

Addressing additional research questions using the fixed-effect Regression Discontinuity models.

	Model 1: Female	Model 2: First generation	Model 3: Racial minority
Reached out (γ)	5.60 (2.72)^a ^*	4.83 (2.26)^*	3.92(2.23)^∼
Centered Cum Grade (β₁)	1.32 (.140)^***	1.33 (.140)^***	1.30 (.141)^***
Female	−3.21 (1.22)^**
Reached out × Female (δ)	−1.52 (2.50)
First Gen		1.66 (1.28)
Reached out × First Gen (δ)		−.365 (2.64)
Racial minority			−3.52 (1.40)^*
Reached out × Racial minority (δ)			2.56 (2.75)
Intercept (β_o)	60.2 (1.51)^***	57.5 (1.29)^***	58.9 (1.29)^***
N	1336	1336	1336
R²	.134	.127	.130

Open in a new tab

^aRobust standard errors in parentheses

*** p<.001, ** p<.01, * p<.05, ∼ p<.1

Probing threats to internal validity: Sensitivity analyses

In interpreting the RD results, researchers should be aware of the potential threats to internal validity. Threats to internal validity refer to alternative explanations (i.e., alternative to the intervention) that may explain the observed model estimation results. While meeting all three RD assumptions may ensure a considerable level of internal validity, researchers should seek additional evidence that strengthens the internal validity of an RD study. In this section, we provide three analytical approaches to strengthen the internal validity, including 1) identifying additional discontinuities, 2) evaluating model specifications, and 3) assessing alternative outcomes.

Are there other discontinuities?

We first introduce the placebo test that aims to examine the naturally-existing discontinuities other than the intervention. In an RD study, it is possible that the observed discontinuity is not introduced by the external intervention, but is inherently embedded in the sample per se. Following (Lee, 2008), we selected a hypothetical threshold at 85 points where no intervention was introduced but treated as a real threshold. We chose 85 because it is higher than the actual threshold but not so high that we are likely to approach the ceiling. It is expected that, if there are no pre-existing discontinuities, there should be no “jump” at the threshold of 85 points. Applying the same visualization strategy and bandwidth in Figure 5, in Figure 7, we created two fitted lines for hypothetically treated students (72 to 84.5 points, the golden line) and hypothetically untreated students (85 to 98 points, light purple line). As shown in Figure 7, there is no discontinuity at the hypothetical threshold at 85 points, indicating that the discontinuity originally observed in the main RD model is highly likely due to the external intervention.

FIGURE 7. — Sensitivity analysis: placebo test using a hypothetical reach-out point at 85 points (Bandwidth = 26 points). There is no discontinuity in the regression line, thus the “jump” in Figure 5 is likely due to the intervention.

In practice, it is preferable to check multiple hypothetical thresholds to strengthen the internal validity. In Supplemental Materials, we provide more visualizations of placebo tests using the same model specification. Refer to Supplemental Figure S2 for more information.

Considering structure of random effects

Second, the model estimation results from an RD study are also sensitive to model specifications (Lee and Lemieux, 2010). In this analysis, we specifically consider the structure of random effects because students participants in DBER research studies are oftentimes clustered (e.g., in classrooms, or years, etc.) thus are rarely independent (Theobald, 2018). In the TA reaching out example, we have data from two instructors (instructor 1 and 2) who taught the course in one or two of three quarters (Fall, Winter, or Spring). In the Fall, there are two sections of the course, so instructor 1 has two course sections (Fall A and Fall B) and instructor 2 has two course sections (Winter and Spring). Finally, the intervention was administered by TAs who lead lab sections, where students met in small groups. As such, we consider three different specifications of random intercepts. The instructor random effects consider the most basic clustered structure where students were only nested within instructors, the course section random effects consider the basic classroom setting, and the lab section random effects explore a more complex clustered structure where students were nested within the smallest unit of TA reaching out intervention.

Applying the visualization strategy exhibited in Figure 6, we created a side-by-side comparison between three specifications of random effects in Figure 8. Overall, concerning alternative model specifications that involve random effects, the estimation of the intervention effect remains stable at most bandwidths, providing substantial evidence of internal validity. To select the most preferred random effect structure, DBER researchers should consider both the data structure and model-fitting indices. First, random effect structure should be based on clusters across which the outcome variable varies greatly and meaningfully. Second, when comparing all three model estimations, model-fitting indices, such as AIC and BIC, may help researchers to select a model with the best fit (Theobald, 2018). In Supplemental Materials, we run the full models with these three different random effects structures and provide model selection details to select the most preferred random effect structure. Based on both variation between groups and all model comparison metrics, Model b (course section random effect) is the preferrable model. Refer to Supplemental Table S4 for more details.

FIGURE 8. — Sensitivity analysis: testing Regression Discontinuity model specifications with different random intercept structures. (a) Instructor random effect. (b) Course section random effect. (c) Lab section random effect.

Alternative learning outcomes

Finally, we can explore other outcome variables that might corroborate our assessment of the effectiveness of the intervention. We initially used RD to understand if reaching out has an impact on the next exam score. It is possible that there are alternative learning outcomes that also might be impacted by the TA reaching out intervention. In other words, it is expected that, if the RD model captures the “real” causal effects of the classroom intervention, we may observe a similar effect of the intervention on related learning outcomes. For example, we can examine the impact of TA reaching out on the Final exam score as an alternative learning outcome. As shown in Figure 9, in general, the TA reaching out intervention has a stable effect on Final exam scores at bandwidths approximately ranging from 20 to 35. It is important to note that the TA reaching out intervention exhibits much smaller effects on students’ scores on their final exam. This is not surprising because it is common that in such short-term classroom intervention studies, the effectiveness of the intervention usually fades with time. This finding may have implications for instructor and TA practice, but given this is a paper describing the statistical methods of regression discontinuity, we leave the practice-level interpretation up to the reader. Therefore, the sensitivity analysis with an alternative learning outcome can provide some evidence, despite its small effect size, to strengthen the internal validity.

FIGURE 9. — Sensitivity analysis: testing Regression Discontinuity model specifications with alternative learning outcomes. Teaching assistant reaching out had an effect on final exam score (for similar bandwidths), but the effect was smaller than the effect of reaching out on the next exam.

CONCLUSION

RD design is a viable tool for DBER researchers who administer an intervention on the basis of a running variable and are interested in understanding the causal effects of the intervention. The RD design is particularly important whenever the random assignment approach or statistical controlling is not feasible, as it elevates some strict assumptions/prerequisites by narrowing down the sample local to the threshold and just looking at students who are practically similar. Additionally, the RD design relies on interventions with clear cutoffs that are generally under the control of instructors or researchers, and thus provides an easy opportunity to evaluate the impact of the intervention. We hope that the extended example as well as the freely available code (https://github.com/TheobaldLab/RegressionDiscontinuity) and the explanation in this paper will help readers a apply an RD design when evaluating the causal impacts of interventions.

Supporting information

cbe-25-rm1-s001.pdf^{(393.6KB, pdf)}

ACKNOWLEDGMENTS

We thank Jon C. Herron, John W. Parks, and Mo S. Turner who collected and provided data used in the extended example. Additionally, this paper is greatly improved by friendly reviewers, including Jon C. Herron, Mo S. Turner, Roddy Theobald and the members of the University of Washington Biology Education Research Group. This work was supported in part by the National Science Foundation under awards (#2012792) and (#2420369). Any opinions, findings, conclusions, or recommendations in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Footnotes

¹In some studies, such an approach is often times referred to as Intention-to-Treat (ITT) analysis. Elsewhere in this paper, to understand the effects of classroom intervention design on students’ learning outcomes in a basic way, we will focus on ITT.

²A more complicated RD design can work for discrete running variables as well, but requires additional analytical process (e.g., see Kolesár & Rothe, 2018).

³We also tested whether each key background characteristic exhibited a “jump” at the threshold, following Dee and Penner (2017), to further check the assumption related to the balance of covariates. Results indicate that none of the key covariate exhibits discontinuity even at a wider bandwidth.

⁴The covariate balance analysis demonstrated in Table 1 can only provide a rough sense of the optimal bandwidth, as the covariate balance results greatly depend on available measures of students’ demographic information and the distribution of running/outcome variables across different student groups.

⁵In an RD study with substantial numbers of non-compliers, also called a fuzzy RD, a binary variable indicating whether individual student's running variable is above or below the threshold should be used regardless of whether they received the treatment or not. In a Sharp RD study, like the example presented here, these two binary indicators are identical.

⁶We understand that the binary measures of gender/racial identities do not capture all minoritized experiences (such as non-binary gender identities). Yet, for the purpose of explaining the application of the RD approach, we stick to the non-perfect definitions of minority groups.

⁷We understand that the analytical sample selected in this extended example is uncommonly large, as the example is aggregated from four instructional sections.

REFERENCES

Angrist, J. D., & Lavy, V. (1999). Using Maimonides’ rule to estimate the effect of class size on scholastic achievement. The Quarterly Journal of Economics, 114(2), 533–575. 10.1162/003355399556061 [DOI] [Google Scholar]
Bartalotti, O. (2019). Regression discontinuity and heteroskedasticity robust standard errors: Evidence from a fixed-bandwidth approximation. Journal of Econometric Methods, 8(1), 20160007. 10.1515/jem-2016-0007 [DOI] [Google Scholar]
Calonico, S., Cattaneo, M. D., Farrell, M. H., & Titiunik, R. (2017). Rdrobust: Software for regression-discontinuity designs. The Stata Journal: Promoting Communications on Statistics and Stata, 17(2), 372–404. 10.1177/1536867X1701700208 [DOI] [Google Scholar]
Calonico, S., Cattaneo, M. D., & Titiunik, R. (2014). Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica, 82(6), 2295–2326. 10.3982/ECTA11757 [DOI] [Google Scholar]
Canning, E. A., Harackiewicz, J. M., Priniski, S. J., Hecht, C. A., Tibbetts, Y., & Hyde, J. S. (2018). Improving performance and retention in introductory biology with a utility-value intervention. Journal of Educational Psychology, 110(6), 834–849. 10.1037/edu0000244 [DOI] [PMC free article] [PubMed] [Google Scholar]
Dee, T. S., & Penner, E. K. (2017). The causal effects of cultural relevance: Evidence from an ethnic studies curriculum. American Educational Research Journal, 54(1), 127–166. 10.3102/0002831216677002 [DOI] [Google Scholar]
Gelman, A., & Zelizer, A. (2015). Evidence on the deleterious impact of sustained use of polynomial regression on causal inference. Research & Politics, 2(1), 1–7. 10.1177/2053168015569830 [DOI] [Google Scholar]
Howell, W. G., & Peterson, P. E. (2006). The education gap: Vouchers and urban schools. Washington, DC: Brookings Institution Press. [Google Scholar]
Jacob, R., Zhu, P., Somers, M. A., & Bloom, H. (2012). A practical guide to regression discontinuity. New York, NY: MRDC. Retrieved from https://www.mdrc.org/work/publications/practical-guide-regression-discontinuity/file-full. [Google Scholar]
Jordt, H., Eddy, S. L., Brazil, R., Lau, I., Mann, C., Brownell, S. E., King K., & Freeman, S. (2017). Values affirmation intervention reduces achievement gap between underrepresented minority and white students in introductory biology classes. CBE—Life Sciences Education, 16(3), ar41. 10.1187/cbe.16-12-0351 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kolesár, M., & Rothe, C. (2018). Inference in regression discontinuity designs with a discrete running variable. American Economic Review, 108(8), 2277–2304. 10.1257/aer.20160945 [DOI] [Google Scholar]
Krueger, A. B. (1999). Experimental estimates of education production functions. The Quarterly Journal of Economics, 114(2), 497–532. 10.1162/003355399556052 [DOI] [Google Scholar]
Lee, D. S. (2008). Randomized experiments from non-random selection in U.S. house elections. Journal of Econometrics, 142(2), 675–697. 10.1016/j.jeconom.2007.05.004 [DOI] [Google Scholar]
Lee, D. S., & Lemieux, T. (2010). Regression discontinuity designs in economics. Journal of Economic Literature, 48(2), 281–355. 10.1257/jel.48.2.281 [DOI] [Google Scholar]
McCrary, J. (2008). Manipulation of the running variable in the regression discontinuity design: A density test. Journal of Econometrics, 142(2), 698–714. 10.1016/j.jeconom.2007.05.005 [DOI] [Google Scholar]
Meaders, C. L., Mendez, L., Aguilar, A. G., Rivera, A. T., Vasquez, I., Mueller, L. O., & Owens, M. T. (2025). An asynchronous chemistry-in-biology intervention improves student content knowledge and performance in introductory biology. CBE—Life Sciences Education, 24(1), ar2. 10.1187/cbe.24-05-0151 [DOI] [PMC free article] [PubMed] [Google Scholar]
O'Neal, C., Wright, M., Cook, C., Perorazio, T., & Purkiss, J. (2007). The impact of teaching assistants on student retention in the sciences: Lessons for TA training. Journal of College Science Teaching, 36(5), 24–29. [Google Scholar]
Page, L. C., Kehoe, S. S., Castleman, B. L., & Sahadewo, G. A. (2019). More than dollars for scholars. Journal of Human Resources, 54(3), 683–725. 10.3368/jhr.54.3.0516.7935r1 [DOI] [Google Scholar]
Perlmutter, L., Salac, J., & Ko, A. J. (2023). “A field where you will be accepted”: Belonging in student and TA interactions in post-secondary CS education. Proceedings of the 2023 ACM Conference on International Computing Education Research V.1, 356–370. 10.1145/3568813.3600128 [DOI]
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference (2nd ed.). Boston, MA: Houghton Mifflin. [Google Scholar]
Solanki, S. M., & Xu, D. (2018). Looking beyond academic performance: The influence of instructor gender on student motivation in STEM fields. American Educational Research Journal, 55(4), 801–835. 10.3102/0002831218759034 [DOI] [Google Scholar]
Stang, J. B., & Roll, I. (2014). Interactions between teaching assistants and students boost engagement in physics labs. Physical Review Special Topics - Physics Education Research, 10(2), 020117. 10.1103/PhysRevSTPER.10.020117 [DOI] [Google Scholar]
Stanich, C. A., Pelch, M. A., Theobald, E. J., & Freeman, S. (2018). A new approach to supplementary instruction narrows achievement and affect gaps for underrepresented minorities, first-generation students, and women. Chemistry Education Research and Practice, 19(3), 846–866. 10.1039/C8RP00044A [DOI] [Google Scholar]
Theobald, E. (2018). Students are rarely independent: When, why, and how to use random effects in discipline-based education research. CBE Life Sciences Education, 17(3), rm2. 10.1187/cbe.17-12-0280 [DOI] [PMC free article] [PubMed] [Google Scholar]
Theobald, E. J., Crowe, A., HilleRisLambers, J., Wenderoth, M. P., & Freeman, S. (2015). Women learn more from local than global examples of the biological impacts of climate change. Frontiers in Ecology and the Environment, 13(3), 132–137. 10.1890/140261 [DOI] [Google Scholar]
Theobald, R., & Freeman, S. (2014). Is it the intervention or the students? Using linear regression to control for student characteristics in undergraduate STEM education research. CBE Life Sciences Education, 13(1), 41–48. 10.1187/cbe-13-07-0136 [DOI] [PMC free article] [PubMed] [Google Scholar]
What Works Clearinghouse. (2022). What Works Clearinghouse procedures and standards handbook, version 5.0.
Wheeler, L. B., Maeng, J. L., Chiu, J. L., & Bell, R. L. (2017). Do teaching assistants matter? Investigating relationships between teaching assistants and student outcomes in undergraduate science laboratory classes. Journal of Research in Science Teaching, 54(4), 463–492. 10.1002/tea.21373 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

cbe-25-rm1-s001.pdf^{(393.6KB, pdf)}

[B1] Angrist, J. D., & Lavy, V. (1999). Using Maimonides’ rule to estimate the effect of class size on scholastic achievement. The Quarterly Journal of Economics, 114(2), 533–575. 10.1162/003355399556061 [DOI] [Google Scholar]

[B2] Bartalotti, O. (2019). Regression discontinuity and heteroskedasticity robust standard errors: Evidence from a fixed-bandwidth approximation. Journal of Econometric Methods, 8(1), 20160007. 10.1515/jem-2016-0007 [DOI] [Google Scholar]

[B3] Calonico, S., Cattaneo, M. D., Farrell, M. H., & Titiunik, R. (2017). Rdrobust: Software for regression-discontinuity designs. The Stata Journal: Promoting Communications on Statistics and Stata, 17(2), 372–404. 10.1177/1536867X1701700208 [DOI] [Google Scholar]

[B4] Calonico, S., Cattaneo, M. D., & Titiunik, R. (2014). Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica, 82(6), 2295–2326. 10.3982/ECTA11757 [DOI] [Google Scholar]

[B5] Canning, E. A., Harackiewicz, J. M., Priniski, S. J., Hecht, C. A., Tibbetts, Y., & Hyde, J. S. (2018). Improving performance and retention in introductory biology with a utility-value intervention. Journal of Educational Psychology, 110(6), 834–849. 10.1037/edu0000244 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] Dee, T. S., & Penner, E. K. (2017). The causal effects of cultural relevance: Evidence from an ethnic studies curriculum. American Educational Research Journal, 54(1), 127–166. 10.3102/0002831216677002 [DOI] [Google Scholar]

[B7] Gelman, A., & Zelizer, A. (2015). Evidence on the deleterious impact of sustained use of polynomial regression on causal inference. Research & Politics, 2(1), 1–7. 10.1177/2053168015569830 [DOI] [Google Scholar]

[B8] Howell, W. G., & Peterson, P. E. (2006). The education gap: Vouchers and urban schools. Washington, DC: Brookings Institution Press. [Google Scholar]

[B9] Jacob, R., Zhu, P., Somers, M. A., & Bloom, H. (2012). A practical guide to regression discontinuity. New York, NY: MRDC. Retrieved from https://www.mdrc.org/work/publications/practical-guide-regression-discontinuity/file-full. [Google Scholar]

[B10] Jordt, H., Eddy, S. L., Brazil, R., Lau, I., Mann, C., Brownell, S. E., King K., & Freeman, S. (2017). Values affirmation intervention reduces achievement gap between underrepresented minority and white students in introductory biology classes. CBE—Life Sciences Education, 16(3), ar41. 10.1187/cbe.16-12-0351 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Kolesár, M., & Rothe, C. (2018). Inference in regression discontinuity designs with a discrete running variable. American Economic Review, 108(8), 2277–2304. 10.1257/aer.20160945 [DOI] [Google Scholar]

[B12] Krueger, A. B. (1999). Experimental estimates of education production functions. The Quarterly Journal of Economics, 114(2), 497–532. 10.1162/003355399556052 [DOI] [Google Scholar]

[B13] Lee, D. S. (2008). Randomized experiments from non-random selection in U.S. house elections. Journal of Econometrics, 142(2), 675–697. 10.1016/j.jeconom.2007.05.004 [DOI] [Google Scholar]

[B14] Lee, D. S., & Lemieux, T. (2010). Regression discontinuity designs in economics. Journal of Economic Literature, 48(2), 281–355. 10.1257/jel.48.2.281 [DOI] [Google Scholar]

[B15] McCrary, J. (2008). Manipulation of the running variable in the regression discontinuity design: A density test. Journal of Econometrics, 142(2), 698–714. 10.1016/j.jeconom.2007.05.005 [DOI] [Google Scholar]

[B16] Meaders, C. L., Mendez, L., Aguilar, A. G., Rivera, A. T., Vasquez, I., Mueller, L. O., & Owens, M. T. (2025). An asynchronous chemistry-in-biology intervention improves student content knowledge and performance in introductory biology. CBE—Life Sciences Education, 24(1), ar2. 10.1187/cbe.24-05-0151 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] O'Neal, C., Wright, M., Cook, C., Perorazio, T., & Purkiss, J. (2007). The impact of teaching assistants on student retention in the sciences: Lessons for TA training. Journal of College Science Teaching, 36(5), 24–29. [Google Scholar]

[B18] Page, L. C., Kehoe, S. S., Castleman, B. L., & Sahadewo, G. A. (2019). More than dollars for scholars. Journal of Human Resources, 54(3), 683–725. 10.3368/jhr.54.3.0516.7935r1 [DOI] [Google Scholar]

[B19] Perlmutter, L., Salac, J., & Ko, A. J. (2023). “A field where you will be accepted”: Belonging in student and TA interactions in post-secondary CS education. Proceedings of the 2023 ACM Conference on International Computing Education Research V.1, 356–370. 10.1145/3568813.3600128 [DOI]

[B20] Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference (2nd ed.). Boston, MA: Houghton Mifflin. [Google Scholar]

[B21] Solanki, S. M., & Xu, D. (2018). Looking beyond academic performance: The influence of instructor gender on student motivation in STEM fields. American Educational Research Journal, 55(4), 801–835. 10.3102/0002831218759034 [DOI] [Google Scholar]

[B22] Stang, J. B., & Roll, I. (2014). Interactions between teaching assistants and students boost engagement in physics labs. Physical Review Special Topics - Physics Education Research, 10(2), 020117. 10.1103/PhysRevSTPER.10.020117 [DOI] [Google Scholar]

[B23] Stanich, C. A., Pelch, M. A., Theobald, E. J., & Freeman, S. (2018). A new approach to supplementary instruction narrows achievement and affect gaps for underrepresented minorities, first-generation students, and women. Chemistry Education Research and Practice, 19(3), 846–866. 10.1039/C8RP00044A [DOI] [Google Scholar]

[B24] Theobald, E. (2018). Students are rarely independent: When, why, and how to use random effects in discipline-based education research. CBE Life Sciences Education, 17(3), rm2. 10.1187/cbe.17-12-0280 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] Theobald, E. J., Crowe, A., HilleRisLambers, J., Wenderoth, M. P., & Freeman, S. (2015). Women learn more from local than global examples of the biological impacts of climate change. Frontiers in Ecology and the Environment, 13(3), 132–137. 10.1890/140261 [DOI] [Google Scholar]

[B26] Theobald, R., & Freeman, S. (2014). Is it the intervention or the students? Using linear regression to control for student characteristics in undergraduate STEM education research. CBE Life Sciences Education, 13(1), 41–48. 10.1187/cbe-13-07-0136 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] What Works Clearinghouse. (2022). What Works Clearinghouse procedures and standards handbook, version 5.0.

[B28] Wheeler, L. B., Maeng, J. L., Chiu, J. L., & Bell, R. L. (2017). Do teaching assistants matter? Investigating relationships between teaching assistants and student outcomes in undergraduate science laboratory classes. Journal of Research in Science Teaching, 54(4), 463–492. 10.1002/tea.21373 [DOI] [Google Scholar]

PERMALINK

Toward Causal Inferences in Discipline-Based Education Research: Using Regression Discontinuity Design to Understand the Effect of Classroom Interventions

Shangmou Xu

Elli J Theobald

Roles

Abstract

INTRODUCTION

COMMON ANALYTICAL STRATEGIES FOR CLASSROOM INTERVENTIONS

FIGURE 1.

Box 1. Glossary of terminology used throughout the paper.

EXTENDED EXAMPLE: USING REGRESSION DISCONTINUITY TO UNDERSTAND THE EFFECT OF A LIGHT-TOUCH TEACHING ASSISTANT REACH-OUT INTERVENTION

Overview of the intervention

Setting up an RD design: A predetermined threshold

FIGURE 2.

FIGURE 3.

Box 2. Summary of Regression Discontinuity assumptions.

Establishing a comparable condition: Bandwidth matters

FIGURE 4.

TABLE 1.

Baseline RD model: The estimation of the “jump”

FIGURE 5.

TABLE 2.

FIGURE 6.

Beyond the baseline estimation: Fixed-effect model in RD framework

TABLE 3.

Probing threats to internal validity: Sensitivity analyses

Are there other discontinuities?

FIGURE 7.

Considering structure of random effects

FIGURE 8.

Alternative learning outcomes

FIGURE 9.

CONCLUSION

Supporting information

ACKNOWLEDGMENTS

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases