Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2024 Jul 24;78(1):141–166. doi: 10.1111/bmsp.12355

Average treatment effects on binary outcomes with stochastic covariates

Christoph Kiefer 1,, Marcella L Woud 2, Simon E Blackwell 2, Axel Mayer 1
PMCID: PMC11701421  PMID: 39045798

Abstract

When evaluating the effect of psychological treatments on a dichotomous outcome variable in a randomized controlled trial (RCT), covariate adjustment using logistic regression models is often applied. In the presence of covariates, average marginal effects (AMEs) are often preferred over odds ratios, as AMEs yield a clearer substantive and causal interpretation. However, standard error computation of AMEs neglects sampling‐based uncertainty (i.e., covariate values are assumed to be fixed over repeated sampling), which leads to underestimation of AME standard errors in other generalized linear models (e.g., Poisson regression). In this paper, we present and compare approaches allowing for stochastic (i.e., randomly sampled) covariates in models for binary outcomes. In a simulation study, we investigated the quality of the AME and stochastic‐covariate approaches focusing on statistical inference in finite samples. Our results indicate that the fixed‐covariate approach provides reliable results only if there is no heterogeneity in interindividual treatment effects (i.e., presence of treatment–covariate interactions), while the stochastic‐covariate approaches are preferable in all other simulated conditions. We provide an illustrative example from clinical psychology investigating the effect of a cognitive bias modification training on post‐traumatic stress disorder while accounting for patients' anxiety using an RCT.

Keywords: average marginal effects, causal inference, logistic regression model, statistical inference

1. INTRODUCTION

In (clinical) psychology and social sciences, the effects of an intervention, prevention, or treatment on a dichotomous outcome variable are often investigated using randomized controlled trials (RCTs). Examples include the effect of psychological interventions in prison on criminal recidivism (Beaudry et al., 2021), the effect of a drinking prevention strategy for college students on abstinence and occurrence of heavy drinking episodes (Larimer et al., 2007), the effect of interventions on smoking cessation (Dijkstra et al., 1998; Osch et al., 2008), and the effect of cognitive behavioral therapy on panic disorder with agoraphobia (Gloster et al., 2011).

Covariate adjustment is common in many of these RCTs for dichotomous outcomes, typically based on logistic regression models. The reasons for covariate adjustment are twofold: it can help to examine moderating effects of covariates or heterogeneity of treatment effects conditional on the covariate (or even on an individual level; see, for example, Wester et al., 2022) or it can be used to obtain effect estimates taking into account a predictive covariate, which can increase the power and thus reduce sample size requirements (Hernández et al., 2004; Moore & van der Laan, 2009). While covariate adjustment in nonlinear models (including logistic regression models) can result in higher efficiency, it is often cautioned that it also introduces bias into the effect estimates in small samples and should be avoided in situations where precision is most important (Imbens & Rubin, 2015; Robinson & Jewell, 1991). Recently, Negi and Wooldridge (2021) put forward that covariate adjustment in logistic regression models can indeed improve both efficiency and precision if the model is correctly specified. In addition, simulation studies have repeatedly shown consistency of treatment effects even from slightly misspecified logistic regression models (e.g., Negi & Wooldridge, 2021; Rosenblum & van der Laan, 2010). In this context, Negi and Wooldridge (2021) recommend always including treatment–covariate interactions unless one is sure that the model is correctly specified without them.

Logistic regression models can be considered as a part of the family of generalized linear models (McCullagh & Nelder, 1998; Nelder & Wedderburn, 1972) with a logistic link function and binomial distributed random component. They can be used to model the probability of the outcome occurring (usually coded as 1) conditional on a treatment, covariates, and possible covariate–treatment interactions. In logistic regression models, it is common to inspect treatment effects defined as risk ratios, odds ratios, or log odds ratios. Recent research suggests that these ratio effects are often misinterpreted in applied research (Niu, 2020) and it is often cautioned (Greenland et al., 1999; Hanmer & Ozan Kalkan, 2013; Mood, 2010) that these ratio effects might not even have a causal interpretation, especially when adjusting for covariates in an RCT.

Average marginal effects (AMEs) are suggested as an alternative to the odds ratio (Greenland et al., 1999; Hanmer & Ozan Kalkan, 2013; Mood, 2010; Norton & Dowd, 2018). A marginal effect is the difference between the conditional probability of the outcome given treatment and given control for a given observation. The mean of such marginal effects is called the AME and can easily be interpreted at the probability scale of the dichotomous outcome. For example, an AME of 5% would indicate that the probability of the outcome occurring increased by about 5% on average. Functions for estimating AMEs from logistic regression models are implemented in several statistical software packages, for example, the margins command in Stata (Williams, 2012) or the margins package in R (Leeper et al., 2021).

It is important to note that the estimation of AMEs does not account for sampling‐based uncertainty (Abadie et al., 2020). Sampling‐based uncertainty refers to sources of variability that are due to the sampling process. For example, if we want to gather a random sample of 100 participants for our RCT we could either predetermine a gender proportion and use stratified sampling (e.g., exactly 50 female participants) or we could randomly sample persons. In the latter case, the gender proportions are stochastic and may vary from sample to sample. While Wooldridge (2010) stated that accounting for sampling‐based uncertainty might be technically correct, but that ‘the adjustment may have a small effect’ (p. 919), Mayer and Thoemmes (2019) showed that properly accounting for this sampling‐based uncertainty in group sizes of categorical covariates can improve statistical inferences on average treatment effects. The improvement of statistical inferences can also be observed when treating continuous covariates as stochastic, that is, the observed values might change from sample to sample. If the sampling‐based uncertainty is neglected, standard errors tend to be underestimated, coverage of confidence intervals (CIs) can drop below the nominal level, and Type I errors can be inflated, especially if a strong treatment–covariate interaction (and, hence, heterogeneous treatment effects) is present. This has been shown in linear (Chen, 2006; Liu et al., 2017) and Poisson regression models (Kiefer & Mayer, 2019). However, a thorough investigation of this phenomenon for stochastic covariates in logistic regression models is lacking.

In this paper, we present and examine different approaches for estimating average treatment effects accounting for stochastic covariates using logistic regression models. First, we provide definitions of the average (causal) treatment effect on a dichotomous outcome as a difference in conditional probabilities in an RCT. In addition, we show that the variability of conditional effects is limited in this scenario. Second, we explain how these conditional probabilities can be estimated with and without accounting for a covariate, that is, we briefly recapitulate the statistics of contingency tables and logistic regression models. Third, we introduce three estimation approaches for the average treatment effect based on the estimated conditional probabilities. These are a simple difference‐in‐means estimator, approaches relying on the sample average of conditional effects (including the AME, but also a variant treating covariates as stochastic), and we propose a new moment‐based approach which accounts for sampling‐based uncertainty by definition. Fourth, we present a simulation study comparing the proposed estimators with a focus on differences between approaches with stochastic and fixed covariate. Fifth, we provide an empirical illustration of the estimators using data from clinical psychology (Woud et al., 2021) investigating the effect of cognitive bias modification training on symptoms of post‐traumatic stress disorder (PTSD) while accounting for patients' anxiety. Finally, we discuss the findings, implications, and limitations of this study.

2. DEFINITION AND TERMINOLOGY OF TREATMENT EFFECTS

Throughout this paper, we focus on the average effect of a randomized treatment on a dichotomous outcome variable. Our notation refers to the stochastic theory of causal effects (Steyer et al., 2022), which is similar to the Neyman–Rubin causal model (Rubin, 2005), but with a stronger emphasis on definitions and notation based on probability theory.

We consider a scenario with a binary outcome variable Y, a binary and randomized treatment X (with levels X=0 for control and X=1 for treatment) and a single continuous covariate Z. When considering a randomized experiment, the causal average treatment effect (ATE) of the treatment (X=1) compared to the control group (X=0) can be computed in different ways, two of which are:

ATE=P(Y=1|X=1)P(Y=1|X=0) (1)
=E[P(Y=1|X=1,Z)P(Y=1|X=0,Z)=CE(Z)]. (2)

Equation (1) shows a simple difference‐in‐means formulation, where we are not conditioning on any covariates. For example, an ATE of .05 could reflect a 5% difference between a probability of the outcome of 50% under control and 55% under treatment, but also between 80% and 85%. When accounting for a (continuous) covariate Z,1the ATE can be computed as expectation of the Z‐conditional effects, as is shown in Equation (2). That is, the effect function CE(Z) represents treatment effects given specific values of Z. We will show below that this technique is used in the marginal effects approach. Note that Equations (1) and (2) hold because we assume randomized assignment to the treatment. Without randomized assignment, the right‐hand sides of these equations can be causally biased and would represent a prima facie effect (for more details, see, for example, Steyer et al., 2000).

Note that in clinical research the ATE is referred to as absolute risk reduction and a related measure is sometimes used, namely, the number needed to treat (NNT; Hutton, 2000). The NNT reflects the number of persons needed to be treated – on average – to yield one person more to benefit from the treatment compared to the control group. It can be computed as the inverse of the ATE, that is, NNT=1/ATE. For example, if we have an ATE of .2 (i.e., the probability of success is 20% higher in treatment than in control), the corresponding NNT is 5, meaning that we need to treat five persons, to get – on average – one beneficial outcome more than in the control group.

Both ways of computing the ATE – as presented in Equations (1) and (2) – are equivalent, but the two equations point to different possible ways to statistically model the dependencies among the outcome Y, treatment X, and possibly covariate Z and to estimate the ATE. This might lead to estimators with systematically different results for statistical inference. For example, the simple difference‐in‐means from Equation (1) does not contain any information about the variability of the individual treatment effects. That is, if a treatment works differently for different people, we have to condition on covariates in order to explore this variability, as is done via the effect function CE(Z) in Equation (2). This additional information can improve precision and efficacy in estimating the ATE, but the actual improvement depends on a number of factors, such as correct specification of the model and balance of the treatment groups (Negi & Wooldridge, 2021).

One factor, which is also important for statistical inferences on ATEs, is heterogeneity of the conditional treatment effects. If treatment effects conditional on a covariate vary a lot, accounting for this covariate will increase the efficacy of treatment effect point estimation (Hernández et al., 2004; Negi & Wooldridge, 2021), but it has also been shown to have an effect of coverage rates of CIs and the empirical detection rate (i.e., Type I and II errors; for example, for Poisson regressions, see Kiefer & Mayer, 2019). Thus, variability of conditional effects is an important concept that we will deepen in the next section.

2.1. Variance of conditional treatment effects

Obviously, the ATE for binary outcomes is always bounded between 1 and 1 (i.e., ATE[1,1]). Consequently, the variance of conditional treatment effects is bounded for dichotomous outcomes, too. That is, regardless of the predictive power of the covariate, the maximum variance of the conditional effects for dichotomous outcomes is limited due to the binary nature of the outcome. Generally, the variability of the conditional treatment effects is given by the variance of the conditional effect function,

Var[CE(Z)]=Var[P(Y=1|X=1,Z)P(Y=1|X=0,Z)],

which is generally bounded between 0 and 1 (i.e., Var[CE(Z)][0,1]). The supremum (i.e., the lowest upper bound) of the variance of conditional effects given a specific ATE is

sup{Var[CE(Z)]:ATE[1,1]}=1ATE2. (3)

This means that the greater the absolute value of the ATE is, the less variance of conditional effects is possible. Conversely, an ATE of zero will yield the highest possible variance of conditional effects. A derivation of the supremum is given in Appendix A. This formula can help to decide what amount of variance is to be considered ‘small’ or ‘large’ in a certain scenario. For example, an actual effect variance of .4 might be ‘large’ if the maximum variance is also .4, but ‘small’ if the maximum variance is 1.0. In the simulation study below, we show that the proportion of effect variance relative to the maximum possible variance is an important measure in deciding whether effect variability affects point and standard error estimation.

3. ESTIMATION OF CONDITIONAL PROBABILITIES

In the previous section we provided nonparametric effect definitions based on the conditional probability P(Y=1|X) or P(Y=1|X,Z), respectively. In an applied scenario, we need a statistical model for estimating these conditional probabilities, for example, based on a contingency table or using a logistic regression model, respectively.

3.1. Contingency table

Given a sample of N independent and identically distributed (i.i.d.) observations of a binary outcome variable Y and a binary treatment variable X, we can compute the absolute frequencies Nyx of observations of each of the four possible combinations of X=x and Y=y:

N00:=i=1N(1Yi)·(1Xi),N10:=i=1NYi·(1Xi),N01:=i=1N(1Yi)·Xi,N11:=i=1NYi·Xi.

These frequencies are typically illustrated with a contingency table as in Table 1.

TABLE 1.

Contingency table of a dichotomous outcome Y and a dichotomous treatment X.

Outcome
Y=0
Y=1
Treatment
X=0
N00
N10
N·0
X=1
N01
N11
N·1
N0·
N1·
N

Note: The Nyx represent the absolute frequencies of observations within the combination of Y=y and X=x. Marginal frequencies of observations for Y=y are denoted by Ny·, or by N·x for X=x.

The marginal frequencies Ny· for Y=y and N·x for X=x are defined as column or row sums, respectively. Based on these absolute cell and marginal frequencies, we can estimate the conditional probabilities used in Equation (1) as relative frequencies:

P^(Y=1|X=0)=N10N·0,P^(Y=1|X=1)=N11N·1. (4)

This procedure is equivalent to computing the group‐specific means of Y for the treatment and control group.

Before we move on to estimation and statistical inference for the ATE from these quantities, we consider how to model the conditional probabilities given in Equation (2) involving a continuous covariate.

3.2. Logistic regression model

In addition to Y and X, we now also consider an i.i.d. sampled continuous covariate Z. The conditional probability of Yi=1 of observation i, given the treatment and the covariate, is often modelled with a logistic regression model with treatment variable, covariate, and a treatment–covariate interaction as predictors:

πi:=P(Yi=1|Xi,Zi)=expγ00+γ10·Xi+γ01·Zi+γ11·Xi·Zi1+expγ00+γ10·Xi+γ01·Zi+γ11·Xi·Zi

with the vector of parameters γ=(γ00,γ10,γ01,γ11).

Commonly, the regression coefficients γ are estimated using a maximum likelihood approach based on the generalized linear model framework (McCullagh & Nelder, 1998; Nelder & Wedderburn, 1972).2The log‐likelihood function for the logistic regression model is

logL(γ)=i=1NYi·log(πi)+(1Yi)·log(1πi)

which has to be solved iteratively. For a semiparametric estimation alternative, see, for example, Basu and Rathouz (2005).

The standard errors can be obtained via the Hessian matrix HLL of the log‐likelihood function evaluated at the maximum likelihood estimate, that is,

HLL(γ^)=i=1Nπ^i·(1π^i)·xixiT

where xi=(1,Xi,Zi,Xi·Zi)T (Miller, 2021), and the covariance matrix of the estimator is given by

Var(γ^)=1N1N·HLL(γ^)1

where the square root of the diagonal elements of Var(γ^) gives the respective standard errors. Negi and Wooldridge (2021) recommend using a robust estimator for the covariance matrix if it is used for treatment effect estimation – for example, a sandwich estimator as described by Stefanski and Boos (2002) which is implemented in many statistical software packages (e.g., the sandwich package in R by Zeileis, 2006).

Notice that the direct interpretation of the regression coefficients γ can be challenging in logistic regressions in presence of covariates and treatment–covariate interactions (Mood, 2010). Thus, it is often recommended to subsequently estimate the ATE (e.g., Hanmer & Ozan Kalkan, 2013) In the following section, we show how the ATE can be estimated from both the contingency table and the logistic regression model.

4. ESTIMATES AND STANDARD ERRORS OF TREATMENT EFFECTS

In this section, we present the estimation of and statistical inference on the ATE based on the estimated conditional probabilities from the previous section.

4.1. Simple difference‐in‐means

The simple difference‐in‐means estimator is fairly simple to obtain by plugging the observed frequencies from Equation (4) into the effect formula from Equation (1):

ATE^SDM=P^(Y=1|X=1)P^(Y=1|X=0)=N11N·1N10N·0.

ATE^SDM is a simple and consistent estimator of the ATE. The corresponding standard error is

SESDM=N11·N01N·13+N10·N00N·03

(see, for example, Fleiss et al., 2003, p. 60) and can, for example, be used to construct a 95% CI around ATE^SDM. In addition, one can test the hypothesis that both conditional probabilities are equal (which corresponds to an ATE of zero) using a two‐proportions test as described by Fleiss et al. (2003, p. 54). For an overview and comparison of further approaches to testing this hypothesis, see Newcombe (1998).

However, effect estimation is more challenging for the ATE when a continuous covariate within a logistic regression model is involved. In the following, we present two general approaches to estimating the ATE as the expectation of the conditional effect function CE(Z), namely, by means of computing a sample average or by integration over the covariate's domain.

4.2. Sample average over conditional effects

One possible way to translate the expectation over conditional effects from Equation (2) into a statistical procedure is by taking the sample average of the conditional effects (SACE) as an estimator for the ATE:

4.2. (5)

ATE^SACE is a consistent estimator of the ATE. The derivation of the estimator for this case, but also for cases with multiple covariates, can be found, for example, in Wooldridge (2010, Ch. 21).

For ATE^SACE, the covariate Z is still treated as an i.i.d. sampled random variable, which we call a stochastic covariate. The standard error for ATE^SACE is given by

SESACE=ATE^SACEγ^·Var(γ^)·ATE^SACEγ^T+1NVar^[CE(Zi)], (6)

where denotes the gradient of a function. This standard error formula has been derived and proven, for example, by Bartlett (2018), Basu and Rathouz (2005), Terza (2016), and Wooldridge (2010).

However, there exists a simplified version of ATE^SACE which is predominantly used in the econometrics and biostatistics literature and is more commonly implemented in statistical software – e.g., in the margins package in R (Leeper et al., 2021) or the margins command in Stata (Williams, 2012) – than ATE^SACE. In this approach, the AME is used as estimator for the ATE. The AME estimator is based on the estimated regression coefficients γ^ and observed values zi of Z in a specific sample:

ATE^AMEγ^=1N·i=1Nexpγ^00+γ^10+γ^01·zi+γ^11·zi1+expγ^00+γ^10+γ^01·zi+γ^11·ziexpγ^00+γ^01·zi1+expγ^00+γ^01·zi.

This estimator looks very similar to the one given in Equation (5) and, in fact, the point estimates from both estimators are identical. Thus, the AME is also a consistent estimator of the ATE (Greene, 2012). However, there are two important differences between ATE^SACE and ATE^AME. First, the lower‐case zi emphasizes that only a certain set of observed covariate values is considered and not a set of random variables. The covariate is treated as fixed by design. Therefore, the AME is not meant to generalize beyond the sample at hand. Second, it is possible to derive the standard error for ATE^AME by simply using the delta method (for an introduction, see Raykov & Marcoulides, 2004). However, this approach neglects sampling‐based uncertainty, because the observed zi are not considered to be stochastic (i.e., as randomly sampled):

SEAME=ATE^AMEγ^·Var(γ^)·ATE^AMEγ^T.

The terms under the root are identical to the first part in the standard error formula for the SACE given in Equation (6), but the second part is omitted here. Thus, one can immediately see that these standard errors provide different results if there is considerable variance in the conditional effects (i.e., if the treatment works differently for different persons). In these cases, neglecting sampling‐based uncertainty will yield underestimated standard errors for the AME and, in turn, results in flawed inferences, for instance, poor coverage rates and inflation of Type I errors (as previously shown for Poisson regressions by Kiefer & Mayer, 2019). Conversely, if conditional effects are homogeneous (i.e., there is no effect variance), both standard error formulas should provide similar estimates even if the covariate was randomly sampled.

4.3. Integral over conditional effects

An alternative to the above‐presented sample averages over the conditional effects, is to estimate the ATE via an integral using moments of the covariate Z. In linear regression, this procedure simplifies to computing the conditional effect at the expectation of Z (Liu et al., 2017; Mayer et al., 2016). In this case, it suffices to estimate the mean μZ of Z and to evaluate the respective conditional effect function at this value. This procedure has the additional benefit that sampling‐based uncertainty in the covariate can be accounted for by including the standard error of μ^Z in the computation.

In nonlinear regression models, such moment‐based approaches usually require a distributional assumption for the covariate and potentially more than one moment (e.g., expectation and variance; for a moment‐based approach for Poisson regression models, see Kiefer & Mayer, 2019, 2021b). In addition, there is not always a convenient analytical solution as in the linear and Poisson regression case. To our knowledge, no moment‐based approach for logistic regressions has been proposed. Such approaches have been found to outperform sample average‐based estimators under specific conditions (e.g., Kiefer & Mayer, 2019), because the distributional assumption provides additional information unless it is violated. Thus, we derive such a moment‐based estimator for logistic regression models in the following.

For the logistic regression model, the ATE from Equation (2) based on the i.i.d. sampled variables can be rewritten to emphasize the meaning of the unconditional expectation as an integral:

4.3. (7)

where pZ(z) denotes the density or probability density function of the covariate Z. Note that this approach treats the covariate Z as stochastic by construction, as we are not looking at observed conditional effects, but consider all conditional effects on the domain of the covariate weighted by its density function. The moment‐based estimator for the ATE can then be written as

ATE^MOM(γ^,θ^)=zCE(z,γ^)·pZ(z,θ^)dz (8)
i=1Mwi·CE(zi,γ^), (9)

where Equation (8) shows estimation of the ATE with the integral over the product of the conditional effect function and the density function of Z using the parameter estimates γ^ and θ^. This is a consistent estimator of the ATE (for a proof, see Appendix B). However, the computation of this estimator usually requires numerical integration as for most densities there is no closed‐form solution. For example, if Z is normally distributed with estimated parameters θ^=(μ^,σ^) the density is well known, but no analytical solution exists to compute the integral.

In Equation (9) a numerical approximation of ATE^MOM is shown using a finite sum over M integration points zi and weights wi. There exist various techniques to compute the integration points and weights, with several variants of Gaussian quadrature (e.g., Gauss–Kronrod, Gauss–Hermite) being readily implemented in most statistical software packages, for instance, the integrate function in R. However, the approximation of this estimator shown in Equation (9) can introduce bias if the numerical integration technique is inadequate. In Appendix C, we provide examples for unidimensional integration techniques of the proposed moment‐based approaches and also discuss settings with multiple covariates, where multidimensional integration is required.

Note that the integration points zi themselves are fixed values, derived by the integration technique used. Thus, ATE^MOM is computed as a function of estimated parameters only – similarly as in the AME estimator. Consequently, the standard error for the moment‐based ATE estimate can be derived using the Delta method and is

SEMOM=ATE^MOM(γ^,θ^)·Var(γ^,θ^)·ATE^MOM(γ^,θ^)T, (10)

but the sampling‐based uncertainty in Z is naturally included via the parameters θ^ and by integrating over all possible conditional effects given the distribution of Z (for a proof, see Appendix B).

5. SIMULATION STUDY

The aforementioned ATE estimators and their respective standard errors can all be shown to be consistent, that is, they are unbiased with sample sizes increasing to infinity. In psychological RCTs, sample sizes tend to be rather small. Thus, we conducted a simulation study to investigate the finite‐sample properties of the different ATE estimators and corresponding standard errors under various possible scenarios. We were especially interested in the statistical inference for ATE estimators when incorporating or neglecting sampling‐based uncertainty. Thus, a central design factor of our simulation study is the variance of conditional treatment effects as this term reflects the key difference between the standard error formulas of the AME and SACE. While effect heterogeneity might stem from the influence of multiple covariates in applied settings, we used only a single covariate accounting for the variance of conditional treatment effects for simplicity. As the focus of this study is on consequences of different variances of the conditional treatment effects, we would argue that it is (at least from a technical perspective) of minor importance whether the effect heterogeneity is generated from one (comprehensive) covariate or from multiple covariates.

As effect variance is actually limited by the ATE, as was shown in Equation (3), we simulated different proportions of variance in relation to the maximum possible variance. For example, for an ATE of 0, the maximum possible effect variance would be equal to 1. So, we simulated effects with 80% (i.e., .8) variance, 50% variance and so on. Thus, we do not investigate absolute variance of treatment effects, but in relation to variance possible given an ATE.

From the existing literature we derived two additional design factors. First, we vary the total sample size from a very small sample of N=25 to a very large sample (N=5000). While such large samples are uncommon for psychological RCTs, we try to investigate at what point consistency of the estimator starts to show. Second, we differentiate between balanced designs (i.e., 50% probability of being treated) and unbalanced designs (i.e., 33% probability of being treated). Negi and Wooldridge (2021) found an effect of different degrees of balance on the root mean squared error of effect estimates, that is, the ATE is estimated more inefficiently in unbalanced designs. Especially for small samples, we would expect unbalanced designs to have a negative effect on all estimators, as the precision and efficacy within the treatment group should be reduced. This should also affect statistical inferences. An overview of our simulation study design is given in Table 2. The design results in a total of 686 conditions and each condition was replicated R=5000 times.

TABLE 2.

Design of the simulation study.

Parameter Values
ATE .5, .3, .1, 0, .1, .3, .5
Relative effect variance .2, .3, .4, .5, .6, .7, .8
Balance (% in treatment) balanced (50%), unbalanced (33%)
Sample size N 25, 50, 75, 100, 250, 1000, 5000

We refrained from investigating these factors for different distributions of the covariate, as this would be expected to cause biased point estimates under extreme conditions, but no additional information regarding the statistical inferences would be generated (see, for example, Kiefer & Mayer, 2019, examining different distributions for the moment‐based approach for Poisson regression models). In addition, the distributional assumption of the moment‐based approach always coincides with the true distribution of the covariate. Thus, the findings from this simulation study reflect the statistical inferences when this assumption is correct. In applied settings, it should be expected that both point estimates and standard errors can be biased if the distributional assumption is violated.

For point estimation, we computed ATE^SDM, ATE^AME, and ATE^MOM. The estimator ATE^SACE is not shown separately, as it is computationally identical to ATE^AME. We computed the absolute and relative bias of the estimators for each condition. For standard error estimation, we computed SE^SDM, SE^AME, SE^SACE, and SE^MOM, and computed the coverage of the 95% CIs as well as the empirical detection rate (EDR; i.e., Type I error for ATEs equal to zero and power otherwise) for each condition.

The commented R code of the data‐generating process and the simulation study can be found on OSF (https://osf.io/gu37s). The simulation study was carried out using R (R Core Team, 2021) with the SimDesign package (Chalmers & Adkins, 2020). Note that the SimDesign package generates “working” replications of each condition, that is, simulated data sets leading to non‐convergent models or related issues are automatically excluded from the analyses.

5.1. Results

An overview of the median results over all approaches and criteria is provided in Table 3.

TABLE 3.

Medians of criteria of simulation study.

ATE^SDM
ATE^AME / ATE^SACE
ATE^MOM
Median absolute bias
<.0001
<.0001
<.0001
Median relative bias (%)
.04
.03
.02
ATE^SDM
ATE^AME
ATE^SACE
ATE^MOM
Median coverage of 95% CIs .943
.821
.946 .947
Median Type I error rate .052
.166
.052 .052

Note: Four methods for estimating the average treatment effect (ATE): simple mean in differences (SDM), average marginal effect (AME), sample average over conditional efects (SACE), and moment‐based approach (MOM). Bold values indicate results that are outside the acceptable range.

5.1.1. Bias and relative bias of ATE estimates

First, we looked at the bias B of each of the point estimators. Note that ATE^AME and ATE^SACE provide identical point estimates, so we do not report them separately here. In general, we can see that bias was largest in conditions with very small sample size (i.e., N=25), large absolute ATE (i.e., |ATE|.3), and high relative effect variance. However, the bias quickly vanishes with increasing sample size. This finding is in line with derivations of Negi and Wooldridge (2021), that is, the estimators are consistent given the model is correctly specified (as was the case in our simulation). Figure 1 provides an overview of the results.

FIGURE 1.

FIGURE 1

Bias for the estimators ATE^SDM, ATE^AME (which is identical to ATE^SACE), and ATE^MOM. White colour indicates zero bias in the respective conditions. Orange colour indicates an overestimation, purple indicates an underestimation of the ATE in the respective conditions. (Colour is available online.)

The median bias of ATE^SDM was |BM|<.0001 with 95% of the values lying in the interval [.008,.008]. Thus, ATE^SDM was unbiased in most scenarios. The most extreme values of bias for ATE^SDM, Bmin=.021 and Bmax=.020, were found with very small sample size (N=25). The median bias of ATE^AME (and ATE^SACE) was also |BM|<.0001 with 95% of the values lying in the interval [.004,.005]. Thus, ATE^AME and ATE^SACE were unbiased in most scenarios and slightly more precise than ATE^SDM. The most extreme values of bias for ATE^AME and ATE^SACE were Bmin=.012 and Bmax=.011, again slightly better than the results of the SDM estimator and only observed for very small sample sizes. Finally, the median bias of ATE^MOM was also |BM|<.0001 with 95% of the values lying in the interval [.004,.004]. Thus, ATE^MOM was also unbiased in most scenarios with precision comparable to ATE^AME. The most extreme values of bias for ATE^MOM were Bmin=.010 and Bmax=.008, thus showing the smallest range of bias among all conditions.

We also examined the relative bias RB, that is, the bias in relation to the true ATE. Note that a positive relative bias means that the absolute value of the ATE is overestimated, regardless of the direction of the effect. In general, we found that relative bias was within a range of ±5% for at least 95% of all conditions, and within a range of ±10% for all conditions and estimators. As for bias, the relative bias diminishes quickly with increasing sample size, meaning that the highest and lowest values of relative bias were found for very small sample sizes (N=25). In contrast to the bias, the extreme values for relative bias were found for small absolute ATE (i.e., .1). Figure 2 provides an overview of the results.

FIGURE 2.

FIGURE 2

Relative bias for the estimators ATE^SDM, ATE^AME (which is identical to ATE^SACE), and ATE^MOM. White colour indicates zero relative bias in the respective conditions. Orange colour indicates an overestimation, purple indicates an underestimation of the ATE in the respective conditions. Grey indicates an ATE of 0, where relative bias cannot be computed.

The median relative bias of ATE^SDM was RBM=.04% with 95% of the values lying in the interval [4.13%,1.77%]. The most extreme values of relative bias for ATE^SDM were RBmin=9.48% and RBmax=4.94%, respectively slightly over and under the traditional cutoff of ±5%, and these were only found with very small sample size (N=25). The median relative bias of ATE^AME and ATE^SACE was RBM=.03% with 95% of the values lying in the interval [3.23%,1.64%]. Thus, ATE^AME and ATE^SACE tended to be slightly more precise than ATE^SDM. The most extreme values of bias for ATE^AME and ATE^SACE were RBmin=9.45% and RBmax=5.39%, and were similar to the results from the SDM estimator. Finally, the median relative bias of ATE^MOM was RBM=.02% with 95% of the values lying in the interval [2.55%,2.04%]. The most extreme values of bias for ATE^MOM were RBmin=8.44% and RBmax=5.20%. Overall, these results are comparable to the results of the SDM and AME/SACE estimators.

Overall, these findings suggest that bias and relative bias of all three estimators is acceptable across all conditions. The minimum values of relative bias were observed in the same condition for all three estimators. Given the low value of ATE=.1 in this condition, it seems plausible that this extreme value is due to chance, especially as such large relative bias was not systematically observed under similar conditions.

5.1.2. Coverage of 95% confidence intervals

In a next step, we computed the (symmetric) 95% CIs based on each ATE estimator and the respective standard errors and examined the coverage rates (i.e., how often the CIs actually include the true ATE at a nominal level of 95%). In general, we found that coverage rates C were acceptable for most conditions in the SDM, SACE, and MOM estimators, except for the combination of very small sample size (N=25) and an unbalanced design. In contrast, the AME estimator had too low coverage rates under all conditions. Figure 3 provides an overview of the results.

FIGURE 3.

FIGURE 3

Coverage of the 95% confidence intervals based on the respective standard error estimates SE^SDM (SDM), SE^AME (AME), SE^SACE (SACE), and SE^MOM (MOM) (i.e., the percentage of confidence intervals that include the true value of the average treatment effect). The dashed lines indicate .95 (i.e., 95% of the 95% confidence intervals include the true value).

The median coverage of the SDM estimator was CM=.943 with 95% of the values lying in the interval [.898,.954], which is acceptably close to the nominal level of 95%. The most extreme values of coverage for the SDM were Cmin=.880 and Cmax=.959. The lowest value was slightly under a traditional cutoff of 90%, but this value was only found with very small sample size (N=25) and an unbalanced design. The median coverage of the AME estimator was CM=.821 with 95% of the values lying in the interval [.617,.921] which is substantially below the nominal level of 95%. The most extreme values of coverage for the AME were Cmin=.583 and Cmax=.928. Thus, the AME did not meet the nominal level of 95% in any scenario, but produced CIs that were too narrow under all conditions. The median coverage of the SACE was CM=.946 with 95% of the values lying in the interval [.902,.961], which is acceptably close to the nominal level of 95%. The most extreme values of coverage for the SACE were Cmin=.866 and Cmax=.968. Again, the lowest coverage rate was found in conditions with unbalanced design and a very small sample size (N=25). Finally, the median coverage of the MOM estimator was CM=.947 with 95% of the values lying in the interval [.914,.976]. The most extreme values of coverage for the MOM estimator were Cmin=.884 and Cmax=.989. As for the SDM and SACE estimators, the lowest coverage rate was found in conditions with unbalanced design and a very small sample size (N=25). The highest coverage rate was found in conditions with balanced design, a very small sample size, and large relative effect variance.

5.1.3. Empirical detection rate

Finally, we looked at the empirical detection rate, which is the Type I error rate T at a nominal level of 5% for an ATE of zero and the power otherwise. In general, we found that Type I error rates were acceptable under most conditions for the SDM, SACE, and MOM estimators, with an exception of the MOM estimator in very small sample sizes (N=25) when large relative effect variance was present. The AME estimator showed an inflated Type I error rate under all conditions, with values up to almost 40%. Figure 4 provides an overview of the results.

FIGURE 4.

FIGURE 4

Empirical detection rate of the null hypothesis tests based on Fisher's test for ATE^SDM or the respective standard error estimates SE^AME (AME), SE^SACE (SACE), and SE^MOM (MOM) – that is, the percentage of significant results (i.e., p<.05) if ATE=0 (i.e., Type I error rate) or if ATE0 (i.e., power). The dashed lines indicate .05 (i.e., 5% of the hypothesis tests return a significant result – nominal level for Type I error rate), and .80 (i.e., 80% of the hypothesis tests return a significant result – the desired level of power in many cases).

The median Type I error rate of the SDM was TM=.052 with 95% of the values lying in the interval [.045,.052]. The most extreme values of Type I error rate for the SDM, Tmin=.042 and Tmax=.060. These results reflect an acceptable error Type I error rate under all simulated conditions. The median Type I error rate of the AME was TM=.166 with 95% of the values lying in the interval [.081,.378]. The most extreme values of Type I error rate for the AME estimator were Tmin=.075 and Tmax=.392. Thus, the AME did not meet the nominal level of 5% Type I error rate in any scenario, but produced inflated Type I error rates under all conditions. The median Type I error rate of the SACE was TM=.052 with 95% of the values lying in the interval [.036,.070]. The most extreme values of Type I error rate for the SACE were Tmin=.030 and Tmax=.085. These results reflect an acceptable error Type I error rate under all simulated conditions. Finally, the median Type I error rate of the MOM estimator was TM=.052 with 95% of the values lying in the interval [.023,.064]. The most extreme values of Type I error rate for the MOM estimator were Tmin=.010 and Tmax=.070. While these Type I error rates are acceptable in most scenarios, the MOM showed a tendency for deflated Type I error rates in scenarios with very small sample sizes (N=75), balanced design, and large relative effect variance.

Our simulation study yielded varying results regarding the power of the different approaches. Broadly speaking, the SDM estimator showed substantially higher power than the SACE and MOM in scenarios with small sample sizes (N100), large ATE (|ATE|=.5), and 50% or more relative effect variance. For smaller ATEs, large relative effect variance, and sample sizes 250, the MOM showed higher power than the SDM and SACE. However, in most scenarios the power of SACE and MOM was similar to that for the SDM, meaning the increase in power due to accounting for a covariate was moderate at best. It is noteworthy that power was generally higher in conditions with balanced designs compared to their unbalanced counterparts. That is, the precision gained through a larger control group did not counterbalance the loss of precision for estimating the parameters in a smaller treatment group. We do not consider the AME here, as it did not meet the nominal level of 5% Type I error rate under any conditions and, therefore, the empirical detection rate cannot be interpreted as power.

6. EMPIRICAL ILLUSTRATION

In this section, we give an empirical illustration of how the above‐mentioned ATE estimators can be applied to real data from psychotherapy research using data from Woud et al. (2021), who examined the effects of cognitive bias modification (CBM) training on PTSD. They used an RCT and compared CBM to sham training (i.e., a control group). A total of N=80 participants were randomized, with N=65 providing the outcome data used in our analyses.

In our analysis, we investigated the ATE of the CBM training (X=1) on a categorical PTSD outcome (estimated PTSD diagnosis) at the six‐week follow‐up assessment. While the ATE is usually expected to be larger at the immediate post‐training assessment, the effect variability is expected to be larger with later assessments due to differences between patients in the extent to which they retain and implement the learning from the intervention. As diagnostic interviews were only administered pre‐treatment, we operationalized the outcome variable Y by means of reaching a certain cut‐off score on the German version of the PTSD checklist for DSM‐5 (PCL‐5; Krüger‐Gottschalk et al., 2017). We applied a cut‐off score of 33 as recommended by the test developers (see Krüger‐Gottschalk et al., 2017). That is, Y=1 indicates a high probability for a PTSD diagnosis and Y=0 indicates the opposite.

We used the patients' score on the Beck Anxiety Inventory (BAI; Beck et al., 2012) measured at pre‐training as our covariate. While this covariate was chosen largely for demonstration purposes, pre‐training anxiety levels could plausibly be expected both to prognostic of treatment outcomes and also interact with treatment condition. High scores on the BAI reflect general (i.e., not PTSD‐specific) anxiety severity, which could interfere with patients' ability to engage with and benefit from treatment as usual, leading to worse outcomes; patients with higher levels of anxiety at baseline may therefore particularly benefit from an adjunctive treatment (such as CBM).

We estimated the ATE using all four aforementioned estimators: SDM, SACE, AME, and MOM. For the covariate‐based estimators, we estimated a logistic regression model with robust standard errors, to account for potential misspecification as recommended by Negi and Wooldridge (2021). The commented R code for our analyses can be found on OSF (https://osf.io/gu37s).

6.1. Results

First, we estimated ATE^SDM based on the contingency table given in Table 4. The estimate for ATE^SDM was .164 (Δ=.376), which means that the probability for a PTSD diagnosis six weeks after treatment was about 16.4% lower in the CBM training group compared to the control group. That is, ATE^SDM reflects the difference between the probability of a PTSD diagnosis under control (75.8%) and under training (59.4%). The estimated ATE also reflects a number needed to treat (NNT) of about 6.1, meaning that 6.1 more patients would have to be treated (than non‐treated), to achieve one diagnosed person less – on average. A summary of the results is given in Table 5.

TABLE 4.

Contingency table for the sample of the illustrative example.

Y=0
Y=1
X=0
8 25 33
X=1
12 20 32
20 45 65

Note: Similarly to Table 1, the numbers reflect absolute frequencies of observations within the combinations of the treatment variable (CBM training, X=1; control group, X=0) and the outcome variable (PTSD diagnosis, Y=1; no PTSD diagnosis, Y=0).

TABLE 5.

ATE estimates and statistical inferences for the illustrative example.

Approach
ATE^
Effect size Δ SE p‐value [95% CI] CI range
SDM
.164
.376
.114 .158
[.388,.061]
.449
AME
.144
.330
.107 .181
[.354,.067]
.421
SACE
.144
.330
.112 .200
[.364,.076]
.440
MOM
.139
.318
.123 .260
[.380,.102]
.482

Abbreviations: AME, average marginal effect; MOM, moment‐based approach; SACE, sample average over conditional effects (point estimate identical to AME); SDM, simple difference‐in‐means.

Second, we estimated a logistic regression model with treatment variable (γ^10=4.034, p=.074), covariate (γ^01=.155, p=.023), and treatment–covariate interaction (γ^11=.164, p=.026). Based on the logistic regression model, we estimated both ATE^AME (.144, Δ=.330; same values for ATE^SACE), and ATE^MOM (.139, Δ=.318). Both covariate‐based ATE estimates were slightly smaller than the SDM estimate, which is also reflected in the corresponding effect sizes.

Third, we investigated the standard errors, 95% CIs, and p‐values of the respective ATE estimates. While all four p‐values would result in keeping the null hypothesis of ATE of zero, the approaches differ with regard to their statistical confidence of doing so. For AME and SACE, the standard errors are lower (SE^AME=.107; SE^SACE=.112) than for the SDM (SE^SDM=.114) and, therefore, the range of the corresponding CIs is smaller, reflecting a higher confidence in the statistical inferences. However, based on our simulation study we would conclude that the AME is overconfident in this case by simply neglecting sampling‐based uncertainty in the covariate. In contrast, the SACE takes this uncertainty into account and still improves beyond the SDM, but in a more moderate way.

In this application, the moment‐based approach results in both the highest standard error (SE^MOM=.123) and the widest 95% CI of all four approaches. Thus, we additionally used a Shapiro–Wilk test to examine whether the normality assumption for the covariate was reasonable and got a non‐significant result (W=.984, p=.574).

In sum, we compared the simple estimate of the ATE (i.e., not including any covariates, just comparing group means, called SDM) to several different ways (i.e., AME, SACE, MOM) of calculating the ATE based on inclusion of a covariate that is prognostic of treatment outcomes and interacts with the treatment condition. The comparison illustrates two important aspects. First, with regard to point estimates, all four estimators yield very similar results, which is to be expected in an RCT. Second, adjusting for a covariate and ignoring sampling‐based uncertainty (i.e., the AME) seemingly provides favourable standard errors, p‐values, and CIs, but these are due to an underestimation of uncertainty and would lead to inflated Type I error rates and undercoverage of CIs. Accounting for sampling‐based uncertainty in the covariate leads to statistical inferences closer to the simple ATE estimate (without covariates). While the overall implications and inferences would be similar for all four estimators in this application, the example also illustrates that substantial differences are possible even in a very simple scenario like this.

7. DISCUSSION

In this paper, we provided a thorough investigation of the adjustment for stochastic covariates in estimation of average treatment effects in logistic regression models in psychological RCTs. We presented four adjusted and unadjusted estimators of the ATE: a simple difference‐in‐means estimator, a sample average of conditional treatment effects treating the covariate as stochastic (i.e., SACE) and one treating it as fixed (i.e., AME), and a newly developed moment‐based approach, which also treats the covariate as stochastic. We discussed the conditions under which these estimators differ with regard to their standard errors and examined the finite‐sample properties of all four estimators in a simulation study. Finally, we offered an empirical example from clinical psychology illustrating the different results provided by the four estimators.

The most important finding from our simulation study is that statistical inferences from the AME estimator (i.e., treating the covariate as fixed) were problematic under almost all conditions. This is in contrast to the statement of Wooldridge (2010) that omitting the part accounting for sampling‐based uncertainty in the standard error formula might only have a small effect. We found that for the AME actual coverage rates did not meet the nominal level of 95%, and the actual Type I error rates did not meet the nominal level of 5% in any condition. These rates were close to their nominal levels when the conditional treatment effects were close to homogeneous. However, homogeneous treatment effects are only possible in logistic regression models if the covariate is not predictive of the outcome at all, and in this case covariate adjustment would only introduce noise into the analysis (Negi & Wooldridge, 2021).

Another important finding is that covariate adjustment did not automatically lead to increased power of an estimator, even though the logistic regression model was correctly specified in our simulation. The increased empirical detection rates of the AME estimator must not be misinterpreted as power – otherwise, one would wrongly conclude that covariate adjustment tremendously improves the power. This is generally not the case. We found that covariate adjustment (with SACE and MOM) was only beneficial in larger samples, when estimating a small to moderate ATE with large variance of conditional effects. In scenarios with larger ATEs, the SDM showed higher power than the covariate‐adjusted approaches. One possible reason for this finding is that the maximum likelihood estimation of the regression parameters might come with much estimation uncertainty in the presence of strong interactions in small samples, which in turn reduces the power of the effect estimate. However, in larger samples the power of the SDM is already quite high, so there is not much room for covariate adjustment to improve beyond. This finding is noteworthy especially because the SDM only relies on the assumption of i.i.d. observations, whereas the other estimators require additional assumptions, such as a correct model specification.

These findings emphasize the importance of accounting for sampling‐based uncertainty when estimating (average) treatment effects based on regression models. They are in line with findings from previous work, for example, on accounting for stochastic group sizes in an ANOVA‐like framework (Mayer & Thoemmes, 2019), stochastic covariates in linear regression models (Chen, 2006; Liu et al., 2017), and stochastic covariates in Poisson regression models (Kiefer & Mayer, 2019). We contribute to this line of research by illustrating the severity of neglecting sampling‐based uncertainty for logistic regression models.

In addition, we proposed a new moment‐based approach which also accounts for stochastic covariates. In a previous study, Kiefer and Mayer (2019) showed that a moment‐based approach can outperform the SACE in Poisson regression models with regard to bias of point estimates and accuracy of coverage rates in the presence of strong interaction effects. However, we could not find these advantages in logistic regression models. While the moment‐based approach yield slightly better coverage rates in very small samples than the SACE, both estimators yield similar results with regard to bias. One possible reason for these findings is that we only examined a normally distributed covariate, while Kiefer and Mayer (2019) found a Poisson‐distributed covariate as leading to huge performance differences between the two estimators. Nevertheless, the moment‐based approach can be a useful in scenarios where the SACE is not directly applicable, for example, with latent covariates (Kiefer & Mayer, 2021a).

We have shown that statistical inferences on the ATE can be difficult, when treatment effects are heterogeneous. In the same vein, it is noteworthy that the ATE itself might be substantively less interesting in cases with heterogeneous treatment effects, as the actual (conditional) treatment effects deviate from the ATE. In our illustrative example, CBM training was found to be more effective for persons with higher anxiety at baseline and less effective for persons with lower anxiety. This information can be useful for tailoring treatments to persons for whom it is beneficial. For another example, see the study on differential effects of a classroom intervention Flunger et al. (2019). Based on covariate selection, there also exist approaches to (approximately) estimate individual treatment effects (Mayer et al., 2020; Wester et al., 2022).

In sum, we showed that accounting for sampling‐based uncertainty is important when doing covariate adjustment for a binary outcome in a psychological RCT. Neglecting this uncertainty can lead to severely inflated Type I error rates and overestimation of power. Instead of the AME we recommend using the SACE, which provides identical point estimates. If treatment effects are heterogeneous, the statistical inferences provided by the SACE are more accurate than those provided by the AME, and if treatment effects are homogeneous, both approaches would provide identical results. The moment‐based approach can be a viable alternative to the SACE, but might be inaccurate in the presence of strong interactions in small samples.

AUTHOR CONTRIBUTIONS

Christoph Kiefer: conceptualization; methodology; software; writing – original draft. Marcella L. Woud: resources; writing – review and editing. Simon E. Blackwell: resources; writing – review and editing. Axel Mayer: conceptualization; writing – review and editing; supervision.

Acknowledgement

Open Access funding enabled and organized by Projekt DEAL.

APPENDIX A. Derivation for supremum of conditional effects variance

We consider the random variables U1:=P(Y=1|X=1,Z) and U0:=P(Y=1|X=0,Z) with values u0,u1[0,1]. Then the conditional effect function is given by

CE(Z)=U1U0,

and its variance can be decomposed to

Var[CE(Z)]=Var(U1U0)=Var(U1)+Var(U0)2·Cov(U1,U0). (A1)

We are interested in the supremum of this variance given a particular ATE between 1 and 1, so we have to consider a restriction on our variables,

ATE=E(U1)E(U0),

that is, we want to derive

sup{Var[CE(Z)]:ATE[1,1]}.

Our derivation of the supremum of the conditional effect variance is based on the observation, that the variances of U0 and U1 are maximized if their values deviate as much as possible from the corresponding expectation, that is, if the values u0,u1{0,1}. Thus, we consider U1(p1) and U0(p0) as Bernoulli‐distributed random variables with E(U0)=p0 and E(U1)=p1.

We use the Cauchy–Schwarz inequality to derive a lower bound on the covariance of U0 and U1, that is, given the variances, the lowest possible covariance is

Cov(U1,U0)=Var(U1)·Var(U0)·ϕ. (A2)

Note that for Bernoulli‐distributed random variables U1 and U0, we added a multiplicative correction factor ϕ to the covariance, as the bounds derived by the Cauchy–Schwarz inequality can be too liberal for binary variables (Ferguson, 1941; Guilford, 1965). The factor ϕ is bounded between 0 and 1 and is computed by

ϕ=p0·p1(1p0)·(1p1).

Note that this formula only holds when considering the lower bound of the covariance. For the upper bound another formula has to be used (Guilford, 1965).

Let us now consider the relation between p0 and p1 of our random variables U0 and U1, which is

p1=p0+ATE.

When we additionally replace the covariance term in Equation (A1) by its lower bound from Equation (A2), we have

Var(U1)+Var(U0)+2·ϕ·Var(U1)·Var(U0)=p1·(1p1)+p0·(1p0)+2·ϕ·p1·(1p1)·p0·(1p0)=(p0+ATE)·(1p0ATE)+p0·(1p0)+2·ϕ·(p0+ATE)·(1p0ATE)·p0·(1p0),

which is a function of p0, because the ATE is treated as a given value in the supremum we want to derive. The maximum of this function can be found for

p0=1ATE2,

and consequently

p1=1p0=1+ATE2.

For these values of p0 and p1, the correction factor is ϕ=1 and can be neglected from the formula.

In conclusion, the first part of the supremum is given by

APPENDIX A.

and the second part of the supremum is given by

APPENDIX A.

Thus, the supremum can be computed as

APPENDIX A.

APPENDIX B. Consistency of moment‐based ATE estimator

In Equation (7), the ATE was denoted as a function g(γ,θ):

g(γ,θ)=zCE(z,γ)·pZ(z,θ)=ATE,

where γ denotes the regression parameters and θ denotes parameters of the density function of the covariate.

Now, assuming that both the regression parameters γ^ and the density parameters θ^ are consistent estimates of γ and θ, respectively, that is,

γ^,θ^AN(γ,θ),Var(γ,θ),

where AN(μ,Σ) denotes asymptotically normally distributed with asymptotic mean vector μ and asymptotic covariance matrix Σ, one can show that the corresponding estimator from Equation (21),

g(γ^,θ^)=zCE(z,γ^)·pZ(z,θ^)dz=ATE^MOM(γ^,θ^)

with N, has the asymptotic distribution

g(γ^,θ^)ANg(γ,θ),g(γ,θ)·Var(γ,θ)·[g(γ,θ)]T.

This is a direct application of the delta theorem (see, for example, Boos & Stefanski, 2013, Theorem 5.19). The asymptotic mean vector shows that the function g(γ,θ) is estimated consistently and the asymptotic variance is the square of the standard error formula in Equation (10).

APPENDIX C. Approximation of the moment‐based ATE estimator

Examples for univariate distributions

In the main text, we said that the computation of the moment‐based ATE estimator

ATE^MOM(γ^,θ^)=zCE(z,γ^)·pZ(z,θ^)dz

usually depends on the distribution of the covariate Z and the available numerical or analytical integration techniques. In this appendix, we will provide three examples of how the integral can be solved numerically (for a normally and Poisson‐distributed covariate) and analytically (for a uniformly distributed covariate).

If Z is normally distributed with estimated parameters θ^=(μ^,σ^), the density is well known and the ATE can be approximated using Gaussian quadrature:

ATE^MOM(γ^,θ^)=z=0CE(z,γ^)·12σ2exp(xμ)22σ2i=1Mwi·CE(zi,γ^),

where wi and zi can, for example, be derived using Gauss–Hermite quadrature rules. Note that Gaussian quadrature techniques approximate the effect function CE with polynomials up to degree 2M1. We expect these techniques to work well for most applications, but the polynomial approximation might be biased in the presence of very strong interaction effects as the logistic function can only be approximated by a polynomial to a certain extent.

If Z is Poisson‐distributed with estimated parameter θ^=(λ^), the density is well known and the ATE can be expressed as a series:

ATE^MOM(γ^,θ^)=z=0CE(z,γ^)·λzz!·expλz=0ZmaxCE(z,γ^)·λzz!·expλ,

where Zmax should be chosen reasonably high depending on λ^. As a rule of thumb, you can choose Zmaxλ^+3·λ^, which are values three standard deviations above the mean. Note that this approximation can be arbitrarily precise depending on the choice of Zmax, that is, with Zmax the approximation error converges to 0.

If Z is continuously uniform distributed on the interval [a,b] with estimated parameters θ^=(â,b^), the density is well known and the ATE has an analytical solution:

graphic file with name BMSP-78-141-e019.jpg

where

α^00=γ^00,α^01=γ^01,α^10=γ^00+γ^10,α^11=γ^01+γ^11

are the group‐specific regression coefficients, which we used for a more parsimonious notation.

Example with multivariate normal distribution

If we consider a multivariate case with m covariates, then the conditional effect function is given by

CE(z,α^0,α^1)=expα10+α11·Z1++α1m·Zmexpα00+α01·Z1++α0m·Zm,

where z=(Z1,,Zm), α^0=(α00,α01,,α0m), and α^1=(α10,α11,,α1m) are the group‐specific regression coefficients.

If the covariates are multivariate normally distributed with estimated parameters θ^=(μ^,Σ^), that is, z34𝒩(μ^,Σ^), then the ATE can be computed by

ATE^MOM(α^0,α^1,θ^)=zCE(z,α^0,α^1)·exp12(zμ^)Σ^1(zμ^)(2π)m|Σ^|dzi=1Mwi·CE(zi,α^0,α^1),

where the zi are integration points on a multidimensional grid (e.g., derived from multidimensional Gauss–Hermite quadrature) and wi are integration weights, respectively.

Note that the computational demand increases with a growing number of covariates, as the number of integration points grows exponentially. However, we expect the computations to be reasonably manageable on modern computer systems for two reasons. First, the computation only has to be performed once, not repeatedly. This sets it apart from the use of numerical integration techniques for the computation of marginal maximum likelihoods (see, for example, Skrondal & Rabe‐Hesketh, 2004, Ch. 6.3), where the computation is repeated throughout the iterative optimization process. Second, the number of covariates is typically rather low in applied settings, given that sample sizes are typically small to moderate in psychological RCTs. In the main text we argued that applied researchers will often only account for a single covariate. However, if, for example, six covariates with 15 integration points per dimension were considered, then Gauss–Hermite quadrature requires 156=11,390,625 integration points in total. On our laptop, the computation of these integration points required about 600 Mb of memory and 30 seconds of time, and the computation of the effect itself lasted 0.3 seconds. The example can also be found in our OSF repository. For more complex scenarios, Monte Carlo integration might be a feasible alternative.

In cases where the assumption of multivariate normality is violated, there might be alternative approaches. Kiefer and Mayer (2021b) provide three different ways of specifying or approximating multivariate distributions of z for the moment‐based approach in Poisson regression models. These techniques might be adapted for logistic regression models as well.

Kiefer, C. , Woud, M. L. , Blackwell, S. E. , & Mayer, A. (2025). Average treatment effects on binary outcomes with stochastic covariates. British Journal of Mathematical and Statistical Psychology, 78, 141–166. 10.1111/bmsp.12355

Footnotes

1

We consider a single covariate Z, because in psychological RCTs typically only a single covariate (e.g., the baseline measure of the outcome) is used for covariate adjustment (Bodner & Bliese, 2018). However, we will discuss below how the estimators for the ATE can be extended to multiple covariates.

2

The generalized linear model treats the covariates as fixed observations – this is analogous to the fixed covariate assumption in the effect estimation later. However, at this point and for this purpose in the estimation process it affects neither the regression coefficients estimates nor their standard errors.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are openly available in OSF at https://osf.io/gu37s/.

REFERENCES

  1. Abadie, A. , Athey, S. , Imbens, G. W. , & Wooldridge, J. M. (2020). Sampling‐based versus design‐based uncertainty in regression analysis. Econometrica, 88(1), 265–296. 10.3982/ECTA12675 [DOI] [Google Scholar]
  2. Bartlett, J. W. (2018). Covariate adjustment and estimation of mean response in randomised trials. Pharmaceutical Statistics, 17, 648–666. 10.1002/pst.1880 [DOI] [PubMed] [Google Scholar]
  3. Basu, A. , & Rathouz, P. J. (2005). Estimating marginal and incremental effects on health outcomes using flexible link and variance function models. Biostatistics, 6(1), 93–109. 10.1093/biostatistics/kxh020 [DOI] [PubMed] [Google Scholar]
  4. Beaudry, G. , Yu, R. , Perry, A. E. , & Fazel, S. (2021). Effectiveness of psychological interventions in prison to reduce recidivism: A systematic review and meta‐analysis of randomised controlled trials. The Lancet Psychiatry, 8(9), 759–773. 10.1016/S2215-0366(21)00170-X [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Beck, A. T. , Epstein, N. , Brown, G. , & Steer, R. (2012). Beck anxiety inventory. American Psychological Association. [Google Scholar]
  6. Bodner, T. E. , & Bliese, P. D. (2018). Detecting and differentiating the direction of change and intervention effects in randomized trials. Journal of Applied Psychology, 103(1), 37–53. 10.1037/apl0000251 [DOI] [PubMed] [Google Scholar]
  7. Boos, D. D. , & Stefanski, L. A. (2013). Essential statistical inference (Vol. 120). Springer. [Google Scholar]
  8. Chalmers, R. P. , & Adkins, M. C. (2020). Writing effective and reliable Monte Carlo simulations with the SimDesign package. The Quantitative Methods for Psychology, 16(4), 248280. 10.20982/tqmp.16.4.p248 [DOI] [Google Scholar]
  9. Chen, X. (2006). The adjustment of random baseline measurements in treatment effect estimation. Journal of Statistical Planning and Inference, 136(12), 4161–4175. 10.1016/j.jspi.2005.08.046 [DOI] [Google Scholar]
  10. Dijkstra, A. , De Vries, H. , Roijackers, J. , & van Breukelen, G. (1998). Tailored interventions to communicate stage‐matched information to smokers in different motivational stages. Journal of Consulting and Clinical Psychology, 66(3), 549–557. 10.1037/0022-006X.66.3.549 [DOI] [PubMed] [Google Scholar]
  11. Ferguson, G. A. (1941). The factorial interpretation of test difficulty. Psychometrika, 6(5), 323–329. 10.1007/BF02288588 [DOI] [Google Scholar]
  12. Fleiss, J. L. , Levin, B. , & Paik, M. C. (2003). Statistical methods for rates and proportions (1st ed.). Wiley. [Google Scholar]
  13. Flunger, B. , Mayer, A. , & Umbach, N. (2019). Beneficial for some or for everyone? Exploring the effects of an autonomy‐supportive intervention in the real‐life classroom. Journal of Educational Psychology, 111(2), 210–234. 10.1037/edu0000284 [DOI] [Google Scholar]
  14. Gloster, A. T. , Wittchen, H.‐U. , Einsle, F. , Lang, T. , Helbig‐Lang, S. , Fydrich, T. , Fehm, L. , Hamm, A. O. , Richter, J. , Alpers, G. W. , Gerlach, A. L. , Ströhle, A. , Kircher, T. , Deckert, J. , Zwanzger, P. , Höller, M. , & Arolt, V. (2011). Psychological treatment for panic disorder with agoraphobia: A randomized controlled trial to examine the role of therapist‐guided exposure in situ in CBT. Journal of Consulting and Clinical Psychology, 79(3), 406–420. 10.1037/a0023584 [DOI] [PubMed] [Google Scholar]
  15. Greene, W. H. (2012). Econometric analysis (7th ed.). Pearson. [Google Scholar]
  16. Greenland, S. , Pearl, J. , & Robins, J. M. (1999). Confounding and collapsibility in causal inference. Statistical Science, 14(1), 29–46. 10.1214/ss/1009211805 [DOI] [Google Scholar]
  17. Guilford, J. P. (1965). The minimal phi coefficient and the maximal phi. Educational and Psychological Measurement, 25(1), 3–8. 10.1177/001316446502500101 [DOI] [Google Scholar]
  18. Hanmer, M. J. , & Ozan Kalkan, K. (2013). Behind the curve: Clarifying the best approach to calculating predicted probabilities and marginal effects from limited dependent variable models. American Journal of Political Science, 57(1), 263–277. 10.1111/j.1540-5907.2012.00602.x [DOI] [Google Scholar]
  19. Hernández, A. V. , Steyerberg, E. W. , & Habbema, J. F. (2004). Covariate adjustment in randomized controlled trials with dichotomous outcomes increases statistical power and reduces sample size requirements. Journal of Clinical Epidemiology, 57(5), 454–460. 10.1016/j.jclinepi.2003.09.014 [DOI] [PubMed] [Google Scholar]
  20. Hutton, J. L. (2000). Number needed to treat: Properties and problems. Journal of the Royal Statistical Society: Series A (Statistics in Society), 163(3), 381–402. 10.1111/1467-985X.00175 [DOI] [Google Scholar]
  21. Imbens, G. , & Rubin, D. B. (2015). Causal inference for statistics, social, and biomedical sciences: An introduction. Cambridge University Press. [Google Scholar]
  22. Kiefer, C. , & Mayer, A. (2019). Average effects based on regressions with a logarithmic link function: A new approach with stochastic covariates. Psychometrika, 84(2), 422–446. 10.1007/s11336-018-09654-1 [DOI] [PubMed] [Google Scholar]
  23. Kiefer, C. , & Mayer, A. (2021a). Accounting for latent covariates in average effects from count regressions. Multivariate Behavioral Research, 56(4), 579–594. 10.1080/00273171.2020.1751027 [DOI] [PubMed] [Google Scholar]
  24. Kiefer, C. , & Mayer, A. (2021b). Treatment effects on count outcomes with non‐normal covariates. British Journal of Mathematical and Statistical Psychology, 74(3), 513–540. 10.1111/bmsp.12237 [DOI] [PubMed] [Google Scholar]
  25. Krüger‐Gottschalk, A. , Knaevelsrud, C. , Rau, H. , Dyer, A. , Schäfer, I. , Schellong, J. , & Ehring, T. (2017). The German version of the Posttraumatic Stress Disorder Checklist for DSM‐5 (PCL‐5): Psychometric properties and diagnostic utility. BMC Psychiatry, 17(1), 379. 10.1186/s12888-017-1541-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Larimer, M. E. , Lee, C. M. , Kilmer, J. R. , Fabiano, P. M. , Stark, C. B. , Geisner, I. M. , Mallett, K. A. , Lostutter, T. W. , Cronce, J. M. , Feeney, M. , & Neighbors, C. (2007). Personalized mailed feedback for college drinking prevention: A randomized clinical trial. Journal of Consulting and Clinical Psychology, 75(2), 285–293. 10.1037/0022-006X.75.2.285 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Leeper, T. J. , Arnold, J. , Arel‐Bundock, V. , & Long, J. A. (2021). Margins: Marginal effects for model objects (Version 0.3.26) . https://CRAN.R‐project.org/package=margins
  28. Liu, Y. , West, S. G. , Levy, R. , & Aiken, L. S. (2017). Tests of simple slopes in multiple regression models with an interaction: Comparison of four approaches. Multivariate Behavioral Research, 52(4), 445–464. 10.1080/00273171.2017.1309261 [DOI] [PubMed] [Google Scholar]
  29. Mayer, A. , Dietzfelbinger, L. , Rosseel, Y. , & Steyer, R. (2016). The EffectLiteR approach for analyzing average and conditional effects. Multivariate Behavioral Research, 51(2‐3), 374–391. 10.1080/00273171.2016.1151334 [DOI] [PubMed] [Google Scholar]
  30. Mayer, A. , & Thoemmes, F. (2019). Analysis of variance models with stochastic group weights. Multivariate Behavioral Research, 54(4), 542–554. 10.1080/00273171.2018.1548960 [DOI] [PubMed] [Google Scholar]
  31. Mayer, A. , Zimmermann, J. , Hoyer, J. , Salzer, S. , Wiltink, J. , Leibing, E. , & Leichsenring, F. (2020). Interindividual Differences in Treatment Effects Based on Structural Equation Models with Latent Variables: An EffectLiteR Tutorial. Structural Equation Modeling: A Multidisciplinary Journal, 27(5), 798–816. 10.1080/10705511.2019.1671196 [DOI] [Google Scholar]
  32. McCullagh, P. , & Nelder, J. A. (1998). Generalized linear models (2nd ed.). Chapman & Hall/CRC. [Google Scholar]
  33. Miller, F. (2021). Gradients and Hessians for log‐likelihood in logistic regression . http://gauss.stat.su.se/phd/oasi/OASII2021_gradients_Hessians.pdf
  34. Mood, C. (2010). Logistic regression: Why we cannot do what we think we can do, and what we can do about it. European Sociological Review, 26(1), 67–82. 10.1093/esr/jcp006 [DOI] [Google Scholar]
  35. Moore, K. L. , & van der Laan, M. J. (2009). Covariate adjustment in randomized trials with binary outcomes: Targeted maximum likelihood estimation. Statistics in Medicine, 28(1), 39–64. 10.1002/sim.3445 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Negi, A. , & Wooldridge, J. M. (2021). Revisiting regression adjustment in experiments with heterogeneous treatment effects. Econometric Reviews, 40(5), 504–534. 10.1080/07474938.2020.1824732 [DOI] [Google Scholar]
  37. Nelder, J. A. , & Wedderburn, R. W. M. (1972). Generalized Linear Models. Journal of the Royal Statistical Society. Series A (General), 135(3), 370. 10.2307/2344614 [DOI] [Google Scholar]
  38. Newcombe, R. G. (1998). Interval estimation for the difference between independent proportions: Comparison of eleven methods. Statistics in Medicine, 17(8), 873–890. [DOI] [PubMed] [Google Scholar]
  39. Niu, L. (2020). A review of the application of logistic regression in educational research: Common issues, implications, and suggestions. Educational Review, 72(1), 41–67. 10.1080/00131911.2018.1483892 [DOI] [Google Scholar]
  40. Norton, E. C. , & Dowd, B. E. (2018). Log odds and the interpretation of logit models. Health Services Research, 53(2), 859–878. 10.1111/1475-6773.12712 [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Osch, L. , Lechner, L. , Reubsaet, A. , Wigger, S. , & Vries, H. (2008). Relapse prevention in a national smoking cessation contest: Effects of coping planning. British Journal of Health Psychology, 13(3), 525–535. 10.1348/135910707X224504 [DOI] [PubMed] [Google Scholar]
  42. R Core Team . (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R‐project.org/ [Google Scholar]
  43. Raykov, T. , & Marcoulides, G. A. (2004). Using the Delta method for approximate interval estimation of parameter functions in SEM. Structural Equation Modeling: A Multidisciplinary Journal, 11(4), 621–637. 10.1207/s15328007sem1104_7 [DOI] [Google Scholar]
  44. Robinson, L. D. , & Jewell, N. P. (1991). Some surprising results about covariate adjustment in logistic regression models. International Statistical Review/Revue Internationale de Statistique, 59(2), 227. 10.2307/1403444 [DOI] [Google Scholar]
  45. Rosenblum, M. , & van der Laan, M. J. (2010). Simple, efficient estimators of treatment effects in randomized trials using generalized linear models to leverage baseline variables. The . International Journal of Biostatistics, 6(1), Article 13. 10.2202/1557-4679.1138 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469), 322–331. 10.1198/016214504000001880 [DOI] [Google Scholar]
  47. Skrondal, A. , & Rabe‐Hesketh, S. (2004). Generalized latent variable modeling: Multilevel, longitudinal, and structural equation models. Chapman & Hall/CRC. [Google Scholar]
  48. Stefanski, L. A. , & Boos, D. D. (2002). The calculus of M‐estimation. The American Statistician, 56(1), 29–38. 10.1198/000313002753631330 [DOI] [Google Scholar]
  49. Steyer, R. , Gabler, S. , von Davier, A. A. , Nachtigall, C. , & Buhl, T. (2000). Causal regression models I: Individual and average causal effects. Methods of Psychological Research Online, 5(2), 39–71. 10.23668/psycharchives.12750 [DOI] [Google Scholar]
  50. Steyer, R. , Mayer, A. , & Lossnitzer, C. (2022). Causal inference on total, direct, and indirect effects. In Maggino F. (Ed.), Encyclopedia of quality of life and well‐being research (pp. 1–26). Springer International Publishing. [Google Scholar]
  51. Terza, J. V. (2016). Inference using sample means of parametric nonlinear data transformations. Health Services Research, 51(3), 1109–1113. 10.1111/1475-6773.12494 [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Wester, R. A. , Rubel, J. , & Mayer, A. (2022). Covariate selection for estimating individual treatment effects in psychotherapy research: A simulation study and empirical example. Clinical Psychological Science, 10(5), 920–940. 10.1177/216770262110710 [DOI] [Google Scholar]
  53. Williams, R. (2012). Using the margins command to estimate and interpret adjusted predictions and marginal effects. The Stata Journal: Promoting communications on statistics and Stata, 12(2), 308–331. 10.1177/1536867X1201200209 [DOI] [Google Scholar]
  54. Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data (2nd ed.). MIT Press. [Google Scholar]
  55. Woud, M. L. , Blackwell, S. E. , Shkreli, L. , Würtz, F. , Cwik, J. C. , Margraf, J. , Holmes, E. A. , Steudte‐Schmiedgen, S. , Herpertz, S. , & Kessler, H. (2021). The effects of modifying dysfunctional appraisals in posttraumatic stress disorder using a form of cognitive bias modification: Results of a randomized controlled trial in an inpatient setting. Psychotherapy and Psychosomatics, 90(6), 386–402. 10.1159/000514166 [DOI] [PubMed] [Google Scholar]
  56. Zeileis, A. (2006). Object‐oriented computation of sandwich estimators. Journal of Statistical Software, 16(9), 1–16. 10.18637/jss.v016.i09 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are openly available in OSF at https://osf.io/gu37s/.


Articles from The British Journal of Mathematical and Statistical Psychology are provided here courtesy of Wiley

RESOURCES