Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Jan 15.
Published in final edited form as: J Am Stat Assoc. 2016 Jan 15;110(512):1412–1421. doi: 10.1080/01621459.2015.1079528

Rerandomization to Balance Tiers of Covariates

Kari Lock Morgan 1, Donald B Rubin 2,*
PMCID: PMC5042467  NIHMSID: NIHMS771360  PMID: 27695149

Abstract

When conducting a randomized experiment, if an allocation yields treatment groups that differ meaningfully with respect to relevant covariates, groups should be rerandomized. The process involves specifying an explicit criterion for whether an allocation is acceptable, based on a measure of covariate balance, and rerandomizing units until an acceptable allocation is obtained. Here we illustrate how rerandomization could have improved the design of an already conducted randomized experiment on vocabulary and mathematics training programs, then provide a rerandomization procedure for covariates that vary in importance, and finally offer other extensions for rerandomization, including methods addressing computational efficiency. When covariates vary in a priori importance, better balance should be required for more important covariates. Rerandomization based on Mahalanobis distance preserves the joint distribution of covariates, but balances all covariates equally. Here we propose rerandomizing based on Mahalanobis distance within tiers of covariate importance. Because balancing covariates in one tier will in general also partially balance covariates in other tiers, for each subsequent tier we explicitly balance only the components orthogonal to covariates in more important tiers.

Keywords: randomization, experimental design, covariate balance, treatment allocation, causal inference, Mahalanobis distance

1 Introduction

Randomized experiments balance all covariates between treatment groups in expectation, yet chance imbalances often exist. If a randomization yields groups that are unbalanced, which is noticed before conducting the experiment, groups should be rerandomized, which can continue until a pre-specified desired level of covariate balance is achieved. Morgan and Rubin (2012) provide theoretical results supporting rerandomization. The general procedure for rerandomization is as follows:

  1. Select units for the comparison of treatments, and collect covariate data on all units.

  2. Define an explicit criterion for covariate balance.

  3. Randomize units to treatment groups.

  4. Check covariate balance and return to Step 3 if the allocation is unacceptable according to the criterion specified in Step 2; continue until the balance is acceptable.

  5. Conduct the experiment.

  6. Perform inference (using a randomization test that follows exactly steps 2-4).

This framework as stated requires that covariate data be available for all units to be randomized before randomization begins, so is most applicable when all units are randomized simultaneously, although extensions to sequential assignment are discussed in another manuscript.

Step 2, defining acceptable balance, requires definitions for measuring balance and thresholds for acceptable balance. Morgan and Rubin (2012) recommend choosing an affinely invariant measure of covariate balance; that is, a criterion such that any affine transformation of the covariate matrix x, c+bx, yields the same rerandomization acceptance decisions. Affinely invariant measures of covariate balance are desirable because they (a) reflect the joint distribution of the covariates, (b) provide equal balance improvement for each covariate and for any linear combination of covariates, and (c) yield unbiased estimates for any linear combination of x if x is ellipsoidally symmetric (Morgan and Rubin, 2012). Mahalanobis distance, further explored in Section 3, is one common example of an affinely invariant criterion for balance. Although equal balance improvement for all covariates is theoretically appealing, in practice we may wish to impose more stringent criteria for covariates a priori thought to be more important, and be more lenient for less important covariates; Section 4 provides a new rerandomization criterion that does this. The threshold for acceptable balance for large samples involves a trade-off between better covariate balance and computational time; Section 5.1 provides a method that can often improve computational efficiency. Section 5.2 offers an alternative rerandomization procedure: generating a fixed number of allocations and choosing the best, according to a pre-defined measure of balance.

2 Example

Here we evaluate the benefits of rerandomization using data from an actual randomized experiment, conducted by Shadish et al. (2008)1. The study was designed to examine whether observational studies can be analyzed to yield valid estimates of causal effects. Participants, undergraduate psychology students at a particular college, were first randomized to be in one of two arms: a randomized experiment (nr = 235) or an observational study (no = 210). Those in the randomized experiment were randomized to take either a vocabulary (nv = 116)or mathematics (nm = 119) training course, and those in the observational study were allowed to choose one of the identical courses. The causal effects being estimated in each are the effect that the vocabulary training course has on subsequent vocabulary test scores, and analagously for mathematics. The goal was to see whether the estimates of the causal effects obtained from the observational study (using methods such as propensity score subclassification) were close to the estimates found from the randomized experiment.

We start by considering the ten covariates we believe to be most relevant: scores on the vocabulary and mathematics pre-tests, number of math courses taken, how much the students like math and literature, whether they prefer math or literature, ACT score, college grade point average (GPA), age, and sex. The balance of these covariates resulting from the actual randomization to experiment or observational study is displayed in Figure 1.

Figure 1.

Figure 1

Love plots for the standardized difference in covariate means for ten covariates in the Shadish et al. (2008) study: randomized experiment versus observational study.

The means for the number of math classes taken, how much they like math, and GPA are all significantly lower for the students randomized to participate in the randomized experiment than for those in the observational study. Because the two groups differ substantially on these important background covariates, differences in estimated causal effects between the experimental group and the observational study group could be because one was an observational study, or because the groups differed at baseline.

Rerandomization provides a way to prevent these chance imbalances, yielding more informative randomized experiments, a point made by Rubin (2008a) in his discussion. We use rerandomization with this example in two ways. In Section 3.2 we balance the ten primary covariates in Figure 1 by rerandomizing based on Mahalanobis distance. In Section 4.3, we consider a richer collection of covariates, twenty-six main effects, two quadratic terms, and sixteen interactions, and use rerandomization involving tiers of covariates of varying importance.

3 Mahalanobis Distance

3.1 Rerandomization based on Mahalanobis Distance: Review of Theory from Morgan and Rubin (2012)

Let x be the fixed n × k covariate matrix, k covariates for n experimental units, let W be the n-component treatment assignment column vector, where Wi = 1 indicates unit i is assigned the active treatment and Wi = 0 indicates unit i is assigned the control condition. Let pure randomization denote complete randomization into groups with no restrictions except for total sample sizes, nT=i=1nWi and nC=i=1n(1Wi). That is, the classic completely randomized experiment where n units are randomly partitioned into two groups of sizes nT and nC. Let TC be the k-dimensional vector of the difference in covariate means between the treatment and control groups:

X¯TX¯C=xWnTx(1W)nC. (1)

A common scalar measure of multivariate balance is Mahalanobis distance, which we present here standardized by a constant involving sample sizes:

MnTnCn(X¯TX¯C)cov(x)1(X¯TX¯C). (2)

Mahalanobis distance effectively calculates simple Euclidean distance after transforming all variables to an orthonormal canonical form; that is, where each transformed covariate has mean zero and variance one, and they are all uncorrelated. For rerandomization with a threshold a, we rerandomize if M > a. When nT = nC, E(X¯TX¯C)x,Ma)=0. Also from Morgan and Rubin (2012), when (a) nT = nC, (b) covariate means are normally distributed, and (c) units are purely randomized and rerandomized when M > a,

cov(X¯TX¯Cx,Ma)=vacov(X¯TX¯Cx,), (3)

where

va2k×γ(k2+1,a2)γ(k2,a2)=P(χk+22a)P(χk2a), (4)

and γ denotes the incomplete gamma function: γ(b,c)0cyb1eydy. Thus, under these conditions, this rerandomization maintains the correlation matrix of TC, and for each covariate leads to equal improvement in balance in the sense of variance reduction.

Percent Reduction in Variance (PRIV) is defined as the percentage by which rerandomization reduces the variance of the difference in means, for each covariate, xj, relative to pure randomization:

PRIV=100(var(X¯j,TX¯j,Cx)var(X¯j,TX¯j,Cx,Wacceptable)var(X¯j,TX¯j,Cx)). (5)

The higher the PRIV, the more the actual differences in covariate means will be concentrated around 0. Under the above conditions, the PRIV for each covariate, and for any linear combination of these covariates, is

1va. (6)

Fewer covariates and smaller thresholds, a, lead to larger PRIV. In particular, greater comptutional power (lower proportion of accepted randomizations) is needed to achieve the same degree of balance for a larger number of covariates.

Let yi(Wi) denote the ith unit's potential outcome under treatment assignment Wi, and assume the Stable Unit Treatment Value Assumption (SUTVA) (Rubin, 1980): the potential outcomes are fixed and do not change with different W. Define τ to be the true average treatment effect in the sample,

τi=1nyi(1)yi(0)n (7)

and define τ^ to be the simple estimate of the average treatment effect:

τ^i=1nWiyi(1)nTi=1n(1Wi)yi(0)nC. (8)

Suppose (i) nT = nC, (ii) the covariate means are normally distributed, (iii) the outcome means are normally distributed given x, (iv) the treatment effect is additive (i.e. yi(1) – yi(0) is the same for all i), and (v) units are rerandomized when M > a, then the PRIV for τ^ is

(1va)R2, (9)

where R2 is the squared multiple correlation between x and y(0). Expression (9) also gives the PRIV for any unobserved covariates. Figure 3 in Morgan and Rubin (2012) shows this PRIV as a function of k, the proportion of randomizations accepted, and R2. By (9), when va → 0, rerandomization has the potential to provide the same increase in precision that would be achieved by increasing the sample size by the factor (1 – R2)−1, which implies that rerandomization can be equivalent to doubling the sample size if R2 = 0.5, and for R2 = 0.9, rerandomization can give the same power as pure randomization, but with only a tenth of the sample size! If cost or availability of experimental units are issues, this could be hugely beneficial.

Figure 3.

Figure 3

Empirical distributions for the standardized difference in means for each covariate, each based on 10, 000 simulated randomizations, pure and rerandomized. Empirical PRIVs are shown on the right. The diamonds show the balance from the actual experiment.

Because rerandomization increases precision of estimators when covariates are correlated with the outcome(s), the traditional two sample t-test and corresponding interval will be too “conservative”, resulting in overly wide confidence intervals and less powerful tests (Morgan and Rubin, 2012). Inference using randomization tests, following the rerandomization procedure actually used, reflects this increase in precision, and maintains the Type I error rate and valid frequentist properties (Morgan and Rubin, 2012). For additive treatment effects, interval estimates can be generated as the sets of values that would not be rejected by a randomization test.

3.2 Experimental Data: Mahalanobis using Ten Covariates

We first focus on the initial randomization of the Shadish et al. (2008) data, randomizing 445 participants into the randomized experiment (nC = 235) and the observational study (nT = 210), matching the sample sizes obtained in the actual study. Because the sample sizes are not exactly equal in the two treatment groups, the theoretical results given in Section 3.1 are approximations, but we will see that nonetheless the empirical results closely match these theoretical values. In this section we balance only the ten primary covariates, displayed in Figure 1, and balance these covariates equally by using Mahalanobis distance to measure balance.

Let pa denote the proportion of all randomizations considered acceptable by the rerandomization criterion, so here, pa = P(Ma). Often in practice pa will be chosen first, and a chosen correspondingly. If a is to be specified first, care should be taken to ensure that the corresponding pa is not too small, and in particular that some randomizations will yield Ma. The number of rerandomizations required to obtain an acceptable allocation follows a Geometric distribution with parameter pa, so equals 1/pa in expectation. We assume the analysis will use a randomization test, so also relevant is the computational time needed to generate enough (here assumed to be 1000) acceptable randomizations to perform a randomization test. For this data set, generating and checking balance for 1000 pure randomizations takes about 1.1 seconds, so it should take about 1.1/pa seconds to generate 1000 acceptable randomizations using programs in R (R Development Core Team, 2011), and computations performed on a Lenovo X200s.

Based on the estimate of computational time, pa = 0.001 or even pa = 0.0001 seem reasonable. With k = 10 covariates, using (6) we calculate the PRIV to be about 93% for pa = 0.0001 and 88% for pa = 0.001. In this illustrative example, we decide that reducing the variance by an extra 5% is not worth the additional three hours of computational time, and so use pa = 0.001.

With a sample size of n = 445 and these covariates, we can safely assume normality of covariate means, so Mχ102 under pure randomization. If this were not to be the case, we could simulate an empirical distribution of M, and find a from the empirical distribution. Thus we set the threshold a such that P(χ102a)=pa=0.001, yielding a = 1.48, so we rerandomize any allocation with M > 1.48. The value of M in the actual experiment was M = 21.0, and thus the particular randomization used is at the 98th percentile for M under pure randomization, and thus was an unusually unlucky allocation.

To explore the effects of rerandomization, we simulate 10, 000 pure randomizations and 10, 000 acceptable randomizations with M ≤ 1.48. Density curves for the difference in means for the first covariate (score on the vocabulary pre-test) are given in Figure 2 under both pure randomization and rerandomization. As expected, rerandomization decreases the variance, thereby eliminating extreme differences in means and creating more allocations with differences in means that are close to zero.

Figure 2.

Figure 2

Empirical densities for difference in means between treatment groups for the Vocabulary Pre-Test, each based on 10, 000 simulated randomizations, pure and rerandomized.

Because rerandomizing using M theoretically yields the same variance reduction for each covariate, Figure 2 should look essentially identical for every covariate. Empirical distributions for the standardized difference in covariate means, the number of standard errors (based on pure randomization) between the treatment group mean and the control group mean, for each covariate are shown in Figure 3. We also include a linear combination of covariates at the bottom of Figure 3, to illustrate that the effects of rerandomization on any linear combination of covariates are also approximately the same as on any individual covariate. This linear combination was not included when calculating M. For each covariate, the empirical PRIV is displayed on the right of Figure 3; note these are all very close to the theoretical value of 88%, despite the fact that the 88% is calculated assuming exact normality of covariate means and equal sample sizes.

We also examine the effect of rerandomization on estimates of the treatment effect. Rather than considering methods for analyzing observational studies, we shift attention to the second stage of randomization in Shadish et al. (2008), where in the experimental arm students are randomized into the vocabulary or mathematics training course. With this randomization, we balance the same ten covariates, but restrict our sample to the 235 participants who were initially randomized to the experimental arm, of which 116 were randomized to vocabulary training and 119 to mathematics. In this study, we are estimating two causal effects: the effects of vocabulary and math training courses on the corresponding test scores. For the purpose of this illustrative example, we fill in the missing potential outcomes assuming the null hypothesis is true, that is, assuming no course effect for any subject. Therefore, in this simulation, we should see causal effect estimates centered around 0.

We again use rerandomization with an acceptance probability of pa = 0.001, so a = 1.48 and the PRIV for each covariate is 88%. By (9), the extent to which rerandomization affects the estimates of the treatment effect depends on the R2 between the outcome and the covariates, approximated by 0.11 for the vocabulary outcome and 0.32 for the mathematics outcome. These are rather low, so we do not expect rerandomization to have a large impact on the sampling variance of the estimated average treatment effects. We compute the theoretical PRIVs of the estimated treatment effect using (9): Vocabularly: 88% × R2 = 88% × 0.11 = 9.8%, and Mathematics: 88% × R2 = 88% × 0.32 = 28%. We can also calibrate using increase in precision, or equivalently, in effective sample size. Theoretically, the precision of the estimated vocabulary treatment effect is increased by a factor of 1/(1–0.097) = 1.11 with rerandomization, and the estimated mathematics treatment effect by a factor of 1/(1 – 0.28) = 1.39. For mathematics, this means the effective sample size with rerandomization is 1.39 × 235 = 327, rather than the actual 235.

We again simulate 10, 000 pure randomizations and rerandomizations, and for each calculate the estimated treatment effects. The empirical PRIVs are 4% for vocabulary and 27% for mathematics, close to the theoretical values, but not exactly the same, due to the approximations involved.

4 Covariates of Varying Importance

Measuring balance with M balances all covariates equally, but when some covariates are known to be substantively more important than others, that is, thought to be more strongly correlated with the outcomes, better balance improvement for them is preferred. This desire is particularly strong when the number of covariates is large, requiring more computational power to achieve a certain degree of equal balance for each. Also, when the sample size is small, ensuring close to perfect balance for all covariates may be impossible, may yield a deterministic randomization, or may restrict the number of acceptable allocations to a number too small for a randomization test. With many covariates, or when balancing interactions and functions of main effects in addition to raw covariates, computational or sample size limits often prevent rerandomizing from achieving identical joint distributions in the treatment and control groups. Rather than either compromising on balance for important main effects, or ignoring balance for less important covariates or interactions, we provide a more flexible framework that allows for differing levels of balance across covariates

4.1 Individual Thresholds

When covariates vary in importance, a natural choice is to impose balance restrictions on each covariate individually, as suggested by Moulton (2004), Maclure et al. (2006), and Cox (2009). We illustrate the sub-optimal nature of this with two simple examples, the first with two uncorrelated covariates, and the second with two correlated covariates, both shown in Figure 4. Restricting each covariate individually forms a rectangular acceptance region, defined by the two calipers, whereas restricting Mahalanobis distance forms an ellipsoidally shaped acceptance region, or spherically shaped with orthogonal covariates. In both situations, the acceptance probability is set to pa = 1/2. Because Mahalanobis distance takes the joint distribution into account, the combinations of covariate difference in means accepted by Mahalanobis distance have greater average probability density than those accepted by restricting each covariate individually, as is clear from the figures. Restricting each covariate individually also fails to preserve the joint distribution of the covariate difference in means across rerandomizations, as pointed out in Rubin (1976).

Figure 4.

Figure 4

Points displayed are TC for two covariates, under orthogonality (left) and correlation (right), under pure randomization of units into two equally sized treatment groups. Acceptance regions are shown for two different rerandomization methods, both with pa = 0.5. The squares are the acceptance regions restricting each covariate individually and the ellipses are the acceptance regions based on M.

This problem becomes more substantial as k increases or pa decreases. Even with independent covariates, if pa is set to be the same for both methods, the proportion of accepted randomizations falling in the corners after restricting covariates individually (randomizations that would not be accepted by restricting M) goes to 1 as k increases and pa decreases. With 20 independent covariates restricted equally and pa = 0.01, more than half of the randomizations accepted by the intersection caliper method would be rejected using M. If these covariates were correlated, or there were more covariates, or a more stringent acceptance level were chosen, this problem would become even more severe.

4.2 Tiers of Covariates

We partition the covariates into T tiers of importance, with kt covariates in tier t. Tier 1 includes the most important covariates, and subsequent tiers include covariates of decreasing importance. The ordering of covariates within a tier is irrelevant. Let xt = (xt1,..., xtkt) represent the covariates in tier t. If we were to balance only the covariates in tier 1, covariates in tier 2 would also be partially balanced to the extent that they are correlated with the covariates in tier 1, by (9). Rather than unnecessarily balancing all of the tier 2 covariates again, we balance only the portion of tier 2 covariates that are orthogonal to the tier 1 covariates.

Let e1x1, and for t > 1 define etj to be the residuals for each covariate xtj after regressing on all covariates in lower numbered tiers:

xtj=β0+i=1t1βixi+etj,t=2,,T. (10)

For 1<tT, define et to be the matrix of residuals for covariates in tier t, so et = (et1,..., etkt). Let Mt be Mahalanobis distance between the treatment and control groups for et:

Mt=nTnCn(e¯t,Te¯t,C)cov(et)1(e¯t,Te¯t,C). (11)

There are different ways to conduct rerandomization using M1,,MT, for example placing separate thresholds on each Mt, or placing a threshold on a weighted linear combination (e.g., see Lock (2011). Here we explore only the former option, because it is more computationally efficient. The computationally intensive aspect of rerandomization is assessing balance for each randomization, and with a separate threshold for each tier, if tier 1 does not have adequate balance, there is no need to calculate M2,,MT. Because early tiers will typically have the more stringent acceptance probabilities, this may avoid many calculations.

Define pat to be the acceptance probability for tier t, set the corresponding threshold at such that P(Mtat) = pat, and rerandomize if any Mt > at, which again leads to covariate means with the same expectation in treatment and control groups. Also, we have the following result:

Theorem 4.1

If the covariate means are normally distributed, and if units are purely randomized with nT = nC and rerandomized when Mt>atfort=1T, then

E(X¯TX¯Cx,Mtatt)=0, (12)
cov(X¯1,TX¯1,Cx,Mtatt)=va1cov(X¯1,TX¯1,Cx), (13)

and the PRIV for the jth covariate in tier t, xtj, is

1va1ift=1, (14)
1[va1R1,tj2+va2(1R1,tj2)]ift=2, (15)
1[va1R1,tj2+l=2t1val(Rl,tj2Rl1,tj2)+vat(1Rt1,tj2)]ift>2, (16)

where Rl,tj2 denotes the squared multiple correlation between covariate xtj and all covariates in tiers 1 to l, and vat is defined by equation (4) with k = kt and a = at.

The proof of Theorem 4.1 is in Appendix A.

Theorem 4.1 states that the results for tier 1 covariates are equivalent to rerandomizing based on Mahalanobis distance for only these covariates: the correlation matrix for differences in means is retained and balance improvement is the same for all covariates in tier 1. Note that for tiers t > 1, PRIV will differ for covariates within the same tier, depending on how correlated each covariate is with covariates in more important tiers. Note that 1 – νat is the PRIV for orthogonalized covariates, et, in each tier, and the PRIVs given in Theorem 4.1 apply to the original, unorthogonalized covariates and hence can differ within a tier.

To parse the PRIVs given in Theorem 4.1, we first consider covariates in tier t = 2. If R1,2j2=1, PRIV would be 1 – νa1, because the residuals would all be 0. If R1,2j2=0, PRIV would be 1 – νa2, because in this case balancing covariates in tier 1 will have no impact on the balance of x2j. In general, PRIV is 1 minus a weighted mean of νa1 and νa2 with the weights determined by the correlation between x2j and the covariates in tier 1.

Next, for illustrative purposes, consider PRIV for covariate x4j in tier 4. By Theorem 4.1, this is

1[va1R1,4j2+va2(R2,4j2R1,4j2)+va3(R3,4j2R2,4j2)+va4(1R3,4j2)]. (17)

Each component in (17) arises from restricting the balance on a particular tier. For t < 4, the νat for tier t is weighted by the additive contribution of et to Rt,4j2. For example, if R32 is only slightly larger than R22, than the covariates in tier 3 contribute little to R32, so balancing e3 will have little impact on the balance of x4j, and the weight on νa3 is correspondingly small. In contast, if x4j is strongly correlated with e3 but not with e1 or e2, then the weight on νa3 will be larger.

Define pa=(pa1,pa2,,paT), a=(a1,,aT), and va=(va1,va2,,vaT). Because covariates in earlier tiers are more important, we expect va1<va2<<vaT. A simple way to ensure this is to choose the tiers such that k1<k2<<kT, and then set all the acceptance probabilities to be the same. However if this is not possible or if more control is desired, each pat can be altered to create the desired level of balance for that tier. The overall acceptance probability, pa, satisfies pat=1Tpat, where the inequality approaches equality as the number of covariates in each tier increases, because χkt2 approaches a normal distribution as kt → ∞, and so orthogonality implies independence.

Theorem 4.2

If the covariate means are normally distributed, and if units are purely randomized with nT = nC and rerandomized when Mt > at for t=1T, and if the treatment effect is additive, then

E(τ^x,Mtatt)=τ, (18)

and the PRIV for the estimated treatment effect, τ^, is

(1va1)R1,y2ift=1, (19)
(1va1)R1,y2+[t=2T(1vat)(Rt,y2R1:(t1),y2)]ift>1, (20)

where Rt,y2 is the squared multiple correlation between xt and y(0), and R1:t,y2 is the squared multiple correlation between x1:t and y(0).

The proof of Theorem 4.2 is in Appendix A.

4.3 Shadish Experimental Data: Tiers of Covariates

Here we return to the Shadish experimental data, but now utilize tiers of covariates. This section pertains only to the second stage of the Shadish et al. (2008) randomization, randomizing units to either the vocabulary or mathematics training class within the experimental arm of the study. Here we balance twenty-six main effects, two quadratic terms, and sixteen interactions. If we were to compute Mahalanobis distance on all forty-nine of these covariates, maintaining the acceptance probability used in Section 3.2, pa = 0.001, rerandomization would yield an expected 54% reduction in variance for each covariate. However, for this dataset, some covariates, such as vocabulary and mathematics pre-test scores, are clearly more important than others, and for these covariates, better balance is desired.

We place vocabulary and mathematics pre-test scores in the top tier. In the next tier, we place the number of mathematics courses taken, whether the person likes math, whether the person likes literature, ACT score, and college GPA, because these Κve covariates are likely to be related to the outcomes. The third tier contains nineteen covariates deemed less important, but still worth balancing. The fourth tier contains quadratic terms for both pre-test scores to balance variances of these distributions as well as their means, and interactions between all covariates in the top two tiers, in case any of these are important. This gives T=4 and k = (2, 5, 19, 18).

To maintain consistency with Section 3.2, we keep the overall acceptance probability at 0.001. Because the tiers have increasing numbers of covariates, one option is to set each pat = (0.001)1/4 = 0.178. Calculating each νat by (4), this gives va = (0.09, 0.29, 0.58, 0.61). If this is satisfactory, we would proceed with equal acceptance probabilities. However, here we decide to improve balance for the top two tiers, at the expense of the bottom two tiers. For various vectors pa, we calculate va for each, and decide on pa = (0.1, 0.1, 0.2, 0.5), which means expected PRIVs for the residuals (orthogonalized covariates) will be 95% for tier 1, 78% for tier 2, 41% for tier 3, and 23% for tier 4. Recall that the residuals within a tier will receive the same variance reduction, although the balance improvement for the original covariates may differ, even within a tier. PRIVs for the original covariates can be calculated usingTheorem 4.1, and will depend on correlations with covariates in lower numbered tiers.

Simulating 1000 acceptable randomizations took 20.1 minutes. Balancing only the ten covariates in Section 3.2 took 18.3 minutes for the same overall acceptance probability; even though we are now balancing more covariates and potentially checking balance on multiple tiers for each randomization, the computational time is approximately the same because so many randomizations (about 95%) only require calculating Mahalanobis distance for tier 1, which has only two covariates.

Empirical PRIVs for the residuals and covariates in each tier are given in Figure 5. The theoretical values for the residuals of each tier are depicted with vertical dotted lines. The empirical values for the residuals are scattered randomly around these lines, lying close to the theoretical values. The covariates, however, have varying PRIVs, even within each tier. The covariates in tier 3 are only weakly correlated with the covariates in tiers 1 and 2, and thus have PRIVs close to theoretical value for the residuals in tier 3, 1 – νa3, and because tier 4 is entirely composed of functions (quadratics and interactions) of covariates in tiers 1 and 2, the PRIVs are closer to the theoretical values for the residuals in earlier tiers.

Figure 5.

Figure 5

Empirical PRIV for residuals and covariates based on 1000 simulated rerandomizations and 10, 000 simulated pure randomizations. Covariates are ordered by tier. Theoretical PRIV values for the residuals of each tier, 1 – νat, are displayed at the top and shown with vertical dotted lines.

We also explore the effects of this rerandomization scheme on the outcomes, vocabulary and mathematics test scores. The empirical PRIVs are 15% for the estimated treatment effect for vocabulary and 25% for the estimated treatment effect for mathematics. This should be contrasted with the 4% for vocabulary and 27% for mathematics achieved in Section 3.2 by balancing all ten covariates equally. Here we improve upon the estimation of the vocabulary treatment effect quite substantially, but gain essentially nothing for the mathematics outcome. The theoretical PRIVs, as calculated by Theorem 4.2 are 15.9% and 29.7%, for vocabulary and mathematics, respectively. The empirical mean estimated treatment effect is −0.016 and −0.011 for vocabularly and mathematics, respectively. These values are very close to 0, as expected, because this simulation was conducted under the null hypothesis of no treatment effect.

5 Extensions

5.1 Computational Efficiency

When using Mahalanobis distance, either on all covariates or within tiers, in many situations computational time can be dramatically reduced by first transforming the covariates to an orthonormal basis. Let e1 = (x11)/s1, where 1 and s1 are the sample mean and standard deviation, respectively, of x1, and for j > 2 define ej by regressing xj on (x1, ..., xj–1):

xj=β0+i=1j1βixi+ϵj, (21)

and standardizing so ej=(ϵjϵ¯j)sϵj. Define e = (e1,..., ek), and balance e, rather than x. Because M is affinely invariant, M will be the same for e as for x, and the order of the covariates is irrelevant. Because e is an orthonormal basis, cov(e) = I, so

M=nTnCnj=1k(E¯j,TE¯j,C)2, (22)

where E¯j,T and E¯j,C are the treatment and control means, respectively, for ej. This allows us to check balance sequentially, adding in one covariate at a time, and when the acceptance probability, pa, is small, then most randomizations will be rejected after checking only the first few covariates. For the experimental data considered in Section 3.2 with n = 445 and k = 10 covariates, use of this method resulted in a 60% reduction in computational time.

The standardization of the variables to have mean 0 can also improve computational efficiency, if taken advantage of. We rewrite the difference in means as follows:

E¯j,TE¯C=eWnTe(1W)nC,=nnTnCeWnTnTnCe1, (23)
=nnTnCeW, (24)

where (24) follows from (23) because each ej has mean 0. For the experimental data from Section 3.2, use of (24) reduced computational time by 75% (in R, as opposed to using (mean(e[W==1,j]) - mean(e[W==⍉,j])) when computing (22).

Although these methods improve computational efficiency for a variety of datasets, they will not always do so. Using (22) and checking covariates sequentially will be most advantageous with fewer covariates and lower pa, when it is more likely that only a small number of covariates will need to be checked before the threshold is exceeded. Using (24) is more important for larger sample sizes. In practice we recommend measuring time to check balance for a specified number, say 1000, randomizations and using the method and code that is most efficient for the situation being faced.

5.2 Choosing the Best

An attractive alternative to specifying a threshold for acceptable balance may be to generate a fixed number of allocations, and choose the “best” according to a pre-specified criterion, such as Mahalanobis distance. In practice we expect to see little difference between rerandomizing until an allocation with Ma is obtained, and choosing the allocation with the smallest M from 1/pa randomizations, where pa = P(Ma). However, with non-scalar measures of balance, such as Mahalanobis distance computed within tiers of covariates, the definition of “best” is not obvious. For choosing the best with tiers of covariates, rather than restricting each tier with a separate threshold, the balance measures for each tier can be combined into a scalar measure with a weighted linear combination: wMt.

5.3 Covariance Adjustment

Although alternatives such as blocking or covariance adjustment may work in simple situations with few covariates, incorporating many covariates and functions of covariates is generally difficult. For example, a priori specifying covariance adjustment with many covariates and interactions into a protocol for a randomized experiment could be problematic. Rerandomization is intended to improve covariate balance via changing the design of the experiment, and so obviate the need to rely on model-based adjustment for imbalance in the analysis phase. There are many reasons for preventing problems by design rather than trying to fix problems by analysis, as stated in Rubin (2008b), with two of the primary reasons being the subjectivity that can be introduced when model-fitting using the outcome data and the sensitivity of estimates to the specifications of the model imposed. In addition to being a design-based method, rerandomization has the potential to improve precision more than typical covariance adjustment (Morgan and Rubin, 2012). If covariance adjustment is going to be used, Cox (1982) proposed using rerandomization first, to further improve the precision of the covariance adjusted estimator. Also, an experiment with better covariate balance, such as can be achieved through rerandomization, will be yield estimators less sensitive to model misspecification. Moreover, the tiered approach for balancing covariates of varying importance introduced here allows for better balance of some covariates over others, whereas with standard covariance adjustment a variable must be either included or excluded from the model.

6 Conclusion

When covariates vary in importance, a rerandomization criterion should be set so that balance is better for more important covariates. However, balancing covariates unequally comes at the price that fully affinely invariant measures of covariate balance can no longer be used. We compromise by using Mahalanobis distance within tiers of covariates, maintaining affine invariance where possible, while providing the flexibility to achieve better balance for more important covariates.

Acknowledgments

This work was partially supported by the National Science Foundation (NSF SES-0550887 and NSF IIS-1017967) and the National Institutes of Health (NIH-R01DA023879).

Appendix A

Proof of Theorem 4.1

Equation (12) follows from Corollary 2.2 in Morgan and Rubin (2012). Equation (13) follows from (3), because also balancing on components orthogonal to x1 will not impact the balance of x1.

Define x1:(t–1) = (x1, ..., xt–1), all covariates in tiers 1 to t – 1. By (10),

X¯tj,TX¯tj,C=β(X¯1:(t1),TX¯1:(t1),C)+(e¯tj,Te¯tj,C). (25)

Because the two terms on the right are uncorrelated, we have

var(X¯tj,TX¯tj,C)=βcov(X¯1:(t1),TX¯1:(t1),C)β+var(e¯tj,Te¯tj,C). (26)

Let σxtj2 denote the variance of the covariate xtj, and σetj2 the variance of the residuals etj, so σetj2=σxtj2(1Rt1,tj2). Therefore

var(e¯tj,Te¯tj,Cx)=nnTnCσetj2 (27)
=nnTnCσxtj2(1Rt1,tj2), (28)

and

βcov(X¯1:(t1),TX¯1:(t1),Cx)β=var(X¯tj,TX¯tj,Cx)var(e¯tj,Te¯tj,Cx) (29)
=nnTnC(σxtj2σetj2) (30)
=nnTnCσxtj2Rt1,tj2. (31)

By (3), because etet for tt, we have

cov(e¯t,Te¯t,Cx,Mtatt)=vatcov(e¯t,Te¯t,Cx). (32)

Moreover, the differences in residual means will remain orthogonal across tiers.

For each covariate x2j in tier 2, by (13), (32), (28), and (31), we have

var(X¯2j,TX¯2j,Cx,M1a1,M2a2)=var[β(X¯1,TX¯1,C)+(e¯2j,Te¯2j,C)x,M1a1,M2a2]=βcov(X¯1,TX¯1,CM1a1,M2a2)β+var(e¯2j,Te¯2j,Cx,M1a1,M2a2)=va1βcov(X¯1,TX¯1,C)β+va2var(e¯2j,Te¯2j,C)=nnTnC(va1σx2j2R1,2j2+va2σx2j2(1R1,2j2))=nnTnCσx2j2(va1R1,2j2+va2(1R1,2j2))=(va1R1,2j2+va2(1R1,2j2))var(X¯2j,TX¯2j,C),

establishing (15).

For tiers t > 2, by expressing (10) in terms of et–1 instead of xt–1, we have

xtj=β0+βxx1:(t2)+βeet1+etj. (33)

Expressing xtj as either equation (10) or (33), the residuals etj remain the same, as does the coefficient of determination. Therefore we have

X¯tj,TX¯tj,C=βx(X¯1:(t2),TX¯1:(t2),C)+βe(e¯t1,Te¯t1,C)+(e¯tj,Te¯tj,C). (34)

Thus

var(βe(e¯t1,Te¯t1,C))=var(X¯tj,TX¯tj,C)var(βx(X¯1:(t2),TX¯1:(t2),C))var(e¯tj,Te¯tj,C)=nnTnC(σxtj2σxtj2Rt22σxtj2(1Rt12))=nnTnCσxtj2(Rt12Rt22). (35)

Note that we can also rewrite (33) as

xtj=β0+βxx1+l=2t1βlel+etj. (36)

The coefficients will change, but again the residuals, etj, will remain the same. Therefore for covariate xtj in tier t > 2,

var(X¯tj,TX¯tj,Cx,Mtatt)=var[β1(X¯1,TX¯1,C)+l=2t1β1(e¯l,Te¯l,C)+(e¯tj,Te¯tj,C)x,Mtatt]=β1cov(X¯1,TX¯1,CMtatt)β1+l=2t1βlcov(e¯l,Te¯l,Cx,Mtatt)βl+var(e¯tj,Te¯tj,Cx,Mtatt)=va1β1cov(X¯1,TX¯1,C)β1+l=2t1valβlcov(e¯l,Te¯l,C)βl+vatvar(e¯tj,Te¯tj,C)=nnTnCσxtj2(va1R12+l=2t1va1(Rl2Rl12)+vat(1Rt12))=(va1R12+l=2t1val(Rl2Rl12)+vat(1Rt12))var(X¯tj,TX¯tj,C), (37)

where (37) follows from orthogonality. Thus, PRIV for the difference in means for covariate xtj in tier t > 2 is as given in equation (20).

Proof of Theorem 4.2

Equation (18) follows directly from Theorem 2.1 of Morgan and Rubin (2012). Without loss of generality, under additivity we can write the outcomes as a linear function of the orthogonalized covariates with added noise:

yi(Wi)=β0+βei+τWi+εi, (38)

where β0+βei is the projection of yi onto the space spanned by (1, ei) and εi a residual that captures any deviations from the linear model. Then the estimated treatment effect can be calculated by

=β0+t=1Tβtet,i+τWi+εi, (39)

Because each of these components are orthogonal, we can compute the the variance of τ^ as

τ^=t=1Tβt(e¯t,Te¯t,C)+τ+(ε¯Tε¯C). (40)

Rerandomization reduces each cov(e¯t,Te¯t,C) by the factor νat, so

var(τ^)=t=1Tβtcov(e¯t,Te¯t,C)βt+var(ε¯Tε¯C). (41)

Define σy2=var(y(1))=var(y(0)) and Ret,y2 to be the squared multiple correlation between et and y(1) and y(0). Then

var(τ^x,Mtatt)=t=1Tvatβtcov(e¯t,Te¯t,Cx)βt+var(ε¯Tε¯Cx). (42)

Moreover, because the tiers are orthogonal, the squared multiple correlation for all covariates and Y(0) is t=1TRet,y2, so

βtcov(e¯t,Te¯t,Cx)βt=Ret,y2(σy2nnTnC). (43)

Therefore,

var(ε¯Tε¯Cx)=(1t=1TRet,y2)(σy2nnTnC). (44)

The PRIV is then

var(τ^x,Mtatt)=(1+t=1T(1vat)Ret,y2)(σy2nnTnC). (45)
PRIV=σy2nnTnC(1+t=1T(1vat)Ret,y2)(σy2nnTnC)σy2nnTnC (46)
=t=1T(1vat)Ret,y2. (47)

Note that for t > 1, Ret,y2=Rt,y2R1:(t1),y2, because the tiers are orthogonal, completing the proof.

Footnotes

1

The second author of the current manuscript was provided the data as a discussant of Shadish et al. (2008)

Contributor Information

Kari Lock Morgan, Department of Statistics, Penn State University, University Park, PA 16802 (klm47@psu.edu).

Donald B. Rubin, Department of Statistics, Harvard University, Cambridge, MA 02138.

References

  1. Cox DR. Randomization and concomitant variables in the design of experiments. Statistics and Probability: Essays in Honor of CR Rao. 1982:197–202. [Google Scholar]
  2. Cox DR. Randomization in the Design of Experiments. International Statistical Review. 2009;77(3):415–429. [Google Scholar]
  3. Lock KF. PhD thesis. Harvard University; Cambridge, MA: 2011. Rerandomization to Improve Covariate Balance in Randomized Experiments. [Google Scholar]
  4. Maclure M, Nguyen A, Carney G, Dormuth C, Roelants H, Ho K, Schneeweiss S. Measuring prescribing improvements in pragmatic trials of educational tools for general practitioners. Basic & clinical pharmacology & toxicology. 2006;98(3):243–252. doi: 10.1111/j.1742-7843.2006.pto_301.x. [DOI] [PubMed] [Google Scholar]
  5. Morgan KL, Rubin DB. Rerandomization to improve covariate balance in experiments. Ann. Statist. 2012;40(2):1263–1282. ISSN 0090-5364. doi: 10.1214/12-AOS1008. [Google Scholar]
  6. Moulton LH. Covariate-based constrained randomization of group-randomized trials. Clinical Trials. 2004;1(3):297. doi: 10.1191/1740774504cn024oa. [DOI] [PubMed] [Google Scholar]
  7. R Development Core Team . R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2011. URL http://www.R-project.org. ISBN 3-900051-07-0. [Google Scholar]
  8. Rubin DB. Multivariate matching methods that are equal percent bias reducing, I: Some examples. Biometrics. 1976:109–120. [Google Scholar]
  9. Rubin DB. Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American Statistical Association. 1980;75(371):591–593. [Google Scholar]
  10. Rubin DB. Comment: The design and analysis of gold standard randomized experiments. Journal of the American Statistical Association. 2008a;103(484):1350–1353. [Google Scholar]
  11. Rubin DB. For objective causal inference, design trumps analysis. Annals. 2008b;2(3):808–840. [Google Scholar]
  12. Shadish WR, Clark MH, Steiner PM. Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random and nonrandom assignments. Journal of the American Statistical Association. 2008;103(484):1334–1344. ISSN 0162-1459. [Google Scholar]

RESOURCES