Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Jul 28.
Published in final edited form as: Biometrics. 2010 Sep;66(3):813–823. doi: 10.1111/j.1541-0420.2009.01364.x

Propensity Score Matching in Randomized Clinical Trials

Zhenzhen XU 1, John D Kalbfleisch 1
PMCID: PMC3407414  NIHMSID: NIHMS156706  PMID: 19995353

Summary

Cluster randomization trials with relatively few clusters have been widely used in recent years for evaluation of health care strategies. On average, randomized treatment assignment achieves balance in both known and unknown confounding factors between treatment groups, however, in practice investigators can only introduce a small amount of stratification and cannot balance on all the important variables simultaneously. The limitation arises especially when there are many confounding variables and in small studies. Such is the case in the INSTINCT trial designed to investigate the effectiveness of an education program in enhancing the tPA use in stroke patients. In this paper, we introduce a new randomization design, the balance match weighted (BMW) design, which applies the optimal matching with constraints technique to a prospective randomized design and aims of to minimize the mean squared error of the treatment effect estimator. A simulation study shows that, under various confounding scenarios, the BMW design can yield substantial reductions in the mean squared error (MSE) for the treatment effect estimator compared to a completely randomized or matched-pair design. The BMW design is also compared with a model-based approach adjusting for the estimated propensity score and Robin-Newey E-estimation procedure in terms of efficiency and robustness of the treatment effect estimator. These investigations suggest that the BMW design is more robust and usually, although not always, more efficient than either of the approaches. The design is also seen to be robust against heterogeneous error. We illustrate these methods in proposing a design for the INSTINCT trial.

Keywords: Clustered randomized trial, Experimental design, Optimal full matching, Propensity score matching, Randomization study

1. Introduction and motivating example

Cluster randomized trials have been widely used in the past three decades for the evaluation of health care and educational strategies, in which intact social units are selected as the units of randomization. On average, randomized treatment assignment avoids bias, achieves balance of both known and unknown confounding factors between intervention groups, and provides valid comparisons of competing intervention strategies. There is much literature that discusses design methods for cluster randomizations such as the completely randomized design (Abdeljaber et al., 1991), matched-pair design (COMMIT, 1995a), stratified design (Graham et al., 1984) and minimization design (Pocock et al., 1975). However, investigators can only introduce a small amount of stratification in practice, which does not ensure balance on all important variables, and post hoc adjustment for many confounders is also problematic. These limitations are particularly important when there are many confounding variables in a small study.

Tissue plasminogen activator (tPA) is a clot-busting drug, which has been found to be an effective treatment for the prevention of post-stroke disability if administered within a three hour time window of the onset of an ischemic stroke (NINDS, 1995). However, the use of tPA has remained relatively low. A randomized clinical trial, INSTINCT trial, was designed in order to investigate the effectiveness of an education program administered to hospital emergency departments in enhancing tPA therapy for stroke patients. Historical data were collected from 24 participating hospitals in Michigan regarding previous stroke volume and demographic variables. Hospitals were the units of randomization and those assigned to the treatment group received educational interventions designed to promote appropriate tPA use, whereas the other hospitals served as controls. The primary outcome is the frequency of appropriate tPA use in each hospital. Stroke volume at baseline (low vs. high), population density (urban vs. rural), age and gender mix are cluster-level factors thought to be strongly associated with outcome. Among these, stroke volume measured as number of stroke discharges and population density were classified as binary. Percentage of female (male) stroke patients who are older than 65 is used as a continuous measure. It is possible to create balance on stroke volume and population density through stratified randomization, however, it is not feasible to balance on all covariates at the same time. As a result, direct estimation of the treatment effect may be subject to bias due to possible imbalance on confounding factors. To resolve this problem, this paper describes and evaluates a new randomization design based on propensity score matching.

The method of propensity score matching has been widely used in observational studies to control for bias (Rosenbaum, 2002; Gu and Rosenbaum, 1993; Hansen, 2004; Ming and Rosenbaum, 2000). The propensity score is defined as the conditional probability of a subject being assigned to the treatment group given the observed covariates. Rosenbaum (1983) showed that exact matching of treated and control subjects on the propensity score will balance all the observed covariates. In non-randomized experiments, the propensity score function is always unknown but the sample estimates of the propensity score can be used. On the other hand, in a randomized clinical trial, the true propensity score is often a known function from the randomization scheme. For example, in the simplest randomized trial, subjects are assigned to treatment or control by the flip of a fair coin and the propensity score is equal to one half for all the subjects and the two treatment groups are perfectly matched on the true propensity score (Joffe, 1999). However, especially in small studies, substantial chance imbalances may still exist and yield some (conditional) bias in the direct treatment effect estimator. Although methods based on the estimated propensity score have not been widely used in the randomized studies, it could have some substantial advantages over the methods by using the true scores under certain scenarios. Robins (1992) has shown that there are even theoretical advantages to using estimated propensity scores.

We introduce a new randomization design, the balance match weighted (BMW) design, which applies the optimal full matching with constraints technique (Olsen, 1997) to the given randomization with the general aim of reducing the mean squared error of the treatment effect estimator. In this design, treated and control subjects are matched into subsets based on their estimated propensity score and an overall estimate is constructed using a weighted sum of the subset-specific estimates. In contrast to the existing stratified design, which first stratifies and then randomizes within strata, the BMW design first randomly assign the units to treatments and then stratifies on the randomized sample. In an implementation of the design, this randomization-stratification process is repeated M times in order to choose a randomization that gives a good overall balance. In general, the BMW design has two advantages. First, it reduces the chance imbalance between the treatment groups in observed covariates through optimal matching, and hence decreases the (conditional) bias in the resultant estimator. Second, it controls for the increase in variance due to matching by using the full matching with constraints technique (Olsen, 1997), in which the choice of the constraint, k, adjusts for the trade-off between the potential gain in bias reduction and possible loss in precision. We examine various strategies for selecting M and k, seeking a good choice which yields good results with respect to mean squared error. It is obvious that MSE performance also depends on the inherent degree of confounding, so we compared the BMW design with the completely randomized design and matched-pair design under different confounding scenarios. If there is no confounding, the three design methods perform equally well. However, if there is considerable confounding, the BMW design can result in a substantial reduction in the MSE of the treatment effect estimator.

The design we propose is appropriate for the situation where all units are available for randomization at the onset, and can’t be applied to clinical trials with staggered entry. Pocock and Simon (1975) proposed a sequential strategy, minimization design, which makes the assignment decision one unit at a time, based solely on the covariate information of previously assigned subjects. On the other hand, the minimization design is not well suited for trials where all observational units are available for randomization at the onset.

The rest of the paper is organized as follows. Notation and models are presented in Section 2. The BMW design is outlined in Section 3 and Section 4 gives results of a simulation study comparing the performance of the BMW design with the completely randomized design, a matched-pair design, the model-based approach by adjusting for the estimated propensity score and the Robin-Newey E-estimation procedure. Its performance under heterogeneous error is also investigated. Section 5 outlines a case study and the paper concludes with discussion in Section 6.

2. Methods

In this section, we present the notation and problem formulation as well as introduce some optimal matching techniques employed in the proposed design.

2.1 Optimal matching

Consider a study with the aim of assessing the effect of treatment. Let N denote the number of subjects available for the study. We assume that N is even and N/2 subjects are randomized to each of the treatment and control groups, but the method we propose could allow imbalance in the randomized assignment. Thus, we suppose that a randomization process divides the N subjects into a set T of N/2 subjects to be treated and a set C of N/2 subjects to receive the control. We also assume that a vector of r covariates, X = (X1, X2, …, Xr)T, is observed for each individual.

Similarity of covariates is measured through an estimated propensity score. Writing Z=1 for the treated subjects, and Z=0 for the control subjects, the (estimated) propensity score distance between the treated unit i and control unit j is given by

di,j=|δ^iδ^j| (1)

where δ̂i is the estimate of the true propensity score, δi = Pr(Z = 1 ∣ Xi), and is obtained from a model such as the logistic regression model

δi=Pr(Z=1Xi;α)=exp(α1+j=2rαjXij)/{1+exp(α1+j=2rαjXij)} (2)

In a randomized clinical trial, the true propensity score δi is typically determined by the randomization scheme and known. We consider the estimated propensity score δ̂i in defining the distances with the aim of producing a design that reduces the actual observed imbalance between treated and control subjects. Matching assembles treated and control units which are as similar as possible into the same stratum using the overall estimated propensity score distance measure. Given T and C, we consider the collection PC,T of all possible matchings, where a matching corresponds to a collection of S strata comprised of matched subsets {(C1, T1), (C2, T2), …, (CS, TS)}, in which, C1, C2, …, CS is a partition of C, T1, T2, …, TS is a partition of T and 1 ≤ SN/2. As is often done (e.g. Rosenbaum, 1991), we measure the quality of a particular matching as

Δ=s=1Sw(|Ts|,|Cs|)Ts×Cs¯ (3)

where

Ts×Cs¯=(i,j)Ts×Cs|δ^iδ^j|/|Ts×Cs|

is the average distance between the |Ts × Cs| possible pairs in the s-th strata, and w(., .) is a weight function. Thus, Δ is a weighted sum of average distances and an optimal matching minimizes Δ over PC,T.

A full matching is one in which each stratum is comprised of one treated (or control) subject matched to one or more control (or treated) subjects so that min(|Ts|, |Cs|) = 1, for s = 1, 2, …, S. Rosenbaum (1991, Lemma 2) showed that if the weight function in (2) is neutral or favors small subclasses, then there is always a full matching that is optimal. Among the class of full matchings with the weight function w(|Ts|, |Cs|) = |Ts| + |Cs| − 1, equation (3) reduces to

Δ=s=1S(|Ts|+|Cs|1)Ts×Cs¯=s=1S(i,j)Ts×Cs|δ^iδ^j|. (4)

In this paper, we use this total distance measure to evaluate the quality of a matching. One potential drawback of the optimal full matching is that some of its matched subsets can be very unbalanced with many controls to one treatment or vice versa. The imbalance among full matching subsets decreases the precision of the estimated treatment effect. One remedy for this is to constrain the full matching so that the ratio of the number of treated versus the number of controls in each stratum is between a lower and upper bound. To accomplish this, we choose an integer k ∈ {1, 2, …, N/2 − 1} and consider the optimization problem

MinimizeΔ=s=1S(i,j)Ts×Cs|δ^iδ^j|. (5)

over the class of full matchings subject to k−1 ≤ |Ts|/|Cs| ≤ k. We refer to the solution to this optimization problem as the optimal full matching with constraint k. When k = 1, we obtain the best matched-pair design with one treated unit and one control unit in each stratum. This assignment leads to a treatment effect estimator with minimum variance in the linear model discussed in the next section, but can result in relatively large bias. When k = N/2 − 1, there is no constraint on the balance in the relative numbers of treated and control units in any matched subset, the covariates are optimally balanced so the bias of treatment effect estimator tends, in this case to be small, but the variance is larger. The BMW design we propose searches for the optimal full matching with constraint k. The choice of k represents a trade-off between bias and variance. In the next section, we examine the mean squared error (MSE) as a measure of this trade-off with a class of linear models. For a specific model in this class, we can choose k to generate a BMW design that achieves minimum MSE. It is observed that the choice of k does not depend much on the specific model.

2.2 Model

To appreciate the effect of treatment on response in a pooled sample and matched sample, respectively, consider the following model: Let Yi, i = 1,2,…,N, represents responses of the unit i, conditional on a given treatment assignment T, C and X,

Yi=α+βI(iT)+j=1rγjXij+εi; (6)

where I(.) is the indicator function, β denotes the true treatment effect, γ1, γ2, …, γr are the confounding effects and ε = (ε1, ε2, …, εN) is the vector of the measurement errors with E[ε|T, C, X] = 0, Var[ε|T, C, X] = σ2 I, σ2 < +∞ and I is the N × N identity matrix.

2.2.1 Pooled sample

Under model (6), the common treatment effect estimator based on the unstratified pooled sample is β̂pool = ȳTȳC, which has conditional expectation

E[β^poolT,C,X]=β+j=1rγj(X¯jTX¯jC) (7)

where the subscripts C and T mean that the averages are computed over the control and treatment groups, respectively. The mean squared error (conditional on T, C and X) is

MSE(β^poolT,C,X)={j=1rγj(X¯jTX¯jC)}2+2σ2/N (8)

2.2.2 Matched sample

Under model (6), estimating the treatment effect for the matched sample involves the computation of a weighted sum. In the sth matched subset (Ts, Cs), the treatment effect estimator is β̄strata,s = ȳTsȳCs, which has conditional expectation

E[β^strata,sT,C,X]=β+j=1rγj(X¯jTsX¯jCs) (9)

The overall estimate can be constructed using a weighted sum,

β^strata=s=1Swsβ^strata,s (10)

where Σs ws = 1, ws ≥ 0. It should be noted that this stratified estimator can be modified to accommodate different weighting methods. Two common choices are weighting in proportion to the number of subjects that each subset contains, (|Ts| + |Cs|)/N (Cochran, 1968), or the inverse variance weighting, (1/|Ts|+1/|Cs|)1/t=1S(1/|Tt|+1/|Ct|)1. For purpose of this discussion, the former weighting method is considered, but it can be easily modified to handle the latter. It follows that the MSE of the stratified estimator (conditional on T, C and X) can be written as

MSE(β^strataT,C,X)={s=1S(|Ts|+|Cs|)Nj=1rγj(X¯jTsX¯jCs)}2+s=1S(|Ts|+|Cs|)2N2(1|Ts|+1|Cs|)σ2 (11)

With no confounding effects or chance imbalance in covariates, the pooled estimator is the unbiased estimate of the treatment effect with minimum variance. In the presence of confounding, stratification reduces the bias but increases the variance. We use the mean squared error to measure the trade-off between bias and variance.

3. The BMW design

In a randomized trial with fixed small sample size N and many confounding covariates, it may be impossible to produce balance on all of the variables simultaneously. In order to reduce the actual observed imbalance as well as increase precision of the estimator, we propose the balance match weighted (BMW) design. The design with specified parameter k is defined algorithmically as follows:

  • Step 1. Randomize half of the subjects to the treatment group, and half to control to obtain sets T and C;

  • Step 2. Compute the estimated propensity scores and create the |T| × |C| matrix of estimated propensity score distances;

  • Step 3. Obtain the optimal full matching with constraint k and record the total distance Δk.

  • Step 4. Repeat Steps 1 to 3 M times; choose the randomized sample with minimum total distance Δk=min(Δ1k,Δ2k,,ΔMk). The choice of M is discussed below.

It is clear that the choice of k represents a trade-off between bias and variance. We use mean squared error (MSE) as a measure of the trade-off. The choice of k (k ∈ (1, 2, …, N/2 − 1)) which minimizes the MSE of the treatment effect estimator depends on the confounding effect γ = (γ1, γ2, …, γr). If γ were known and M is fixed, it would be possible to compute the MSE for each k based on the BMW design in Step 1 to 4 above. It would be possible then to select the k that minimizes the mean squared error. In practice, of course, the true value of γ is unknown; therefore in the next section we use a simulation study to evaluate the effects of k on reducing the MSE under a variety of assumptions about the size of the confounding effects. We find that k = 2 is a suitable choice under most of the confounding scenarios considered.

Clearly, the larger M is, the better the matching that the BMW design attains. Therefore, M = ∞ is the best choice in principle but infeasible in practice. In the next section, we explain how the MSE depends on M and find that most of the gain is attained by relatively small M of 10 or so in the cases considered.

The implementation of Step 3 which searches the optimal full matching with constraint k (Olsen, 1997) is conducted using the Optmodel Procedure in SAS (Version 9.1.3.2). A similar program Optmatch in R has also been developed (Hansen, 2004).

There are alternative ways to adjust for the covariate imbalance resulting from randomization. Since small sample sizes do not allow for control of all variables by model-based method, one possible approach, suggested by an Associate Editor, is to adjust the estimated propensity score in a regression model such as:

Yi=α+βI(iT)+γδ^i+εi. (12)

Let β̂MB denote the ordinary least squares estimate of β from (12). Our simulations and investigations suggest that the model-based approach seems to work well if the model for the propensity score is appropriately specified, where, by appropriately specified, we mean that the regression model for the propensity score includes the same regression parameters and is of the same form as the true model for the outcome variable Y. For example, if the true model is Yi=α+βI(iT)+γ1Xi+γ2Xi2+εi and we specify logit(δi)=logit(Pr(Z=1Xi;α))=α1+α2Xi+α3Xi2, then regression adjustment using δ̂i will tend to work well. In fact, if the confounding effects are large, β̂MB tends to be somewhat more efficient than the estimator obtained from the BMW approach. On the other hand, the BMW approach is more robust if the propensity score model is inappropriately specified as, for example, if the same true model of Y holds and we specify logit(δi)=logit(Pr(Z=1Xi;α))=α1+α2Xi. This is examined further in the simulations of Section 4.

Robins and Newey (1992) proposed another procedure based on the propensity score in observational studies. Their approach is designed to provide a consistent estimator, β̃E, when the model for propensity score δ̂i is correctly specified. This estimator is

βE=i=1nYi(Ziδ^i)/i=1nZi(Ziδ^i). (13)

At the suggestion of a reviewer, we also evaluate this approach in the simulations of the next section.

4. Simulation Study

In order to assess the performance of the BMW design, we first carried out a simulation study to compare it with a completely randomized design and a matched-pair design. In doing so, we considered a wide variety of settings and, for each setting, estimated the mean squared error based on 1000 replications.

4.1 Structure of the simulation

For each of N subjects, we generated a set of r covariates X1, X2, …, Xr, where the covariates were drawn independently from various distributions as described below. Given a randomization of subjects to the two treatment groups, the responses were generated conditional on the treatment assignment (Zi = 0 or 1) and the covariates (Xij), where Pr(Zi = 1 ∣ Xij) = 0.5. Specifically, the response was obtained from:

Yi=βZi+j=1rγjXij+εi (14)

where εii.i.dN(0,1) and i = 1, 2, …, N. In the simulations, we considered the following:

  • The true treatment effect was taken to be β = 0.7

  • The true confounding effects were γj = γ, j = 1, …, r where γ = 0.5, 1.0, 1.5. Note that the results we obtain do not depend on the choice of β. When the covariates follows symmetric distributions, the results do not depend on the signs of the components of γ either.

  • For the first three settings, we considered r = 4 covariates selected from the following distributions: (i) X1,X2,X3,X4i.i.dBernoulli(0.5); (ii) X1,X2i.i.dBernoulli(0.5); X3,X4i.i.dN(0,0.25); (iii) X1,X2i.i.dBernoulli(0.5); X3,X4i.i.dBernoulli(0.66).

  • For the fourth case, we considered r = 8 covariates: X1,X2,X3,X4,X5,X6,X7,X8i.i.dBernoulli(0.5)

  • We consider sample sizes N = 30, 60.

The completely randomized design assigns half of the units at random to each of the two treatment groups. For this design, the treatment effect estimator is β̂pooled = TC and the corresponding mean squared error (conditional on T, C and X) is given in (8). We also consider a matched-pair design in which each unit is matched (so much as possible) to another unit based on the first covariate X1. One unit in each pair is then randomly assigned to treatment and one to control. The BMW design, as described in the preceding section, creates an optimally matched sample for each constraint k, where k =1,2,…,N/2 − 1, and for each choice of M, this leads to the weighted treatment effect estimator β̂strata in (10) along with its mean squared error (11). We further consider β̂MB, from the model-based approach by adjusting for the estimated propensity score (12) and the Robin-Newey E estimator β̃E (13). Finally, we examine the possible effects of homoscedastic error on the BMW design by allowing the error variance to depend on the first covariate X1.

4.2 Results

The average mean squared errors based on 1000 replications are summarized in Table 1. From Cochran (1968), the true unconditional MSE of β̂pool is 2σy2/N, where σy2 refers to the overall variability in outcome Y. In this, one part, jγj2Var(X¯jTX¯jC), is due to variability in the observed covariates X1, X2, …, Xr and the other to the conditional variations of Y given X1, X2, …, Xr. Formally, the unconditional MSE is (from (8))

MSE(β^pool)=E[{k=1Kγk(X¯kTX¯kC)}2+2Nσ2]=k=1Kγk2Var(X¯kTX¯kC)+2Nσ2=2σy2N (15)

With pre-randomization matching or post-randomization stratification on covariates, the average MSE values are also obtained in the simulation. A similar formula to (15) can be obtained for the matched pairs design, but formulas for the BMW design are complicated. For the BMW design, the average MSE for each constraint k = (1, 2, …, N/2−1) were examined in the simulations, but only those for k = 1, 2, 3 are displayed since the MSE changes little when k increases over three. The percent reduction in MSE is 100×(MSEMSEBMW)/MSE, where MSEBMW corresponds to the minimal value of MSE for each k in the BMW design, and MSE refers to the MSE value for the design to which BMW is being compared (e.g. the completely randomized design or the matched-pair design).

Table 1.

Percent reductions in the MSE of treatment effect estimator for the BMW design compared to a completely randomized design (CR) and matched-pair design (MP). Sample size N=30 subjects. Number of replications=1000.

γ
j=14γj
M MSE (CR) MSE Percent Reduction(%) (BMW vs. CR Design)
MSE (MP) MSE Percent Reduction(%) (BMW vs. MP Design)
k = 1 k = 2 k = 3 k = 1 k = 2 k = 3
X1,X2,X3,X4i.i.dBernoulli(0.5)
(0.5,0.5,0.5,0.5) 2 5 0.166 12.21 10.30 6.87 0.158 7.96 5.96 2.37
10 14.43 11.77 7.14 10.29 7.50 2.64
20 17.45 13.54 8.81 13.46 9.36 4.40
(1.0,1.0,1.0,1.0) 4 5 0.280 35.61 43.58 39.67 0.239 24.57 33.90 29.33
10 40.37 44.45 41.74 30.15 34.92 31.75
20 50.39 48.66 46.21 41.87 39.86 36.99
(1.5,1.5,1.5,1.5) 6 5 0.450 45.39 61.58 57.94 0.374 34.29 53.77 49.39
10 52.19 62.26 59.02 42.47 54.59 50.69
20 58.43 63.52 60.64 49.97 56.10 52.64
X1,X2i.i.dBernoulli(0.5); X3,X4i.i.dN(0,0.25)
(0.5,0.5,0.5,0.5) 2 5 0.155 8.77 5.67 1.09 0.148 4.45 1.21 -3.59
10 9.46 5.85 1.66 5.17 1.40 -2.99
20 12.17 7.74 3.52 8.01 3.38 -1.05
(1.0,1.0,1.0,1.0) 4 5 0.218 24.37 30.79 27.29 0.190 13.20 20.58 16.56
10 28.89 32.40 29.18 18.39 22.42 18.73
20 32.85 33.09 30.13 22.94 23.22 19.82
(1.5,1.5,1.5,1.5) 6 5 0.316 35.91 50.61 47.45 0.252 19.56 38.01 34.04
10 42.98 52.08 48.45 28.43 39.85 35.29
20 48.35 51.58 48.30 35.17 39.22 35.10
X1,X2i.i.dBernoulli(0.5); X3,X4i.i.dBernoulli(0.66)
(0.5,0.5,0.5,0.5) 2 5 0.165 12.11 12.08 7.31 0.158 8.03 8.00 3.01
10 14.93 12.99 8.78 10.98 8.96 4.55
20 16.13 12.69 8.72 12.24 8.64 4.48
(1.0,1.0,1.0,1.0) 4 5 0.267 32.21 40.76 36.77 0.229 20.97 30.94 26.29
10 37.92 43.13 39.39 27.63 33.71 29.34
20 41.88 44.14 41.22 32.25 34.88 31.48
(1.5,1.5,1.5,1.5) 6 5 0.430 50.98 61.68 59.36 0.352 40.15 53.20 50.37
10 50.63 59.12 55.57 42.75 52.60 48.48
20 55.05 59.33 56.08 47.87 52.84 49.07
X1,X2,X3,X4,X5,X6,X7,X8i.i.dBernoulli(0.5)
(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5) 4 5 0.204 17.35 23.93 18.68 0.187 10.06 17.21 11.49
10 18.63 24.30 19.63 11.44 17.62 12.53
20 22.65 25.22 19.42 15.82 18.62 12.30
(1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0) 8 5 0.390 28.74 52.41 52.21 0.363 23.39 48.83 48.62
10 35.80 56.12 53.11 30.97 52.82 49.58
20 43.23 57.60 54.22 38.96 54.41 50.78
(1.5,1.5,1.5,1.5,1.5,1.5,1.5,1.5) 12 5 0.725 35.07 66.86 68.47 0.664 29.12 63.83 65.58
10 46.71 71.55 69.76 41.83 68.94 66.99
20 52.71 73.14 70.29 48.38 70.68 67.57

It is interesting to examine how the MSE of the treatment effect estimator is affected by various parameter settings. Overall, the BMW design shows significant reductions in MSE as compared to both the completely randomized and matched-pair designs.

4.2.1 Confounding effects γj

Table 1 reveals that as the confounding effects, measured by jγj, increase, the average mean squared errors generally increase. However, the MSE in the BMW design increases much more slowly than the MSE in the completely randomized or matched-pair design. This suggests that the BMW design becomes much more effective in reducing the mean squared error when confounding effects increase. Specifically, as we raise j=1rγj from 2.0 to 6.0 for Bernoulli distributed covariates (Table 1), the MSE reduction of the BMW design with k = 2 compared to the matched pair design varies dramatically from 5.96% to 53.77% for M = 5, from 7.50% to 54.59% for M = 10, and from 9.36% to 56.10% for M = 20. An even larger reduction in MSE arises when comparing the BMW design with the completely randomized design.

4.2.2 The Choice of the Constraint k

We now examine the MSE as a function of k. When the model contains four covariates of various forms (Table 1) and there is relatively little confounding such as j=1rγj=2.0, then the MSEs corresponding to k = 1 are slightly smaller than those corresponding to k = 2. As j=1rγj increases, however, a greater reduction in MSE due to constraint k = 2 becomes apparent. Intuitively, for a small sample with strong confounding effects, bias reduction is more important than variance reduction, so the larger value of k (k = 2) is more efficient. However, when the number of covariates is r = 8, the constraint k = 2 minimizes the MSE for all confounding effects considered.

4.2.3 Number of Replication M

The MSE is obviously a decreasing function of M for given γ and k and M = ∞ is theoretically the best choice but practically impossible. We therefore use the simulation study to search for a practical choice of M which could obtain most of the gain as M → ∞. The results suggest that if there is little confounding (j=1rγj=2.0) and the covariates are independently Bernoulli distributed (Table 1), the percent reduction in MSE of the BMW design versus the matched-pair design increases from 7.96% to 10.29% to 13.46% for M from 5 to 10 to 20, with k = 1. If there is relatively more confounding (j=1rγj=6.0), the percent reduction in MSE increases more modestly from 53.77% to 54.59% to 56.10% with M, while using matching with constraints k = 2. Similar trends are seen in comparing the BMW design with the completely randomized design or using different covariate distributions. We conclude that, when confounding effect are relatively strong, the BMW design even with relatively small M is very effective in reducing MSE. A good compromise value of M is M = 10 for the cases considered. One could, however, make an argument for much larger M in any application since the additional computation is not so great.

4.2.4 Covariate Settings

There are four covariate settings examined in the simulation studies. The results suggest that, in situations where existing designs often fail in producing balance across covariates, the BMW design provides a useful approach. Gains in efficiency are substantial when the covariates are Bernoulli variables with important, but somewhat more modest gains, when the covariates include continuous variables. For given γ, the gains due to the BMW design are similar for symmetric and asymmetric Bernoulli distributions for the covariates. Finally, when the number of Bernoulli covariates increases from four to eight, the BMW design achieves a larger reduction in MSE.

4.2.5 Sample Size N

Sample size has an impact on the performance of the BMW design, and as sample size becomes very large, we would expect the relative gains to decrease as randomization itself guarantees substantial balance among the covariate values. Our simulation results reveal, however, that when the sample size increases from 30 to 60, the percent reduction in MSE from the BMW design decreases only very little. This suggests a possible value for this approach even in larger studies. Computational aspects are easily accommodated for the larger sample sizes; for example, the processing time for the simulations with N = 60 increases by about 40% over those for N = 30.

It is also of interest to compare the BMW design with the model-based approach adjusting for the estimated propensity score and Robin-Newey E-estimation procedure in terms of efficiency and robustness of the treatment effect estimator. Therefore, we evaluate the MSE property of the three approaches under two scenarios, one where the propensity score model is appropriately specified and one where it is not.

4.2.6 Propensity score appropriately specified

Under this scenario, we specify the true model and propensity score model as follows:

Yi=α+βI(iT)+j=14γjXj,i+εi. (16)
logit(δi)=logit{Pr(Z=1Xi;α)}=α0+j=14αjXj,i (17)

From the results summarized in Table 2, we see that the MSE obtained by the model-based approach remains relatively constant as the confounding effects increase, provided the terms in the propensity score model mimic that in the true model for Y. If there is relatively little confounding (j=1rγj<6.0), the MSEs in the BMW design are slightly smaller than those from the model-based approach. As j=1rγj increases, however, a somewhat greater reduction in MSE is obtained through the model-based approach. Both the BMW design and the model-based estimate perform much better than the E-estimation procedure in the context of these small randomized experiments.

Table 2.

Percent reductions in the MSE of treatment effect estimator for the BMW design compared to the model-based adjustment approach adjusting for the estimated propensity score (MB) and E estimation procedure (E-est), where the propensity score model is appropriately and inappropriately specified, respectively. Number of replications=1000.

γ M MSE (MB) MSE Percent Reduction(%) (BMW vs. MB)
MSE (E – est) MSE Percent Reduction(%) (BMW vs. E – est)
k = 1 k = 2 k = 3 k = 1 k = 2 k = 3
where propensity score inappropriately specified (17) (18)
Xi.i.dNormal(0,0.25)
(0.5, 0.5) 10 0.185 0.65 14.75 12.25 0.334 45.06 52.85 51.47
(1.0, 1.0) 10 0.365 -0.15 30.03 32.31 0.964 62.10 73.52 74.39
(1.5, 1.5) 10 0.665 5.80 41.88 46.12 2.013 68.90 80.81 82.21
where propensity score appropriately specified (15) (16)
X1,X2,X3,X4i.i.dBernoulli(0.5)
(0.5,0.5,0.5,0.5) 10 0.165 15.01 15.74 6.79 0.211 33.41 33.98 26.97
(1.0,1.0,1.0,1.0) 10 0.166 -0.87 6.02 1.44 0.528 68.38 70.54 69.10
(1.5,1.5,1.5,1.5) 10 0.166 -29.84 -2.49 -11.31 0.971 77.85 82.52 81.01
X1,X2i.i.dBernoulli(0.5); X3,X4i.i.dBernoulli(0.66)
(0.5,0.5,0.5,0.5) 10 0.152 7.19 5.08 0.48 0.247 42.97 41.68 38.85
(1.0,1.0,1.0,1.0) 10 0.152 -8.99 0.16 -6.41 0.492 66.32 69.15 67.12
(1.5,1.5,1.5,1.5) 10 0.153 -32.00 -9.29 -18.78 0.916 77.99 81.78 80.19
X1,X2i.i.dBernoulli(0.5); X3,X4i.i.dN(0,0.25)
(0.5,0.5,0.5,0.5) 10 0.148 5.41 1.64 -2.74 0.203 30.71 27.95 24.74
(1.0,1.0,1.0,1.0) 10 0.148 -4.52 0.64 -4.09 0.387 59.89 61.88 60.06
(1.5,1.5,1.5,1.5) 10 0.148 -21.56 -2.15 -9.91 0.689 73.82 78.00 76.33

4.2.7 Propensity score inappropriately specified

In practice, the true model for outcome Y is unknown, and due to the small sample size, it is difficult to determine what model is best; consequently adjustment for many potential confounders may not work well. The simulation studies in Table 2 suggest that when the propensity score model does not mimic the correct regression terms in the true model, the BMW design provides a more robust approach than the model based approach. For illustration purpose, we looked at a true model and propensity score model as follows:

Yi=α+βI(iT)+γ1Xi+γ2Xi2+εi. (18)
logit(δi)=logit{Pr(Z=1Xi;α)}=α1+α2Xi, (19)

where Xii.i.dNormal(0,1). As the confounding effects γj increases from 0.5 to 1.5, the percent reduction in MSE of the BMW design compared to the model-based approach increases from 14.75% to 41.88%, for M = 10. Again, the E-estimation procedure does not perform well in this context. This suggests that the BMW design is more robust than the model-based approach when the propensity score model is inappropriately specified, as would often be the situation in practice.

4.2.8 Heteroscedastic errors

In the clustered randomized trials with few but relatively large clusters, the hemoscedasticity error assumption is unlikely to hold. To investigate the effects of this, we allowed the error distribution of the outcome to vary by the first covariate X1 in our simulation studies. In particular, in the model (6), we specified εii.i.dN(0,1) if X1 = 1 and εii.i.dN(0,0.25) if X1 = 0, where X1,X2i.i.dBernoulli(0.5) and X3,X4i.i.dN(0,0.25). The results in Table 3 suggest that the relaxation of the hemoscedasticity error assumption has little impact on the performance of the BMW design. The case study in the next section is a case where such heteroscedasticity may be present.

Table 3.

Percent reductions in the MSE of treatment effect estimator for the BMW design compared to a completely randomized design (CR) and matched-pair design (MP) under the heteroscedastic and hemoscedastic error assumption, respectively. Number of replications=1000. X1,X2i.i.dBernoulli(0.5); X3,X4i.i.dN(0,0.25)

γ
j=18γj
M MSE (CR) MSE Percent Reduction(%) (BMW vs. CR Design)
MSE (MP) MSE Percent Reduction(%) (BMW vs. MP Design)
k = 1 k = 2 k = 3 k = 1 k = 2 k = 3
εii.i.dN(0,1) if X1 = 1 and εii.i.dN(0,0.25) if X1 = 0
(0.5,0.5,0.5,0.5) 2 10 0.089 14.19 13.32 8.83 0.075 -0.90 -1.93 -7.20
(1.0,1.0,1.0,1.0) 4 10 0.152 41.40 49.28 44.78 0.121 26.63 36.50 30.87
(1.5,1.5,1.5,1.5) 6 10 0.258 52.57 65.36 65.05 0.166 26.48 46.31 45.83
εii.i.dN(0,1)
(0.5,0.5,0.5,0.5) 2 10 0.155 9.46 5.85 1.66 0.148 5.17 1.40 -2.99
(1.0,1.0,1.0,1.0) 4 10 0.218 28.89 32.40 29.18 0.190 18.39 22.42 18.73
(1.5,1.5,1.5,1.5) 6 10 0.316 42.98 52.08 48.45 0.252 28.43 39.85 35.29

5. Planning an Educational Study for tPA Usage in Stroke

In this section, we consider the use of the BMW design in planning an educational study to increase tPA therapy use for stroke patients as described in the Introduction. As noted there, four covariates were measured on participating institutions, and it was impossible to simultaneously obtain a balance in a matched-pair design. The simulation study in Section 4 suggests that design parameter k = 2 and the number of replication M = 10 give results that are close to optimum over a broad class of covariate distributions and confounding effects. We therefore choose these parameters in proposing a design for the tPA study.

We randomly assigned the 24 hospitals to two treatment groups, and estimated the sample-based propensity score for each hospital. The hospitals were then optimally matched into subsets with k = 2 which gave a minimum total distance of 2.5887. We then randomized the hospitals an additional 9 times obtaining distance measures: 2.05, 2.50, 0.20, 1.42, 0.49, 3.00, 1.14, 0.72 and 1.48. The fourth randomization produced the smallest distance. The corresponding BMW design is presented in Table 4, where there were 9 matched subsets with treated hospital 1 matched to control 6, treated hospital 2 and 3 jointly matched to control 8, and so on. For comparison, the data were also randomized by using a matched-pair design, where the twenty-four hospitals were matched into twelve pairs based on the two binary covariates, rural versus urban population density and low versus high stroke volume. One hospital in each pair was then randomized to treatment and one to control. Figure 1 illustrates treatment to control group imbalance in the two continuous covariates, under the BMW and the matched-pair design.

Table 4.

Optimal Matched sample produced by the BMW design with k = 2 and M = 10 for the case study. X1: percent of females greater than 65 years old among all females in the census tract (%); X2: percent of males greater than 65 years old among all males in the census tract (%); X3: stroke volume (low vs. high); X4: population density (urban vs. rural). The estimated propensity score (δ̂) was shown for each subject and the total propensity score distance Δ =0.202 for the stratum.

Strata Treatment Group Control Group
ID(δ̂) X1 X2 X3 X4 ID(δ̂) X1 X2 X3 X4
1 1 (0.33) 0.15 0.13 0 0 6 (0.35) 0.19 0.07 0 0
2 2 (0.38) 0.17 0.11 1 0 8 (0.35) 0.22 0.14 0 0
11 (0.40) 0.22 0.14 1 0
3 3 (0.63) 0.13 0.06 1 1 9 (0.63) 0.14 0.06 1 1
19 (0.67) 0.25 0.15 1 1
4 4 (0.58) 0.12 0.06 0 1 12 (0.60) 0.07 0.06 1 1
5 14 (0.32) 0.13 0.07 0 0 13 (0.32) 0.13 0.09 0 0
15 (0.31) 0.10 0.06 0 0
6 17 (0.41) 0.24 0.12 1 0 10 (0.41) 0.26 0.18 1 0
22 (0.43) 0.30 0.17 1 0
7 20 (0.60) 0.08 0.06 1 1 16 (0.61) 0.10 0.07 1 1
18 (0.61) 0.09 0.05 1 1
8 21 (0.60) 0.18 0.14 0 1 5 (0.61) 0.19 0.13 0 1
9 24 (0.62) 0.23 0.16 0 1 7 (0.62) 0.24 0.19 0 1
23 (0.62) 0.11 0.07 1 1

Figure 1.

Figure 1

Covariate Imbalances from the matched-pair design (matching on the categorical covariates: Population density and stroke volume) and the BMW design. The imbalance value in covariate X for unit i was computed as Imbalance(Xi)=jTsXj/|Ts|kCsXk/|Cs|=X¯TsX¯Cs where s is the stratum that unit i belongs to.

When γ is known, we can determine the constraint k that minimizes the mean squared error when using the BMW design. Preliminary data provided estimates of the regression parameters in a logit model for the proportion of stroke cases receiving tPA as -0.63 (stroke volume), 0.02 (population density), 4.33 (percent female older than 65) and -1.23 (percent male older than 65). Since there are 24 hospitals, k can take values from 1 to 11. For k=1, M = 10 randomizations gave a minimum distance of 0.2936. We then repeated the above process with the same randomized samples but with constraints k=2, 3, …, 11 and for each k, searched for the optimal sample with minimal distance. Third, based on the approximate value of γ above, we computed the MSE from (11) as 0.1076, 0.1045 and 0.1114 for the optimal sample with constraint k = 1, 2, 3. This suggests that pair matching and matching with constraint k = 2 achieve approximately the same level of optimality in terms of minimizing MSE. Compared with the matched-pair design described above, the BMW design reduced the MSE of the treatment effect estimator by 42%.

6. Discussion

The BMW design is, in essence, applying the optimal full matching with constraints technique to randomization in order to achieve overall balance between treatment groups and control the variance of the treatment comparison and so yield good MSE properties. One of the virtues of this design is that it will not only reduce the chance imbalance in observed covariates but also preserve the advantage of traditional randomized designs in balancing the unobserved covariates on average. Although only partial balance on the observed covariates is achieved by the BMW design, it is substantially better than the balance obtained by random assignment of treatments. When there is considerable confounding in small studies, this improvement in balance can result in a substantial decrease of mean squared error in the treatment effect estimator.

The BMW design can be revised to allow the user to select other criteria besides MSE to compromise between bias and variance. If variance of the estimator is not a concern, one can modify this design to achieve optimal balance and so reduce conditional bias (i.e. set k = N/2 − 1). On the other hand, if the objective is to minimize variance, optimal pair matching with constraint k=1 is the best full matching choice.

We recommend use of a super-population model for analysis, and this is the basis of the simulation comparisons that we have made. It is worth noting, however, that the BMW design can also form the basis of a randomization test. Suppose, for example, that a sample has been collected using the BMW design with given k and M and the value of the test statistic (e.g. t statistic) has been computed. We now repeat the BMW design with the same k and M a large number B of times and each time compute the test statistic based on the fixed outcomes observed. This would lead to a randomization test and confidence intervals following standard methods. This would typically yield a reasonably large reference set as the basis of the test. For example, with continuous covariates, if M = ∞ (in practice, M is large) and k = 1, then the BMW design will always lead to the same set of matched pairs. The reference set would still contain the 2n possible treatment assignments with n pairs. With smaller M or discrete covariates, the reference set could be larger.

The model-based approach of adjusting for the estimated propensity score and the Robin-Newey E-estimation procedure could be considered as alternatives to the BMW design. Our simulation studies suggest that, when the propensity score model is appropriately specified, the BMW design is more efficient than the model-based approach when the confounding effects are relatively small; the model based approach, however, becomes more efficient than the BMW design when the confounding effects increase. On the other hand, when the propensity score model is inappropriately specified, the BMW design achieves substantial gain over the model-based approach. In the context considered in this paper, the E-estimation procedure is the least efficient and robust.

Greevy et al. (2004) proposed another multivariate matching design based on Mahalanobis distance. This approach searches for the optimal multivariate non-bipartite matching followed by randomization within pairs. We also investigated this in a simulation study presented in Table 5. As the confounding effects increase, or the number of covariates increase, the BMW design becomes much more effective in reducing MSE compared to Greevy’s design. This may be because the Mahalanobis distance is inferior to propensity scores when there are many covariates.

Table 5.

Percent reductions in the MSE of treatment effect estimator for the BMW design compared to multivariate non-bipartite matching design (NB). Number of replications=1000.

γ
j=18γj
M MSE (NB Design) MSE Percent Reduction(%) (BMW vs. NB Design)
k = 1 k = 2 k = 3
X1,X2,X3,X4i.i.dBernoulli(0.5)
(0.5,0.5,0.5,0.5) 2 5 0.146 -0.07 -2.27 -6.11
10 2.47 -0.55 -5.90
20 5.91 1.44 -3.91
(1.0,1.0,1.0,1.0) 4 5 0.185 2.42 14.49 8.53
10 9.62 15.79 11.68
20 24.78 22.18 18.44
(1.5,1.5,1.5,1.5) 6 5 0.250 1.77 30.92 24.36
10 14.01 32.12 26.28
20 25.24 34.40 29.20
X1,X2,X3,X4,X5,X6,X7,X8i.i.dBernoulli(0.5)
(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5) 4 5 0.156 -8.15 0.51 -6.35
10 -6.41 0.96 -5.13
20 -1.15 2.18 -5.39
(1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0) 8 5 0.222 -25.19 16.39 16.07
10 -12.76 22.92 17.65
20 0.26 25.53 19.59
(1.5,1.5,1.5,1.5,1.5,1.5,1.5,1.5) 12 5 0.338 -39.10 29.01 32.47
10 -14.16 39.06 35.22
20 -1.31 42.46 36.37

In general terms, the BMW design appears to provide a viable approach in the context of small studies where adjustment for randomization imbalance may be important. Furthermore, the simplicity of this matching-based design allows researchers to perform simple stratified analyses that adjust for imbalance in the randomization, which is appealing.

Finally, simulation shows that the BMW design can substantially reduce the MSE of the treatment effect estimate, as compared to the existing randomized designs in linear models. These investigations could be extended to other regression models, such as the class of general linear models. It should also be noted that the BMW design can be generalized to clinical trials with more than two treatment arms. Baseline category logit model can be used to estimate the probability of a subject being assigned to each treatment arm, and Euclidean distance can be used to measure the quality of a matching.

Supplementary Material

Supp Data

Acknowledgments

The authors would like to thank Professors Mary Haan and Phillip Scott for helpful discussions. This work was supported in part by grant (RO1 NS050372) from the National Institute of Neurological Disorders and Stroke.

Footnotes

Supplementary Materials

The data for the case study and a SAS macro implementing the BMW design are available under the Paper Information link at the Biometrics website http://www.biometrics.tibs.org.

Contributor Information

Zhenzhen XU, Email: zzxu@umich.edu.

John D. Kalbfleisch, Email: jdkalbfl@umich.edu.

References

  1. Abdeljaber MH, Monto AS, Tilden RL, Schork MA, Tarwotjo I. The impact of vitamin A supplementation on morbidity: a randomized community intervention trial. American Journal of Public Health. 1991;81:1654–1656. doi: 10.2105/ajph.81.12.1654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Cochran WG. The Effectiveness of Adjustment by Subclassifications in Removing Bias in Observational Studies. Biometrics. 1968;24:295–313. [PubMed] [Google Scholar]
  3. COMMIT, Research Group. Community Intervention Trial for Smoking Cessation (COMMIT): I. Cohort results from a four-year community intervention. American Journal of Public Health. 1995a;85:183–192. doi: 10.2105/ajph.85.2.183. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cox DR. Regression models and life tables (with discussion) Journal of the Royal Statistical Society, Series B. 1972;34:187–200. [Google Scholar]
  5. Gu Xing Sam, Rosenbaum Paul R. Comparison of Multivariate Matching Methods: Structures, Distances, and Algorithms. Journal of Computational and Graphical Statistics. 1993;2:405–420. [Google Scholar]
  6. Graham JW, Flay BR, Johnson CA, Hansen WB, Collins LM. A Multiattribute Utility Measurement Approach to the Use of Random Assignment with Small Numbers of Aggregated Units. Evaluation Review. 1984;8:247–260. [Google Scholar]
  7. Greevy R, Lu B, Silber JH, Rosenbaum P. Optimal multivariate matching before randomization. Biostatistics. 2004;5:263–275. doi: 10.1093/biostatistics/5.2.263. [DOI] [PubMed] [Google Scholar]
  8. Hansen Ben B. Full Matching in an Observational Study of Coaching for the SAT. Journal of the American Statistical Association. 2004;99:609–618. [Google Scholar]
  9. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2001. [Google Scholar]
  10. Joffe Marshall M. Propensity Scores. American Journal of Epidemiology. 1999;150:327–333. doi: 10.1093/oxfordjournals.aje.a010011. [DOI] [PubMed] [Google Scholar]
  11. Ming Kewei, Rosenbaum Paul R. Substantial Gains in Bias Reduction from Matching with a Variable Number of Controls. Biometrics. 2004;56:118–124. doi: 10.1111/j.0006-341x.2000.00118.x. [DOI] [PubMed] [Google Scholar]
  12. NINDS, The National Institute of Neurological Disorders and Stroke rt-PA Stroke Study Group. Tissue Plasminogen Activator for Acute Ischemic Stroke. The New England Journal of Medicine. 1995;333:1581–1588. doi: 10.1056/NEJM199512143332401. [DOI] [PubMed] [Google Scholar]
  13. Olsen Stephanie P. Multivariate matching with non-normal covariates in observational studies. University of Pennsylvania; 1997. [Google Scholar]
  14. Pocock SJ, Simon R. Sequential treatment assignment with balancing for prognostic factors in the controlled clinical trial. Biometrics. 1975;31:103–115. [PubMed] [Google Scholar]
  15. Robins JM, Mark SD, Newey WK. Estimating exposure effects by modeling the expectation of exposure conditional on confounders. Biometrics. 1992;48(2):479–495. [PubMed] [Google Scholar]
  16. Rosenbaum Paul R, Rubin Donald B. The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika. 1984;70:41–55. [Google Scholar]
  17. Rosenbaum Paul R. A Characterization of Optimal Designs for Observational Studies. Journal of the Royal Statistical Society Series B (Methodological) 1999;53:597–610. [Google Scholar]
  18. Rosenbaum Paul R. Observational Studies. New York: Springer; 2002. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Data

RESOURCES