Skip to main content
Sage Choice logoLink to Sage Choice
. 2018 Dec 20;29(3):728–751. doi: 10.1177/0962280218817926

Covariate-adjusted survival analyses in propensity-score matched samples: Imputing potential time-to-event outcomes

Peter C Austin 1,2,3,, Neal Thomas 4, Donald B Rubin 5,6,7
Editor: Peter Austin
PMCID: PMC7082895  PMID: 30569832

Abstract

Matching on an estimated propensity score is frequently used to estimate the effects of treatments from observational data. Since the 1970s, different authors have proposed methods to combine matching at the design stage with regression adjustment at the analysis stage when estimating treatment effects for continuous outcomes. Previous work has consistently shown that the combination has generally superior statistical properties than either method by itself. In biomedical and epidemiological research, survival or time-to-event outcomes are common. We propose a method to combine regression adjustment and propensity score matching to estimate survival curves and hazard ratios based on estimating an imputed potential outcome under control for each successfully matched treated subject, which is accomplished using either an accelerated failure time parametric survival model or a Cox proportional hazard model that is fit to the matched control subjects. That is, a fitted model is then applied to the matched treated subjects to allow simulation of the missing potential outcome under control for each treated subject. Conventional survival analyses (e.g., estimation of survival curves and hazard ratios) can then be conducted using the observed outcome under treatment and the imputed outcome under control. We evaluated the repeated-sampling bias of the proposed methods using simulations. When using nearest neighbor matching, the proposed method resulted in decreased bias compared to crude analyses in the matched sample. We illustrate the method in an example prescribing beta-blockers at hospital discharge to patients hospitalized with heart failure.

Keywords: Propensity score, survival analysis, propensity score matching, matching, Monte Carlo simulations

1 Introduction

Observational studies are increasingly being used to estimate the effects of treatments, interventions, and exposures. These studies are complicated by confounding when the distribution of prognostically important covariates differs between treated and control subjects. Statistical methods should be used to minimize this confounding so that valid inferences on treatment effects can be drawn from such studies.

Statistical methods based on estimated propensity scores are increasingly being used to address this confounding. The true propensity score is the probability of treatment assignment conditional on measured baseline covariates.1 There are basically four ways of using the propensity score: matching, stratification, inverse probability of treatment weighting (IPTW), and model-based covariate adjustment,13 all of which are used in the biomedical literature.4,5

Survival or time-to-event outcomes occur frequently in the medical and epidemiological literature.6 Recent studies have examined the performance of different propensity score methods for estimating the effect of treatment on survival outcomes.7,8 Dorn suggested that when conducting an observational study one should ask “how would the study be conducted if it were possible to do it by controlled experimentation?”.9 Rubin suggests that this question defines the objective of an observational study10 (p.16). In a randomized controlled trial (RCT) with survival outcomes, there are two common methods for quantifying the effect of treatment on outcomes: estimating the survival function in each treatment group (often using the Kaplan–Meier method) and estimating the relative effect of treatment on the hazard of the outcome (often using a Cox proportional hazards model). The former approach allows for a comparison of survival between the treatment groups across the duration of follow-up, whereas the latter compares the rates at which outcomes occur in the two groups. In RCTs, an unadjusted, or crude, comparison of outcomes between treatment groups will, on average, result in an unbiased estimate of the effect of treatment versus control. However, investigators frequently report adjusted estimates in RCTs to adjust for conditional (on covariates) bias that may occur due to residual (after randomization) imbalance in baseline covariates between treatment groups.6

Propensity score methods use some aspects of the design and analysis of an RCT. In particular, they compare outcomes between treated and control subjects with similar distributions of measured baseline covariates.1 Although matching on the estimated propensity score may, on average, minimize confounding due to measured covariates, one could expect residual confounding in any specific analysis. When using propensity-score matching with time-to-event outcomes, it is desirable to have a method for treatment effects estimation that reduces the effect of residual confounding persisting after matching on the propensity score.

The objective of the current study is to describe and evaluate the performance of methods to conduct covariate-adjusted analyses in propensity-score matched samples when the outcomes are times-to-events. In Section 2, we provide notation, background on propensity score matching, and describe three methods to reduce residual confounding when estimating the effect of treatment on survival outcomes in propensity-score matched samples. In Section 3, we describe simulations to evaluate the performance of these methods, and in Section 4, we report their results. In Section 5, we provide a brief case study illustrating the application of the proposed methods. In Section 6, we summarize our findings and place them in the context of the existing literature.

2 Statistical methods

2.1 The potential outcomes framework and average treatment effects

We introduce the potential outcomes framework in general. In the final three paragraphs, we discuss the use of this framework with survival or time-to-event outcomes. Let Z denote an indicator variable denoting treatment status (Z = 1 for active treatment vs. Z = 0 for the control treatment). In settings such as this with two possible treatments, the potential outcomes framework (also known as Rubin’s Causal Model11) assumes that the ith subject has a pair of potential outcomes: Yi(0) and Yi(1), the outcomes when receiving the control and active treatment, respectively, and under identical circumstances (including at the same point in time).11 However, each subject receives only one of the two treatments (after one treatment has been applied, the other treatment cannot be applied under identical circumstances). Therefore, only one outcome, denoted by Yi, is observed for the ith subject: the outcome under the treatment received.

The effect of treatment for the ith subject is defined as Yi(1) − Yi(0). The average treatment effect (ATE) is defined as E[Yi(1)-Yi(0)], where E[] is the expectation (average) over all subjects (note that although the mean is commonly used for summarizing causal effects, Rubin has long suggested that in some scenarios, the use of other summary measures, such as the median, may be more appropriate11). The average treatment effect among the treated subjects (ATT) is E[Yi(1)-Yi(0)|Z=1]. The latter, which will be the target estimand of interest in the current paper, is the average effect of treatment in those subjects who received the active treatment: it is the effect of treatment versus control in those subjects who were ultimately treated.

When outcomes are time-to-event in nature, the above definitions can be used with Y(0) and Y(1) denoting the event times under control and treatment, respectively. The resultant metric for quantifying the effect of treatment is the mean change in survival time. Although the causal estimand is well defined, it is infrequently used in the biomedical literature. Instead, the effect of treatment on survival is more frequently quantified using the absolute differences in survival probabilities at specified durations of follow-up or using the hazard ratio.

The survival difference at a specified time t can be formulated as a special case of the ATT. Define Di(1)=Yi(1)>t and Di(0)=Yi(0)>t, so that we have defined two binary potential outcomes corresponding to whether the respective time-to-event potential outcomes exceed the specified time t. Then the survival difference at time t among treated subjects is defined to be E[Di(1)-Di(0)|Z=1]=E[Di(1)|Z=1]-E[Di(0)|Z=1]. This equality displays a subtle distinction in causal inference. The left-hand side represents the mean causal difference for the same subject, whereas the right-hand side is a comparison of “marginal” distributions of potential outcomes. In this setting, “marginal” means separate distributions for Y(0) and Y(1), and not their joint distribution, or distributions averaged over baseline covariates. Because means are linear functionals, the marginal and joint summaries coincide, which is not true for many summaries (e.g., median[Yi(1)-Yi(0)|Z=1]median[Yi(1)|Z=1]-median[Yi(0)|Z=1]).

The hazard ratio falls within the class of “marginal” summaries of causal outcomes. Let h0 represent the hazard function for the distribution of Yi(0) when Zi=1, and correspondingly for h1 and Yi(1). The hazard ratio that we target is h1(t)/h0(t), which is assumed to be constant at all times t. This hazard ratio compares the marginal distributions and not the potential outcomes within each subject.

2.2 Notation and background

We let X denote a vector of measured baseline covariates. The propensity score is e=Pr(Z=1|X) when treatment assignment is unconfounded (given X), and is often estimated using a logistic regression model with the binary indicator variable Zi regressed on Xi. Alternatively, methods from the machine learning literature, such as random forests or generalized boosting methods, can be used.1214 Variable selection for estimating the propensity score model has been considered elsewhere.15

Propensity-score matching involves the formation of matched sets of treated and control subjects who share a similar value of the estimated propensity score. We restrict our attention to pair-matching without replacement, in which each matched set consists of one treated subject and one control subject. A wide variety of algorithms exist for matching subjects on the propensity score.16 We restrict our focus to two methods: nearest neighbour matching on the propensity score and nearest neighbour matching on the logit of the propensity score using calipers of width equal to 0.2 of the standard deviation of the logit of the propensity score.17,18 Propensity-score matching can estimate the ATT using the simple approach where, for each matched treated subject, the potential outcome under the active treatment is simply the subject’s observed outcome and for each matched treated subject, the potential outcome under the control treatment is estimated using the observed outcome of the matching control subject.

When outcomes are time-to-event in nature, Kaplan–Meier survival curves can be estimated separately in treated and control subjects in the matched sample,19 which allows for the comparison of the survival function between treated and control subjects across the duration of follow-up. Additionally, a Cox proportional hazards model can be fit in the matched sample in which the hazard of the occurrence of the outcome is regressed on an indicator variable denoting treatment status.8,19

2.3 Covariate-adjusted survival analyses in propensity-score matched samples

We describe three methods for conducting covariate-adjusted survival analyses in propensity-score matched samples. The first two methods are based on using regression models to estimate the potential outcomes under control for the matched treated subjects. Conventional survival analyses are then conducted using the estimated potential outcomes under control and the observed outcomes under treatment for the matched treated subjects. The third method is based on directly estimating covariate-adjusted survival curves under control for the matched treated subjects.

2.3.1 Imputing control potential outcomes for the matched treated subjects

We describe two methods for estimating the missing potential outcomes under control for each of the matched treated subjects. The first method is a fully parametric approach based on the use of Accelerated Failure Time (AFT) parametric survival models, whereas the second method is a semi-parametric approach based on the use of the Cox proportional hazards model. These methods are motivated by the method described by Imbens for use with continuous outcomes20 and by approaches described by Little et al. and by Zhao et al. for imputing survival outcomes.21,22

Imputing potential outcomes using AFT models

When considering treatment effects for continuous outcomes, different authors have suggested that a regression model could be used to impute the potential outcome under control for the treated subjects, thereby allowing estimation of the ATT.20,2326 We will modify this approach for use with time-to-event outcomes. Our approach is similar to that described in an unpublished thesis from 1998.27 Within the sample of matched control subjects, one can use an AFT parametric survival model to regress the time of the occurrence of the outcome on the measured baseline covariates: log(Ti)=β0+β1xi1++βkxik+σɛi, where Ti denotes the survival time of the ith control subject, and xi1,…,xik denote the k covariates measured on this subject. The parameter σ is referred to as the scale parameter, and the residual terms, ɛi, are assumed to follow a specified distribution. The assumed distribution of the ɛi imply event times that follow a specific distribution. For instance, if the error terms follow a two parameter extreme value distribution, then the event times will follow a Weibull distribution. Using the AFT model fit using the matched control subjects, one then applies the fitted model to the matched treated subjects to determine the conditional distribution of event times for the matched treated subjects given their observed baseline covariates. An event time is drawn from this conditional distribution. This simulated event time serves as the imputed potential outcome under control for the matched treated subject. These imputed event times reflect an estimate of the event time for each treated subject had they not been treated (i.e. the predicted event times estimate the potential outcome under control for each of the treated subjects). Note that it is important to draw a random event time from the conditional distribution of event times, rather than using the mean or median of the distribution of event times, because the true event time is not known. For each treated subject, we impute a single potential outcome from the estimated control distribution yielding a “complete matched data set” that can be analyzed using conventional methods for matched samples. The rationale for the use of single imputation is that it reflects what is done in conventional matching, in which the observed outcome for the matched control subject serves as the imputed potential outcome under control for the matched treated subject. Thus, conventional matching implicitly imputes a single potential outcome under control for each matched treated subject. A single imputed data set suffices for the assessment of bias reduction presented here. We expect that a multiple imputation method28 that creates multiple completed matched data sets and then combines the resulting estimates will improve precision compared to single imputation, and yield improved inferential procedures (see Appendix 1 for a brief description of how this would be implemented).

Once the potential outcome under control has been imputed for each matched treated subjects, the analyst can use these imputed potential outcomes under control, along with the observed outcomes under treatment for the matched treated subjects, to conduct conventional survival analyses. One can use the Kaplan–Meier method to obtain an estimate of the survival function under treatment (using the observed outcomes for the matched treated subjects) and an estimate of the survival function under control (using the imputed potential outcomes under control for the matched treated subjects).29 Similarly, one can use a Cox proportional hazards regression model to regress the hazard of the occurrence of the event on an indicator variable denoting treatment status.

Imputing potential outcomes using semi-parametric Cox regression models

The second approach to estimating the potential outcomes under control for the matched treated subjects is based on the use of the semi-parametric Cox proportional hazards model. This approach is motivated by a method described by Zhao et al. to impute event times in the presence of informative censoring.22 The method is similar in concept to the method described above. However, rather than fitting an AFT parametric survival model, one fits a Cox proportional hazards model in the sample of matched control subjects: log(hi(t))=log(h0(t))+β1xi1++βkxik, where h(t) denotes the baseline hazard function and hi(t) denotes the hazard function for the ith subject (as above the treatment indicator variable is not included in the regression model). An estimate of the baseline survival function is obtained (e.g., using the Breslow estimate of the baseline hazard function). The fitted model is then applied to the sample of matched treated subjects. For each matched treated subject, one obtains an estimate of the survival function (which is equivalent to the distribution function for event times). Let Siest(t) denote the estimated survival function (under control) for the ith matched treated subject. One then draws a random number from the standard uniform distribution uU(0,1), and the simulated potential outcome under control for the ith matched treated subject is obtained by inverting the relationship Siest(t)=u.

Once the potential outcomes under control have been imputed for each matched treated subject, these are combined with the observed outcomes (under treatment) for each matched treated subject. Conventional statistical methods for survival analysis are then conducted as described in the previous section.

2.3.2 Directly estimating covariate-adjusted survival functions under control

In the previous two sections, we described methods for imputing (i.e. estimating) the potential outcome under control for the matched treated subjects. In this section, we describe a method for directly estimating the covariate-adjusted survival function under control. The method is similar to the second one described above. As above, a covariate-adjusted survival curve, representing the survival function under control, is estimated for each matched treated subject. These survival curves are then averaged across all matched treated subjects. Let Si0,est(t) denote the estimated survival function under control for the ith matched treated subject. Then, the estimated survival function under control for matched treated subjects is estimated as S0,est(t)=1Ni=1NSi0,est(t), where N denotes the number of matched treated subjects.

Once the estimated survival function under control has been estimated for the matched treated subjects, it can be compared with the empirical survival function for observed event times for the matched treated subjects (say, obtained using the Kaplan–Meier method). Absolute differences in survival probabilities at different times can be reported. A limitation of this approach is that the effect of treatment can only be quantified using methods based on the comparison of survival functions. Direct estimation of hazard ratios using a Cox regression model are not possible using this approach, because an estimated or imputed potential outcome under control has not been obtained for each matched treated subject. Thus, while the first approach permits estimation of both absolute and relative effects of treatments on survival, the second approach only permits estimation of the absolute effects of treatments on survival.

3 Monte Carlo simulations to examine the performance of methods for conducting covariate-adjusted survival analyses within matched samples

Monte Carlo simulations were used to examine the performance of the three methods described above. Our primary focus is on estimating survival functions, whereas our secondary focus is on estimating hazard ratios. We conducted a set of simulations in which one factor was allowed to vary: the conditional hazard ratio relating treatment to the hazard of the outcome. Other factors were fixed in the design of the simulation: the number of baseline covariates and their distribution, the prevalence of treatment, and the distribution of event times. Due to space constraints, we were limited to examining a small number of scenarios. A factor that limited the number of scenarios that we could examine was the difficulty in summarizing differences in survival curves across estimation methods. As our primary focus was on estimating survival functions, results are best displayed graphically, with the survival functions displayed for different estimation methods (presentation of the results will require one figure panel per scenario for each of the two matching methods). Given the difficulty of summarizing the differences in two survival curves using a single numerical quantity, we were limited in the number of scenarios for which we could concisely report results.

3.1 Simulating baseline covariates and treatment status

We simulated baseline covariates, treatment status, and time-to-event outcomes for a super-population consisting of 1,000,000 subjects. Ten baseline covariates (X1,,X10) were simulated for each subject from independent standard normal distributions. For each subject, the logit of the probability of treatment was determined from the following logistic model

logit(pi)=log(0.2/0.8)+log(1.1)x1i+log(1.25)x2i+log(1.5)x3i+log(1.75)x4i+log(2)x5i+log(2.5)x6i+log(1/1.25)x7i+log(1/1.5)x8i+log(1/2)x9i+log(1/2.5)x10i

Thus, increasing values of six of the baseline variables were associated with an increasing likelihood of treatment selection, while increasing values of four of the variables were associated with a decreasing likelihood of treatment assignment. For each subject, treatment assignment was then simulated from a Bernoulli distribution with subject-specific parameter pi. Because the coefficients in the above model are of the form “log(a)”, the coefficients are logarithms of odds ratios, and “a” is the odds ratio for treatment for the given covariate. Thus, a one standard deviation increase in X1 is associated with a 10% increase in the odds of treatment.

3.2 Simulating a time-to-event outcome

For each subject, we simulated two potential time-to-event outcomes using a method described by Bender et al.30 First, we simulated a time-to-event outcome under the control treatment, and then we simulated a time-to-event outcome under the active treatment. Under the control treatment, the linear predictor was defined as

LPi(0)=log(1/2.5)x1i+log(1/2)x2i+log(1.1)x3i+log(1.25)x4i+log(1.5)x5i+log(1/1.5)x6i+log(1.75)x7i+log(2)x8i+log(1/1.25)x9i+log(2)x10i

As with the treatment-selection model describe above, the regression coefficients are log-hazard ratios. Thus, a one standard deviation increase in X1 is associated with a 60% decrease in the hazard of the occurrence of the outcome. The linear predictor under the active treatment is LPi(1)=LPi(0)+log(βtreat). For each subject, we generated a random number from a standard Uniform distribution, and an event time was generated for each subject from a Weibull distributions, T=(-log(u)λeLP)1/η, where u ∼ U(0,1), LP denotes either LP(0) or LP(1), and λ and η denote the scale and shape parameters, respectively, of the Weibull distribution (using the parametrization of Bender et al.). We set λ and η equal to 0.00002 and 2, respectively, as has been done in previous studies.8,31 The observed time-to-event for each subject was the potential outcome under the treatment to which the subject was assigned. The median of the population distribution of event times if all subjects were control was 173 (25th and 75th percentiles: 84 and 350).

To determine the true population survival curves under treatment and control (which was necessary to permit determination of bias when estimating covariate-adjusted survival curves), we created a dataset consisting of both potential outcomes for each subject who was treated. In this dataset, we estimated Kaplan–Meier survival functions using the potential outcomes under treatment and the potential outcomes under control separately (as both of these were simulated for each subject). This was done in those subjects who were treated as our focus is matching without replacement, which estimates the effect of treatment in the treated (ATT). Similarly, we computed the true population marginal hazard ratio. Using both potential outcomes for each subject who was assigned to treatment, we regressed the time-to-event outcomes on an indicator variable denoting treatment status using a Cox proportional hazards regression model. Note that this analysis essentially used two records per treated subject: one record for each of the two potential outcomes.

3.3 Monte Carlo simulations

From the super-population of 1,000,000 subjects, we drew a random sample of size 1000 subjects (without taking treatment status into account during the random sampling). In this random sample, we then estimated the propensity score using logistic regression to regress observed treatment status on all 10 measured baseline covariates. Using the estimated propensity score, we created two matched samples. First, we used nearest neighbour matching (NNM) to match treated and control subjects on the propensity score. Treated subjects were selected at random and matched to the control subject whose propensity score was closest to that of the selected treated subject. Matching was done without replacement, so that each control subject could be selected at most once. Second, a propensity-score matched sample was constructed using nearest neighbour matching on the logit of the estimated propensity score using a caliper of width equal to 0.2 standard deviations of the logit of the propensity score.17 Within each of the two matched samples, Kaplan–Meier estimates of the survival function were estimated in treated and control subjects separately. These estimated survival curves represent the survival curves that are typically estimated when propensity-score matching is used. Similarly, in each of the two matched samples, the hazard of the outcome was regressed on an indicator variable denoting treatment status using a proportional hazards regression model.

We then used the three methods described above to estimate covariate-adjusted survival curves representing survival under control for treated subjects in the matched sample. For the first approach, we used a Weibull AFT. For both the Weibull AFT model and the Cox regression model, we included all 10 baseline covariates in the regression model. We also used the first two methods described above to estimate an adjusted hazard ratio. The imputed potential time-to-event outcomes under control for the matched treated subjects were combined with the observed time-to-event outcomes for the matched treated subjects. A Cox model was used to regress the hazard of the outcome on an indicator variable denoting treatment status.

For comparative purposes, we examined an alternative method for estimating an adjusted hazard ratio in the propensity-score matched sample. In the original propensity-score matched sample (consisting of the observed outcomes for the matched control subjects and the observed outcomes for the matched treated subjects), we used a Cox proportional hazards model to regress the hazard of the outcome on an indicator variable denoting treatment status and the 10 baseline covariates. We refer to this approach as “fitting a fully-adjusted regression model in the matched sample.”

This process was repeated 1000 times. As our primary focus is on estimating survival functions, we summarized the estimated survival functions using the following procedure: for each of the two matching methods, the empirical Kaplan–Meier survival curves estimated in the control subjects were averaged over the 1000 repetitions of the simulations. For each of the two matching methods, the adjusted Kaplan–Meier survival curves under control were averaged over the 1000 repetitions of the simulations. Similarly, estimated log-hazard ratios were averaged over the 1000 repetitions of the simulations.

One factor was allowed to vary in the Monte Carlo simulations: the true conditional hazard ratio for the effect of treatment on the hazard of the outcome (exp(βtreat)). The true conditional hazard ratio took four values: 1, 2, 3, and 4. The magnitude of this parameter will affect the degree of separation between the population survival curves for treated and control subjects. We thus actually constructed four super-populations of potential. From each super-population, we drew 1000 random samples and conducted the statistical analyses described above.

3.4 Sensitivity analysis

In above simulations, an appropriately specified AFT model was used to model the distribution of event times when imputing control potential outcomes for the treated. In particular, the time-to-event outcomes in the super-population were simulated to follow a Weibull distribution conditional on the baseline covariates. Then, during the adjustment process, a Weibull AFT model was fit to the data. In this sensitivity analysis, we simulated time-to-event outcomes in the super-population to follow a Gompertz distribution with scale and shape parameters equal to 0.002 and 0.006, respectively, using methods described by Bender et al.30 The median of the population distribution of event times if all subjects were control was 172 (25th and 75th percentiles: 58 and 353).Then, as part of the adjustment process, a Weibull AFT model was fit to the simulated data. Thus, an AFT model that assumed an incorrect distribution of event times was fit when computing adjusted event times.

4 Results of the Monte Carlo simulations

4.1 Description of simulated datasets

Each of the 1000 simulated datasets consisted of 1000 subjects with X and Z. The median number of treated subjects was 290 across the 1000 iterations of the simulations (25th and 75th percentiles: 280 and 300). When using caliper matching, the mean number of matched treated subjects was 195. The mean percentage of treated subjects that were matched to a control subject using caliper matching was 67%. Thus, we are simulating data in which there is the risk of “bias due to incomplete matching.”18

The distribution of the propensity score in treated and control subjects in the first simulated dataset is displayed in Figure 1. The top panel describes non-parametric estimates of the density function of the logit of the propensity score, whereas the bottom panel presents side-by-side boxplots of the distribution of the propensity score in treated and control subjects. The distribution of the propensity score differed substantially between treated and control subjects.

Figure 1.

Figure 1.

Distribution of the propensity score in treated and control subjects.

Balance in baseline covariates between treated and control subjects was assessed using the absolute of the standardized mean difference (or standardized differences).18,32 The mean of the absolute standardized mean difference across the 1000 replications is reported in Table 1. Substantial imbalance was observed for most of the 10 covariates in the crude (unmatched) samples. The use of NNM resulted in matched samples in which balance was substantially improved; however, residual imbalance persisted for some of the 10 baseline covariates. The use of caliper matching resulted in matched samples in which differences between matched treated and matched controls subjects were minimal (mean absolute standardized mean difference < 0.05 for all 10 covariates).

Table 1.

Balance of baseline covariates in simulated datasets.

Variable Mean absolute standardized difference of the mean
Crude/unmatched NNM Caliper matching
X1 0.076 0.042 0.042
X2 0.151 0.063 0.042
X3 0.268 0.101 0.043
X4 0.373 0.146 0.040
X5 0.465 0.182 0.042
X6 0.630 0.251 0.041
X7 0.149 0.063 0.041
X8 0.273 0.106 0.042
X9 0.464 0.181 0.042
X10 0.631 0.250 0.040

Note: Each cell contains the mean of the absolute standardized mean difference across the 1000 iterations of the simulations.

These descriptive analyses illustrate that we have simulated data in which methods to estimate adjusted effects when using NNM would be anticipated to result in improved estimates of treatment effect. However, these descriptive analyses suggest that simple crude estimates obtained in the matched samples obtained using NNM will be contaminated by residual confounding. However, the use of caliper matching may result in bias due to incomplete matching, as a moderate proportion of the treated subjects was not successfully matched to a control subject.18 Thus, it is desirable to examine methods by which estimates obtained using NNM can be improved by reducing the residual bias due to residual imbalance in baseline covariates.

4.2 Estimation of survival functions

In Figures 2 to 5, there are two figures for each matching method (NNM in Figures 2 and 3; caliper matching in Figures 4 and 5). Each figure consists of two panels, representing two of the four values of the conditional hazard ratio for the effect of treatment on the hazard of the occurrence of the outcome. Each panel presents seven survival functions. One pair of survival functions shows survival under treatment and control (using the appropriate potential outcome) in the treated subjects in the super-population. These curves denote the true survival curves in treated subjects (using both the observed outcome and the counterfactual outcome) in the super-population. A second pair of survival functions shows the average crude or unadjusted Kaplan–Meier estimates of the survival functions that were obtained in treated and control subjects in the propensity-score matched samples. These curves represent the survival functions that would typically be estimated when using propensity-score matching. The fifth survival function shows the average adjusted survival curves under control in the matched subjects when an AFT model was used for imputing the potential outcome under control. The sixth survival function shows the average adjusted survival curves under control in the matched subjects when a Cox model was used for imputing the potential outcome under control. The fifth and sixth survival functions represent adjusted survival functions under control when a regression model was used to impute the potential outcome under control for the matched treated subjects. The seventh survival function shows the average adjusted survival curves under control in the matched subjects when a Cox model was used to estimate directly adjusted survival functions under control. For each matching method, differences between the three adjusted survival curves under control were minimal (indeed, the three survival curves are essentially indistinguishable).

Figure 2.

Figure 2.

Survival curves for NNM.

Figure 5.

Figure 5.

Survival curves for caliper matching.

Figure 3.

Figure 3.

Survival curves for NNM.

Figure 4.

Figure 4.

Survival curves for caliper matching.

In all four scenarios, when using NNM, the mean estimated survival curve in the matched treated subjects (i.e. the survival curve under treatment that would typically be obtained when using propensity-score matching) coincides with the treated population’s survival curve under treatment. This is to be expected as NNM results in the inclusion of all treated subjects in the matched sample. However, when using caliper matching, the mean estimated survival curve in the matched treated subjects (i.e. the survival curve under treatment that would typically be obtained when using propensity-score matching) differs from the treated population’s survival curve under treatment, which reveals bias due to incomplete matching. On average, only 67% of treated subjects were successfully matched to a control subject when using caliper matching. Thus, the exclusion of treated subjects from the matched sample resulted in biased estimation of the treated population’s survival curve under treatment.

In the top panel of Figure 2 (true hazard ratio of one – denoting a null treatment effect), we observe that the crude (or unadjusted) survival curve estimated in the matched control subjects (i.e. the survival function under control that would typically be estimated when using propensity-score matching) is noticeably different from the remaining curves when NNM was used. However, the three adjusted survival functions under control (i.e. the three adjusted survival functions under control estimated using the methods described above) coincide with the true population survival curves (which are coincident given the null treatment effect). In the lower panel of Figure 2, we observe that the crude survival curves estimated in the matched control subjects (i.e. the survival function under control that would typically be estimated when using propensity-score matching) are biased, whereas the three adjusted survival functions under the control (i.e. the three adjusted survival functions under control estimated using the methods described above) agree with the true population survival function under control. A similar phenomenon is observed in Figure 3 for the scenarios with true conditional hazard ratios of 3 and 4.

When examining the results for caliper matching, slightly different results were observed. In the top panel of Figure 4 (true hazard ratio of one – denoting a null treatment effect), we observe that the two population survival functions are coincident (due to the null treatment effect). However, the five estimated survival functions (either for treatment or control), while consistent with one another, are all different from the population survival functions, which indicates that biased estimation of the true survival function occurred, both for treated subjects and for control subjects. When the true conditional hazard ratio was 2 (lower panel of Figure 4), we observe that both crude estimators (i.e. the survival functions under treatment and control that would typically be estimated when using propensity-score matching) produce biased estimates of the corresponding true population survival function. Furthermore, the adjusted estimates of survival under control in the matched treated subjects (i.e. the adjusted survival functions under control for the matched treated subjects obtained using the methods described above) are essentially indistinguishable from the crude estimate of the survival function under control. Similar findings, with a modest amplification of bias, are displayed in Figure 5.

Estimation of survival functions permits the estimation of absolute differences in the probability of an event occurring within specified durations of follow-up;33 from this quantity one can compute the corresponding number needed to treat to avoid one outcome over the specified duration of follow-up. Within each of the four scenarios, we found the 25th, 50th, and 75th percentiles of event-times in the large super-population. Then, using the mean survival curves reported in Figures 2 to 5, we estimated the probability of the occurrence of the outcome within each of these three times for each estimation method. The differences in these probabilities, which denote an absolute risk reduction within a specified duration of follow-up, are reported in Figure 6 (NNM) and Figure 7 (caliper matching). Each figure consists of series of dot charts. For each combination of the true conditional hazard ratio (1, 2, 3, and 4) and the percentile of follow-up time (25th, 50th, and 75th), there is a horizontal line. On each horizontal line are five dots, representing: (i) the true population risk difference determined in the super-population; (ii) the risk difference estimated using Kaplan–Meier methods in the sample constructed using the given matching method (NNM or caliper matching) (i.e. risk difference that would typically be estimated when using propensity-score matching); (iii) the risk difference estimated using the given matching method with AFT imputation of the potential outcome under control; (iv) the risk difference estimated using the given matching method with Cox regression-based imputation of the potential outcome under control; (v) the risk difference estimated using the given matching method with Cox regression used to estimate directly survival functions under control.

Figure 6.

Figure 6.

Differences in survival (NNM).

Figure 7.

Figure 7.

Differences in survival (caliper matching).

When using NNM, in general the risk differences obtained using the crude (unadjusted) Kaplan–Meier survival curves in the matched samples (i.e. the risk differences that would typically be estimated when using propensity-score matching) displayed more bias. The three methods for producing adjusted survival functions under control tended to produce very similar estimates of the risk difference. Furthermore, these estimates of the risk difference tended to be similar to the true population value of the risk difference. When using caliper matching, all of the estimation methods (including the “crude” or unadjusted Kaplan–Meier survival curves in the matched sample (i.e. using the survival functions under treatment and control that would typically be estimated when using propensity-score matching)) tended to produce comparable estimates of the risk difference, which displayed minimal bias.

4.3 Estimation of hazard ratios

The two adjustment methods based on imputing an adjusted potential outcome under control for the matched treated subjects permit estimation of hazard ratios, in addition to estimation of survival functions. Estimated hazard ratios are reported in the top quarter of Table 2. The results for a different scenario are reported in each row. The percent relative bias is reported in the second quarter of the table. In each row we report the true conditional hazard ratio and true marginal hazard ratio for that scenario. When computing relative bias, the hazard ratios estimated using an unadjusted Cox model in the matched sample, and the methods based on imputing outcomes under control for the matched treated subjects are compared to the true marginal hazard ratio. In contrast to this, the hazard ratio when a fully adjusted Cox model was fit in the matched sample is compared to the true conditional hazard ratio.

Table 2.

Estimated hazard ratios using the different methods.

Conditional HR Marginal HR NNM NNM (AFT impute) NNM (Cox impute) NNM (Cox regression) Caliper Caliper (AFT impute) Caliper (Cox impute) Caliper (Cox regression)
Weibull distribution for event times
 Estimated hazard ratio
  1 1 0.73 1.00 1.01 1.01 1.00 1.00 1.00 1.01
  2 1.36 1.00 1.37 1.37 2.03 1.37 1.38 1.37 2.04
  3 1.63 1.20 1.64 1.64 3.07 1.65 1.66 1.65 3.08
  4 1.86 1.37 1.88 1.87 4.10 1.88 1.90 1.89 4.13
 Relative bias (%)
  1 1.00 −26.6 0.4 0.6 0.7 −0.3 0.1 0.0 0.5
  2 1.36 −26.4 0.6 0.4 1.7 0.4 1.0 0.7 1.9
  3 1.63 −26.5 0.7 0.4 2.2 0.8 1.5 1.2 2.6
  4 1.86 −26.6 0.8 0.5 2.6 1.1 1.9 1.6 3.2
Gompertz distribution for event times
 Estimated hazard ratio
  1 1.00 0.88 1.02 1.00 1.00 1.00 0.99 1.00 1.00
  2 1.36 1.20 1.33 1.36 2.02 1.37 1.31 1.37 2.03
  3 1.64 1.45 1.57 1.64 3.04 1.65 1.56 1.65 3.07
  4 1.87 1.65 1.76 1.87 4.07 1.89 1.77 1.89 4.12
 Relative bias (%)
  1 1.00 −12.3 1.9 −0.3 0.0 −0.3 −0.7 −0.4 0.0
  2 1.36 −11.9 −2.5 −0.1 0.8 0.4 −3.6 0.4 1.5
  3 1.64 −11.7 −4.4 0.0 1.4 0.9 −4.7 1.0 2.3
  4 1.87 −11.7 −5.6 0.2 1.7 1.2 −5.3 1.3 2.9

Note: The relative bias for NNM, NNM (AFT impute), NNM (Cox impute), Caliper, Caliper (AFT impute), and Caliper (Cox impute) compare the estimated hazard ratio with the true underlying marginal hazard ratio. The relative bias for NNM (Cox regression) and Caliper (Cox regression) compare the estimated hazard ratio with the true underlying conditional hazard ratio.

From Table 2, the use of a univariate Cox model in the matched sample resulted in biased estimation of the true underlying marginal hazard ratio when using NNM. However, imputing potential outcomes under control for the matched treated subjects in conjunction with NNM resulted in estimates of the true marginal hazard ratio with minimal bias (relative bias <1%). The use of caliper matching (either with the observed outcome under control or the adjusted potential outcome under control) resulted in estimates of the true marginal hazard ratio with minimal bias (relative bias <2%).

An important observation from the top half of Table 2 is that the use of a fully-adjusted Cox regression model in the matched sample (using either NNM or caliper matching) resulted in estimates of the conditional hazard ratio with minimal bias. Thus, although matching alters the target estimand, the use of a fully adjusted model in the matched sample results in a nearly unbiased estimate of a conditional effect.

4.4 Sensitivity analysis (Gompertz distribution of event times)

In the sensitivity analysis, we simulated event times using a conditional Gompertz distribution, but used a Weibull AFT model to impute control potential outcomes. Estimated survival curves are reported in Figure 8 (NNM) and Figure 9 (caliper matching) (legends for these figures are omitted due to space constraints – refer to the legends for Figures 2 through 5 for description of the curves). Estimated differences in survival are reported in Figure 10. Estimated hazard ratios and relative bias are reported in the bottom half of Table 2.

Figure 8.

Figure 8.

Survival curves for NNM (Gompertz distribution).

Figure 9.

Figure 9.

Survival curves for caliper matching (Gompertz distribution).

Figure 10.

Figure 10.

Differences in survival probabilities (Gompertz distribution).

When using NNM, the use of Cox regression to impute control potential outcomes for the matched treated subjects or to directly estimate survival functions produced estimated survival functions under control that coincided with the treated population’s survival functions under control. However, use of a Weibull AFT model to impute control potential outcomes for the matched treated subjects resulted in survival functions under control that differed from those in the population. In some instances, the survival curve obtained by using a Weibull AFT model to impute control outcomes was no better than the crude survival curves estimated using the observed outcomes under control in the matched sample. Similarly, when using caliper matching, the survival function obtained when using a Weibull AFT model to impute control potential outcomes differed from the other two adjusted survival functions under control. Assuming an incorrect distribution for event times resulted in biased estimation of risk differences (Figure 10), often greater than when the observed outcomes for the control subjects were used.

When examining estimation of hazard ratios (bottom half of Table 2), we see that using a Weibull AFT model to impute potential outcomes resulted in estimates of the underlying marginal hazard ratio with minor bias (relative bias ranging from -5.6% to 1.9%) when used with NNM. However, the relative bias was larger than when the AFT imputation model was correctly specified. Using the Cox model to impute outcomes resulted in estimates of the marginal hazard ratio that were essentially unbiased.

Note that both the Weibull AFT model and the Gompertz AFT model are models that also satisfy the proportional hazards assumption. Indeed, apart from the exponential AFT model (which is a special case of the Weibull model), these are the only AFT models that also satisfy the proportional hazards assumption.30 Thus, the good performance of the Cox model to impute potential outcomes is likely due to the proportional hazards assumption being satisfied in both settings.

5 Case study

We provide an empirical example to illustrate the application of the proposed methods.

5.1 Data and analyses

We used data from a previously published study comparing the balancing properties of different propensity score methods.34 The sample consisted of 7059 patients discharged from hospital with a diagnosis of heart failure in Ontario, Canada between 1 April 1999 and 31 March 2001. Data on patient characteristics were obtained by retrospective chart review by trained cardiovascular research nurses. These data were collected as part of the Enhanced Feedback for Effective Cardiac Treatment Study, a study designed to improve the quality of care provided to patients with cardiovascular disease in Ontario.35

The exposure of interest was whether the patient received a prescription for a beta-blocker at hospital discharge. Of the 7059 patients, 1880 (26.6%) received a beta-blocker prescription at hospital discharge. Patients were followed for up to 15 years post-discharge. Patients who survived 15 years post-discharge had their survival times treated as censored: 132 subjects (1.9%) were censored after 15 years of follow-up, with the remaining 98.1% of subject dying within 15 years of hospital discharge.

Receipt of a beta-blocker prescription was regressed on the set of 28 baseline covariates using a logistic regression model. The baseline covariates included demographic characteristics (age and sex), vital signs on admission (systolic blood pressure, respiratory rate, and heart rate), initial laboratory values (white blood count, hemoglobin, sodium, glucose, potassium, urea, and creatinine), comorbid conditions (diabetes, stroke or transient ischemic attack, previous AMI, atrial fibrillation, peripheral arterial disease, chronic obstructive pulmonary disease, dementia, cirrhosis, and cancer), presence of left bundle branch block on first electrocardiogram within 24 hours of admission, presenting signs and symptoms (neck vein distension, S3, S4, rales >50% of lung field), and findings on chest X-ray (pulmonary edema, cardiomegaly).

Statistical methods identical to those described above were used to estimate both crude and adjusted survival curves in propensity-score matched samples. When using caliper matching, 99.3% of the treated subjects were matched to a control subject. A Weibull AFT model was used for imputing outcomes under control for the matched treated subjects. Both crude and adjusted hazard ratios for beta-blocker use were estimated in the matched samples. Imputed potential event times under control that exceeded 15 years were censored at 15 years.

5.2 Results

Estimated survival curves are shown in Figure 11, which contains two panels, the left one shows estimated survival curves when using NNM, whereas the right one displays survival curves when using caliper matching. Each panel describes five survival curves. The first survival curve is the empirical survival curve fit in the matched treated subjects (this is the survival function under treatment that would typically be estimated when using propensity-score matching); the second is the empirical survival curve fit in the matched control subjects (this is the survival function under control that would typically be estimated when using propensity-score matching); the last three survival curves are survival curves that were adjusted for the covariates described above using the methods previously described. The adjusted survival curve obtained from the method that averages over the estimated survival curves for each subject was very similar to the empirical survival function fit in the matched control subjects. The two methods based on imputing potential outcomes under control resulted in survival functions that were similar to one another for the first 2000 days of follow-up. However, after this time, the two survival functions diverged, with the curve obtained using AFT imputation having a heavier tail than the other survival function. When using caliper matching, differences between the survival curves under control were smaller than the observed differences when using NNM. This attenuation of differences would be expected based on the results of the Monte Carlo simulations described in Section 4.

Figure 11.

Figure 11.

Survival in beta-blocker and control subjects in patients with CHF.

The estimated hazard ratio for beta-blocker use in the sample obtained using NNM was 0.84 (95% confidence interval: 0.79 – 0.90). The hazard ratios obtained using AFT imputation of the potential outcomes under control and Cox imputation of the potential outcomes under control were 0.83 (95% CI: 0.78 – 0.88) and 0.71 (95% CI: 0.67 – 0.75), respectively. The estimated hazard ratio for beta-blocker use in the sample obtained using caliper matching was 0.84 (95% confidence interval: 0.79 – 0.90). The hazard ratios obtained using AFT imputation of the potential outcomes under control and Cox imputation of the potential outcomes under control were 0.83 (95% CI: 0.78 – 0.88) and 0.72 (95% CI: 0.68 – 0.77), respectively.

6 Discussion

We proposed two methods to estimate survival curves under control in samples obtained using propensity score matching. The first method is based on estimating the potential outcomes under control after adjustment for baseline covariates, instead of using the observed outcomes for the matched control subjects. Either an AFT parametric survival model or a Cox proportional hazards model can be used to impute the control potential outcomes. The second method is based on fitting a Cox proportional hazards model and estimating a survival function under control for each matched treated subject. Both methods permit estimation of adjusted survival functions under control in the matched treated subjects. However, the first method also permits estimation of an adjusted marginal hazard ratio in the matched sample.

When the AFT model correctly specified the distribution of event times, the three different approaches to estimating adjusted survival functions under control for the matched treated subjects had similar performance, producing estimated survival functions that were essentially indistinguishable from one another. When using NNM, the crude survival function estimated using the observed outcomes for the matched control subjects was a biased estimate of the population survival function under control. However, the adjusted survival functions under control were essentially unbiased estimates of the population survival function under control. When using caliper matching, we observed that the adjusted survival functions under control were similar to the crude or unadjusted survival functions estimated using the observed outcomes for the matched control subjects. However, all of these survival functions were biased estimates of the true population control survival function. Based on our simulations, we would suggest that imputation of the potential outcomes under control be done using a Cox model rather than an AFT model, as the performance of the latter deteriorated when the distribution of event times was mis-specified. Being a semi-parametric method, the Cox model does not make as many assumptions about the distribution of event times. However, both of the survival distributions that we considered (Weibull and Gompertz) also satisfy the proportional hazards assumption. In settings in which the proportional hazards assumption is violated, it is likely that the use of a Cox model to impute potential outcomes may result in degraded performance.

We also examined the performance of an alternative method that involved fitting a fully adjusted Cox proportional hazards model in the matched sample where the hazard of the outcome was regressed on an indicator variable denoting treatment status in addition to a set of baseline covariates. Whereas the other methods produced estimates of the true underlying marginal hazard ratio, this method produced an estimate of the underlying conditional hazard ratio. Investigators should be aware that these approaches are not interchangeable as they have different implied target estimands. As noted by Gail et al., linear treatment effects (e.g. differences in means and risk differences) are collapsible, meaning that the conditional and marginal effects coincide. However, common epidemiological measures of effect, such as the odds ratio and the hazard ratio, are non-collapsible, meaning that the conditional and marginal effects do not coincide (except under a null effect).36

There are certain limitations to the current study. The primary limitation is its dependence on Monte Carlo simulations. We used simulations because analytic derivations would be difficult in the settings that we examined. It is possible that different results would be obtained under different data-generating processes. As noted above, given the difficulty in summarizing the differences in two survival curves using a single numerical quantity, we were limited in the number of scenarios for which we could concisely report results. The current study illustrates that the method has merit, at least in certain situations, and that it can result in superior performance compared to conventional methods. A second limitation is that we only simulated settings in which the imputation regression model is correctly specified. We anticipate that the performance of the proposed method will deteriorate when the regression model is incorrectly specified. Due to space constraints, we are unable to examine this issue in the current study. However, this issue merits examination in subsequent research. There is precedent for examining the performance of a proposed method under ideal circumstances and then subsequently examining its performance under model mis-specification.37,38

Rosenbaum and Rubin used the term “bias due to incomplete matching” to refer to the bias that can arise when estimating the ATT when not all treated subjects are included in the matched sample.18 When choosing a matching algorithm, the choice between NNM and caliper matching represents a trade-off between two different sources of bias. NNM results in the inclusion of all treated subjects in the matched sample, and thus the avoidance of bias due to incomplete matching. However, the avoidance of possible bias due to incomplete matching comes at the cost of possible increased residual confounding, because no constraint is imposed on the quality of the matches. In contrast, caliper matching can result in the exclusion of some treated subjects, and thus may be susceptible to bias due to incomplete matching. However, caliper matching is expected to reduce the magnitude of residual confounding due to differences in the distribution of baseline covariates, because it imposes constraints on the quality of acceptable matches. Bias due to incomplete matching can be thought of as a structural form of bias that may not be circumvented by subsequent covariate adjustment (because one is using a non-representative sample to estimate an effect in a larger population). Bias due to imperfect matching may be thought of as bias due to residual confounding that may be further minimized by subsequent covariate adjustment. Bias due to incomplete matching explains the biased estimation that was observed to occur when using caliper matching. Both the crude estimated survival functions and the adjusted survival functions were biased estimates of the true population survival functions (indicating bias due to incomplete matching). However, the crude survival function under control did not differ meaningfully from the adjusted survival functions under control, indicating that bias due to residual confounding was minimal. Our motivation for developing a method to estimate adjusted survival curves under control in the sample obtained using NNM was to reduce bias due to residual confounding while avoiding bias due to incomplete matching. The simulations indicate that the proposed methods were successful in satisfying these criteria. Our data-generating process was designed to simulate data in which there were substantial differences between treated and control subjects. This would be a scenario in which both bias due to confounding and the potential for incomplete matching would be expected to be present.

In the context of RCTs, regression adjustment is frequently used when estimating differences in means, odds ratios, or hazard ratios.6 Such regression adjustment permits a reduction in bias that may occur due to residual imbalance in prognostically important baseline covariates that may exist despite random treatment allocation. As noted above, regression adjustment is common when analyzing data from RCTs.6 However, this practice is not limited to RCTs. Indeed, several authors have described methods to combine regression adjustment with matching-based methods for reducing the effects of confounding when using observational data. Rubin examined using regression adjustment in matched samples when estimating linear treatment effects for continuous outcomes (i.e. differences in means) and found that it reduced bias to a greater extent than regression alone or matching alone.25 Similarly, he found that nearest neighbor Mahalanobis metric matching in combination with regression adjustment on matched pairs differences was an effective method to control for confounding when outcomes were continuous.24 Rubin and Thomas further examined combining matching on the propensity score with adjustment for a limited set of prognostically important covariates when estimating linear treatment effects.39 These studies restricted their focus to settings with continuous outcomes and the estimation of linear treatment effects. Although these studies examined the fitting of a linear regression model within the matched sample, other studies examined the use of regression models to impute potential outcomes. In a setting with continuous outcomes, it has been suggested that linear regression be combined with matching on the propensity score to result in greater reduction in bias than was possible for either method alone.20,24,25,40 Imbens suggested that a regression model be fit in the control subjects (either all control subjects or the matched control subjects) and that the resultant model be used to impute missing potential outcomes under control for the treated subjects. This method was the motivation for our proposed use of AFT survival models to estimate the potential time-to-event outcome under control for the matched treated subjects. Similarly, Gutman and Rubin proposed a method for estimating the effect treatment on binary outcomes that combined multiple imputation with the use of two regression splines to impute potential outcomes.41 A recently proposed method called double-propensity score adjustment has been described that combines matching on the propensity score with covariate adjustment using the propensity score to impute potential outcomes under control.42 This method is applicable in settings with continuous or binary outcomes. Although several studies have examined methods to combine matching with regression adjustment when estimating linear treatment effects with continuous outcomes, to the best of our knowledge, comparable methods for use with survival or time-to-event outcomes have only been described in an unpublished PhD thesis.27 We are unaware of subsequent applications of the method. We anticipate that the method will increase in popularity once its description and evaluation have been published in the peer-reviewed literature.

When imputing missing control potential outcomes for the matched treated subjects, we estimated the imputation model using the matched control subjects. There are three other alternatives that could have been used: (i) fitting the imputation model in all control subjects (both matched and unmatched control subjects); (ii) fitting the imputation model in matched treated and control subjects; (iii) fitting the imputation model in all subjects (both treated and control subjects and both matched and unmatched subjects). The first alternative could have used control subjects who were substantially different from the matched treated subjects. Doing so could have involved a certain degree of extrapolation by allowing these subjects to contribute to the model that would then be applied to the matched treated subjects. The second alternative would have required that the imputation model include an indicator variable denoting treatment status, as the sample consisted of both treated and control subjects. Furthermore, it would require the assumption that the relationship between the covariates and the outcome was similar between treated and control subjects (or else it would require the inclusion of a large number of interaction terms). The third alternative would have the possible limitations of both the first and second alternatives. The relative performance of these different approaches merits examination in subsequent research.

The focus of the current study has been on estimating survival functions and hazard ratios. We have shown that, in combination with NNM, imputing missing control potential outcomes for the matched treated subjects allows for estimation of these quantities with possibly minimal bias. Subsequent research is necessary to examine sampling variance estimation when this procedure is used. We hypothesize that imputing multiple control potential outcomes for each matched treated subject will permit estimation of hazard ratios with narrower confidence intervals compared to imputing a single control potential outcome for each matched treated subject. Rubin’s rules can be used to obtain a pooled estimate of the standard error of the pooled hazard ratio.28 However, as a potential outcome is being imputed for each matched treated subject, it is not clear what is the optimal number of potential outcomes under control to impute for each matched treated subject. This merits examination in subsequent research.

We have highlighted methods that combine matching with regression adjustment. An alternative approach to combining a design-based method for bias reduction with regression adjustment is the so-called “doubly-robust” methods that combine regression adjustment with inverse probability of treatment weighting using the propensity score.43,44 These methods require the specification of two regression models: one model for the treatment-selection process and an outcomes model. These methods are said to be doubly robust because as long as one of the two regression model is correctly specified, then the estimate of the average treatment effect is consistent (however, if both models are incorrect, the performance of the method can be very poor44). Lunceford and Davidian describe doubly robust estimators for linear treatment effects with continuous outcomes.43 An alternative approach is described by Joffe et al., who used a series of weighted logistic regression models to estimate treatment effects when outcomes are binary.45 The method that we proposed in the current paper shares some similarities with these methods and those described above for combining matching and regression adjustment. Both sets of methods (regression adjustment combined with either matching or weighting) combine a design-based method of bias reduction (matching or weighting) with regression adjustment with the objective of obtaining a greater degree of bias reduction or a more robust estimate of treatment effect.

The use of covariate-adjusted analyses in propensity-score matched samples is motivated by the use of regression adjustment in analyses of RCTs. In our simulations, we assumed that the sample size was adequate to allow adjustment for all of the measured baseline covariates. In specific scenarios, either involving small sample sizes or rare outcomes, it may not be possible to estimate a regression model that incorporates all of the baseline covariates. In the context of RCTs, it has been suggested that the variables included in the regression model being used for adjustment should be identified during the design stage of the study and should include those variables that are strongly prognostic for the outcome.46 Based on this recommendation, in the context of observational studies using propensity-score matching with either small sample sizes or rare outcomes, analysts should include a limited number of covariates in the regression model. Analysts should prioritize for inclusion in the regression model those variables that are strongly prognostic of the outcome. While this issue merits examination in a separate study, it is beyond the scope of the current study.

In summary, we have described methods for combining regression adjustment using parametric or semi-parametric survival models with propensity-score matching to estimate survival curves and hazard ratios. When used with NNM, these methods can estimate population survival functions with minimal bias. In particular, when used with NNM, these methods permit accurate estimation of population survival functions while avoiding the bias due to incomplete matching that can occur when using caliper matching.

Appendix 1. Generalization of the proposed method to multiple imputation of potential outcomes under control

The proposed method imputed a single potential outcome under control for each matched treated subject. This reflects what is done with conventional propensity-score matching, in which the observed outcome for the matched control subject serves as the imputed potential outcome under control for the matched treated subject. However, using our imputation model it is possible to impute multiple potential outcomes under control for each matched treated subject. Assume that M outcomes under control are imputed for each matched treated subject. It is important that in the subsequent analyses, these M imputed outcomes under control are not averaged within each matched treated subject. Instead, the subsequent analyses need to account for the between-imputation variation in the imputed potential outcomes under control. To do so, M matched sets need to be constructed. In the first matched set, the first of the M imputed potential outcomes under control is used for each matched treated subject. A conventional analysis is then conducted in this matched set, and a measure of effect (e.g. a log-hazard ratio), along with its estimated standard error is obtained. In the second matched, the second of the M imputed potential outcomes under control is used for each matched treated subject. A conventional analysis is then conducted in this matched set, and a measure of effect (e.g. a log-hazard ratio), along with its estimated standard error is obtained. This process is repeated M times, once for each of the M imputed potential outcomes under control, resulting in M estimates of effect (e.g. log-hazard ratios) and their estimated standard errors. These estimates of effect and their standard errors are combined using Rubin’s Rules to obtain an overall measure of effect.28 The application of Rubin’s Rules is a “post-processing” step that combines the M point estimates as well as their standard errors to get a “pooled” standard error that reflects both sampling uncertainty of an individual estimate as well as the between-imputation variability in the estimate.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by ICES, which is funded by an annual grant from the Ontario Ministry of Health and Long-Term Care (MOHLTC). The opinions, results and conclusions reported in this paper are those of the authors and are independent from the funding sources. No endorsement by ICES or the Ontario MOHLTC is intended or should be inferred. This research was supported by an operating grant from the Canadian Institutes of Health Research (CIHR) (MOP 86508). PCA was supported by a Mid-Career Investigator award from the Heart and Stroke Foundation. The Enhanced Feedback for Effective Cardiac Treatment (EFFECT) data used in the study was funded by a CIHR Team Grant in Cardiovascular Outcomes Research (grant nos CTP 79847 and CRT43823). These datasets were linked using unique, encoded identifiers and analyzed at ICES. DBR was supported in part by grants from the U.S. NIH, U.S. NSF and the U.S. ONR.

References

  • 1.Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: 41–55. [Google Scholar]
  • 2.Austin PC. A tutorial and case study in propensity score analysis: an application to estimating the effect of in-hospital smoking cessation counseling on mortality. Multivariate Behav Res 2011; 46: 119–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Austin PC. An introduction to propensity-score methods for reducing the effects of confounding in observational studies. Multivariate Behav Res 2011; 46: 399–424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Weitzen S, Lapane KL, Toledano AY, et al. Principles for modeling propensity scores in medical research: a systematic literature review. Pharmacoepidemiol Drug Safety 2004; 13: 841–853. [DOI] [PubMed] [Google Scholar]
  • 5.Austin PC. A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003. Stat Med 2008; 27: 2037–2049. [DOI] [PubMed] [Google Scholar]
  • 6.Austin PC, Manca A, Zwarenstein M, et al. A substantial and confusing variation exists in handling of baseline covariates in randomized controlled trials: a review of trials published in leading medical journals. J Clin Epidemiol 2010; 63: 142–153. [DOI] [PubMed] [Google Scholar]
  • 7.Gayat E, Resche-Rigon M, Mary JY, et al. Propensity score applied to survival data analysis through proportional hazards models: a Monte Carlo study. Pharmaceut Stat 2012; 11: 222–229. [DOI] [PubMed] [Google Scholar]
  • 8.Austin PC. The performance of different propensity score methods for estimating marginal hazard ratios. Stat Med 2013; 32: 2837–2849. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Dorn HF. Philosophy of inference from retrospective studies. Am J Public Health 1953; 43: 692–699. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Rubin DB. Matched sampling for causal effects, New York, NY: Cambridge University Press, 2006. [Google Scholar]
  • 11.Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 1974; 66: 688–701. [Google Scholar]
  • 12.Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med 2010; 29: 337–346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Setoguchi S, Schneeweiss S, Brookhart MA, et al. Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol Drug Safe 2008; 17: 546–555. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.McCaffrey DF, Ridgeway G, Morral AR. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol Meth 2004; 9: 403–425. [DOI] [PubMed] [Google Scholar]
  • 15.Austin PC, Grootendorst P, Anderson GM. A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: a Monte Carlo study. Stat Med 2007; 26: 734–753. [DOI] [PubMed] [Google Scholar]
  • 16.Austin PC. A comparison of 12 algorithms for matching on the propensity score. Stat Med 2014; 33: 1057–1069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Austin PC. Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies. Pharmaceut Stat 2011; 10: 150–161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat 1985; 39: 33–38. [Google Scholar]
  • 19.Austin PC. The use of propensity score methods with survival or time-to-event outcomes: reporting measures of effect similar to those used in randomized experiments. Stat Med 2014; 33: 1242–1258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Imbens GW. Nonparametric estimation of average treatment effects under exogeneity: a review. Rev Econ Stat 2004; 86: 4–29. [Google Scholar]
  • 21.Little RJ, Wang J, Sun X, et al. The treatment of missing data in a large cardiovascular clinical outcomes study. Clin Trials 2016; 13: 344–351. [DOI] [PubMed] [Google Scholar]
  • 22.Zhao Y, Herring AH, Zhou H, et al. A multiple imputation method for sensitivity analyses of time-to-event data with possibly informative censoring. J Biopharm Stat 2014; 24: 229–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Rubin DB. Bayesian inference for causal effects: the role of randomization. Ann Stat 1978; 6: 34–58. [Google Scholar]
  • 24.Rubin DB. Using multivariate matched sampling and regression adjustment to control bias in observational studies. J Am Stat Assoc 1979; 74: 318–328. [Google Scholar]
  • 25.Rubin DB. The use of matched sampling and regression adjustment to remove bias in observational studies. Biometrics 1973; 29: 185–203. [Google Scholar]
  • 26.Rubin DB. Bayesian inference for causality: the importance of randomization. In: Proceedings of the Social Statistics Section of the American Statistical Association, 1975, pp. 233–239.
  • 27.Roseman LD. Reducing bias in the estimate of the difference in survival in observational studies using subclassification on the propensity score, PhD. Thesis. Cambridge, MA: Department of Statistics, Harvard University, 1998. [Google Scholar]
  • 28.Rubin DB. Multiple imputation for nonresponse in surveys, New York, NY: John Wiley & Sons, 1987. [Google Scholar]
  • 29.Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc 1958; 53: 457–481. [Google Scholar]
  • 30.Bender R, Augustin T, Blettner M. Generating survival times to simulate Cox proportional hazards models. Stat Med 2005; 24: 1713–1723. [DOI] [PubMed] [Google Scholar]
  • 31.Austin PC, Schuster T. The performance of different propensity-score methods for estimating absolute effects of treatments on survival outcomes: A simulation study. Statistical Methods in Medical Research 2016; 25: 2214–2237. DOI: 10.1177/0962280213519716. [DOI] [PMC free article] [PubMed]
  • 32.Austin PC. Using the standardized difference to compare the prevalence of a binary variable between two groups in observational research. Commun Stat Simulat Computat 2009; 38: 1228–1234. [Google Scholar]
  • 33.Altman DG, Andersen PK. Calculating the number needed to treat for trials where the outcome is time to an event. BMJ 1999; 319: 1492–1495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Austin PC. The relative ability of different propensity score methods to balance measured covariates between treated and untreated subjects in observational studies. Med Decis Mak 2009; 29: 661–677. [DOI] [PubMed] [Google Scholar]
  • 35.Tu JV, Donovan LR, Lee DS, et al. Effectiveness of public report cards for improving the quality of cardiac care: the EFFECT study: a randomized trial. J Am Med Assoc 2009; 302: 2330–2337. [DOI] [PubMed] [Google Scholar]
  • 36.Gail MH, Wieand S, Piantadosi S. Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika 1984; 7: 431–444. [Google Scholar]
  • 37.Austin PC, Stuart EA. Optimal full matching for survival outcomes: a method that merits more widespread use. Stat Med 2015; 34: 3967–3967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Austin PC, Stuart EA. The performance of inverse probability of treatment weighting and full matching on the propensity score in the presence of model misspecification when estimating the effect of treatment on survival outcomes. Stat Meth Med Res 2017; 26: 1654–1670. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Rubin DB, Thomas N. Combining propensity score matching with additional adjustments for prognostic covariates. J Am Stat Assoc 2000; 95: 573–585. [Google Scholar]
  • 40.Rubin DB. The use of matched sampling and regression adjustment in observational studies, PhD thesis. Cambridge, MA: Department of Statistics, Harvard University, 1970. [Google Scholar]
  • 41.Gutman R, Rubin DB. Robust estimation of causal effects of binary treatments in unconfounded studies with dichotomous outcomes. Stat Med 2013; 32: 1795–1814. [DOI] [PubMed] [Google Scholar]
  • 42.Austin PC. Double propensity-score adjustment: a solution to design bias or bias due to incomplete matching. Stat Meth Med Res 2017; 26: 201–222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med 2004; 23: 2937–2960. [DOI] [PubMed] [Google Scholar]
  • 44.Kang J, Schafer J. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci 2007; 22: 523–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Joffe MM, Ten Have TR, Feldman HI, et al. Model selection, confounder control, and marginal structural models: Review and new applications. Am Stat 2004; 58: 272–279. [Google Scholar]
  • 46.Senn S. Testing for baseline balance in clinical trials. Stat Med 1994; 13: 1715–1726. [DOI] [PubMed] [Google Scholar]

Articles from Statistical Methods in Medical Research are provided here courtesy of SAGE Publications

RESOURCES