Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Feb 16.
Published in final edited form as: Stat Methods Med Res. 2019 Oct 1;29(3):695–708. doi: 10.1177/0962280219877908

Propensity Score Matching for Treatment Delay Effects with Observational Survival Data

Erinn M Hade 1,2,3, Giovanni Nattino 1, Heather A Frey 3, Bo Lu 1
PMCID: PMC7885462  NIHMSID: NIHMS1668566  PMID: 31571522

Abstract

In observational studies with a survival outcome, treatment initiation may be time-dependent, which is likely to be affected by both time-invariant and time-varying covariates. In situations where the treatment is necessary for the study population, all or most subjects may be exposed to the treatment sooner or later. In this scenario, the causal effect of interest is the delay in treatment reception. A simple comparison of those receiving treatment early versus those receiving treatment late might not be appropriate, as the timing of the treatment reception is not randomized. Extending Lu’s matching design with time-varying covariates (Lu, 2005), we propose a propensity score matching strategy to estimate the treatment delay effect. The goal is to balance the covariate distribution between on-time treatment and delayed treatment groups at each time point using risk set matching. Our simulation study shows that, in the presence of treatment delay effects, the matching based analyses clearly outperforms the conventional regression analysis using the naive Cox Proportional Hazards model. We apply this method to study the treatment delay effect of 17 alpha-hydroxyprogesterone caproate (17P) for patients with recurrent preterm birth.

Keywords: Treatment Delay Effect, Risk Set Matching, Time-varying Covariate, Covariate Balance, Cox Proportional Hazards Model

1. Introduction

In clinical or health related studies with a longitudinal cohort of patients, it is common that patients may receive treatment at different time points. In randomized controlled studies where the administration of treatment follows a strict protocol, this might not be an issue even if the timing of treatment is off a bit. In observational studies, multiple factors may contribute to why the treatment is not administered at the desired time (referred to as treatment delay). Often times, self-selection plays an important role in determining if and when treatment is received. Patients who feel as though they have more severe disease, those who have a worse history of disease, or those who are more informed about their condition may receive treatment sooner than those who do not have these perceptions or education. These patients may seek medical care earlier and be more aggressive about timely treatment. Patients who are delayed in receiving treatment are often substantially different from those who receive timely treatment. Therefore, a simple comparison between patients who receive treatment early and those who receive treatment late may lead to biased estimation of treatment effects. The group with delayed treatment may consist of healthier subjects who can afford to wait longer for the treatment. It is more reasonable to compare subjects with similar covariate trajectories, but not through a simple dichotomous classification of treatment groups (Li et al., 2001).

Weekly injections of 17 alpha-hydroxyprogesterone caproate (17P) at 250mg between 16 and 20 weeks gestation, have been found to reduce the risk of preterm birth and birth complications in a randomized trial of women with a history of preterm birth (birth before 37 weeks gestation) (Meis et al., 2003). In fact this trial was stopped early due to the treatment benefit in those receiving 17P, compared to placebo injections. Since 2004 17P has been widely available from compounding pharmacies, as FDA manufacturing approval was withdrawn in 2000. In early 2011 the FDA granted orphan drug approval for the standardized manufacturing of Makena, 17P at 250mg weekly intramuscular injection, without formal trials demonstrating benefit of this formulation (Patel and Runmore, 2012). Although previous trials have shown positive results for clinical use of 17P (from both manufactured and compounded), the mechanisms by which it works have not been fully determined and it remains unclear why prophylactic treatment is not more widely effective in high risk women. It has been suggested that inconsistent findings of 17P in trials and observational studies, could be due to the variability of initiation of 17p during pregnancy (How et al., 2007; Ning et al., 2017). While guidelines from randomized trials recommend initiation between 16 and 24 weeks, clinical practice often extends this window to earlier and later initiation (Iams et al., 2012).

Inconsistent effectiveness of 17P by timing of initiation has been reported previously, however no investigation has fully examined treatment delay beyond a simple ’early’ and ’late’ two group comparison. Recent work by Ning (Ning et al., 2017) report improved outcomes with earlier initiation, before 17 weeks gestation. However other published results suggest no detrimental effect of late 17P initiation (How et al., 2007). We investigate the delayed timing of 17P treatment on the time to delivery, in a cohort of women who received prenatal care at a single academic medical center between 2011 and 2016. All women received 17P for prevention of recurrent preterm delivery in a singleton gestation, and have cervical length measurements prior to initiation of 17P.

There are several ways of adjusting confounding bias in observational data, including matching, weighting, or structural modeling. We focus on matching methods as the treatment groups in the treatment delay problem are not pre-determined. We want to match the patients who look most similar to one another, in order to remove confounding, rather than simply label those who receive treatment late as ’un-exposed’ controls. Li et al. (Li et al., 2001) proposed risk set matching to pair patients together based on both time-invariant and time-varying covariates at each time point. A patient treated at time point tk can enter the matching in one of two ways. She can enter as a treated patient at time tk or as a not-yet-treated control for a patient treated earlier than tk, depending on who she resembles the most. Such matching methods are less model dependent than structural modeling approaches, are easily interpretable, and are less subjective, as they do not involve knowledge about the outcomes in the matching process.

Lu (Lu, 2005) discussed matching designs with time-varying covariates for treatment delay. With a longitudinal cohort of interstitial cystits (IC) patients, he proposed matching strategies to estimate a short term effect of a surgical procedure on continuous/ordered outcomes. Patients received the intervention at different time points, at least partially related to their symptom history. So the time-varying covariates had an impact on the timing of treatment initiation. The treatment effect measurement considered was continuous and time fixed (i.e. pain score three months after the treatment). In this paper, we consider a similar time-varying treatment initiation process, but we extend the effect estimation to survival outcomes.

The paper is organized as follows. Section 2 describes the matching design with time varying covariates. Section 3 discusses the assumptions and various methods for evaluating survival outcomes in observational studies. Section 4 presents a simulation study that compares matching methods with naive regression modeling approaches. Section 5 analyzes the real data example of premature delivery prevention. Section 6 summarizes the paper and discusses potential limitations.

2. Matching Design for Treatment Delay Effect

2.1. Treatment Delay Effect

We extend the conventional potential outcomes framework to define the treatment delay effect Imbens and Rubin (2015). Assume there are T time points in the study, indexed by t = 1, … , T. Denote Ait be the treatment reception status at time t for subject i, being 1 if treated and 0 if not treated. For the question of our interest, Ai = {Ait} is a monotonically non-decreasing vector of 0 or 1’s. In the pre-term birth cohort, when the intervention of 17P is initiated in a pregnant woman at time t, her treatment indicator changes to “1” and remains “1” thereafter. Denote Aitrt be the first time that subject i receives the treatment:

Aitrt={min{t|Ait=1,t=1,,T},ifAiT=1T+1,otherwise

Then, the treatment status vector Ai contains (Aitrt1) zeros, followed by (TAitrt+1) ones. At each time point t, a subject has a pair of potential outcomes (Yit1,Yit0). Depending on when the subject receives the treatment (Aitrt), everyone has a vector of observed outcomes Yi(Aitrt) For example, if T = 6 and Aitrt=3, the treatment status vector Ai = {0,0,1,1,1,1} and Yi(3)={Yi10,Yi20,Yi31,Yi41,Yi51,Yi61}. For every patient, the potential outcome is a double-indexed sequence (by treatment and time point), i.e. Yi={(Yi10,Yi11),(Yi20,Yi21),(Yi30,Yi31),(Yi40,Yi41),(Yi50,Yi51),(Yi60,Yi61)}. The observed outcome is a single-index sequence, like Yi(3).

Different treatment delay effects can be defined, depending of the delay period and the outcome of interest. For example, with a continuous outcome, the one-time-period treatment delay effect at time t = k (k < T) on the outcome at the end of the study for subject i is defined as

Δi1(k)=YiT(k)YiT(k+1)

where YiT(k) is the potential outcome under treatment trajectory with Aitrt=k at time t = T. For time-to-event outcomes, one may replace YiT(k) with Si(k), which is the potential survival time for subject i given the treatment is applied at time k.

2.2. Matching Design

In randomized experiments, one may randomly assign part of the sample to start treatment on time and the rest to delay the treatment for a pre-specified time period. Then, the difference in the outcome between the two groups reveals the effect of treatment delay. Observational studies present a challenge, as participants are treated at different time points without a pre-designated delay indicator. A naive analysis strategy is to label participants treated later in the process as delayed, and people treated earlier as not delayed. But those treated earlier may be different from those treated later in some important covariates due to the observational nature of the data. A more sensible way is to pair people with a similar covariate history, but with different treatment timing. Treatment initiation is considered as a time-to-event process and we can implement it in a risk set matching framework (Li et al., 2001; Lu, 2005).

Risk set matching follows from the concept of constructing the conditional likelihood in case-control survival data (Prentice and Breslow, 1978). Formally, suppose there are N subjects and they receive the treatment at s distinct time points, 0 ≤ t1 < t2 < ⋯ < ts. Without loss of generality, for those who never receive the treatment, we may denote their treatment time as ts=tend+, where tend is the end of the study period. Since tend+ is the last time point, those subjects will always be used as controls where possible. We also assume the number of subjects being treated at each time point are n1, n2, ⋯, ns, where n1 + n2 + ⋯ + ns = N. To construct risk set matching, we may pair treated subjects with not-yet-treated subjects at each time point, sequentially in the time order. With the understanding that the risk here refers to the probability of receiving treatment, the risk set at time point t includes all subjects who have not received the treatment at tϵ for a tiny ϵ > 0. Starting with t1, the risk set at t1 includes everyone and the size of the risk set is N. Then, we may create matches between the n1 subjects who receive treatment at t1 and the Nn1 subjects who are not-yet-treated, based on pre-specified matching distance (i.e. propensity score distance). Using a 1 : 1 design, we get n1m matched pairs, where n1mn1 in case there are unmatchable treated subjects. Moving to the next treatment time point t2, we exclude all subjects treated at t1 and the corresponding matched controls (i.e., n1m not-yet-treated subjects matched at t1). The risk set at t2 includes Nn1n1m subjects, with n2 treated and Nn1n1mn2 not-yet-treated subjects. Applying the same matching procedure between the two groups, we get n2m matched pairs at time t2, where n2mn2. We continue the process for each subsequent time point whenever there are treated subjects available. Generally, for time point tk, the risk set includes N(n1+n2++nk1+n1m+n2m++nk1m) subjects, with nk treated and N(n1+n2++nk1+n1m+n2m++nk1m)nk not-yet-treated. We stop when there are no more treated subjects available or no more matches can be made based on the pre-specified matching rule. We retain the terminology of risk set, as it is a well accepted terminology in survival analysis. To be clear, in our utilization of risk set matching, we are modeling and matching on the the risk set for treatment, not the risk of an adverse event or outcome. Moreover, the risk set matching design resembles a sequential randomization process, where some subjects are randomized at one point, and some of the remaining subjects are randomized at a later time point.

Figure 1 illustrates the key difference between risk set matching and a naive analysis of early vs. late treatment. Four subjects A, B, C, and D receive treatment at different time points. A circle indicates the treatment initiation and a cross indicates the clinical outcome of interest (i.e. event time). Given treatment initiation is a time-varying process, subjects receiving treatment early are likely to be different from those doing so late (i.e. more severely ill patients tend to need treatment sooner). In this sense, A and B are more alike and C and D are more alike. With risk set matching, we can create matched pairs (A, B) and (C, D). Then, it is clear that the treatment delay has a harmful effect of shortening the survival time by 2 time units in both pairs. In an early-vs-late analysis, A and B would be in the early group, and C and D would be in the late group. If we match (A, C) and (B, D), the contrast shows a beneficial effect of 2 time unit due to treatment delay. But this does not account for the fact that C and D are healthier patients with potentially better outcome, so they can afford to initiate the treatment later.

Figure 1:

Figure 1:

Illustration of Risk Set Matching

Unlike a cross-sectional study, treatment initiation is a function of time and it is important to control for potential time-dependent confounding. A time-dependent propensity score based on the Cox proportional hazards model is introduced by Lu (Lu, 2005). It models the hazard of receiving treatment at each time point, conditional on both time-constant and time-varying covariates. It is shown to balance the distribution of observed covariates in matched treated and control groups at every time point. With such propensity scores, we can balance covariate distributions at different time points through risk-set matching.

2.3. Matching Implementation

Before conducting matching, we need to first estimate the time-dependent propensity score with the Cox proportional hazards model. It captures the hazard of being treated at time t for subject i as

hi(t)=h0(t)exp[βTXi(t)]

where Xi(t) include both time-constant (baseline) and time-dependent covariates PD (2010); Kalbfleisch JD (1980). For the purpose of balancing covariates, we utilize the linear component of the model as the propensity score, βTXi(t). This avoids the issue of estimating the propensity score with an unknown baseline hazard function for each individual. Therefore, our distance metric for matching is calculated based on the linear propensity score, β^TXi(t) and a natural choice of distance between subjects is the Euclidean distance, [β^TXi(t)β^TXj(t)]2).

Two algorithms for risk set matching have been discussed in Lu Lu (2005). The sequential matching algorithm is conceptually easier to understand and implement. Risk sets of subjects possibly initiating the treatment are constructed chronologically at each time point when there is at least one treatment event. As previously described, matching is conducted sequentially for each risk set from the first time point to the last one with both treated and not-yet-treated subjects available. Starting with the first risk set, we can create matched pairs using existing bipartite matching algorithms. Then all treated and the matched untreated subjects are removed from the dataset as we continue with the second risk set. The same procedure is repeated sequentially for each risk set until no more matching can be made. The second one, a simultaneous algorithm, can run much faster than the sequential algorithm. It takes advantage of a non-bipartite matching algorithm to match subjects at different time points all at once Lu et al. (2001). However, its implementation is substantially more complicated. Since our premature delivery prevention dataset is small, containing a few hundred subjects, matching can be implemented quickly. Therefore we focus on the sequential matching algorithm throughout the rest of the paper.

2.4. A Toy Example

This example illustrates how the risk set matching is conducted with one time-varying covariate. The covariate may be an indicator for disease progression with smaller values indicating more severe disease and needing immediate treatment. Suppose there are ten patients in this study with treatment time at 16, 18, 20 or 22 weeks. There is a time-varying covariate measured biweekly until the patient receives the treatment. The initial measurement is taken at 14th week. For a patient treated at 16th week, she has two measurements; for another patient treated at 22nd week, she has five measurements. These measurements are reported in Table 1 (in the parenthesis following the letter). For example, patient A’s 14th week measurement is 4 and 16th week measurement is 3.

Table 1:

Treatment Time and Time-varying Covariate Measurement for All Ten patients

Trt Time Patients

16 A (4, 3), B (5, 3)
18 C (4, 3, 2), D (5, 3, 2), E(6, 5, 4), F(7, 5, 4)
20 G (6, 5, 4, 2), H (7, 5, 4, 3), I (7, 6, 5, 4), J (8, 7, 6, 5)
22 K (8, 6, 5, 4, 3), L (9, 7, 6, 5, 3)

To conduct sequential matching, we first construct the risk set at week 16, where the first treatment occurs. This risk set includes 10 patients (2 treated and 8 not-yet-treated). We want to match treated patients, A and B, each with a not-yet-treated patient who has the closest value on the measurement at 16th week. With this simple example, it is easy to see that the best matches are (A, C) and (B, D). Then we remove matched patients from the pool and continue the matching for the next risk set. The next risk set is constructed at week 18, which includes 2 treated and 6 not-yet-treated patients. Now, we want to match treated patients, E and F, each with a not-yet-treated patient who has the closest value on the measurement at 18th week. We match E with G and F with H, as they have the same measurement at week 18. After removing (E, G) and (F, H) from the pool, the next risk set at week 20 is left with 4 patients, 2 treated (I, J) and 2 not-yet-treated (K, L). We match I with K and J with L based on the measurement at week 20. The following table summarizes the matching process:

As shown in table 2, the sequential matching design matches patients with closer treatment time together. This is sensible because they tend to have similar covariate values at the time point of matching. If the conventional two group matching design is applied, we have to match patients treated in week 16 and 18 to those treated in week 20 and 22. Instead of matching A with C as both having a measurement of 3 at week 16, we have to match A with G. Then there is a difference of two units in covariate at the time when A is treated. As a result, the balance of covariates between treated and not-yet-treated is compromised.

Table 2:

Sequential Matching Process

Risk set week Patients in the risk set Matched pairs

16 Treated: A, B (A, C)
Not-yet-treated: C, D, E, F, G, H, I, J, K, L (B, D)

18 Treated: E, F (E, G)
Not-yet-treated: G, H, I, J, K, L (F, H)

20 Treated: I, J (I, K)
Not-yet-treated: K, L (J, L)

3. Evaluating Causal Effects with Survival Outcomes

3.1. Assumptions for Estimating Causal Effects

To warrant valid causal interpretation of our matching design and the subsequent analysis, we extend the assumptions used in conventional cross-sectional studies as follows:

  • Stable Unit Treatment Value with Treatment Delay

    This is a generalization of the SUTVA assumption in Rubin Rubin (1980). It implies that patients receive the same version of treatment regardless of whether the treatment is delayed. The potential outcomes of any patient do not vary with the treatments assigned to other patients (delayed or not).

  • Conditional Independence for Treatment Delay

    It implies that, conditioning on covariates history up to time t, the decision to get the treatment at time t or delay the treatment for a period of Δt is just random. This ensures no unmeasured confounding for treatment delay decision.

  • Positivity
    0<P(Aitrtt+δ|Aitrtt)<1

    This assumption implies that treatment groups with or without delay have common support. It ensures that patients in two treatment groups are comparable.

  • Homogeneous Treatment Delay Effect

    This assumptions assures that the delay effect is the same whether the delay occurs early or late in the process. So we can pool the patients treated at different time points for inference. This is similar to the constant treatment effect assumption in cross-sectional case. If researchers are more interested in heterogeneous effects, they can drop this assumption but the analysis has to be carried out separately by the time points of treatment.

3.2. Evaluating Survival Outcomes

Time-to-event data can be evaluated through a variety of survival measures including the mean or median survival time, the survival functions, or the hazard functions. There is a rich literature in causal inference discussing the appropriate adjustment needed for valid treatment effect estimation with observational survival outcomes. For comparing survival functions, Xie Xie and Liu (2005) proposed a weighted logrank test, weighted by the inverse of propensity score. For propensity score matched treatments/exposures, Austin Austin (2014) discussed the use of stratified logrank test (SLR) and Lu Lu et al. (2018) discussed the use of the paired Prentice-Wilcoxon test (PPW) O’Brien and Fleming (1987), where the matched sets are used as strata. For estimating hazard ratios, the attention has been primarily on the Cox proportional hazards (PH) model, as long as the proportional hazards assumption holds. A popular choice for inference with survival outcomes is the marginal structural Cox PH model with inverse propensity score weighting, which renders causal interpretation for the regression coefficient Hernan et al. (2000). If the outcome of interest is survival time, Uno Uno et al. (2014) argued that restricted mean survival time (RMST) is a good causal measure. Chen and Tsiatis proposed a method to estimate the average causal effect on RMST, conditioning on confounder patterns Chen and Tsiatis (2001).

Since we have devised a matching design to balance covariates and to reduce dependence on outcome modeling, we use analytic strategies compatible with matching. In our data analysis, we utilize SLR and PPW to test the null hypothesis of no difference in survival functions by treatment groups. To gauge the causal effect in terms of the hazard ratio, we apply the stratified Cox PH model to the matched data.

4. Simulation Studies

4.1. Design

Monte Carlo simulations are conducted to assess the performance of the proposed methodology. We generate samples of size N = 1,000, where subjects are characterized by eight time-invariant covariates, X1, …, X8, and one time-varying covariate, Z(t). For each subject i, the values X1i, …, X4i are sampled from a standard normal distribution, while the values X5i, …, X8i are sampled from a Bernoulli distribution with probability 0.5. The time-varying covariate has form Zi(t) = cit, with ci ∈ {0.01, 0.2}, representing an indicator of disease progression. In particular, the two values of ci simulate two types of progression: slow for ci = 0.01 or fast for ci = 0.2.

To be consistent with our data example, the time unit is week. The time to treatment Aitrt is sampled from an exponential distribution with hazard function

hAi(t)=λAexp(j=18βAjXji+βAZZi(t)),

with λA = 1/6, (βA1, …, βA8) = log ((1.05, 1.1, 1.15, 1.2, 1.1, 1.2, 1.3, 1.4)), and βAZ = log(1.3).

For subjects receiving treatment at week k, the potential outcome YTi(k), time to delivery, is defined by a constant value (20 weeks), plus a value sampled from an exponential distribution with hazard function

hYik(t)=λYexp(βY1X1i+βY2X2i+βY5X5i+βY6X6i+βYZZi(t)+(k1)δ),

with λY = 1/60, (βY1, βY2, βY5, βY6) = log ((1.05, 1.1, 1.1, 1.2)) and βY Z = log(1.2). Note that only four time-invariant covariates (two continuous and two binary variables) appear in the data-generating model of the outcome and, therefore, are true confounders. The parameter δ defines the treatment delay effect. We consider two treatment effect scenarios. First, a no-effect case, where δ = 0 and delaying the treatment does not impact on the time to the outcome event. Second, a case with a positive effect, where δ = log(1.2). In this case, delaying the treatment by one week increases the outcome hazard by 20%.

Both the times Aitrt and YTi(k) are rounded up to the smallest integer larger than the sampled value, to mimic the data structure with integer time points of the motivating study. In particular, we used the methodology described by Austin (Austin, 2012) to sample the time to treatment and the time to outcome (delivery), as both depend on time-varying covariates.

We considered a censoring mechanism independent from covariates and outcome times. Censoring times are defined as the same baseline constant value used for the outcome event times (20 weeks), plus a value sampled from an exponential distribution with rate λY/1.5. This setup generates samples where about 20% of the outcome event times are censored.

For both scenarios, we generate 10,000 datasets and apply several methods to each generated sample. First, we apply a naïve approach, based on a Cox PH model including the covariates and an indicator for late versus early treatment. Late (versus early) treatment is defined by treatment timing at or above the median time to treatment and such indicator is a fixed covariate. The coefficient of the treatment indicator in the naïve model is used to estimate the treatment effect. To test whether the treatment delay effect is zero, we use the Wald test on the coefficient. We consider two naïve models: one including all covariates and one including only the true confounders (X1, X2, X5, X6 and Z(t)).

Second, we apply the sequential matching design as described in the previous sections, where subjects treated at time t are matched to not-yet-treated subjects by time t. We consider matching on two propensity score models: a model including all the covariates and one including only the true confounders. In both matched samples, the treatment delay effect is estimated by a Cox PH model including a sole continuous variable, measuring the delay effect of one-unit increase in the treatment time. To account for the matching structure, the matched pair indicator is included as strata in the Cox PH model (Austin, 2014). The effect is estimated by the coefficient of the delay variable. To test the hypothesis of no effect, we compute the Wald test on the coefficient of the delay variable in the Cox model and also apply the nonparametric PPW and SLR tests to the matched sample.

Finally, we also consider a “restricted” matching design focusing on a one-week delay effect, where subjects treated at time t are only allowed to be matched to not-yet-treated subjects treated at time t + 1. Again, we match subjects on the propensity score model including all covariates and on the model including only the true confounders. We apply the same estimating and testing strategies used in the non-restricted matching design, by fitting stratified Cox PH models with a single treatment delay indicator and employing the PPW and SLR tests.

We compare the performance of various testing approaches by assessing type-I error (i.e., the rejection rate in the null scenario) and power (i.e., the rejection rate in the scenario with nonzero effect). We evaluate the treatment effect estimation in terms of bias, coverage of the 95% confidence intervals and root mean squared error (RMSE). To gauge the estimation efficiency, we compute the empirical standard deviation of the estimates across simulations.

To quantify the magnitude of the treatment effect on a scale that is more interpretable than a hazard ratio, we provide estimates of S1(40) – S0(40), the difference in survival probability corresponding to a one-week delay at time t = 40, which is approximately the median time-to-event in the simulated data. When restricting the matching to consecutive time points, we use the Kaplan-Meier estimates of the survival functions S^0 and S^1 in the treatment and delayed-treatment groups, respectively. The same approach cannot be used in the matched sample generated without restrictions on delay time, because the within-pair treatment delay may be much larger than one. Therefore, we fit a Cox model including only the treatment delay variable (as a continuous predictor) and use the model to estimate the difference of the survival probabilities at delay 1 and 0. Because we observe that the estimates of such marginal model are overly sensitive to the presence of the few subjects with very long delay, we drop the 5% of pairs with the not-yet-treated subjects having longest delay. Confidence intervals for the survival probability difference are computed using bootstrap ?.

4.2. Results

Simulation results are summarized in Table 3. When the samples are generated with no treatment delay effect (top panel of the table), all estimators show negligible bias and the coverage of the 95% confidence intervals is close to the nominal level. Similarly, the type-I errors of all tests at the α = 0.05 level are approximately equal to the nominal level. Including both true confounders and predictors of the treatment in the naive or in the propensity score model did not impact on the type-I error, bias and coverage of the confidence intervals. However, considering only the true confounders in the models results in estimates with slightly smaller variability and, consequently, lower RMSE. Interestingly, restricting the matching steps to subjects treated in consecutive time points results in estimates of δ with considerably larger variability than the non-restricted matching—the empirical SD of the former method is about 6 times as large as the SD of the latter. This phenomenon could be partially due to the difference in sample size of the matched sets generated by the two approaches. In our 10,000 simulations, when the propensity score model includes only the true confounders, the mean size of the matched samples generated by the non-restricted matching procedure is 992.1 (SD=20.1, median=998, Q1-Q3=994-998). The restricted matching procedure results in smaller matched samples, with a mean size of 830.4 (SD=42.8, median=834, Q1-Q3= 810-858). Because of the larger variability of the estimates, the restricted matching procedure attains a higher RMSE than the non-restricted approach.

Table 3:

Results of simulation study. For the estimation of δ, the table provides the bias (multiplied by 100), the coverage of the 95% confidence intervals, the root mean squared error (RMSE), the mean of the estimated standard deviation (MESD) and the Monte Carlo standard deviation of the estimate (MCSD).

Scenario Measure Covariates Naïve Cox Matching (unrestr.) Matching (restr.)
Cox PPW SLR Cox PPW SLR
No Effect (eδ = 1) Type-I error (%) All 4.4 4.7 4.7 5.1 4.5 5.6 4.8
True Conf. 4.2 4.7 4.5 4.9 4.6 5.2 4.9
Bias (×100) All 0.232 −0.180 - - 0.112 - -
True Conf. 0.218 −0.160 - - 0.195 - -
Coverage (%) All 95.6 95.3 - - 95.5 - -
True Conf. 95.8 95.3 - - 95.4 - -
RMSE All 0.078 0.021 - - 0.129 - -
True Conf. 0.076 0.019 - - 0.118 - -
MESD All 0.079 0.021 - - 0.130 - -
True Conf. 0.077 0.019 - - 0.118 - -
MCSD All 0.078 0.021 - - 0.129 - -
True Conf. 0.076 0.019 - - 0.118 - -

Delay Effect (eδ = 1.2) Power (%) All 100.0 100.0 95.5 100.0 29.1 24.4 31.2
True Conf. 100.0 100.0 97.0 100.0 34.5 24.8 36.6
Bias (×100) All 61.4 −0.900 - - −0.931 - -
True Conf. 63.1 −0.685 - - −0.730 - -
Coverage (%) All 0.0 92.5 - - 95.6 - -
True Conf. 0.0 93.0 - - 95.5 - -
RMSE All 0.619 0.028 - - 0.120 - -
True Conf. 0.636 0.026 - - 0.109 - -
MESD All 0.076 0.026 - - 0.123 - -
True Conf. 0.075 0.024 - - 0.111 - -
MCSD All 0.078 0.026 - - 0.120 - -
True Conf. 0.076 0.025 - - 0.109 - -

In presence of a nonzero treatment delay effect, the simulations clearly show the inadequateness of the naïve estimator, which appears to be severely biased. The bias of the estimator results in confidence intervals never covering the true value. By comparing subjects who receive the treatment late to those receiving it early, this methodology overestimates the impact of a treatment delay. Conversely, both matching strategies result in small bias and appropriate coverage of the effect. As in the first scenario, restricting the matching steps to subjects treated in consecutive time points results in smaller samples than the non-restricted matching. This impacts on both the variability of the estimate of δ, as noted in the null scenario, but also on the power of the tests. Indeed, all of the tests show lower power in the restricted matching design. Similarly, including covariates not related to the outcome in the propensity score model generates estimates with a slightly larger variability, which results in a slight increase in the RMSE. Interestingly, the PPW test shows a bit lower power than the other competing testing strategies in both matching designs, which is likely due to the fact that it is a nonparametric test and does not take advantage of the proportional hazards structure in the data. The SLR test and the stratified Cox model show similar performance, with the latter having slightly better type-I error control.

Our simulation study focuses on the conditional treatment delay effect since our data generating mechanism is designed with a pre-specified conditional delay effect. This design is easier to compare to the conventional clinical research practice, where Cox PH models with all relevant predictors are used. To provide an alternative measure of the delay effect, we also compute the difference between the survival probabilities corresponding to a one-week delay at the fixed time t = 40. We compute the averages of such differences across the samples generated by matching on the propensity score model including the true confounders. When there is no delay effect (δ = 0), the average of the estimated differences are −0.14% and 0.05% in the unrestricted and restricted matching procedures, respectively. Because the difference in survival probability for a one-week delay is expected to be 0 in this case, we expect 0 to lie outside of the bootstrap-based 95% confidence interval about 5% of the time. We observe a type-I error rate close to the nominal level in both matching procedures—6.3% in the unrestricted and 5.5% in the restricted matching. The same average differences are 5.6% and 5.8% in presence of the delay effect defined by δ = log(1.2). In this scenario, the value 0 lies outside of the confidence interval in 100% and 38.4% of the times. Hence, the power to detect a difference in survival probabilities is higher in the unrestricted matching procedure.

In addition, to provide some intuition of the population level effect estimate, we calculate the marginal effect using the Cox PH model with robust variance estimator to account for the correlation within matched sets. In our no-effect scenario, where δ = 0, the marginal effect is zero as well. In the scenario where δ = log(1.2), the marginal 1-week delay effect in the simulated data is calculated to be approximately 1.152 (on the hazard ratio scale). Notably, this effect is close to the true conditional effect, set by design to be 1.2. It turns out that the marginal model with restricted matching design performs well with both type I error and 95% CI coverage at the nominal level. On the other hand, the marginal model with unrestricted matching has inflated type I error and poor 95% CI coverage. For brevity, results are not shown here.

5. Prevention of Recurrent Premature Delivery

5.1. Preterm clinic data

Table 4 describes characteristics of the 421 women in our cohort. Women with singleton pregnancies and a history of spontaneous preterm birth who initiated 17P therapy between 2011 and 2016 were included. We use these data to investigate if delayed 17P initiation had a detrimental effect (shorter) on time to delivery, as estimated by a naive fixed two-group optimal propensity score matching and our sequential risk set matching method for delayed treatment. All patient characteristics were measured at the time of initial visit to the prematurity clinic which occurred one or more weeks prior to 17P initiation, with the exception of cervical length. Cervical length (CL) measurements were measured approximately every two weeks and were most often initiated at week 16. Initiation of 17P is recommended between 16 and 20 weeks gestation, and generally will not be initiated prior to 14 weeks gestation or after 24 weeks.

Table 4:

Participant characteristics and balance by early and late treatment timing

Before Matching
After Matching
Fixed 2-group
Sequential
Characteristic Early

(n=254)
Late

(n=167)
ASD

(%)
Early

(n=112)
Late

(n=112)
ASD

(%)
Trt

(n=126)
Not Yet Trt
(n=126)
ASD

(%)
Maternal age, mean (sd) 29.5 (5.7) 29.2 (5.5) 4.7 29.5 (5.6) 29.5 (5.7) 2.2 29.2 (5.7) 29.4 (5.5) 4.0
Race/Ethnicity, n (%)
Black, non-Hispanic 114 (45) 72 (43) 3.6 37 (44) 36 (43) 8.9 57 (45) 54 (43) 4.8
 White, non-Hispanic 120 (47) 81 (49) 2.5 41 (49) 40 (48) 12.5 60 (48) 60 (48) 0
 Other race/ethnicity 20 (8) 14 (8) 1.9 6 (7) 8 (10) 6.5 9 (7) 12 (10) 8.6
Insurance, n (%)
Private 90 (35) 42 (25) 22.5 39 (46) 31 (37) 14.6 32 (25) 37 (29) 8.9
Public 143 (56) 108 (65) 17.1 41 (49) 46 (55) 8.9 79 (63) 78 (62) 1.6
None 21 (8) 17 (10) 6.6 4 (5%) 7 (8) 9.6 15 (12) 11 (9) 10.4
BMI, n (%)
 <25 kg/m2 76 (30) 69 (41) 23.9 16 (19) 18 (21) 29.8 38 (30) 48 (38) 16.7
25-30 kg/m2 60 (24) 40 (24) 0.8 23 (27) 25 (30) 2.1 42 (33) 31 (25) 19.3
>30 kg/m2 118 (46) 58 (35) 24.0 45 (54) 41 (49) 23.3 46 (37) 47 (37) 1.6
Tobacco, n (%) 81 (32) 57 (34) 4.8 33 (39) 26 (31) 9.6 47 (37) 41 (33) 10.0
Number prior PTB, n (%)
1 167 (66) 103 (62) 8.5 56 (67) 58 (69) 3.8 75 (60) 74 (59) 1.6
2 61 (24) 38 (23) 3.0 23 (27) 22 (26) 8.3 32 (25) 32 (25) 0
>2 26 (10) 26 (16) 15.9 5 (6) 4 (5) 6.9 19 (15) 20 (16) 12.8
Earliest prior GA, n (%)
16-20 wks 31 (12) 14 (8) 12.6 15 (18) 12 (14) 17.6 13 (10) 13 (10) 0
 20-28 wks 98 (39) 45 (27) 24.9 34 (40) 37 (44) 28.9 44 (35) 41 (33) 5.0
 28-32 wks 34 (13) 23 (14) 1.1 13 (15) 15 (18) 2.6 11 (9) 16 (13) 12.8
 32-36 wks 91 (36) 85 (51) 30.7 22 (26) 20 (24) 45.3 58 (46) 56 (44) 3.2
Prior DC, n (%) 74 (29) 41 (25) 10.1 26 (31) 23 (27) 10.1 40 (32) 32 (25) 14.0

First Cervical length, mean (sd) 34.2 (7.8) 32.3 (9.0) 2.1 34.5 (6.9) 34.2 (8.9) 8.0 33.8 (7.7) 33.9 (8.2) 0.5

5.2. Propensity score model and balance assessment

Before matching, patient characteristics (measured prior to the time of treatment) are considerably imbalanced for: insurance status, body mass index and gestational age category of the earliest preterm birth. The percent absolute standardized difference (ASD) is used to assess balance between groups and is defined as 100 times the difference in the mean values of the two groups, divided by the pooled standard deviation. We consider an ASD below 10% desirable, and below 20% acceptable for all characteristics to achieve balance. All covariates in Table 4 are used to estimate the fixed two-group (early and late) propensity score, and the time-dependent propensity score. The fixed two-group propensity score model is fit via logistic regression. The time dependent propensity score is estimated via Cox proportional hazards model with a time-varying CL covariate, to estimate the hazard of treatment at each gestational age week. The dataset containing the CL time-varying covariate is modeled in ‘long’ format, where subjects have multiple row for each time point understudy and their corresponding fixed and time varying covariates. Insurance status, earliest gestational age of prior preterm birth and CL measure at baseline and at treatment were all significantly at the 0.05 level; however all variables are retained in the propensity score models as they are relevant in the relationship to our outcome of interest, time to delivery. After matching, both matching methods improve balance between treatment groups. However, we see a much larger improvement with the time-dependent propensity score, resulting in fewer covariates with ASDs over 10% and no covariates with ASD over 20%. The fixed time propensity score does not balance BMI and earliest prior GA.

Following LuLu (2005), we further assess balance through estimating the effect of each covariate on the time to treatment (Table 5). For each pre-treatment covariate, this additional measure of balance, as estimated by the HR and the associated Wald test p-value, is estimated through a Cox PH model for the time to treatment. As with the ASD, this measure is established for the full cohort prior to matching and for only the matched sets following matching. After matching, the Cox PH model for the association between each covariate and the time to treatment is stratified by the matched set indicator. Ideally, if covariates are well balanced, we do not expect to see a significant impact of covariates on hazard ratios of treatment (i.e. a HR close to 1 and a p-value < 0.10). These summaries provide another measure of balance before and after matching for time varying exposures. As was identified before matching, insurance status and earliest gestational age of prior preterm birth were strongly associated with time to treatment (p-value < 0.10). After matching, all covariates are sufficiently balanced. As with the ASD, BMI is not as closely balanced as other covariates, but does meet our metrics for adequate balance.

Table 5:

Covariate Impact on Treatment Hazard Before and After Matching

Before Matching
(n=421)
After Matching
(n=252)
Characteristic HR (SE) p-value HR (SE) p-value
Maternal age 1.00 (0.01) 0.88 0.99 (0.02) 0.74
Race/Ethnicity
Black, non-Hispanic - - - -
White, non-Hispanic 0.89 (0.09) 0.25 0.96 (0.26) 0.89
Other race/ethnicity 0.89 (0.17) 0.53 0.69 (0.35) 0.46
Insurance
Private - - - -
Public 0.80 (0.09) 0.04 1.24 (0.39) 0.50
None 0.74 (0.14) 0.10 1.68 (0.83) 0.29
BMI
 <25 kg/m2 - - - -
 25-30 kg/m2 1.01 (0.13) 0.91 1.65 (0.52) 0.11
 >30 kg/m2 1.20 (0.13) 0.11 1.19 (0.35) 0.56
Tobacco 0.90 (0.09) 0.33 1.33 (0.42) 0.36
Number prior PTB
1 - - - -
2 1.03 (0.12) 0.77 0.99 (0.30) 0.97
>2 0.87 (0.13) 0.34 0.93 (0.35) 0.85
Earliest prior GA
16-20 wks - - - -
20-28 wks 0.89 (0.15) 0.50 1.10 (0.51) 0.84
28-32 wks 0.64 (0.13) 0.03 0.62 (0.39) 0.45
32-36 wks 0.59 (0.10) 0.002 0.91 (0.49) 0.86
Prior DC 0.98 (0.11) 0.84 1.36 (0.38) 0.27

Cervical length, earliest 1.00 (0.01) 0.83 1.00 (0.02) 0.97
Cervical length, at treatment - - 1.01 (0.02) 0.41

5.3. Risk Set Matching

Through sequential risk set matching, pairs of treated and not-yet-treated patients with similar covariate histories were formed. Patients in our cohort initiated treatment with 17P between 14 and 22 weeks gestation, creating 9 risk sets. At 14 weeks gestation, 30 patients were treated with 17P (and 391 were not yet treated) and all 30 patients were paired with a not-yet-treated patient. Pairs were required to have an estimated propensity score within a caliper of 20% of the propensity score standard deviation. Moreover, to mimic a clinically meaningful delay pattern, treated and not-yet-treated patients were only allowed to be paired if they were at least two weeks apart. In our cohort, the majority (60%) of patients were treated before 17 weeks, with 184 treated at 16 weeks. A total of 126 matched pairs, 256 total patients, were included in the delayed treatment risk set matched analyses. With a large number of patients treated at week 16, our sequential matching was similar to an early-vs-late matching (i.e. week 16 or early vs. after week 16). To examine the impact of too many patients receiving treatment in early weeks on matching, we also conducted a small sensitivity analysis by sampling a portion of these patients. The intention was to distribute the number of treated patients more evenly across different time points. A 20% random sample of patients who initiated treatment at weeks 14, 15, and 16 and all patients who initiated therapy in later weeks made up the our sampled cohort. Both risk set and fixed two-group optimal matching were performed in R software version 3.3, using the nbpMatching package (R Development Core Team, 2017; Lu et al., 2009, 2011).

5.4. Delayed treatment effect

Table 6 summarizes the estimated effect of delayed treatment with 17P by the PPW test, the SLR test and the Cox PH model stratified on the matched pair, for our outcome of interest time to delivery. This stratified Cox PH model is estimated to make inference on our outcome; it does not include any time-varying covariate, since such covariates are balanced through matching. This model is different from the Cox PH model used earlier to estimate the time-varying propensity score. Inference on time to delivery include our causal effect of delayed treatment through sequential matching, and a fixed two-group optimal matching analysis of early versus late treatment comparison. The outcome, time to delivery, is defined as the time from the first possible treatment (regarded as baseline), week 14, to the week of delivery. Few subjects (< 5%) are censored due to loss to follow up. The risk set matched analyses do not show a significant treatment delay effect with a 0.95 hazard ratio (SE: 0.18, p-value=0.79). All three tests (PPW, SLR, Cox stratified) fail to reject the null hypothesis of no difference in time to delivery. The fixed two-group matched analyses yield similar conclusions of no strong influence of delay on time to delivery, with a 0.86 hazard ratio (SE: 0.19, p-value=0.45). The effects estimated in the sampled cohort also show no strong evidence of a treatment delay effect. As in our simulation studies, we further estimate the survival difference for delay. The difference in survival probabilities at 37 weeks gestation, for those who receive treatment early as compared to those with a 2 week delay, is estimated from Kaplan-Meier estimates of the survival function and the 95% bootstrap confidence interval. The estimated difference in survival probabilities at 37 weeks (for a two week delay) is 0.002 (95% CI: −0.120, 0.123) indicating that the proportion of women yet to deliver at 37 weeks gestation is not substantially different in these two groups. These results are consistent with the previous result indicating no observed effect of treatment delay.

Table 6:

Treatment effect estimates

HR(SE) Test statistic p-value
Risk Set matching
Full Cohort
   PPW test −0.12 0.548
LR test, stratified 0.07 0.785
Cox PH, stratified 0.95 (0.18) −0.27 0.788
Sampled Cohort
   PPW test −0.62 0.731
LR test, stratified 1.00 0.317
Cox PH, stratified 0.81 (0.22) −0.97 0.330
Fixed Two-group matching: Late versus Early
Full Cohort
   PPW test 0.40 0.344
LR test, stratified 0.59 0.441
Cox PH, stratified 0.86 (0.19) −0.76 0.446
Sampled Cohort
   PPW test −0.50 0.691
LR test, stratified 0.17 0.680
Cox PH, stratified 1.11 (0.26) 0.39 0.696

6. Discussion

In longitudinal studies with time varying covariates, conventional fixed-two group matching designs may not produce matched sets incorporating time related information well. Because these time-varying covariates may have substantial impact on the treatment decision over time, it is important to balance them at the time of treatment reception. We adopt a risk set matching design for observational survival data, which is an extension of Lu’s (Lu, 2005) work on continuous outcomes. It models the propensity score as a time-to-event process, which can incorporate time-varying covariates. Risk sets are created at time points when a treatment occurs and matched sets are created sequentially based on the propensity score at each time point. Such a design is ideal to evaluate the treatment delay effect, where the majority of the cohort will receive the treatment eventually, but the treatment might be delayed due to individual characteristics. We illustrate the methodology with a real world study of premature delivery prevention. Our matching design is shown to improve covariate balance. Our findings are consistent with previous work, suggesting no substantial impact of delaying 17P initiation How et al. (2007) and do not confirm recent findings from a smaller observational study which investigated the fixed time early versus late comparison Ning et al. (2017).

When applying this design to real data, several practical issues may arise. First, when there are both treated and never-treated subjects in the cohort, our methods can still be utilized. The never-treated subjects will always serve as controls. They may not be good matches for early-treated subjects; however they may mimic late-treated subjects well. The only caveat is, when the never-treated group is large, many of them are not likely to be included in the final matched sets. Second, to find the best matching results over time, the number of treated subjects in each risk set can not vary dramatically. In our data example, nearly half of the women were treated at week 16. This limits the choice of matching, as everyone treated after week 16 will be used as a control to feed the large number of treated women at week 16. We have attempted to lessen the impact of so many treated subjects at one time point by sampling subjects at treatment times. However, this may have an impact on inference if the sample size of the original cohort is not very big. Third, though our method can incorporate time-varying covariates, it can not handle repeated treatment. An individual may change her status from not-treated to treated at most once in time. This scenario is reasonable for most surgical interventions, but may not be a reasonable framework for some drug studies when individuals’ drug taking behavior may change very often over time.

Simulation results suggest that this matching method controls confounding well when time varying covariates are incorporated, and the propensity score model is correctly specified with regards to all confounders. Future work will explore the impact of mis-specified propensity score models, in terms of incorrect functional forms of confounders or leaving out some confounders. It is also well known that matching only balances observed confounders. Unmeasured confounding is a major concern in observational studies due to the lack of randomization. Rosenbaum Rosenbaum (2002) proposed a comprehensive framework to assess the impact of potential hidden bias on the observed significant association, based on matching design. Lu (Lu et al., 2018) devised a sensitivity analysis for survival outcomes based on the PPW test. If practitioners are concerned with the potential unmeasured confounding, a sensitivity analysis can be implemented to assess the robustness of observed findings.

Supplementary Material

example.r
script_sim.r

7. Acknowledgments

This work was partially supported by grant 1R01 HS024263 from the Agency of Healthcare Research and Quality of the U.S. Department of Health and Human Services. Support for this project was also partially provided by the Ohio State University Institute for Population Research through a grant from the Eunice Kennedy Shriver National Institute for Child Health and Human Development of the National Institutes of Health, P2CHD058484, and by Award Number Grant UL1TR001070 from the National Center For Advancing Translational Sciences.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the Agency of Healthcare Research and Quality of the U.S. Department of Health and Human Services, the Eunice Kennedy Shriver National Institute for Child Health and Human Development, the National Center For Advancing Translational Sciences or the the National Institutes of Health.

References

  1. Austin PC (2012) Generating survival times to simulate cox proportional hazards models with time-varying covariates. Statistics in Medicine 31: 3946–3958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Austin PC (2014) The use of propensity score methods with survival or time-to-event outcomes: reporting measures of effect similar to those used in randomized experiments. Statistics in Medicine 33(7): 1242–1258. DOI: 10.1002/sim.5984 URL https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.5984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chen PY and Tsiatis AA (2001) Causal inference on the difference of the restricted mean lifetime between two groups. Biometrics 57(4): 1030–1038. DOI: 10.1111/j.0006-341X.2001.01030.x URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.0006-341X.2001.01030.x. [DOI] [PubMed] [Google Scholar]
  4. Hernan M, Brumback B and Robins J (2000) Marginal structural models to estimate the causal effect of zidovudine on the survival of hiv-positive men. Epidemiology 11: 561–570. [DOI] [PubMed] [Google Scholar]
  5. How HY, Barton JR, Istwan NB, Rhea DJ and Stanziano GJ (2007) Prophylaxis with 17 alpha-hydroxyprogesterone caproate for prevention of recurrent preterm delivery: does gestational age at initiation of treatment matter? American, Journal of Obstetrics and Gynecology 197(260): e1–e4. [DOI] [PubMed] [Google Scholar]
  6. Iams J, Dildy G, Macones G and Silverman N (2012) Practice bulletin no. 130: prediction and prevention of preterm birth. Obstetrics and gynecology 120: 964–73. [DOI] [PubMed] [Google Scholar]
  7. Imbens GW and Rubin DB (2015) Causal Inference: For Statistics, Social, and Biomedical Sciences, and introduction. Cambridge University Press. [Google Scholar]
  8. Kalbfleisch JD PR (1980) The Statistical Analysis of Failure Time Data. New York: Wiley. [Google Scholar]
  9. Li YP, Propert KJ and Rosenbaum P (2001) Balanced risk set matching. Journal of the American Statistical Association 96(455): 870–882. [Google Scholar]
  10. Lu B (2005) Propensity score matching with time-dependent covariates. Biometrics 61: 721–728. [DOI] [PubMed] [Google Scholar]
  11. Lu B, Cai D and Tong X (2018) Testing causal effects in observational survival data using propensity score matching design. Statistics in Medicine 37(11): 1846–1858. DOI: 10.1002/sim.7599. [DOI] [PubMed] [Google Scholar]
  12. Lu B, Greevy R and Beck C (2009) nbpMatching: functions for non-bipartite optimal matching. URL http://CRAN.R-project.org/package=nbpMatching R package version 1.0.
  13. Lu B, Greevy R, Xu X and Beck C (2011) Optimal nonbipartite matching and its statistical applications. The American Statistician 65: 21–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Lu B, Zanutto E, Hornik R and Rosenbaum PR (2001) Matching with doses in an observational study of a media campaign against drug abuse. Journal of the American Statistical Association 96(456): 1245–1253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Meis PJ, Klebanoff M, Thom E, Dombrowski MP, Sibai B, Moawad AH, Spong CY, Hauth JC, Miodovnik M, Varner MW, Leveno KJ, Caritis SN, Iams JD, Wapner RJ, Conway D, O’Sullivan MJ, Carpenter M, Mercer B, Ramin SM, Thorp JM and Peaceman AM (2003) Prevention of recurrent preterm delivery by 17 alpha-hydroxyprogesterone caproate. New England Journal of Medicine 348(24): 2379–2385. DOI: 10.1056/NEJMoa035140 URL http://www.nejm.org/doi/full/10.1056/NEJMoa035140. [DOI] [PubMed] [Google Scholar]
  16. Ning A, Vladutiu CJ, Dotters-Katz SK and Goodnight WH (2017) Gestational age at initiation of 17-alpha hydroxyprogesterone caproate and recurrent preterm birth. American Journal of Obstetrics and Gynecology 217(371): e1–e7. [DOI] [PubMed] [Google Scholar]
  17. O’Brien PC and Fleming T (1987) A paired prentice-wilcoxon test for censored paired data. Biometrics 43: 169–180. [Google Scholar]
  18. Patel Y and Runmore MM (2012) Hydroxyprogesteroe caproate injection (makena) one year later. Pharmacy and Therapeutics 37(47): 405–411. [PMC free article] [PubMed] [Google Scholar]
  19. PD A (2010) Survival Analysis Using SAS: A Practical Guide. 2nd edition Cary: SAS Publishing. [Google Scholar]
  20. Prentice RL and Breslow NE (1978) Retrospective studies and failure time models. Biometrika (1): 153–8. [Google Scholar]
  21. R Development Core Team (2017) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria: URL http://www.R-project.org/. ISBN 3-900051-07-0 [Google Scholar]
  22. Rosenbaum P (2002) Observational Studies. New York: Springer. [Google Scholar]
  23. Rubin DB (1980) Randomization analysis of experimental data in the fisher randomization test. Journal of the American Statistical Association 75: 591–593. [Google Scholar]
  24. Uno H, Claggett B, Tian L, Inoue E, Gallo P, Miyata T, Schrag D, Takeuchi M, Uyama Y, L Z, Skali H, Solomon S, Jacobus S, Hughes M, Packer M and Wei L (2014) Moving beyond the hazard ratio in quantifying the between-group difference in survival analysis. Journal of Clinical Oncology 32(22): 2380–2385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Xie J and Liu C (2005) Adjusted kaplan-meier estimator and log-rank test with inverse probability of treatment weighting for survival data. Statistics in Medicine 24: 3089–3110. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

example.r
script_sim.r

RESOURCES