Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Oct 18.
Published in final edited form as: Biometrics. 2020 Aug 3;77(3):1101–1117. doi: 10.1111/biom.13332

Variance Estimation in Inverse Probability Weighted Cox Models

Di Shu 1,*, Jessica G Young 1, Sengwee Toh 1, Rui Wang 1,2
PMCID: PMC12534113  NIHMSID: NIHMS2114534  PMID: 32662087

Summary:

Inverse probability weighted Cox models can be used to estimate marginal hazard ratios under different point treatments in observational studies. To obtain variance estimates, the robust sandwich variance estimator is often recommended to account for the induced correlation among weighted observations. However, this estimator does not incorporate the uncertainty in estimating the weights and tends to overestimate the variance, leading to inefficient inference. Here we propose a new variance estimator that combines the estimation procedures for the hazard ratio and weights using stacked estimating equations, with additional adjustments for the sum of non-independent and identically distributed terms in a Cox partial likelihood score equation. We prove analytically that the robust sandwich variance estimator is conservative and establish the asymptotic equivalence between the proposed variance estimator and one obtained through linearization by Hajage et al., 2018. In addition, we extend our proposed variance estimator to accommodate clustered data. We compare the finite sample performance of the proposed method with alternative methods through simulation studies. We illustrate these different variance methods in both independent and clustered data settings, using a bariatric surgery dataset and a multiple readmission dataset, respectively. To facilitate implementation of the proposed method, we have developed an R package ipwCoxCSV.

Keywords: clustered data, Cox model, inverse probability weighting, marginal hazard ratio, sandwich variance estimator

1. Introduction

Inverse probability weighting, a tool to address missing data or unequal selection probabilities, has been widely used in various fields such as causal inference (e.g., Rosenbaum, 1987; Lunceford and Davidian, 2004; Hernán and Robins, 2020) and survey sampling (e.g., Horvitz and Thompson, 1952; Pfeffermann, 1993; Höfler et al., 2005; Seaman and White, 2013; Miratrix et al., 2018). For time-to-event outcomes, the inverse probability weighted (IPW) Cox model is frequently used to estimate the marginal hazard ratio comparing hazard functions of counterfactual failure times under different point treatments in observational studies (Hernán and Robins, 2020). When the interest is in the comparison of binary point treatments (e.g., “treat” versus “do not treat”), the weights are a function of the estimated propensity score; i.e., the probability of receiving treatment conditional on the measured baseline covariates (Rosenbaum and Rubin, 1983). The consistency of the resulting estimator depends on several assumptions, including the assumptions of exchangeability between treated and untreated individuals given the baseline measured covariates, correct model specification, and consistent estimation of the propensity score.

The focus of this work is on variance estimation for the treatment effect estimators from IPW Cox models. Previous authors have discussed an efficiency paradox such that estimators constructed using the estimated nuisance parameters are more efficient than those constructed using the true values of these nuisance parameters (e.g., van der Laan and Robins, 2003; Henmi and Eguchi, 2004). Henmi and Eguchi (2004) gave a sufficient condition for this paradox based on the orthogonality of the components of the projected estimating functions the projections of the score function on to a given set of estimating functions – corresponding to the parameters of interest and the nuisance parameters. In IPW estimation of average treatment effects for non-survival outcomes, it was found that estimating parameters in a propensity score model leads to a smaller asymptotic variance for the IPW estimator than using the true values (Lunceford and Davidian, 2004). Similarly, with survival outcomes, it has been noted that a robust sandwich variance estimator tends to be conservative in estimating the variance of an IPW estimator when ignoring the uncertainty in estimating the weights (e.g., Robins, 1997, 1999; Hernán et al., 2000). Given the convenient implementation of a robust sandwich variance estimator using off-the-shelf statistical software, this approach to variance estimation has become routine in practical applications of weighted analysis, including in IPW Cox estimation.

Austin (2016) confirmed in extensive simulations that the robust sandwich variance estimator tends to provide conservative estimates of the variance in the case of IPW estimators of Cox models. He suggested that, given the overestimation of the variance, which leads to wider confidence intervals and inefficient inference, bootstrap resampling (Efron and Tibshirani, 1993) should be used in place of the robust sandwich variance estimator. However, given the computational burden of the bootstrap method, an analytical formula for computing a consistent variance estimator is desirable. Analytical variance formulae for IPW estimators for non-survival outcomes have been proposed in various settings (Lunceford and Davidian, 2004; Williamson et al., 2014; Perez-Heydrich et al., 2014), using the standard M-estimation technique (Stefanski and Boos, 2002) based on stacked estimating equations of the treatment effect and propensity score weights. However, the Cox partial likelihood score equation is not a sum of independently and identically distributed (i.i.d.) terms, making it challenging to apply the standard M-estimation technique to obtain a sandwich-type variance estimator for the hazard ratio. Mao et al. (2018) proposed to first poissonize the Cox model and then construct the stacked estimating equations, motivated by numerical findings that the poissonized likelihood gave nearly identical point estimates as the Cox model. This method involves penalized splines that require specification of the number and location of knots. Hajage et al. (2018) derived a closed-form variance formula using linearization (Deville, 1999). Their approach involved linearizing the Cox model and the propensity score weights to arrive at a variable whose dispersion can be used to approximate the variance.

In this paper, we take a different approach to derive an analytical variance formula, by directly correcting the available robust sandwich variance estimator (Lin and Wei, 1989; Binder, 1992) that ignores the uncertainty in weight estimation. Specifically, we combine the estimating equations for the propensity score weights and the estimating equation used for the robust sandwich variance estimation. In the “meat” part of the sandwich variance estimator, we approximate the original non-i.i.d. terms in the weighted partial likelihood score equation with the i.i.d terms proposed by Lin and Wei (1989) and Binder (1992). We establish two properties of the proposed variance estimator. First, we show that it is asymptotically equivalent to the existing linearization estimator (Hajage et al., 2018). Second, we show that it is more efficient than the existing robust sandwich variance estimator through a direct comparison of the two formulae.

We further propose a new variance estimator for clustered data settings. Clustered data occur frequently in practice. For example, time to hospital-acquired infections may be correlated when patients have multiple hospital admissions. We extend the robust sandwich variance estimator proposed by Lee et al. (1992) to the IPW context, and use stacked estimating equations to account for the uncertainty in weight estimation.

The rest of the manuscript is organized as follows. In Section 2, we review the IPW estimation of Cox models. In Section 3, we review four existing variance estimation methods, and propose the corrected sandwich variance estimator for both independent and clustered data settings. We establish the relation between our corrected sandwich variance estimator and the linearization variance estimator and prove analytically that the robust sandwich variance estimator is conservative. In Section 4, we conduct simulation studies to evaluate the finite sample performance of the proposed method. For illustration, in Section 5 we perform IPW Cox analyses of two data applications representing independent and clustered data settings respectively. We conclude the paper with a discussion in Section 6.

2. Estimation of Marginal Hazard Ratios Using Inverse Probability Weighting

Observed Data Structure:

Consider a study in which the following are measured on each of i=1,,n individuals randomly sampled from a target population of interest (we initially assume individuals are i.i.d. and therefore suppress the i subscript here): Let X be a vector of measured baseline covariates, A a binary treatment indicator (A=1 if treated and A=0 otherwise), and T=min(T*,C) where T* is the event time, C is the censoring time. We further define δ=I(T*C), where I() is the indicator function. We assume that C is independent of (T*,X) conditional on A.

Parameter of Interest:

We aim to estimate the log marginal hazard ratio θ of the model:

λat=λ0texpθa, (1)

where λa(t) is the hazard function for Ta*, the time to failure for a given individual in the study population that would have been observed had we set the treatment level A=a for a=0 or 1.

IPW estimator of θ:

Inverse probability weighting effectively eliminates or reduces confounding bias such that the weighted data emulate data that would have been collected from a randomized controlled trial. We consider the IPW estimator θ^ that solves the weighted partial likelihood score equation (Cox, 1975; Lin and Wei, 1989; Binder, 1992) for θ

i=1nw^iδiAi-l:lRiw^lexpAlθAll:lRiw^lexpAlθ=0, (2)

where Ri=l:l=1,,n,TiTl,δi=1 is the risk set for individual i who experiences an event at Ti and w^i is an estimate of the weight wi. Two types of weight are commonly used: the conventional inverse probability weight

wi=wi,ipw=Aiei+1-Ai1-ei (3)

and the stabilized weight

wi=wi,stab=PA=1Aiei+PA=01-Ai1-ei, (4)

where ei=PAi=1Xi is the propensity score (Rosenbaum, 1987; Cole and Hernán, 2004, 2008). In an observational study, the propensity score ei and treatment prevalence P(A=1) are unknown but may be estimated from the data. We consider the estimator θ^ from a logistic regression model for ei and, when stabilized weights are used, nonparametric estimation of the marginal treatment prevalence using the proportion treated in the sample.

The consistency of θ^ for the true value of θ relies on correct specification of the propensity score model. It also requires several identifying assumptions including conditional exchangeability of treated and untreated individuals (A independent of Ta* given X), positivity (individuals with A=1 or A=0 are possibly observed within all levels of X) and sufficiently well-defined counterfactual outcomes (Hernán and Robins, 2020).

3. Variance Estimation Methods for Marginal Hazard Ratios

In this section we describe five variance estimation methods for the IPW estimator θ^. In Section 3.1, we review four existing variance estimation methods. In Section 3.2, we propose the corrected sandwich variance estimator and establish its relation with the linearization estimator and the standard robust sandwich variance estimator. We also give an extension of our estimator that handles clustered data.

3.1. Review of four existing variance estimation methods

3.1.1. Naive likelihood-based variance estimator.

With the estimated weights in (2) treated as known constants, an application of the partial likelihood-based variance estimation (Andersen and Gill, 1982) leads to the naive likelihood-based variance estimator for θ^:

var^NL(θ^)=-i=1nψi*(θ^)-1, (5)

where

ψi*θ=w^iδiAi-l:lRiw^lexpAlθAll:lRiw^lexpAlθ.

In addition to ignoring the uncertainty in weight estimation, the naive likelihood-based variance estimator (5) incorrectly assumes independence among the weighted observations and thus is biased in general.

3.1.2. Robust sandwich variance estimator.

To help protect against model misspecification, Lin and Wei (1989) developed the robust sandwich variance estimator for partial likelihood estimates of Cox model parameters, and Binder (1992) extended their results to incorporate known constant weights.

The weighted robust sandwich variance estimator that replaces the true weights wi with their estimates w^i, i=1,,n is given by

var^RS(θ^)=-i=1nψi*(θ^)-1i=1nηi*(θ^)ηi*(θ^)T-i=1nψi*(θ^)-1T, (6)

Where

ηi*(θ^)=w^iδiAi-S1iS0i-w^iAiexp(Aiθ^)j=1nδjw^jITjTiS0j+w^iexp(Aiθ^)j=1nδjw^jITjTiS1jS02j,
S0(i)=l:lRiwl^exp(Alθ^)andS1(i)=l:lRiwl^exp(Alθ^)Al.

Since the log marginal hazard ratio is a constant, both ψi*(θ^) and ηi*(θ^) are scalars, and the robust sandwich variance estimator can be re-written as

var^RS(θ^)=-i=1nψi*(θ^)-2i=1n{ηi*(θ^)}2. (7)

Because the robust sandwich variance estimator (6) or (7) treats the estimated weights as known constants, it does not take into account the uncertainty in weight estimation and is generally a biased estimator of the true variance of θ^.

3.1.3. Bootstrap variance estimator.

The bootstrap method (Efron and Tibshirani, 1993) has been frequently used to obtain variance of estimators. In the current context, one resamples data at the individual level with replacement M times, for a user-specified M (e.g., M=500) to construct M bootstrap samples, each containing the same number of observations as the original data. For each bootstrap sample m=1,,M, the entire estimation algorithm is repeated, including estimation of the propensity score and corresponding weights, to obtain an estimate of the log hazard ratio θ under the model (1). Denote the estimate for sample m by θ^m. The bootstrap variance estimator is then given by

var^BOOT(θ^)=1M-1m=1Mθ^m-1Mm=1Mθ^m2. (8)

Note that, because the propensity score and the weights are re-estimated in each bootstrap sample, the bootstrap variance estimator (8) incorporates the uncertainty in weight estimation. Austin (2016) found that the performance of the bootstrap variance estimator was superior to the commonly used robust sandwich variance estimator in his simulations.

3.1.4. Linearization variance estimator.

Hajage et al. (2018) derived an analytical variance formula for the IPW estimator θ^ that is the solution to (2) using an influence function technique (Deville, 1999). Specifically, they showed that the variance can be approximated by the dispersion of a linearized variable divided by sample size. In their derivation, linearization was conducted for both the Cox model and the propensity score weights.

Their proposed linearization variance estimator is

var^LIN(θ^)=1nn-1i=1nL^i-1ni=1nL^i2, (9)

where L^i:i=1,,n are the linearized terms. Specifically, define

L^0i=δiAi-S1(i)S0(i)-exp(θ^Ai)j=1nw^jδjITjTiS0jAi-S1jS0j,U^=1nj=1ne^j1-e^jXjXjT,andV^=1nj=1nw^jδjS1(j)S0(j)1-S1(j)S0(j).

For the conventional inverse probability weights (3), the linearized term L^i in (9) is

L^1i=V^-1w^iL^0i+d^1rAi-e^iXi,

where

d^1=U^-11nj=1n-Aj1-e^je^j+1-Aje^j1-e^jL^0jXj,

and e^i is the estimated propensity score for i=1,,n. For the stabilized weights (4), the linearized term L^i in (9) is given by L^2i=V^-1{w^iL^0i+d^2Ai-ρ^+d^3rAi-e^iXi}, where

d^2=1nj=1nAje^j-1-Aj1-e^jL^0j

and

d^3=U^-11nj=1n-Ajρ^1-e^je^j+1-Aj(1-ρ^)e^j1-e^jL^0jXj.

3.2. The proposed corrected sandwich variance estimator

3.2.1. Theoretical development.

We derive a new analytical variance estimator under either the conventional inverse probability weights (3) or stabilized weights (4). Our method extends the robust sandwich variance estimation to account for the uncertainty in estimating weights. We refer to the proposed estimator as the corrected sandwich variance estimator.

First, we develop the variance estimator with the conventional inverse probability weights (3). Let γ denote the vector of parameters in the propensity score model, which is specified as a logistic regression model. The corresponding system of estimating equations for β=θ,γTT is given by

i=1nΦi(θ,γ)=i=1nψi(θ,γ)=i=1nwiδiAi-l:lRiwlexpAlθAll:lRiwlexpAlθ=0i=1nπi(γ)=i=1nAi-1/1+exp-γTXiXi=0 (10)

where wi is individual i ‘s conventional weight defined by (3) and estimated using the score function πi(γ) for logistic propensity score model (with 1 included in the vector of covariates). Solving (10) for θ,γTT gives (θ^,γ^T)T, where θ^ is the estimated log hazard ratio and γ^ is the estimated propensity score model parameters.

A standard application of M-estimation (e.g., Stefanski and Boos, 2002) to (10) is complicated by the fact that the partial likelihood score equation is not a sum of i.i.d. terms. We propose to estimate the variance of β^=(θ^,γ^T)T by adapting the strategy of Lin and Wei (1989) and Binder (1992) to get around the non-i.i.d. problems.

In Web Appendix A, we prove that the variance of β^ can be consistently estimated by

var^CS(β^)=A(β^)-1B(β^)A(β^)-1T, (11)

where A(β^)=-i=1nΦi(β^) and B(β^)=i=1nΩi(β^)Ωi(β^)T, with Ωi(β^)=(ηi(θ^,γ^),πi(γ^)T)T and ηi(θ^,γ^) given by

w^iδiAi-S1(i)S0(i)-w^iAiexp(Aiθ^)j=1nδjw^jITjTiS0j+w^iexp(Aiθ^)j=1nδjw^jITjTiS1jS02j.

Then the element in the first row and the first column of matrix var^CS(β^), denoted by var^CS(θ^), is the proposed variance estimator for θ^.

Let ρ denote the treatment prevalence. Under the stabilized weights (4), we define a system of estimating equations for β=θ,γT,ρT:

i=1nΦi(θ,γ,ρ)=i=1nψi(θ,γ,ρ)=i=1nwiδiAi-l:lRiwlexpAlθAll:lRiwlexpAlθ=0i=1nπi(γ)=i=1nAi-1/1+exp-γTXiXi=0i=1nσi(ρ)=i=1nAi-ρ=0 (12)

where wi is given by (4), and ψi(θ,γ,ρ), πi(γ) and σi(ρ) are the partial likelihood score function for the weighted Cox model, the score function for the logistic propensity score model (with 1 included in the vector of covariates), and the estimating function for the treatment prevalence, respectively.

Solving (12) for θ,γT,ρT gives (θ^,γ^T,ρ^)T, where θ^ is the estimated log hazard ratio, γ^ is the estimated propensity score model parameters, and ρ^ is the estimated treatment prevalence. The variance estimator for β^=(θ^,γ^T,ρ^)T under estimating equations (12) can be derived in a similar way to the variance estimator (11) under estimating equations (10).

3.2.2. Comparison with the linearization and robust sandwich estimators.

Both the proposed variance estimator var^CS(θ^) and the linearization variance estimator var^LIN(θ^) developed by Hajage et al. (2018) incorporate the uncertainty in the estimation of the propensity score weights. In this section we establish the connections between these two analytical estimators.

Although derived from different approaches, the two estimators are asymptotically equivalent. This is justified by showing that var^CS(θ^) can be re-written as the empirical second moment of the linearized variable divided by n. Below is a sketch of proof with conventional weights. The detailed proof with both types of weights is available in Web Appendix B.

We re-write A(β^) and B(β^) in block matrix form as

A(β^)=A11A120A22andB(β^)=B11B12B12TB22.

It can be shown that var^CS(θ^), the element in the first row and the first column of matrix var^CS(β^)=A(β^)-1B(β^)A(β^)-1T, is given by

var^CS(θ^)=1A112B11-2A112B12A22-1A12T+1A112A12A22-1B22A22-1A12T.

On the other hand, it can be shown that

i=1nL^1i2/n2=1A112B11+2d1T^B12T+d^1TB22d^1.

By further showing d^1=-A22-1A12T, we obtain

var^CS(θ^)=i=1nL^1i2/n2, (13)

where L^1i is the linearized term for i=1,,n. By (9) and (13), var^CS(nθ^) is the empirical second moment of the linearized variable, and var^LIN(nθ^) is the sample variance of the linearized variable. Because variance is the same as the second moment for a mean-zero variable, var^CS(nθ^) and var^LIN(nθ^) are asymptotically equivalent.

While it is well-known that the standard robust sandwich variance estimator is conservative, the development of a correct variance formula allows explicit comparisons of the two formulae. In Web Appendix C, we derive the large sample difference matrix between the proposed and robust sandwich variance estimators, which is negative definite. In addition to providing an explicit proof that the robust sandwich variance estimator is conservative, examining the components of this difference matrix may provide insights into which settings result in large or negligible differences between the standard robust sandwich variance estimator and the proposed corrected estimator.

3.2.3. Extension to handle clustered data.

We again use the stacked estimating equations approach to develop a variance estimator for IPW Cox model with clustered data (e.g., time to hospital-acquired infection when there are multiple hospital admissions for the same patient). In unweighted situations, Lee et al. (1992) proposed a robust sandwich variance estimator for Cox regression when the data consists of a large number of independent small-size clusters of correlated failure time observations. We consider its extension to the IPW context and correct the corresponding robust sandwich variance estimator by further accounting for the estimating equations for the propensity score weights.

Suppose cluster i has Ki failure times for i=1,,n and k=1,,Ki, where Ki is relatively small compared to n. For the k th failure time of cluster i, let Xik be the baseline covariates, Aik the treatment indicator, Tik=min(Tik*,Cik) where Tik* is the event time and Cik is the censoring time, and δik the event indicator. Note that XikT,AikT may contain cluster-level factors, which are k-invariant.

We now develop the variance estimator under the conventional inverse probability weights (3). Let γ denote the logistic propensity score model parameters. The corresponding system of the estimating equations for β=θ,γTT is given by

i=1nk=1KiΦi,k(θ,γ)=i=1nk=1Kiψi,k(θ,γ)=0i=1nk=1Kiπi,k(γ)=0 (14)

where

ψi,k(θ,γ)=wikδikAik-j=1nl=1KjITjlTikwjlexpAjlθAjlj=1nl=1KjITjlTikwjlexpAjlθ

is the partial likelihood score function for the weighted Cox model,

πi,k(γ)=Aik-1/1+exp-γTXikXik

is the score function for logistic propensity score model (with 1 included in the vector of covariates), and wik is given by (3). Solving (14) for θ,γTT gives (θ^,γ^T)T, denoted by β^.

Similar to the development in Section 3.2.1, we derive the corrected sandwich variance estimator of β^, given by var^CS(β^)=A(β^)-1B(β^)A(β^)-1T, where A(β^)=-i=1nk=1KiΦi,k(β^) and B(β^)=i=1nΩi(β^)Ωi(β^)T, with Ωi(β^)=k=1Kiηi,k(θ^,γ^),k=1Kiπi,k(γ^)TT and

ηi,k(θ^,γ^)=w^ikδikAik-S1(i,k)S0(i,k)-w^ikAikexpAikθ^j=1nl=1Kjδjlw^jlITjlTikS0(j,l)+w^ikexpAikθ^j=1nl=1Kjδjlw^jlITjlTikS1j,lS02j,l,

where S1(i,k)=j=1nl=1KjITjlTikw^jlexp(Ajlθ^)Ajl and S0(i,k)=j=1nl=1KjITjlTikw^jlexp(Ajlθ^). Then the element in the first row and the first column of matrix var^CS(β^), denoted by var^CS(θ^), is the proposed variance estimator for θ^. Similarly, the variance estimator under stabilized weights (4) can be obtained by further including the estimating equation for treatment prevalence, i.e., i=1nk=1KiAik-ρ=0. Note ηi,k(θ^,γ^) is the key to address non-i.i.d. issues. In unweighted case, ηi,k(θ^,γ^) reduces to the result of Lee et al. (1992). In non-clustered case, ηi,k(θ^,γ^) reduces to the result of Binder (1992).

4. Simulation Studies

We conducted simulation studies to compare the finite sample performance of the proposed corrected sandwich variance estimation method with alternative methods in two settings: without clustering and with clustering. In Setting 1, we compared the proposed estimator with the naive likelihood-based variance estimator, the robust sandwich variance estimator, the bootstrap variance estimator, and the linearization variance estimator. In Setting 2, we compared the proposed estimator with the clustered-version robust sandwich variance estimator (Lee et al., 1992) and the cluster bootstrap variance estimator (Davison and Hinkley, 1997; Field and Welsh, 2007).

4.1. Setting 1: without clustering

4.1.1. Data generation and simulation scenarios.

To simulate data that exactly followed model (1), we adapted the simulation method of Young et al. (2008), which was initially designed for time-varying treatment settings, to our point-treatment setting. Specifically, for i=1,,n individuals, we simulated the following (we assumed individuals were i.i.d. and therefore suppressed the i subscript):

Step 1: counterfactual event time under A=0, T0*, according to an exponential distribution with constant hazard rate λ0=0.01.

Step 2: vector of covariates X=(X(1),X(2),X(3))T, where X(1)=0.5(T0*+0.2)/(T0*+1)+0.3Z and X(2)=1/log(1.3T0*+3)-0.3Z with Z following the standard normal distribution, and X(3) a binary variable with P(X(3)=1T0*)=0.3+0.5/(T0*+1).

Step 3: treatment indicator A generated by setting the probability of being treated to be 1/{1+exp(γ0+X(1)-X(2)-X(3))}, where the parameter γ0 was chosen such that the treatment prevalence was about 10%, 20%, 30%, 40%, or 50%.

Step 4: event time using formula T*=T0*exp(-θA), where θ was specified as log(0.8) such that the true marginal hazard ratio was 0.8.

Step 5: censoring time C generated from an exponential distribution whose rate was chosen to yield a censoring rate about 20%, 40%, 60%, or 80%, to feature different degrees of censoring, and calculated T=min(T*,C) and δ=I(T*C).

We considered sample sizes of 250 and 5000 and ran 1000 simulations for each parameter configuration. Five hundred bootstrap samples were used for the bootstrap variance estimator.

4.1.2. Results.

Similar to Austin (2016), to evaluate the accuracy of the proposed variance estimator in comparison to the other four methods, we examined the ratio of the average standard error (ASE) to the empirical standard error (ESE), where ASE was calculated as the average of the estimated standard errors for θ^ (obtained using each variance estimation method) across 1000 simulation runs, and ESE was calculated as the empirical estimate of the standard error (i.e., square root of sample variance of θ^ across 1000 simulation runs). ESE directly measures the uncertainty in estimation of the log marginal hazard ratio and reflects the true variability. With an adequate sample size, a consistent variance estimator is expected to have the ratio of ASE to ESE close to 1.

Figures 1 and 2 depict the ratios of ASE to ESE for the five variance estimation methods under various combinations of censoring rates and treatment prevalence. As expected, the proposed variance estimator generally produced ASE to ESE ratios fairly close to 1 with n=5000 (censoring rates ranging from 20% to 80%, treatment prevalence ranging from 10% to 50%), indicating that it estimated the variance with high accuracy when the sample size was adequate. With a small sample size n=250, the ratios of ASE to ESE can be noticeably smaller than 1 especially when the treatment prevalence was far from 50% and the censoring rate was high, implying that the proposed variance estimator may underestimate the variance when the number of events is small within one or both treatment groups. The robust sandwich variance estimator tended to produce ratios greater than 1, suggesting a tendency to overestimate the truth. In some scenarios, the robust sandwich variance estimator produced ratios as high as 1.4, implying a 40% overestimation of variance. The naive likelihood-based variance method severely underestimated the variance under the conventional inverse probability weights. Under stabilized weights, it can underestimate or overestimate the variance. The linearization method performed similarly to the proposed method, as seen from the overlapping lines in figures. With a large sample size of n=5000, the bootstrap method performed well. With a small sample size of n=250, the bootstrap method severely overestimated the variance under high censoring rates and low treatment prevalence, which is likely due to extreme estimates in some bootstrap samples.

Figure 1:

Figure 1:

Ratios of average standard error (ASE) to empirical standard error (ESE) with n=250. Total number of failure events is about 200, 150, 100, or 50. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.

Figure 2:

Figure 2:

Ratios of average standard error (ASE) to empirical standard error (ESE) with n=5000. Total number of failure events is about 4000, 3000, 2000, or 1000. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.

We further examined the empirical coverage rates of the corresponding 95% confidence intervals obtained using the five variance estimation methods, where an empirical coverage rate was calculated as the percentage of 95% confidence intervals in 1000 simulation runs that covered the true log marginal hazard ratio. Results are summarized in Figures 3 and 4. As in Austin (2016), we drew three horizontal lines (at 93.65%, 95% and 96.35%) to indicate a plausible range of coverage rates. Based on 1000 simulation runs, a consistent variance estimator is expected to have empirical coverage rates that fluctuate around 95% and mostly contained within the interval of (93.65%,96.35%).

Figure 3:

Figure 3:

Empirical coverage rates in percent with n=250. Total number of failure events is about 200 or 50. The right panel shows a zoom-in version of the left panel. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.

Figure 4:

Figure 4:

Empirical coverage rates in percent with n=5000. Total number of failure events is about 4000 or 1000. The right panel shows a zoom-in version of the left panel. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.

In most scenarios, the proposed method produced empirical coverage rates close to 95% and within range, although it tended to result in undercoverage with high censoring rates and low treatment prevalence. As anticipated, the robust sandwich variance estimator tended to produce conservative confidence intervals with empirical coverage higher than 95% (and 96.35%), due to its overestimation of variance. In some scenarios, its empirical coverage rates could be nearly 100%. The naive likelihood-based variance method produced severe undercoverage under the conventional inverse probability weights due to its underestimation of variance. Using stabilized weights, its coverage rates behaved much better than under the conventional weights, but still could be outside of the range (93.65%, 96.35%). Results from the linearization method were almost the same as those from the proposed method, shown from the overlapping lines in figures. With a large sample size n=5000, the bootstrap method produced reasonable empirical coverage rates that within the interval (93.65%, 96.35%). With a small sample size n=250, the bootstrap method produced slight undercoverage under 20% censoring and overcoverage under 80% censoring.

Finally, we examined the average widths of the 95% confidence intervals (Figures S1 and S2, Web Appendix D). Under a large sample size of n=5000, the three methods that account for uncertainty in weights: the bootstrap method, the linearization method, and the proposed method behaved similarly. Under a small sample size of n=250, the bootstrap method produced the widest confidence intervals. This difference could be substantial, likely resulting from extreme estimates obtained in some bootstrap samples when the number of events was small. Results for additional settings with sample sizes of n=100,500,1000, and 2000 are included in Figures S3S14 (Web Appendix D). Similar results were observed.

4.2. Setting 2: with clustering

We compared the finite sample performance of the proposed corrected sandwich variance estimator with the clustered-version robust sandwich variance estimator (Lee et al., 1992) and the cluster bootstrap variance estimator (Davison and Hinkley, 1997; Field and Welsh, 2007). When implementing the cluster bootstrap, we resampled with replacement from the n clusters and used all observations from each selected cluster to form the bootstrap samples.

We simulated n clusters of size K as follows. For the i th cluster where i=1,,n, we first generated K counterfactual failure events {T0*(i,1),,T0*(i,K)} from the Frank’s family with unit exponential margins and Kendall’s tau equals 0.7. Then for k=1,,K, we specified covariates Xik=(Xik(1),Xik(2),Xik(3))T, where Xik(1)=1Kk=1K[0.5{T0*(i,k)+0.2}/{T0*(i,k)+1}],Xik(2)=1Kk=1K[1/log{1.3T0*(i,k)+3}], and Xik(3)=0.3+0.5/{T0*(i,k)+1}. Here Xik(1) and Xik(2) were k-invariant cluster-level factors. The treatment Aik was generated by setting the propensity score for cluster i and time k to ei,k=1/{1+expγ0+2Xik(1)+Xik(2)+)Xik(3))}, where γ0 was chosen to achieve the desired treatment prevalence of approximately 10%, 20%, 30%, 40%, or 50%. We calculated Tik*=T0*(i,k)exp-θAik. Censoring times for each cluster were drawn independently from an exponential distribution whose rate was chosen to yield a censoring rate of about 20% or 60%. The true marginal hazard ratio was specified as 1.5. We considered 80 clusters with size K=3 or K=6, and ran 1000 simulations for each parameter configuration. Five hundred bootstrap samples were used when implementing the cluster bootstrap method.

Figure 5 reports the ratios of ASE to ESE for the three variance estimation methods under various combinations of censoring rate and treatment prevalence. The proposed variance estimator generally produced ASE to ESE ratios close to 1 and outperformed the robust sandwich variance estimator and the cluster bootstrap estimator. The robust sandwich variance estimator tended to overestimate the truth, just like in settings without clustering; in some scenarios, the resulting estimates doubled the true variance.

Figure 5:

Figure 5:

Ratios of average standard error (ASE) to empirical standard error (ESE) with n=80 clusters each of size K. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.

We further reported the empirical coverage rates and the average 95% confidence interval widths (Figures S15S16, Web Appendix E) and repeated simulations under n=200 and n=800 clusters (Figures S17S22, Web Appendix E). The simulation results showed that when the number of clusters became larger (n=200 and n=800), both the proposed method and the cluster bootstrap performed well in terms of ratios of ASE to ESE and coverage rates. In general, the proposed method produced narrower confidence intervals than did the cluster bootstrap.

5. Real-World Examples

5.1. Application to bariatric surgery data: an example without clustering

We considered a dataset from the IBM® MarketScan® Research Databases. This dataset included 6690 patients aged 18 to 79 years who received sleeve gastrectomy (SG) or Roux-enY gastric bypass (RYGB) surgery between 1/1/2015 and 9/30/2015. The treatment variable was set to 1 if the patient received SG and 0 if the patient received RYGB. The outcome was time to the first all-cause hospitalization during the first 30-day follow-up after patients were discharged from the index surgery hospitalization. As a common feature in safety studies of rare outcomes, the censoring rate was high (97%).

We conducted the IPW Cox regression to estimate the marginal hazard ratio. A logistic propensity score model was specified; see a list of covariates and balance diagnosis in Figure S23, Web Appendix F. The estimated marginal hazard ratios under the conventional and stabilized weights were both 0.659. We used five variance estimation methods to obtain the standard error and 95% confidence interval (Table 1). The proposed corrected sandwichvariance estimator and the linearization estimator produced almost the same results. The robust sandwich variance estimate was only slightly larger than the proposed and linearization variance estimates in this example. The likelihood-based variance method produced a remarkably smaller standard error than the other methods under the conventional weights, consistent with the findings in simulation studies. All the variance methods examined produced 95% confidence intervals that excluded 1, suggesting a statistically significant lower risk of post-surgery hospitalization at the nominal level of 5% comparing SG to RYGB.

Table 1:

Analysis results of bariatric surgery data using various variance estimation methods: estimated log marginal hazard ratio (log HR), estimated marginal hazard ratio (HR), standard error, and 95% confidence interval for marginal hazard ratio (95% CI of HR)

Weight log HR HR Variance Method Standard Error 95% CI of HR

Conventional −0.417 0.659 Naive likelihood 0.0960 (0.5458, 0.7953)
Robust sandwich 0.1440 (0.4968, 0.8738)
Bootstrap (500 times) 0.1450 (0.4958, 0.8755)
Linearization 0.1436 (0.4973, 0.8729)
Corrected sandwich 0.1435 (0.4973, 0.8729)

Stabilized −0.417 0.659 Naive likelihood 0.1426 (0.4982, 0.8714)
Robust sandwich 0.1440 (0.4969, 0.8738)
Bootstrap (500 times) 0.1450 (0.4959, 0.8755)
Linearization 0.1435 (0.4973, 0.8729)
Corrected sandwich 0.1435 (0.4973, 0.8729)

5.2. Application to multiple readmission data: an example with clustering

We considered a clustered dataset of hospital readmission times, which is available from an R package frailtypack (Rondeau et al., 2012). The dataset contained 861 individual rehospitalization observations from 403 patients (clusters) who were diagnosed with colorectal cancer. The treatment variable was set to 1 if the patient was initiated on chemotherapy and 0 otherwise. The outcome was time to readmission, and each patient may have multiple readmissions. There were a total of 458 individual failure events, leading to a censoring rate of 46.8%. A logistic propensity score model was specified; see a list of covariates and balance diagnosis in Figure S24, Web Appendix F. The IPW Cox analysis led to a hazard ratio estimate 0.780 using both the conventional and stabilized weights.

Figure 6 displays forest plots of 95% confidence intervals for hazard ratios using various variance estimators. The clustered-version corrected sandwich variance method produced noticeably narrower confidence intervals than the clustered-version robust sandwich variance method. Indeed, replacing the robust sandwich variance method with the corrected sandwich variance method shifted the results from a p-value of 0.09 to a p-value of 0.05. Figure 6 also showed that a failure to account for clustering led to remarkably narrower confidence intervals.

Figure 6:

Figure 6:

Analysis results of multiple readmission data using various variance estimation methods: forest plots of hazard ratios and 95% confidence intervals. A1 and A2 represent the inverse probability weighted Cox analysis using conventional and stabilized weights, respectively. In each panel, the dotted vertical line represents the marginal hazard ratio point estimate.

6. Discussion

We considered variance estimation for IPW Cox model and proposed the corrected sandwich variance estimator for both independent and clustered data settings. Our simulation studies demonstrated satisfactory performance of the proposed variance estimator and confirmed that the standard robust sandwich variance estimator which incorporates estimated weights is conservative. The performance of the linearization estimator and the proposed estimator was quite similar and tended to provide narrower confidence intervals than the bootstrap estimator. Although the robust sandwich variance estimator ignores the uncertainty in weight estimation, the impact of ignoring such uncertainty on the magnitude of the variance is expected to be small when sample size is large. Based on findings from prior and our studies, the proposed variance estimator, the linearization variance estimator, and the bootstrap variance estimator are generally recommended for practical use. To facilitate the implementation of the proposed method, we developed an R package ipwCoxCSV.

The idea of correcting available robust sandwich variance estimators through stacked estimating equations or linearization is generally applicable to weighted Cox models in causal inference and survey sampling. For example, results of Binder (1992) may be extended to multivariable-adjusted Cox models (targeting conditional hazard ratios) with sampling weights estimated from the data. As another example, in multi-site studies, it will be useful to develop a variance estimator for IPW Cox model stratified on data-contributing sites (Shu et al., 2020). It is also possible to handle other weighting strategies; for example Hajage et al. (2018) considered weights that target the average treatment effect among treated (ATT).

As the performance of asymptotic methods relies on adequate sample size, in small samples with low treatment prevalence, the proposed estimator may underestimate the variance while the bootstrap estimator may overestimate or underestimate the variance. In this case, analysts may calculate var^CS(θ^), var^LIN(θ^), var^BOOT(θ^), and var^RS(θ^) to see if they are similar. The underestimation of the robust sandwich variance for estimators from generalized linear models with small number of clusters has been extensively studied (see for examples Kauermann and Carroll, 2001; Mancl and DeRouen, 2001). Small sample correction formulae have been proposed. It is not yet apparent and would be useful to investigate, whether or how these corrections may be extended to the survival data settings.

The robust sandwich variance method and the bootstrap method can be applied under any type of propensity score model. The proposed and linearization methods assume a logistic propensity score model, which is widely used in practice. To be flexible within the logistic model form, analysts may include additional terms such as interaction terms between covariates or higher-order polynomial terms of certain covariates if the relationship might be non-linear, to help achieve covariate balance. In principle, the proposed method allows for any type of propensity score model, as long as it has a well-defined estimating equation to be included in the stacked estimating equations. When propensity scores are derived through other machine learning alternatives (e.g., neural networks) (Westreich et al., 2010), a corresponding estimating equation may not be readily available and it is unclear how to develop an analytical variance estimator. It would be useful to investigate the statistical properties of resulting estimators.

Supplementary Material

Supplementary Material

Supporting Information

Web Appendices AF referenced in Sections 35, and code and example data, are available with this paper at the Biometrics website on Wiley Online Library. An R package ipwCoxCSV which implements the proposed method, is available from the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/package=ipwCoxCSV.

Acknowledgements

The authors thank the Editor, Associate Editor, and two referees for providing thoughtful comments, which led to an improved version of the paper. The authors also thank Qoua Her at Harvard Pilgrim Health Care Institute for the help with the bariatric surgery dataset. Drs. Shu and Toh are partially supported by the National Institutes of Health (U01EB023683) and the Agency for Healthcare Research and Quality (R01HS026214). Drs. Shu and Wang are partially supported by R01 AI136947 from the National Institute of Allergy and Infectious Diseases. Drs. Toh and Wang are also supported by Harvard Pilgrim Health Care Institute Robert H. Ebert Career Development Awards.

References

  1. Andersen PK and Gill RD (1982) Cox’s regression model for counting processes: a large sample study. The Annals of Statistics, 10, 1100–1120. [Google Scholar]
  2. Austin PC (2016) Variance estimation when using inverse probability of treatment weighting (IPTW) with survival analysis. Statistics in Medicine, 35, 5642–5655. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Binder DA (1992) Fitting Cox’s proportional hazards models from survey data. Biometrika, 79, 139–147. [Google Scholar]
  4. Cole SR and Hernán MA (2004) Adjusted survival curves with inverse probability weights. Computer Methods and Programs in Biomedicine, 75, 45–49. [DOI] [PubMed] [Google Scholar]
  5. Cole SR and Hernán MA (2008) Constructing inverse probability weights for marginal structural models. American Journal of Epidemiology, 168, 656–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cox DR (1975) Partial likelihood. Biometrika, 62, 269–276. [Google Scholar]
  7. Davison AC and Hinkley DV (1997) Bootstrap Methods and Their Application. Cambridge: Cambridge University Press. [Google Scholar]
  8. Deville JC (1999) Variance estimation for complex statistics and estimators: linearization and residual techniques. Survey Methodology, 25, 193–204. [Google Scholar]
  9. Efron B and Tibshirani RJ (1993) An Introduction to the Bootstrap. New York: Chapman & Hall/CRC. [Google Scholar]
  10. Field CA and Welsh AH (2007) Bootstrapping clustered data. Journal of the Royal Statistical Society, Series B, 69, 369–390. [Google Scholar]
  11. Hajage D, Chauvet G, Belin L, Lafourcade A, Tubach F and De Rycke Y (2018) Closed-form variance estimator for weighted propensity score estimators with survival outcome. Biometrical Journal, 60, 1151–1163. [DOI] [PubMed] [Google Scholar]
  12. Henmi M and Eguchi S (2004) A paradox concerning nuisance parameters and projected estimating functions. Biometrika, 91, 929–941. [Google Scholar]
  13. Hernán MA, Brumback B and Robins JM (2000) Marginal structural models to estimate the causal effect of zidovudine on the survival of hiv-positive men. Epidemiology, 11, 561–570. [DOI] [PubMed] [Google Scholar]
  14. Hernán MA and Robins JM (2020) Causal Inference: What If. Boca Raton: Chapman & Hall/CRC. [Google Scholar]
  15. Höfler M, Pfister H, Lieb R and Wittchen H-U (2005) The use of weights to account for non-response and drop-out. Social Psychiatry and Psychiatric Epidemiology, 40, 291–299. [DOI] [PubMed] [Google Scholar]
  16. Horvitz DG and Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47, 663–685. [Google Scholar]
  17. Kauermann G and Carroll RJ (2001) A note on the efficiency of sandwich covariance matrix estimation. Journal of the American Statistical Association, 96, 1387–1396. [Google Scholar]
  18. Lee EW, Wei L-J and Amato DA (1992) Cox-type regression analysis for large numbers of small groups of correlated failure time observations. In: Klein JP and Goel PK (Eds.) Survival Analysis: State of the Art. Dordrecht: Kluwer Academic Publishers, 237–247. [Google Scholar]
  19. Lin DY and Wei L-J (1989) The robust inference for the Cox proportional hazards model. Journal of the American Statistical Association, 84, 1074–1078. [Google Scholar]
  20. Lunceford JK and Davidian M (2004) Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine, 23, 2937–2960. [DOI] [PubMed] [Google Scholar]
  21. Mancl LA and DeRouen TA (2001) A covariance estimator for GEE with improved small-sample properties. Biometrics, 57, 126–134. [DOI] [PubMed] [Google Scholar]
  22. Mao H, Li L, Yang W and Shen Y (2018) On the propensity score weighting analysis with survival outcome: estimands, estimation, and inference. Statistics in Medicine, 37, 3745–3763. [DOI] [PubMed] [Google Scholar]
  23. Miratrix LW, Sekhon JS, Theodoridis AG and Campos LF (2018) Worth weighting? How to think about and use weights in survey experiments. Political Analysis, 26, 275–291. [Google Scholar]
  24. Perez-Heydrich C, Hudgens MG, Halloran ME, Clemens JD, Ali M and Emch ME (2014) Assessing effects of cholera vaccination in the presence of interference. Biometrics, 70, 734–744. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Pfeffermann D (1993) The role of sampling weights when modeling survey data. International Statistical Review/Revue Internationale de Statistique, 61, 317–337. [Google Scholar]
  26. Robins JM (1997) Marginal structural models. In: 1997 Proceedings of the Section on Bayesian Statistical Science. Alexandria, VA: American Statistical Association, 1–10. [Google Scholar]
  27. Robins JM (1999) Marginal structural models versus structural nested models as tools for causal inference. In: Halloran E and Berry D (Eds.) Statistical Models in Epidemiology: The Environment and Clinical Trials. New York: Springer, 95–134. [Google Scholar]
  28. Rondeau V, Mazroui Y and Gonzalez JR (2012) frailtypack: an R package for the analysis of correlated survival data with frailty models using penalized likelihood estimation or parametrical estimation. Journal of Statistical Software, 47, 1–28. [Google Scholar]
  29. Rosenbaum PR (1987) Model-based direct adjustment. Journal of the American Statistical Association, 82, 387–394. [Google Scholar]
  30. Rosenbaum PR and Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. [Google Scholar]
  31. Seaman SR and White IR (2013) Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research, 22, 278–295. [DOI] [PubMed] [Google Scholar]
  32. Shu D, Yoshida K, Fireman BH and Toh S (2020) Inverse probability weighted Cox model in multi-site studies without sharing individual-level data. Statistical Methods in Medical Research, 29, 1668–1681. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Stefanski LA and Boos DD (2002) The calculus of M-estimation. The American Statistician, 56, 29–38. [Google Scholar]
  34. van der Laan MJ and Robins JM (2003) Unified Methods for Censored Longitudinal Data and Causality. New York: Springer. [Google Scholar]
  35. Westreich D, Lessler J and Funk MJ (2010) Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. Journal of Clinical Epidemiology, 63, 826–833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Williamson EJ, Forbes A and White IR (2014) Variance reduction in randomised trials by inverse probability weighting using the propensity score. Statistics in Medicine, 33, 721–737. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Young JG, Hernán MA, Picciotto S and Robins JM (2008) Simulation from structural survival models under complex time-varying data structures. JSM Proceedings, Section on Statistics in Epidemiology, Denver, CO: American Statistical Association. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES