Skip to main content
HHS Author Manuscripts logoLink to HHS Author Manuscripts
. Author manuscript; available in PMC: 2025 Dec 13.
Published in final edited form as: Health Serv Outcomes Res Methodol. 2025 Jun 7;25(4):382–409. doi: 10.1007/s10742-025-00344-x

Propensity score weighting analysis with complex survey data for estimating population-level treatment effects on survival: a simulation study

Lihua Li 1,2,3,4, Chen Yang 1,2, Wei Zhang 1,2, Yulei He 3, John R Pleis 3, Lauren M Rossen 3, Bian Liu 1,5, Morgan Earp 3, Madhu Mazumdar 1,2,6
PMCID: PMC12700616  NIHMSID: NIHMS2093345  PMID: 41394309

Abstract

Propensity score weighting (PSW) is a valuable tool for estimating treatment effects on survival outcomes in observational studies. However, there is no clear best practice for applying PSW to complex survey data with survival outcomes. This paper addresses this gap by exploring how to integrate PSW into complex survey with design features (strata, clusters, sampling weights) for unbiased population-level estimates. We evaluate three PSW methods where: Method I: neither the propensity score (PS) model nor the outcome model accounts for the survey design; Method II: the PS model does not account for the survey design, but the outcome model does; Method III: both the PS model and outcome model account for the survey design. Through extensive simulations, we compare performance in estimating absolute treatment effects measured by population survival quantile effects and relative treatment effects measured by population marginal hazard ratios. Mean relative bias, mean absolute bias and coverage probability are estimated for model evaluations under various scenarios, including varying treatment effect magnitude, censoring type and rate, level of PS overlap, presence of outliers and nonresponse. Findings reveal that both survey-weighted Methods II and III outperform the unweighted Method I under most scenarios for both measures of treatment effects, especially when there is a true treatment effect. Both weighted methods II and III are found to perform closely, including when there exists informative censoring, influential outliers, or non-response. We recommend that when considering PSW with complex survey data for estimating population-level treatment effects on survival outcomes, both modelling stages should incorporate survey designs, but it is most critical for the outcome modelling. For illustration, all methods are applied to the public-use 2000–2018 National Health Interview Survey (NHIS) Linked to Mortality Files with mortality information through 2019 to estimate the effect of smoking cessation after a cancer diagnosis on subsequent overall survival.

Keywords: Propensity score weighting, Complex survey data, Population survival quantile effect (PSQE), Population marginal hazard ratio (PMHR), Population based research, Inverse probability of treatment weighting (IPTW)

1. Introduction

The concept of propensity score (PS) analysis was introduced by Rosenbaum and Rubin (1983) to remove bias due to observed confounders in non-randomized studies (Rosenbaum and Rubin 1983). The purpose of using PS in observational studies is to create comparable treatment groups by ensuring that they share similar characteristics, mimicking some characteristics of randomized controlled trials (Austin 2011). Essentially, PS is a balancing score, and is defined as the conditional probability of receiving treatment (Z=1) given a set of observed covariates X:e(X)=P(Z=1X). PS analysis usually involves two steps. The first step is to estimate the PS using parametric modelling approaches such as commonly used logistic regressions and covariate balancing propensity scores (CBPS) (Imai and Ratkovic 2014), or flexible and newer approaches such as ensemble learning and other machine learning models (Lee et al. 2009; Watkins et al. 2013; Pirracchio et al. 2015; Cannas and Arpino 2019; Phillips et al. 2023). The second step is to estimate the treatment effect while incorporating PS into the outcome modeling. In the second step, PS is typically used in one of four ways: matching on PS, stratifying by PS, including PS as a covariate, and most commonly, propensity score weighting (PSW) (Rosenbaum 1987; Austin and Mamdani 2006; Stuart 2010; Austin 2011).

The application of PSW methods in analyzing complex survey data to infer causal effects has only recently gained traction (Dugoff et al. 2014; Ridgeway et al. 2015; Yang et al. 2023), despite its common usage in non-survey-based observational studies to mitigate the effects of confounding when estimating the effects of treatment on outcomes, including survival outcomes (Austin 2014; Austin and Schuster 2016; Mao et al. 2018). Unlike data from typical observational studies, complex population-based surveys often incorporate intricate design elements such as strata, clusters, and survey weights. These multilevel sampling designs ensure the samples’ representativeness of the target population with efficiency when survey weights are used (e.g., the US general population), a key advantage not easily achieved even in large scale cohort studies. However, as survey data are observational, population-based survey studies are inherently susceptible to confounding and therefore biased estimates if, for example, the distribution of a confounder differs between groups being compared and in addition, sampling variability may also lead to imbalances. Integrating PSW analysis into the complex survey setting offers an elegant solution. It not only addresses potential observed confounding but also has the capacity to incorporate survey design elements into the analysis. This enables unbiased estimation of population treatment effect, reflecting the true effect of treatment/exposures on the population from which the sample originates (Ridgeway et al. 2015; Austin et al. 2018; Dong et al. 2020; Yang et al. 2023).

While some studies have explored PSW methods for analyzing complex survey data, their scope has been limited to continuous or binary outcomes (Dugoff et al. 2014; Ridgeway et al. 2015; Yang et al. 2023). Notably, there is no guidance on conducting PSW analysis for complex survey data with survival outcomes, which are common outcome types. This paper aims to address this critical gap by comprehensively evaluating three PSW methods in estimating the treatment effect on survival outcomes in the context of complex survey data analysis, with a focus on whether and how survey design elements are incorporated into the analysis. This study serves as a first-of-its-kind guide, navigating users through a series of simulations and culminating in a real-world application, ultimately equipping researchers with the knowledge to properly perform PSW analysis for survival outcomes using complex survey data.

The paper is structured as follows: In Sect. 2, we review PSW methods in survival analysis with non-survey data, followed by its extension in survey study settings. We then focus on setting up three different ways of incorporating survey designs into the two stages of the PSW analysis. In Sect. 3, we describe a series of simulations under various scenarios to evaluate the performance of the three PSW methods. In Sect. 4, the simulation results are presented and interpreted; In Sect. 5, we implement the methods on a real-world data study, examining the effect of smoking cessation after a cancer diagnosis on subsequent overall survival. In Sect. 6, we summarize the key findings and discuss the implications and potential future directions.

2. Methods

2.1. PSW and survival outcomes

Propensity score weighting (PSW) analysis follows a two-step approach: In the first step, the PS is estimated by regressing the binary treatment variable, Z, on the observed baseline covariates X, often using a logistic regression model. Each participant, regardless of their treatment status (treated or control), receives a PS. Treated individuals are assigned a weight equal to 1e^(X), while control individuals are assigned a weight of 11e^(X), where e^(X) is the estimated propensity score. This weighting is often called “inverse probability of treatment weighting” (IPTW), where the weight is the inverse of the conditional probability of having the subject’s own exposure status (treated/control) (Sato and Matsuyama 2003; Lunceford and Davidian 2004). In the second step, the PS-based weights are incorporated into the outcome model to estimate the treatment effect, where essentially participants are ‘reweighted’ to account for their differences in observed characteristics to reduce selection bias and confounding (McCaffrey et al. 2004).

When applying PSW methods to the data with survival outcomes, the effect of treatment is often quantified by either the absolute effect or relative effect of a treatment/exposure on survival (Uno et al. 2014; Mao et al. 2018). The absolute effect is usually measured by the survival quantile effect (SQE) at level q(0,1), which quantifies the difference in survival time between the treated and control groups at q-quantile of the survival time distribution (Mao et al. 2018). By denoting the survival functions for the control and the treated group as S0 and S1, respectively, the SQE is defined as Δ(S0,S1;q)=S11(1q)S01(1q), where Sk1(1q)=sup{t0:Sk(t)1q} for the treatment group k=0,1, denoting q-quantile of the survival time. It is often derived using an adjusted Kaplan–Meier estimator (AKME) (Xie and Liu 2005). The relative effect of treatment, measured by marginal hazard ratio (MHR), captures the hazard of experiencing the outcome in the treatment group compared to the control group. It is often estimated using a univariate Cox proportional hazards (PH) model λZ(t)=λ0(t)exp(βZ), where the hazard of the outcome is regressed on an indicator variable denoting treatment status Z, and λ0 is the hazard function for the control group Z=0. It is weighted by PS weights (Binder 1992; Austin et al. 2018). In essence, the absolute effect compares survival times between the treated and control groups across the entire survival distribution, while the relative effect compares the instantaneous rates of experiencing the outcomes between the treated and control groups.

2.2. PSW with complex survey data

When analyzing a population-based, complex survey sample, the estimands of interest are usually population treatment effects, which refer to the treatment effects in the population from which the sample was drawn (e.g., the target population) (Pfeffermann 1993; Imbens 2004; Imai et al. 2008). Accordingly, when applying PSW to a sample from complex survey data with survival outcomes, the estimands of interest are the SQE and MHR at the population level. The population SQE (PSQE) at level q is defined as PSQE(q)=Δ(S0Population,S1Population;q), where S0Population and S1Population denote the population-level survival function for the control and treated groups respectively. They are estimated from the AKME accounting for the survey design

S^kPopulation(t)=j:tjt[1djkwYjkw] (1)

If t1t and S^kPopulation(t)=1 if t1>t, where w={wi} represents the vector of survey weights for each subject i, and k{0,1} is the index of the treatment groups;

djkw=i:Ti=tjwiδiI(Zi=k)e^(xi)k[1e^(xi)]1k (2)

represents the weighted number of events occurring at tj[0,t], where δi{0,1} is the indicator of event and I() is the indicator function, and

Yjkw=i:TitjwiI(Zi=k)e^(xi)k[1e^(xi)]1k (3)

represents the weighted number of subjects at risk up to tj[0,t].

For the estimation of population MHR (PMHR), the univariate Cox PH model is weighted by the product of inverse probability of treatment weights and survey weights, which is often referred to as “the composite weight” (Cook et al. 2009). The model also accounts for other survey design elements such as strata and clusters. The estimated PMHR, exponential of the regression coefficient β, is determined by solving the partial likelihood score equation with weights incorporated (Binder 1992):

i=1nwie^(xi)zi[1e^(xi)]1zi{ziΣ^(1)(ti,β)Σ^(0)(ti,β)}=0 (4)

where

Σ^(1)(t,β)=1ni=1nwiI(tit)ziexp(βzi)e^(xi)zi[1e^(xi)]1zi (5)

and

Σ^(0)(t,β)=1ni=1nwiI(tit)exp(βzi)e^(xi)zi[1e^(xi)]1zi. (6)

It can be obtained using the svycoxph function in the survey package (Lumley et al. 2024).

2.3. PSW methods considered

In our simulation study, we consider the following three methods for estimating the PSQE and PMHR, building on the previous work on integrating PS to survey data (Dugoff et al. 2014; Ridgeway et al. 2015; Yang et al. 2023):

Method I: neither PS model nor the outcome models (i.e. AKME and Cox regression) accounts for the survey design (strata, clusters and survey weights)

Method II: the PS model does not account for the survey design but the outcome models do

Method III: both the PS model and outcome models account for the survey design

A logistic regression model is used in the first step to estimate the PS. In the second step, the AKME method is used to estimate the PSQE, and a weighted Cox PH regression is employed to estimate PMHR. For Method I, both the AKME and the Cox PH regression models are weighted by the PS only, i.e. assuming wi=1 for all i in (2-6) as well as for the PS estimation therein. For Methods II and III, both the AKME and Cox regression are weighted by the product of the survey weights and the inverse probability of treatment weights derived from the PS. Specifically, the AKME accounting for survey weights is obtained by plugging djkw and Yjkw given by (2) and (3) respectively into (1), while the PMHR is obtained by solving (4). In addition, both the AKME and Cox regression incorporate survey strata and cluster.

3. Simulation

3.1. Simulation design

3.1.1. Step 1. Population and covariate generation

Following the simulation study design of Austin et al. (2018), we generate a population data set containing 1,000,000 observations, consisting of 10 strata, each with 20 clusters. Each cluster contains 5000 observations.

We simulate six covariates using AiN(μi,s,c,1) where μi,s,c=μi,s+μi,c, μi,sN(0,0.103) and μi,cN(0,0.053) for i=1,,6, where s=1,,10 indicates the strata and c=1,,20 denotes the clusters. We then set Xi=Ai for i=1,2,3 for generating the three continuous variables. We generate two binary variables X4 and X5 by setting Xi=1 if Ai>0 and Xi=0 otherwise for i=4 and 5. We also generate a categorical variable X6 with three levels corresponding to cases A60.25, A6(0.25,0.25], and A6>0.25.

3.1.2. Step 2. Treatment assignment

The probability of being treated is a function of X, which can be expressed as P(Z=1X=x)=exp(α0+α1x1+α2x2+α3x3+α4x4+α5x5+α6x6,1+α7x6,2)1+exp(α0+α1x1+α2x2+α3x3+α4x4+α5x5+α6x6,1+α7x6,2), where Z is the indicator of treatment and X6,1 and X6,2 are two dummy variables from the categorical variable X6. We set the overall prevalence of the treated group to approximately 35%.

3.1.3. Step 3. Outcome generation

We simulate survival outcomes by treatment status using a method described by Bender et al. (2005). First, we simulate a time-to-event outcome under the control treatment, followed by simulating a time-to-event outcome under the active treatment. Under the control treatment simulation, the linear predictor was defined as LP(0)=β0+β1X1+β2X2+β3X3+β4X4+β5X5+β6X6,1+β7X6,2. The linear predictor under the treatment simulation is LP(1)=LP(0)+log(βtreat). For each subject, we generate a random number from a standard Uniform distribution, and generate an event time from a Weibull distribution, T(0)=(log(U)λeLP(0))1η, where UU(0,1) and λ and η denote the scale and shape parameters, respectively, of the Weibull distribution. Then the counterfactual event time under the treatment is given by T(1)T(0)(βtreat)1η. We set λ and η equal to 0.00002 and 2, respectively. To determine the value of βtreat that achieves a specified PMHR, we use an iterative process proposed by Austin (2013). In this process, the survival outcome is regressed on the treatment status indicator for 1000 times to obtain 1000 estimated coefficients for the treatment status. The average of these coefficients is considered as an estimate of the logarithm of the MHR associated with a specific value of βtreat (Austin 2013). Using the bisection approach (Austin 2023b) we repeat this process until the value of βtreat that results in the desired PMHR is obtained.

3.1.4. Step 4. Stratified sampling and survey weights construction

We employ a stratified two-stage sampling approach for sample data generation. First, we sample 5 clusters out of the 20 clusters from each of the 10 strata. We then draw a random sample of 5000 subjects from the sampled clusters, with the sample sizes of subjects ns,c, set to 150, 140, 130, 120, 110, 90, 80, 70, 60, 50 for stratum 1,2,,10. In each stratum, we sampled an equal number of subjects from the selected 5 clusters, thus the sample sizes allocated to the 10 strata were as follows: 750, 700, 650, 600, 550, 450, 400, 350, 300, and 250. The subjects are sampled with their correspondingly selection probability π(x), defined as π(x)=P(δ=1X=x)=exp(γ0+γ1x1+γ2x2+γ3x3+γ4x4+γ5x5+γ6x6,1+γ7x6,2)1+exp(γ0+γ1x1+γ2x2+γ3x3+γ4x4+γ5x5+γ6x6,1+γ7x6,2). (Yang et al. 2023; Li et al. 2025), where δ is an indicator for being surveyed. Values of the coefficients are chosen so that the overall probability of being surveyed is 0.0005. The survey weights are constructed as the product of the fraction of the number of subjects in the population over the selected subjects (205)(5000ns,c) and the inverse probability of being sampled 1π(x) (Dugoff et al. 2014; Ridgeway et al. 2015). This sampling is repeated 1000 times, generating 1000 datasets.

3.1.5. Step 6. Scenarios considered

To evaluate the performance of the three methods under different settings, we first simulate scenarios with various true PMHRs. We set the true PMHR to 1, 1.5, 2, and 3 by adjusting the value of βtreat accordingly, with the true survival curves associated with each true PMHR illustrated in Fig. 1. We then consider various censoring rates under both independent and dependent censoring. Censoring time is generated from exponential distribution Cexp(θ). Both independent censoring TC, and covariate-dependent censoring:TCX, Cexp(exp(θX)) are considered. By adjusting the values of θ and θ, we set the censoring rate at 0, 0.1, 0.2 and 0.3. For dependent censoring, we consider that censoring depends on one continuous variable X1 and one binary variable X4. We also incorporate scenarios with varying levels of propensity score overlap. The overlap in PS (“Covariate Overlap”) is often used to check the positivity assumption, ensuring that each treatment group has individuals with comparable covariate profiles (Zhou et al. 2020; Li and Li 2021). The degree of overlap is influenced by the amount of confounding present in the data and the proportion of extreme PS. The stronger overlap in PS suggests more similar covariate profiles between the treatment and control groups and a less amount of confounding. Using the method proposed by Hu et al. (2021), we multiply all value of αs in the treatment assignment model by a constant ψ, where ψ is set as 0.4 and 2 to represent strong and weak overlap, respectively. This is depicted by two panels in Fig. 2, showing the extent to which the distributions of PS for the treatment and control groups overlap. Evaluating a method across varying levels of overlap allows researchers to understand how robust the method is to deviations from ideal overlap (Zhou et al. 2020; Reifeis and Hudgens 2022; Yang et al. 2023; Li et al. 2025). Moreover, to assess the robustness of our methods, we introduce influential outliers and non-response. We generate influential outliers by letting 5% of X2 be contaminated with a normal distribution N(2,1). For scenarios with nonresponse, we assume that approximately 20% of the sampled subjects had no response with the probability of nonresponse related with X2, X4, X6,1 and X6,2. Lastly, we consider scenarios with varying sample sizes of 2500, 5000 and 10,000, while maintaining the same sampling probability. Considering non-collapsibility of HR from the Cox PH model (Gail et al. 1984), we employed the bisection method in the iterative data generating process described in Sect. 3.1 to induce a desired MHR from a conditional HR. Despite its adoption in previous studies (Austin 2013, 2023b; Austin et al. 2019), it has not been studied whether this indirect transformation introduces bias when a complex sampling scheme is involved. Values of αs, βs, θs and γs corresponding to data generation and each scenario can be found in Supplementary Tables S1-S4.

Fig. 1.

Fig. 1

True survival curves associated with true population marginal hazard ratio of 1.0, 1.5, 2.0, and 3.0, when λ = 0.00002 and η = 2

Fig. 2.

Fig. 2

Levels of overlap in true propensity scores with treatment prevalence of approximately 35%. Panels from left to right respectively represent weak, and strong overlap in propensity scores by specifying the value of ψ to 2, and 0.4 correspondingly

3.2. Model evaluation

Under each simulated scenario, we apply methods I, II and III defined previously to estimate the treatment effect, specifically the PSQE^(q) (q=0.25,0.50 (median) and 0.75) and PMHR^. For model evaluations, we calculate mean relative bias (MRB) and coverage probability (CP) of the effect estimates. Here, MRB is defined as 11000i=11000(PSQE^i(q)PSQE(q)1) for the absolute treatment effect and 11000i=11000(PMHR^iPMHR1) for the relative treatment effect, which is the relative difference between the true and estimated effect, averaged across the 1000 sampled datasets. In addition, we also report mean absolute bias (MAB), defined as 11000i=11000(PSQE^i(q)PSQE(q)) for the absolute treatment effect and 11000i=11000(PMHR^iPMHR) for the relative treatment effect, To obtain the standard error of each estimated treatment effect for constructing the CP, the jackknife leave-cluster-out method (Rao et al. 1992; Rust and Rao 1996; Lee and Kim 2002; Kolenikov 2010) was used for estimating se(PSQE^(q)), which is

se(PSQE^(q))=(s=1S(ns1)nsj=1ns(PSQE^j(q)PSQE^(q))2)12,q=0.25,0.50,0.75

where S is the number of stratum, ns is the number of sampled clusters in the s-stratum, and PSQE^j(q) is the estimation of PSQE(q) with the j th cluster removed from the s-stratum. The standard error of PMHR^, se(PMHR^), is obtained empirically from the Cox regressions (Andersen and Gill 1982; Therneau and Grambsch 2000). The CP for each measure is defined as

11000i=11000I(PSQE(q)isinPSQE^i(q)±1.96se(PSQE^(q)))

for q=0.25,0.50 and 0.75 and 11000i=11000I(PMHRisinPMHR^±1.96sePMHR^), respectively. It is the percentage that the 95% CI contains the true treatment measure out of the 1000 times. We seek a method that balances accuracy (low MAB) with reliable estimation (CP close to 95%).

4. Simulation results

4.1. True treatment effect

Figure 3 illustrates the MRBs and CPs of PSQE^(q) and PMHR^ obtained using the three PSW methods described in Sect. 2.3 with varied MHRs, when the treatment prevalence was set to 35%, and no censoring was applied. Visual examinations of the true survival curves in contrast to those estimated using the three PSW methods for each scenario are provided in the supplemental material (Figures S1-S15). For the absolute treatment effect, it is observed that as the true effect got larger, the MRBs of PSQE^(q) decreased, although the MABs increased as expected. Their CPs decreased for all the three methods, however, more drastically for Method I. In comparison to method I, methods II and III consistently produced lower MRB’s for PSQE^(q). Additionally, both methods II and III consistently yielded higher values of CPs compared to method I. Particularly, Method III performed the best with the lowest MRBs and CPs closest to 95% across most situations. For the relative treatment effect, as the true treatment effect increased, the MABs of all PMHR^ increased while the MRBs were relatively stable for all the three methods. The CPs slightly decreased for Method II and III but increased for Method I. Compared to Method I, Method II and III yielded lower MRBs and CPs closer to 95% under most scenarios (Fig. 3).

Fig. 3.

Fig. 3

Mean Relative biases (MRBs), Mean absolute biases (MABs) and coverage probabilities (CPs) of the absolute treatment effect measured by the estimated survival quantile effect at level q(PSQE^(q)) and relative treatment effect measured by the estimated population marginal hazard ratio (PMHR^) using Method I Method II and Method III with varied magnitude of true marginal hazard ratio when the treatment prevalence = 35% and no censoring. Method I: neither the PS model nor the outcome model accounts for the survey design; Method II: the PS model does not account for the survey design, but the outcome model does; Method III: both the PS model and outcome model account for the survey design. MRB is unavailable for the PSQE^(q) when true PMHR=1, because the true PSQE(q) value of 0 is not a valid denominator

4.2. Censoring

4.2.1. Independent censoring

Figures 4, 5, 6 illustrate the MABs and CPs of PSQE^(q) and PMHR^ obtained using the three PSW methods, when censoring was independent of the covariates with censoring rates 10%, 20% and 30% while keeping the treatment prevalence at 35%. Again, both Method II and Method III yielded lower MRBs of PSQE^(q) and their CPs were closer to 95% compared to Method I. However, unlike in the previous setting with no censoring, where Method III outperformed Method II under most scenarios, the trade-offs of MRBs and CPs between Method II and III were observed, particularly for PSQE^(q) at q=0.25 and q=0.50. For the relative treatment, regardless of the true treatment effect, the MRBs of PMHR^ from both weighted methods II and III were lower than those from the unweighted Method I, while the CPs from all three methods were comparable and stayed above 90%. Even though these patterns remained the same across all censoring rates, the MRBs from all three methods slightly increased for both PSQE^(q) and PMHR^ as the censoring rate increased. However, the CPs decreased for PSQE^(q) but not for PMHR^ (Figs. 2, 3, 4).

Fig. 4.

Fig. 4

Mean Relative biases (MRBs), Mean absolute biases (MABs) and coverage probabilities (CPs) of the absolute treatment effect measured by the estimated survival quantile effect at level q(PSQE^(q)) and relative treatment effect measured by the estimated population marginal hazard ratio (PMHR^) using Method I, Method II and Method III with varied magnitude of true marginal hazard ratio when the treatment prevalence = 35% and censoring type is independent with censoring rate = 0.1. Method I: neither the PS model nor the outcome model accounts for the survey design; Method II: the PS model does not account for the survey design, but the outcome model does; Method III: both the PS model and outcome model account for the survey design. MRB is unavailable for the PSQE^(q) when true PMHR=1, because the true PSQE(q) value of 0 is not a valid denominator

Fig. 5.

Fig. 5

Mean Relative biases (MRBs), Mean absolute biases (MABs) and coverage probabilities (CPs) of the absolute treatment effect measured by the estimated survival quantile effect at level q(PSQE^(q)) and relative treatment effect measured by the estimated population marginal hazard ratio (PMHR^) using Method I, Method II and Method III with varied magnitude of true marginal hazard ratio when the treatment prevalence = 35% and censoring type is independent with censoring rate = 0.2. Method I: neither the PS model nor the outcome model accounts for the survey design; Method II: the PS model does not account for the survey design, but the outcome model does; Method III: both the PS model and outcome model account for the survey design. MRB is unavailable for the PSQE^(q) when true PMHR=1, because the true PSQE(q) value of 0 is not a valid denominator

Fig. 6.

Fig. 6

Mean Relative biases (MRBs), Mean absolute biases (MABs) and coverage probabilities (CPs) of the absolute treatment effect measured by the estimated survival quantile effect at level q(PSQE^(q)) and relative treatment effect measured by the estimated population marginal hazard ratio (PMHR^) using Method I, Method II and Method III with varied magnitude of true marginal hazard ratio when the treatment prevalence = 35% and censoring type is independent with censoring rate = 0.3. Method I: neither the PS model nor the outcome model accounts for the survey design; Method II: the PS model does not account for the survey design, but the outcome model does; Method III: both the PS model and outcome model account for the survey design. MRB is unavailable for the PSQE^(q) when true PMHR=1, because the true PSQE(q) value of 0 is not a valid denominator

4.2.2. Dependent censoring

Figures 7, 8, 9 illustrate the MRBs and CPs of PSQE^(q) and PMHR^ obtained using the three PSW methods, when censoring was independent with censoring rates 10%, 20% and 30% while maintaining the treatment prevalence of 35%. The patterns were similar as those in the scenarios with independent censoring but with higher values of MRBs and lower values of CPs from all three methods for PSQE^(q) correspondingly, while not necessarily for PMHR^. For the absolute treatment effect, Method II and III outperformed Method I again in terms of both MRBs and CPs. However, Method II tended to outperform more often for PSQE^(q), depending on the value of PMHR and quantile of survival function, though there was still no dominant advantage between them. For the relative treatment effect, despite that the CPs of PMHR^, from all three methods were all reasonably around 95%, both weighted Methods II and III yielded much lower MABs and MRBs than the unweighted Method I, and this was more evident when the true PMHR increased.

Fig. 7.

Fig. 7

Mean Relative biases (MRBs), Mean absolute biases (MABs) and coverage probabilities (CPs) of the absolute treatment effect measured by the estimated survival quantile effect at level q(PSQE^(q)) and relative treatment effect measured by the estimated population marginal hazard ratio (PMHR^) using Method I, Method II and Method III approaches with varied magnitude of true marginal hazard ratio when the treatment prevalence = 35% and censoring type is dependent with censoring rate = 0.1. Method I: neither the PS model nor the outcome model accounts for the survey design; Method II: the PS model does not account for the survey design, but the outcome model does; Method III: both the PS model and outcome model account for the survey design. MRB is unavailable for the PSQE^(q) when true PMHR=1, because the true PSQE(q) value of 0 is not a valid denominator

Fig. 8.

Fig. 8

Mean Relative biases (MRBs), Mean absolute biases (MABs) and coverage probabilities (CPs) of the absolute treatment effect measured by the estimated survival quantile effect at level q(PSQE^(q)) and relative treatment effect measured by the estimated population marginal hazard ratio (PMHR^) using Method I, Method II and Method III with varied magnitude of true marginal hazard ratio when the treatment prevalence = 35% and censoring type is dependent with censoring rate = 0.2. Method I: neither the PS model nor the outcome model accounts for the survey design; Method II: the PS model does not account for the survey design, but the outcome model does; Method III: both the PS model and outcome model account for the survey design. MRB is unavailable for the PSQE^(q) when true PMHR=1, because the true PSQE(q) value of 0 is not a valid denominator

Fig. 9.

Fig. 9

Mean Relative biases (MRBs), Mean absolute biases (MABs) and coverage probabilities (CPs) of the absolute treatment effect measured by the estimated survival quantile effect at level q(PSQE^(q)) and relative treatment effect measured by the estimated population marginal hazard ratio (PMHR^) using Method I, Method II and Method III with varied magnitude of true marginal hazard ratio when the treatment prevalence = 35% and censoring type is dependent with censoring rate = 0.3. Method I: neither the PS model nor the outcome model accounts for the survey design; Method II: the PS model does not account for the survey design, but the outcome model does; Method III: both the PS model and outcome model account for the survey design. MRB is unavailable for the PSQE^(q) when true PMHR=1, because the true PSQE(q) value of 0 is not a valid denominator

4.3. Overlap in propensity score

Figures 10, 11 show the MRBs and CPs of PSQE^(q) and PMHR^ obtained using three PSW methods when level of overlap in PS between the treatment and control group was varied to be strong (ψ=0.4) and weak (ψ=2), while the prevalence of the treatment was held constant at 35% and there was no censoring. With the strong overlap in PS between two groups, for the absolute treatment effect, the three methods were comparable with the lowest MABs of all PSQE^(q) from Method I when the true PMHR=1. However, when the true PMHR>1, Method II and Method II consistently had lower MABs for all PSQE^(q) and their CPs closer to 95% compared to Method I. For the relative treatment effect, Method I achieved the lowest MABs while maintaining reasonable CPs for PMHR^ across all true PMHR values. However, when the PS overlap was weak, both method II and III consistently outperformed method I in terms of MABs and CPs for both treatment effect measures.

Fig. 10.

Fig. 10

Mean Relative biases (MRBs), Mean absolute biases (MABs) and coverage probabilities (CPs) of the absolute treatment effect measured by the estimated survival quantile effect at level q(PSQE^(q)) and relative treatment effect measured by the estimated population marginal hazard ratio (PMHR^) using Method I, Method II and Method III with varied magnitude of true marginal hazard ratio when PS overlap is strong (ψ=0.4), while keeping the treatment prevalence = 35% and no censoring. Method I: neither the PS model nor the outcome model accounts for the survey design; Method II: the PS model does not account for the survey design, but the outcome model does; Method III: both the PS model and outcome model account for the survey design. MRB is unavailable for the PSQE^(q) when true PMHR=1, because the true PSQE(q) value of 0 is not a valid denominator

Fig. 11.

Fig. 11

Mean Relative biases (MRBs), Mean absolute biases (MABs) and coverage probabilities (CPs) of the absolute treatment effect measured by the estimated survival quantile effect at level q(PSQE^(q)) and relative treatment effect measured by the estimated population marginal hazard ratio (PMHR^) using Method I, Method II and Method III with varied magnitude of true marginal hazard ratio when PS overlap is weak (ψ=2), while keeping the treatment prevalence = 35% and no censoring. Method I: neither the PS model nor the outcome model accounts for the survey design; Method II: the PS model does not account for the survey design, but the outcome model does; Method III: both the PS model and outcome model account for the survey design. MRB is unavailable for the PSQE^(q) when true PMHR=1, because the true PSQE(q) value of 0 is not a valid denominator

4.4. Presence of outliers

When 5% influential outliers were presented in the settings of no censoring, independent censoring and dependent censoring with a censoring rate of 20%, the performance of the three methods remained similar to the settings without outliers. Methods II and III continued to outperform Method I in estimating both measures of the treatment effect, with Method III yielding the lowest MABs and reasonable CPs in most scenarios, especially with no censoring and independent censoring (Supplementary Figures S16-S18).

4.5. Nonresponse

When approximately 20% subjects in the sample were non-responders, the results—whether under no censoring, independent censoring, or dependent censoring with a 20% censoring rate—were similar to the pattern observed in the settings with no non-response. Method II and Method III outperformed Method I when the true PMHR>1 for both measures of the treatment effect. Again, Method III slightly performed better than Method II under most scenarios when there was no censoring and independent censoring. We also noted a decrease in performance of all the three methods particularly for the absolute treatment effect, as the complexity of settings increased from no censoring to dependent censoring in conjunction with non-response (Figures S19-S21).

4.6. Sample size

The performance of the three methods remained similar, with Methods II and II outperforming Method I, particularly when the true PMHR>1. As expected, the MRBs for all the three methods decreased as the sample size increased. The CPs of both PSQE^(q) and PMHR^ stayed relatively stable for Methods II and III, while the CPs of PSQE^(q) decreased drastically for Method I when the treat effect size grew larger (Figures S22, S23).

5. Illustrative example

In this section, we provide a case study to illustrate application of applying PSW on population-based survey data for examining the effect of smoking cessation subsequent to a cancer diagnosis on overall survival.

5.1. Data and study population

We utilized publicly available data from the 2000–2018 National Health Interview Survey (NHIS) Linked to Mortality Files (LMF), where the mortality data are through December 31, 2019 (latest available LMF data). This ensured a minimum follow-up time of 1-year. NHIS is an ongoing cross-sectional population-based survey with a multistage probability design that provides national representative samples of civilian, non-institutionalized population living in the 50 US states and the District of Columbia. The survey collects data on the participants’ socioeconomic status, demographics, health conditions, and access to and utilization of health care services (National Center for Health Statistics 2016). The public-use LMF provides information on vital status of NHIS participants, the quarter and year of death if deceased, as well as the leading cause of death.

To assess the impact of smoking cessation subsequent to a cancer diagnosis on overall survival of adult cancer survivors, we focused on the comparison between those who quit smoking in 3 years of their cancer diagnosis and those who continued smoking. We first identified 22,325 adult cancer survivors (age at cancer diagnosis ≥ 18) who were between 18 and 84 years old at the time of their NHIS interviews and had smoking history information. We did not include cancer survivors aged 85 years and above because the NHIS does not report single year of age separately for this age group in public-use data (n = 1562). After applying some exclusion criteria, our final study sample was comprised of 8,120 participants. Of these, 1575 participants were former smokers who quit smoking within 3 years of a cancer diagnosis, and 6,545 were those who continued smoking after a cancer diagnosis. (Supplementary Figure S22).

5.2. Measures of outcome, exposure, and covariate variables

We defined the event of interest as all-cause mortality. Utilizing the linked mortality data, we derived follow-up time from age at a cancer diagnosis to age at death or end of follow-up. Given the information on the month of death and cancer diagnosis was unavailable, we set the follow-up time to 1 year for those who died in the same year of a cancer diagnosis (n = 23). The exposure variable was a binary quitting status subsequent to cancer diagnosis. Only participants who quit smoking within 3 years of their cancer diagnosis were included as the “treatment” or “quit” group. If a participant has more than one cancer type (n = 852), the year of the first cancer diagnosis was considered as the index date. The “control” or “non-quit” group was defined as cancer survivors who smoked daily or occasionally at the time of their NHIS interviews.

Based on existing literatures and domain knowledge (Yang et al. 2015; Tabuchi et al. 2017; Wang et al. 2023), we considered the following individual-level demographic, clinical characteristics and health behaviors as potential confounders. These variables included age at first cancer diagnosis, sex, race/ethnicity, marital status, educational attainment, residential region, family income (reported by a family respondent) level as a percentage of federal poverty level (FPL), cancer type dichotomized as smoking-related cancer or others, years of smoking, alcohol consumption and survey era (Table 1).

Table 1.

Baseline characteristics of adult cancer survivors (age 18–84) who quit smoking within 3 years of a cancer diagnosis and those who continued smoking, before and after propensity score weighting (PSW)

Variable Before PSW After PSW


Quit (N
= 1575)
Non-quit (N
= 6545)
SMD Quit (N
= 1575)
Non-quit (N
= 6545)
SMD
Age in years at a cancer diagnoses (%) 0.45 0.05
 18–30 9.1 21.8 20.4 19.2
 31–44 16.8 24.2 22.1 23.0
 45–64 55.4 40.9 42.7 43.8
 65–74 16.2 10.9 12.7 11.7
 75–84 2.5 2.2 2.1 2.3
Sex, female (%) 54.0 63.7 0.20 59.5 61.9 0.05
Race/Ethnicity (%) 0.06 0.04
 Non-Hispanic White 85.3 85.8 86.4 85.6
 Non-Hispanic Black 7.6 7.1 7.2 7.2
 Hispanic 3.9 3.1 3.2 3.3
 Othera 3.2 4.0 3.2 3.8
Marital status (%) 0.34 0.01
 Married/living with a partner 49.0 35.5 38.3 37.9
 Never married/divorced/separated 33.4 49.6 46.9 46.7
 Widowed 17.6 14.9 14.9 15.4
Education (%) 0.21 0.09
 < High school 15.6 19.7 16.0 18.9
 High school graduate (including GED) 30.0 34.6 33.5 33.8
 College or equivalentb 46.1 41.3 44.7 42.1
 Postgraduate degreec 8.3 4.5 5.8 5.2
Region (%) 0.10 0.07
 Northeast 18.4 15.5 16.9 15.8
 Midwest 24.0 27.2 24.2 27.1
 South 39.7 39.7 39.9 39.5
 West 17.8 17.7 18.9 17.6
Family income level as a percentage of federal poverty level (%)d 0.34 0.07
 < 200% 31.5 46.7 41.2 44.0
 200–399% 33.3 29.9 30.8 30.6
 ≥ 400% 35.2 23.4 28.0 25.4
Smoking-related cancere, yes (%) 40.7 37.0 0.08 34.1 37.5 0.07
Years of smoking, mean (SD) 34.06 (14.24) 28.33 (15.58) 0.38 29.24 (15.75) 29.43 (15.47) 0.01
Alcohol consumption, yes (%) 0.04 0.02
 Lifetime abstainerf 9.4 9.9 9.5 9.9
 Former drinkerg 28.0 26.3 27.2 26.7
 Current drinkerh 62.5 63.8 63.3 63.5
Survey era (%) 0.10 0.04
 2000–2004 25.7 25.1 24.8 25.0
 2005–2009 24.4 27.2 26.1 26.8
 2010–2014 25.8 27.1 26.5 27.1
 2015–2018 24.1 20.5 22.6 21.1

All the estimates are weighted by the linkage eligible sampling weight

SMD standardized mean difference, SD standard deviation

a

Other includes Asian, American Indian or Alaska Native, Native Hawaiian or Other Pacific Islander, and other race/ethnicity (including multiple race groups)

b

College or equivalent includes those with a Bachelors or Associate degree, as well as those who have attended college without the awarding of a degree

c

Postgraduate degree includes those with a Masters, Professional, or Doctoral degree

d

Family income level as a percentage of federal poverty level (using poverty thresholds as defined by the US Census Bureau) is based on the NHIS multiple-imputed family income and personal earnings files (Centers for Disease Control and Prevention 2018)

e

Smoking related cancer includes cancers of the lung, mouth and throat, voice box, esophagus, stomach, kidney, pancreas, liver, bladder, cervix, colon and rectum, and leukemia

f

Lifetime abstainer is defined as having less than 12 drinks in the adult’s lifetime

g

Former drinker is defined as having no drinks in past year but one or more drinks in any one year

h

Current drinker is defined as having one or more drinks in past year

5.3. Statistical analyses

To examine whether smoking cessation after a cancer diagnosis is associated with overall survival, we applied the three PSW methods described earlier. In the first step, we estimated the PS by regressing the binary post-diagnosis quit status on the observed confounders listed in Table 1 using logistic regression models, with and without accounting for the survey design. We then derived the IPTW using the estimated PS to create a pseudo population of cancer survivors based on the distribution of covariates in the combined sample. In the second step, we incorporated the IPTW into the Kaplan–Meier estimation and Cox PH regression, with and without accounting for the survey design. Considering the potential immortal time bias resulting from attributing the time period between a cancer diagnosis and quitting smoking as “exposed”, we performed sensitivity analyses using the landmark method with 2 landmark times (i.e., 1 year and 2 years after a cancer diagnosis) (Mi et al. 2016). In addition, we also performed another sensitivity analysis by restricting the “quit” group to those who quit smoking within 1 year of a cancer diagnosis.

5.4. Results

Before applying PSW, the quit and non-quit groups differed in many baseline characteristics (Table 1). Comparing to the non-quit group, the quit group tended to be older, males, more educated, married or living with a partner or widowed, and having a higher family income. After PSW, the two groups had similar distribution of covariates (Table 1), with standard mean difference (SMD) ≤ 0.10 indicating good balance (Austin 2011).

Table 2 shows the estimated absolute effect and relative effect of quitting smoking within 3 years after a cancer diagnosis on the subsequent overall survival for adult cancer survivors followed up to 60 years since their cancer diagnosis by the three PSW methods. There were 564 out of 1,575 (weighted 36.1%) cancer survivors in the quit group who died during the follow-up period, while there were 2,057 out of 6,545 (weighted 31.4%) cancer survivors in the non-quit group who died during the follow-up period. All three methods showed that compared to the quit group, the non-quit group consistently had a shorter survival time (Fig. 12). The estimated median (q=0.50) survival time was 39 years for the quit group and 36 years for the non-quit group, with the estimated PSQE of 3 years (95% CI 2.76–3.24 years) and 3 years (95% CI 2.90–3.10 years) from Method II and III respectively. Similarly, both Methods II and II again yielded the close or same estimate for PSQE at 0.25-quantile (q=0.25)(2 years, 95% CI 2.00–2.00 years from both methods) and 0.75-quantile (q=0.75) of the survival time (2 years, 95% CI 1.69–2.31 years from Method II vs 1 years, 95% CI 0.63–1.37 years from Method III) (Table 2). We also found similar patterns of worse survival in the non-quit group compared to the quit group when using the relative survival effect measure. Compared to the quit group, the non-quit group consistently had significantly higher hazard of death, with the estimated PMHR 1.18 (95% CI 1.06–1.32), 1.19 (95% CI 1.05–1.34) and 1.18 (95% CI 1.04–1.33) from Methods I, II and III respectively (Table 2). Landmark analyses provide very similar estimated PMHR, which ranged between 1.18–1.19 and 1.17–1.18 for a landmark time set at 1 year and 2 years after a cancer diagnosis respectively (Supplementary Table S6 & S7). Results from sensitivity analyses restricting the quit group to those who quit smoking within 1 year of a cancer diagnosis show that the effect of smoking cession was still significant, with the estimated PMHR 1.21 (95% CI 1.06–1.38), 1.23 (95% CI 1.06–1.43) and 1.18 (95% CI 1.02–1.37) respectively from Methods I-III (Supplementary Table S6). We also compared the distribution of follow-up time between the analytic cohort and the excluded subjects due to missing values (n = 220) and did not observe concerning differences (Supplementary Table S8).

Table 2.

Estimated absolute treatment effect and relative treatment effect of quitting smoking within 3 years of a cancer diagnosis on overall survival (survival time was left truncated at 60 years after a cancer diagnosis)

Method Exposure group Number of deaths/
Number of subjects
q-quantile of survival time in years (95% CI) PMHR * (95% CI)
Survival
time
(q=0.25)
PSQE(q=0.25) Survival
time
(q=0.50)
PSQE(q=0.50) Survival
time
(q=0.75)
PSQE(q=0.75)
Method I Quit 564/1575 20 2.00 (2.00, 2.00) 39 2.00 (2.00, 2.00) 50 0 (−0.27, 0.27) 1.18 (1.06, 1.32)
Non-quit 2057/6545 18 37 50
Method II Quit 564/1575 20 2.00 (2.00, 2.00) 39 3.00 (2.76, 3.24) 51 2.00 (1.69, 2.31) 1.19 (1.05, 1.34)
Non-quit 2057/6545 18 36 49
Method III Quit 564/1575 20 2.00 (2.00, 2.00) 39 3.00 (2.90, 3.10) 50 1.00 (0.63, 1.37) 1.18 (1.04, 1.33)
Non-quit 2057/6545 18 36 49

Note: PSQE, population survival quantile effect; PMHR, Population marginal hazard ratio; Method I: neither the PS model nor the outcome model accounts for the survey design; Method II: the PS model does not account for the survey design, but the outcome model does; Method III: both the PS model and outcome model account for the survey design

*

Hazard ratio of death comparing subjects who didn’t quit smoking to those who quit smoking within 3 years of a cancer diagnosis

Fig. 12.

Fig. 12

Estimated survival curves of adult cancer survivors who quit smoking within 3 years of a cancer diagnosis and those who continued smoking estimated using Method I, Method IIand Method III. Method I: neither the PS model nor the outcome model accounts for the survey design; Method II: the PS model does not account for the survey design, but the outcome model does; Method III: both the PS model and outcome model account for the survey design

Though our intent of this real-data example is to illustrate the application of the three methods rather than to evaluate the effectiveness of smoking cessation subsequent to a cancer diagnosis, we acknowledge several limitations, such as unmeasured confounders (e.g. disease stage, duration and treatment history), time-invariant health behaviors (e.g. smoking or drinking habits may change over time) and possible misclassification for the period from the time of quitting smoking if one quit smoking after the NHIS interview. In addition, there may be immortal time bias, as cancer survivors in the study only included those who survived long enough to participate in the survey.

6. Discussion

In this study, we conducted a series of simulations to explore the appropriate application of PSW in the analysis of complex survey data with survival outcomes. Depending on whether and in which step the survey design elements were incorporated, three PSW methods for estimating the PSQE and PMHR were evaluated. The performance of the three methods was examined under a wide range of scenarios including varying the true treatment effect size, censoring type and censoring rate, level of overlap in PS, and as well as the presence of influential outliers and non-response.

All the three methods yielded biased estimates for both PSQE and PMHR, but the degree of bias differed across the methods and estimates of PSQE and PMHR. For example, the MRBs of the estimated PMHR from Methods II and III consistently ranged from 2 to 4% across most scenarios. This subtle bias could stem from the data generating process for a desired MHR, which is derived from a conditional HR because of non-collapsibility property of the Cox PH model (Gail et al. 1984). To determine the value of the conditional HR exp(βtreat) that induced the desired MHR, we used a bisection approach in an iterative process. Though this approach is well established (Austin 2013, 2023a; Austin et al. 2019), the indirect transformation from a conditional hazard ratio to a MHR may introduce some bias. In addition, sampling, especially with the complex probability scheme, may also introduce bias. We drew a sample of total 5000 out of the population of 1,000,000, with an unequal sample size from each stratum and corresponding selection probabilities. Even with 1000 repetitions, the PMHR estimated from the samples might still differ from the true MHR of the entire population. Furthermore, extreme PS weights from certain samples can cause inadequate covariate balance, potentially biasing the estimated MHR. This becomes more pronounced as the true MHR approaches 1 (Austin 2013).

Previous studies have demonstrated that combining the survey design and PSW is a robust approach to achieve unbiased population-level treatment effects (e.g., population average treatment effect (PATE) and population average treatment effect on the treated (PATT)) for a continuous or binary outcome (Dugoff et al. 2014; Ridgeway et al. 2015; Yang et al. 2023). Our results also suggest that both weighted methods accounting for the survey design were preferred over the unweighted method for estimating population-level treatment effects on a survival outcome under most scenarios. Compared to the unweighted method, both weighted methods yielded estimates of absolute and relative treatment effects with lower MABs and reasonable CPs, when PS were not strongly overlapped or treatment effects existed. Despite that MABs from all the three methods increased and CPs decreased as the value of true treatment effect got larger, these metrics were relatively more stable for the weighted Methods II and III. The better performance of the two weighted methods than the non-weighted method remained even with the presence of outliers or non-response.

However, when there was a strong overlap in PS and no treatment effect, the three PSW methods performed closely, with some trade-offs of MABs and CPs between Method I and the two weighted methods. It is also noted that the Method I yielded the estimates with the lowest MABs and reasonable CPs for PMHR regardless of the treatment effect size. This finding is confirmed in the case study where three methods yielded close estimates particularly for PMHR. While the overlap in PS decreased in the sensitivity analysis with further restrictions, we noticed that the estimates from the three methods for PMHR, started diverging. Therefore, we stress the importance of a preliminary analysis in practice to assess the overlap in PS, as well as consideration of measures of treatment effects (absolute or relative treatment effect), before selecting a suitable method.

We also found that the performance of the two weighted methods depended on the censoring, censoring type and quantile of the survival function—Method III exceled under most scenarios, particularly in the settings of no censoring. When there was independent censoring, Method II tended to outperform Method III at some time points of SQE measures for the non-zero treatment effects, and this was more apparent if the censoring was dependent. However, the performances of both weighted methods were very close— the differences in MRBs were negligible and/or trade-offs of bias and coverage with no clear advantage of one over the other. Thus, there is no strong evidence for recommending Method II over Method III. More importantly, the type of censoring or its underlying mechanism is often unknown in reality. It is also noted that the bias from all three methods went up when the censoring rate increased, suggesting that embedding other approaches such as inverse probability of censoring weighting (IPCW) to account for the censoring distribution (Robins and Finkelstein 2000; Boyd et al. 2012) may be needed.

As our focus in this work is on exploring whether and in which modelling stage survey designs should be incorporated in the PSW analysis, we employed the commonly used logistic regression to construct simple inverse probability of treatment weights, while future work could consider newer and more flexible methods, such as ensemble learning approaches and other machine learning models to derive PS, (Lee et al. 2009; Watkins et al. 2013; Pirracchio et al. 2015; Gharibzadeh et al. 2018; Cannas and Arpino 2019) as well as alternative PS weights such as variance stabilized weighting (Austin and Stuart 2015; Li and Thomas 2018) and overlap weighting for extreme PS weights (Li and Thomas 2018). Also, for a consistent comparison across all settings, we adopted the traditional Cox PH regression to estimate relative treatment effects. However, building on this work, future research could consider the extension to other approaches such as the robust method for weighted Cox PH regressions to handle the influential outliers (Minder and Bednarski 1996; Sitlani et al. 2020), IPCW to correct bias resulting from depending censoring (Robins and Finkelstein 2000; Boyd et al. 2012; Therneau 2015; Dunkler et al. 2018; Handorf et al. 2024), and a time-dependent Cox regression when the PH assumption does not hold (Robins and Finkelstein 2000; Boyd et al. 2012; Therneau 2015; Dunkler et al. 2018; Handorf et al. 2024).

In this paper, we sought to provide much-needed guidance on how to properly account for survey designs in PSW analysis of survival outcomes using complex survey data and shed light on expanding PSW research. This will help maximize the value of population-based surveys with linked administrative data, such as NHIS and National Health and Nutrition Examination Survey (NHANES) linked to mortality file (National Center for Health Statistics Division of Analysis and Epidemiology 2022a; 2022b) as well as longitudinal surveys such as National Health and Aging Trend Study (NHATS) and Medicare Current Beneficiary Survey (MCBS) (Centers for Medicare and Medicaid Services 2018; Freedman and Kasper 2019). We found from simulations that both survey-weighted methods are preferred under most scenarios, including when there exists informative censoring, influential outliers, or non-response. Therefore, we recommend that when considering PSW with complex survey data for estimating population-level treatment effects on survival, both modelling stages should incorporate survey designs, but it is most critical for the outcome modelling.

Supplementary Material

Supplementary Materials

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/s10742-025-00344-x.

Funding

This work was supported by ASA & NCHS Fellowship awarded to Dr. Lihua Li. Drs. Madhu Mazumdar and Lihua Li were additionally supported by funding from National Cancer Institute Center Support Grant (P30 CA196521). MM was also supported by the National Center for Advancing Translational Sciences (NCATS) cooperative agreement (1U01 TR002997-01 A1).

Disclosures

The authors have declared no disclosure on any previous presentation of the research, manuscript, or abstract. The findings and conclusions in this paper are those of the author(s) and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

Footnotes

Conflict of interest The authors declare no competing interests.

Data availability

The data generation in the simulation was described in the manuscript. The National Health Interview Survey (NHIS) data and Linked Mortality Files (LMF) are publicly available and can be found at the websites of Centers for Disease Control and Prevention (https://www.cdc.gov/nchs/nhis/data-questionnaires-documentation.htm; https://www.cdc.gov/nchs/data-linkage/mortality-public.htm).

References

  1. National Center for Health Statistics: 2015 National Health Interview Survey (NHIS) Public Use Data Release Survey Description (2016) [Google Scholar]
  2. Andersen PK, Gill RD: Cox’s regression model for counting processes: a large sample study. Ann. Stat (1982). 10.1214/aos/1176345976 [DOI] [Google Scholar]
  3. Austin PC: An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav Res 46(3), 399–424 (2011). 10.1080/00273171.2011.568786 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Austin PC: The performance of different propensity score methods for estimating marginal hazard ratios. Stat. Med 32(16), 2837–2849 (2013). 10.1002/sim.5705 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Austin PC: The use of propensity score methods with survival or time-to-event outcomes: reporting measures of effect similar to those used in randomized experiments. Stat. Med 33(7), 1242–1258 (2014). 10.1002/sim.5984 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Austin PC: The iterative bisection procedure: a useful tool for determining parameter values in data-generating processes in Monte Carlo simulations. BMC Med. Res. Methodol (2023a). 10.1186/s12874-023-01836-5 [DOI] [Google Scholar]
  7. Austin PC: The iterative bisection procedure: a useful tool for determining parameter values in data-generating processes in Monte Carlo simulations. BMC Med. Res. Methodol 23(1), 45 (2023b). 10.1186/s12874-023-01836-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Austin PC, Mamdani MM: A comparison of propensity score methods: a case-study estimating the effectiveness of post-AMI statin use. Stat. Med 25(12), 2084–2106 (2006). 10.1002/sim.2328 [DOI] [PubMed] [Google Scholar]
  9. Austin PC, Schuster T: The performance of different propensity score methods for estimating absolute effects of treatments on survival outcomes: a simulation study. Stat. Methods Med. Res 25(5), 2214–2237 (2016). 10.1177/0962280213519716 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Austin PC, Stuart EA: Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Stat. Med 34(28), 3661–3679 (2015). 10.1002/sim.6607 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Austin PC, Jembere N, Chiu M: Propensity score matching and complex surveys. Stat. Methods Med. Res 27(4), 1240–1257 (2018). 10.1177/0962280216658920 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Austin PC, Ceyisakar IE, Steyerberg EW, Lingsma HF, Marang-van de Mheen PJ: Ranking hospital performance based on individual indicators: can we increase reliability by creating composite indicators? BMC Med. Res. Methodol (2019). 10.1186/s12874-019-0769-x [DOI] [Google Scholar]
  13. Bender R, Augustin T, Blettner M: Generating survival times to simulate Cox proportional hazards models. Stat. Med 24(11), 1713–1723 (2005). 10.1002/sim.2059 [DOI] [PubMed] [Google Scholar]
  14. Binder DA: Fitting Cox’s proportional hazards models from survey data. Biometrika 79(1), 139–147 (1992). 10.1093/biomet/79.1.139 [DOI] [Google Scholar]
  15. Boyd AP, Kittelson JM, Gillen DL: Estimation of treatment effect under non-proportional hazards and conditionally independent censoring. Stat. Med 31(28), 3504–3515 (2012) [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Cannas M, Arpino B: A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting. Biom. J 61(4), 1049–1072 (2019). 10.1002/bimj.201800132 [DOI] [PubMed] [Google Scholar]
  17. Centers for Disease Control and Prevention [cited Feb. 25, 2025], “National Health Interview Survey 2018 Data Release,” [online]. Available at: https://archive.cdc.gov/#/details?url=https://www.cdc.gov/nchs/nhis/nhis_2018_data_release.htm.
  18. Centers for Medicare and Medicaid Services [cited Jun. 26, 2024], “Medicare Current Beneficiary Survey (MCBS).” [online]. Available at: https://www.cms.gov/data-research/research/medicare-current-beneficiary-survey.
  19. Cook BL, McGuire TG, Meara E, Zaslavsky AM: Adjusting for health status in non-linear models of health care disparities. Health Serv. Outcomes Res. Method 9(1), 1–21 (2009). 10.1007/s10742-008-0039-6 [DOI] [Google Scholar]
  20. Dong N, Stuart EA, Lenis D, Quynh Nguyen T: Using propensity score analysis of survey data to estimate population average treatment effects: a case study comparing different methods. Eval. Rev 44(1), 84–108 (2020). 10.1177/0193841X20938497 [DOI] [PubMed] [Google Scholar]
  21. Dugoff EH, Schuler M, Stuart EA: Generalizing observational study results: applying propensity score methods to complex surveys. Health Serv. Res 49(1), 284–303 (2014). 10.1111/1475-6773.12090 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Dunkler D, Ploner M, Schemper M, Heinze G: Weighted Cox Regression Using the R Package cox-phw. J. Stat. Softw (2018). 10.18637/jss.v084.i02 [DOI] [Google Scholar]
  23. Freedman VA, Kasper JD: Cohort profile: the national health and aging trends study (NHATS). Int. J. Epidemiol 48(4), 1044–1045g (2019) [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Gail MH, Wieand S, Piantadosi S: Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika 71(3), 431–444 (1984). 10.1093/biomet/71.3.431 [DOI] [Google Scholar]
  25. Gharibzadeh S, Mansournia MA, Rahimiforoushani A, Alizadeh A, Amouzegar A, Mehrabani-Zeinabad K, Mohammad K: Comparing different propensity score estimation methods for estimating the marginal causal effect through standardization to propensity scores. Commun. Stat. Simul. Comput 47(4), 964–976 (2018). 10.1080/03610918.2017.1300267 [DOI] [Google Scholar]
  26. Handorf EA, Smaldone MC, Movva S, Mitra N: Analysis of survival data with nonproportional hazards: a comparison of propensity-score-weighted methods. Biom. J 66(1), e202200099 (2024). 10.1002/bimj.202200099 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Hu L, Ji J, Li F: Estimating heterogeneous survival treatment effect in observational data using machine learning. Stat. Med 40(21), 4691–4713 (2021). 10.1002/sim.9090 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Imai K, Ratkovic M: Covariate balancing propensity score. J. r. Stat. Soc. Ser. B Stat Methodol. 76(1), 243–263 (2014) [Google Scholar]
  29. Imai K, King G, Stuart EA: Misunderstandings between experimentalists and observationalists about causal inference. J. R. Stat. Soc. Ser. A Stat. Soc 171(2), 481–502 (2008). 10.1111/j.1467-985X.2007.00527.x [DOI] [Google Scholar]
  30. Imbens GW: Nonparametric estimation of average treatment effects under exogeneity: a review. Rev. Econ. Stat 86(1), 4–29 (2004). 10.1162/003465304323023651 [DOI] [Google Scholar]
  31. Kolenikov S.: Resampling variance estimation for complex survey data. Stata J. Promot. Commun. Stat. Stata 10(2), 165–199 (2010). 10.1177/1536867X1001000201 [DOI] [Google Scholar]
  32. Lee BK, Lessler J, Stuart EA: Improving propensity score weighting using machine learning. Stat. Med 29(3), 337–346 (2009). 10.1002/sim.3782 [DOI] [Google Scholar]
  33. Lee H, Kim J: Jackknife variance estimation for two-phase samples with high sampling fractions. In Proceedings of the Survey Research Methods Section, American Statistical Association; (2002) [Google Scholar]
  34. Li Y, Li L: Propensity score analysis methods with balancing constraints: a Monte Carlo study. Stat. Methods Med. Res 30(4), 1119–1142 (2021). 10.1177/0962280220983512 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Li F, Thomas LE: Addressing extreme propensity scores via the overlap weights. Am. J. Epidemiol (2018). 10.1093/aje/kwy201 [DOI] [Google Scholar]
  36. Li L, Yang C, Hu L, Zhang W, Aldridge M, Liu B, Mazumdar M: Comparative effectiveness of propensity score estimation methods for inverse probability of treatment weighting analysis with complex survey data: a simulation study. J. Surv. Stat. Methodol (2025). 10.1093/jssam/smaf003 [DOI] [Google Scholar]
  37. National Center for Health Statistics Division of Analysis and Epidemiology: [cited Jul. 9, 2024], “NHIS Public-use Linked Mortality Files (2022b) [online]. Available at: https://www.cdc.gov/nchs/data-linka ge/mortality-public.htm
  38. Lumley T, Gao P, Schneider B: [cited Feb 24, 2025], “Package ‘survey’,” [online] (2024). Available at: https://cran.r-project.org/web/packages/survey/survey.pdf [Google Scholar]
  39. Lunceford JK, Davidian M: Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat. Med 23(19), 2937–2960 (2004). 10.1002/sim.1903 [DOI] [PubMed] [Google Scholar]
  40. Mao H, Li L, Yang W, Shen Y: On the propensity score weighting analysis with survival outcome: estimands, estimation, and inference. Stat. Med 37(26), 3745–3763 (2018). 10.1002/sim.7839 [DOI] [PubMed] [Google Scholar]
  41. McCaffrey DF, Ridgeway G, Morral AR: Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol. Methods 9(4), 403–425 (2004). 10.1037/1082-989X.9.4.403 [DOI] [PubMed] [Google Scholar]
  42. Mi X, Hammill BG, Curtis LH, Lai ECC, Setoguchi S: Use of the landmark method to address immortal person-time bias in comparative effectiveness research: a simulation study. Stat. Med 35(26), 4824–4836 (2016) [DOI] [PubMed] [Google Scholar]
  43. Minder CE, Bednarski T: A robust method for proportional hazards regression. Stat. Med 15(10), 1033–1047 (1996). [DOI] [PubMed] [Google Scholar]
  44. National Center for Health Statistics Division of Analysis and Epidemiology:, [cited Jul. 9, 2024], “NHANES III Public-use Linked Mortality Files, 2019, (2022a) [online]. Available at: https://www.cdc.gov/nchs/data-linkage/mortality-public.htm
  45. Pfeffermann D.: The role of sampling weights when modeling survey data. Int. Stat. Rev (1993). 10.2307/1403631 [DOI] [Google Scholar]
  46. Phillips RV, Van Der Laan MJ, Lee H, Gruber S: Practical considerations for specifying a super learner. Int. J. Epidemiol 52(4), 1276–1285 (2023) [DOI] [PubMed] [Google Scholar]
  47. Pirracchio R, Petersen ML, van der Laan M: Improving propensity score estimators’ robustness to model misspecification using super learner. Am. J. Epidemiol 181(2), 108–119 (2015). 10.1093/aje/kwu253 [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Rao J, Wu C, Yue K: Some recent work on resampling methods for complex surveys. Surv. Methodol 18(2), 209–217 (1992) [Google Scholar]
  49. Reifeis SA, Hudgens MG: On variance of the treatment effect in the treated when estimated by inverse probability weighting. Am. J. Epidemiol 191(6), 1092–1097 (2022). 10.1093/aje/kwac014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Ridgeway G, Kovalchik SA, Griffin BA, Kabeto MU: Propensity score analysis with survey weighted data. J Causal Inference 3(2), 237–249 (2015). 10.1515/jci-2014-0039 [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Robins JM, Finkelstein DM: Correcting for noncompliance and dependent censoring in an AIDS clinical trial with inverse probability of censoring weighted (IPCW) log-rank tests. Biometrics 56(3), 779–788 (2000) [DOI] [PubMed] [Google Scholar]
  52. Rosenbaum PR: Model-based direct adjustment. J. Am. Stat. Assoc (1987). 10.1080/01621459.1987.10478441 [DOI] [Google Scholar]
  53. Rosenbaum PR, Rubin DB: The central role of the propensity score in observational studies for causal effects. Biometrika 70(1), 41–55 (1983). 10.1093/biomet/70.1.41 [DOI] [Google Scholar]
  54. Rust KF, Rao JN: Variance estimation for complex surveys using replication techniques. Stat. Methods Med. Res 5(3), 283–310 (1996). 10.1177/096228029600500305 [DOI] [PubMed] [Google Scholar]
  55. Sato T, Matsuyama Y: Marginal structural models as a tool for standardization. Epidemiology 14(6), 680–686 (2003). 10.1097/01.EDE.0000081989.82616.7d [DOI] [PubMed] [Google Scholar]
  56. Sitlani CM, Lumley T, McKnight B, Rice KM, Olson NC, Doyle MF, Huber SA, Tracy RP, Psaty BM, Delaney JAC: Incorporating sampling weights into robust estimation of Cox proportional hazards regression model, with illustration in the Multi-Ethnic Study of Atherosclerosis. BMC Med. Res. Methodol (2020). 10.1186/s12874-020-00945-9 [DOI] [Google Scholar]
  57. Stuart EA: Matching methods for causal inference: a review and a look forward. Stat. Sci (2010). 10.1214/09-sts313 [DOI] [Google Scholar]
  58. Tabuchi T, Goto A, Ito Y, Fukui K, Miyashiro I, Shinozaki T: Smoking at the time of diagnosis and mortality in cancer patients: What benefit does the quitter gain? Int. J. Cancer 140(8), 1789–1795 (2017). 10.1002/ijc.30601 [DOI] [PubMed] [Google Scholar]
  59. Therneau T.: A package for survival analysis in S. R Package Version. 2(7), 2014 (2015) [Google Scholar]
  60. Therneau TM, Grambsch PM: Modeling Survival Data: Extending the Cox Model. Springer; (2000) [Google Scholar]
  61. Uno H, Claggett B, Tian L, Inoue E, Gallo P, Miyata T, Schrag D, Takeuchi M, Uyama Y, Zhao L, Skali H, Solomon S, Jacobus S, Hughes M, Packer M, Wei LJ: Moving beyond the hazard ratio in quantifying the between-group difference in survival analysis. J. Clin. Oncol 32(22), 2380–2385 (2014). 10.1200/JCO.2014.55.2208 [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Wang X, Romero-Gutierrez CW, Kothari J, Shafer A, Li Y, Christiani DC: Prediagnosis smoking cessation and overall survival among patients with non-small cell lung cancer. JAMA Netw. Open 6(5), e2311966 (2023). 10.1001/jamanetworkopen.2023.11966 [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Watkins S, Jonsson-Funk M, Brookhart MA, Rosenberg SA, O’Shea TM, Daniels J: An empirical comparison of tree-based methods for propensity score estimation. Health Serv. Res 48(5), 1798–1817 (2013). 10.1111/1475-6773.12068 [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Xie J, Liu C: Adjusted Kaplan-Meier estimator and log-rank test with inverse probability of treatment weighting for survival data. Stat. Med 24(20), 3089–3110 (2005). 10.1002/sim.2174 [DOI] [PubMed] [Google Scholar]
  65. Yang B, Jacobs EJ, Gapstur SM, Stevens V, Campbell PT: Active smoking and mortality among colorectal cancer survivors: the Cancer Prevention Study II nutrition cohort. J. Clin. Oncol 33(8), 885–893 (2015). 10.1200/JCO.2014.58.3831 [DOI] [PubMed] [Google Scholar]
  66. Yang C, Cuerden MS, Zhang W, Aldridge M, Li L: Propensity score weighting with survey weighted data when outcomes are binary: a simulation study. Health Serv. Outcomes Res. Method (2023). 10.1007/s10742-023-00317-y [DOI] [Google Scholar]
  67. Zhou Y, Matsouaka RA, Thomas L: Propensity score weighting under limited overlap and model misspecification. Stat. Methods Med. Res 29(12), 3721–3756 (2020). 10.1177/0962280220940334 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Materials

Data Availability Statement

The data generation in the simulation was described in the manuscript. The National Health Interview Survey (NHIS) data and Linked Mortality Files (LMF) are publicly available and can be found at the websites of Centers for Disease Control and Prevention (https://www.cdc.gov/nchs/nhis/data-questionnaires-documentation.htm; https://www.cdc.gov/nchs/data-linkage/mortality-public.htm).

RESOURCES