Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jan 1.
Published in final edited form as: Biom J. 2022 Dec 21;66(1):e202200099. doi: 10.1002/bimj.202200099

Analysis of survival data with non-proportional hazards: A comparison of propensity score weighted methods

Elizabeth A Handorf 1,*, Marc Smaldone 2, Sujana Movva 3, Nandita Mitra 4
PMCID: PMC10282107  NIHMSID: NIHMS1858480  PMID: 36541715

Abstract

One of the most common ways researchers compare cancer survival outcomes across treatments from observational data is using Cox regression. This model depends on its underlying assumption of proportional hazards, but in some real-world cases, such as when comparing different classes of cancer therapies, substantial violations may occur. In this situation, researchers have several alternative methods to choose from, including Cox models with time-varying hazard ratios; parametric accelerated failure time models; Kaplan-Meier curves; and pseudo-observations. It is unclear which of these models are likely to perform best in practice. To fill this gap in the literature, we perform a neutral comparison study of candidate approaches. We examine clinically meaningful outcome measures that can be computed and directly compared across each method, namely, survival probability at time T , median survival, and restricted mean survival. To adjust for differences between treatment groups, we use Inverse Probability of Treatment Weighting based on the propensity score. We conduct simulation studies under a range of scenarios, and determine the biases, coverages, and standard errors of the Average Treatment Effects for each method. We then demonstrate the use of these approaches using two published observational studies of survival after cancer treatment. The first examines chemotherapy in sarcoma, which has a late treatment effect (i.e. similar survival initially, but after two years the chemotherapy group shows a benefit). The other study is a comparison of surgical techniques for kidney cancer, where survival differences are attenuated over time.

Keywords: non-proportional hazards, observational studies, inverse probability of treatment weighting

1. Introduction

Many cancer outcomes studies use observational databases, such as SEER-Medicare or the National Cancer Database,(Enewold et al., 2020; Boffa et al., 2017) to compare survival outcomes between different treatment or exposure groups. A common approach to model survival data with censoring and potentially confounding covariates is the Cox proportional hazards regression model. (D. R. Cox, 1972) However, the standard Cox model makes a strong assumption that the hazards are proportional between groups over time (a constant hazard ratio). (Klein & Moeschberger, 2003) This assumption may not hold in real-world studies. In recent years, this issue has arisen more frequently in the oncology literature, as non-proportionality is a common feature in studies of immunotherapies. (Jiménez, 2022; Rahman et al., 2019)

There are a variety of models for censored outcomes which do not assume proportional hazards. Non-parametric methods, including the ubiquitous Kaplan-Meier method, do not make assumptions about the functional form of the survival curves, and therefore accommodate non-proportionality. (Kaplan & Meier, 1958; Collett, 2015) A recent method which uses pseudo-observations can also be used without parametric assumptions. This method, using a missing data framework, uses a consistent estimator to replace each observation with a pseudo-observation (i.e. its estimate when it is excluded from the sample). (Andersen, Hansen, & Klein, 2004; Andersen & Pohar Perme, 2010; Andersen, Syriopoulou, & Parner, 2017; Sachs & Gabriel, 2022; Ambrogi, Iacobelli, & Andersen, 2022) There are also parametric and semi-parametric approaches which accommodate non-proportionality. The Cox model can be modified to include time dependent hazards, relaxing the assumption of proportionality. (Collett, 2015) Accelerated Failure Time (AFT) models are a different class of regression methods for censored data, which assume a linear model for log-time. They can be modeled fully parametrically, and for many choices of error distributions, the models are non-proportional. (Collett, 2015)

Given the number of choices, a natural question is which method performs best in real-world conditions, particularly when confounding is present. As standard implementations of non-parametric approaches do not readily incorporate covariates, in this paper we focus on approaches which can use propensity score weighting to improve balance and address confounding (including parametric AFT models, Cox models, pseudo-observations, and Kaplan-Meier curves). The propensity score is commonly used for observational studies and has many advantages as we discuss in section 2. We then describe several methods for analysis of survival data which do not rely on proportional hazards. We focus on clinically relevant estimands and compare the performance of several parametric, semi-parametric, and nonparametric approaches. We focus on methods which can 1) be used with Inverse Probability of Treatment Weighting (IPTW), 2) identify the estimands of interest, and 3) are readily implemented by applied researchers using currently available software. In this paper, we use realistic simulated scenarios in a neutral comparison study of possible methods. Our results provide insight on the performance of these methods in clinical applications.

This work is motivated by two recently published clinical studies in which the treatment effect exhibited non-proportional hazards. In a study of chemotherapy for soft-tissue sarcoma, there was a negligible survival benefit until about two years after the start of treatment. (Movva, von Mehren, Ross, & Handorf, 2015) In another study comparing two different surgical techniques in early-stage renal cancer, one approach had a strong benefit immediately after surgery, but the benefit decreased over time. (Ristau et al., 2018) These two forms of deviation from proportionality served as the motivation for this study. The data for both of these studies came from an observational source, the National Cancer Database. (Boffa et al., 2017) Therefore, both were likely subject to confounding by indication; that is, the choice of treatment was informed by factors that are also associated with survival.

2. Propensity Score weighting

The propensity score, first proposed by Rosenbaum and Rubin, (Rosenbaum & Rubin, 1983, 1984) is defined as the probability of receiving treatment given a set of covariates,

e=P(Z=1X)

where e is the propensity score, Z is a binary indicator for treatment and X is a vector of (measured) covariates.

Rosenbaum and Rubin showed that the propensity score is a balancing score, that is, conditional on the propensity score, the treatment is independent of the covariates. This has important implications for control of confounding in observational studies; namely, if the propensity score is defined based on all confounding covariates, controlling for the propensity score will control for all sources of confounding.

Propensity score methods have a number of benefits over traditional regression adjustment. First, they provide an easy way to check for sufficient covariate overlap and balance between the treated and control groups. (Austin, 2011) Second, one can use flexible and modern approaches, such as ensemble learning approaches and other machine learning models to estimate the propensity scores. (van der Laan, Mark J, Polley, & Hubbard, 2007; Pirracchio, Petersen, & van der Laan, 2015; Cannas & Arpino, 2019; Zou, Mi, Tighe, Koch, & Zou, 2021) This can make the estimates less vulnerable to bias from an incorrect functional form. Propensity scores also allow for the use of doubly robust methods. (Funk et al., 2011) Third, propensity score weighting approaches provide marginal, not conditional, estimates, so the interpretation of a treatment effect does not depend on a subjects’ covariate values. Marginal estimates are therefore are more interpretable and useful to clinicians and policymakers. (Austin & Stuart, 2015; Lunceford & Davidian, 2004; Austin & Stuart, 2017; Mao, Li, Yang, & Shen, 2018) There are drawbacks to using propensity score methods as well. First, and most importantly, propensity score methods cannot overcome biases from unmeasured confounders. Further, they may be less familiar and less readily understood by clinical readership. The propensity score can also be used in several different ways (e.g. stratification, matching, weighting), and changing the method may change the estimand, producing results with different interpretations. Propensity score methods therefore must be conducted by analysts with expertise in the area, and clearly described when publishing results. (Deb et al., 2016)

There are several ways to use propensity scores in analyses of survival data. Here we focus on weighting, specifically Inverse Probability of Treatment Weighting (IPTW). IPTW creates a synthetic sample where covariates are unrelated to treatment assignment, so that treatment effects in the weighted samples can be summarized and compared directly.

To implement an IPTW analysis, one must first obtain the estimated propensity scores for each subject e^i. This can be accomplished via a number of approaches, including logistic regression, random forests, or ensemble methods. The IPTW for subject i is then defined as

w^i=Zie^i+1Zi1e^i.

As Zi is binary, if subject i is treated, their weight is one over their probability of being treated (based on their covariates). If the subject did not receive the treatment, their weight is one over their probability of receiving the control. Therefore treated subjects who were less likely to be treated have larger weights, as do control subjects who were less likely to receive the control, thus creating a balanced pseudo-population. In subsequent analyses, these weights are used to estimate the expected effects under the treatment and control conditions. It is important to note that one should check for covariate balance between treatment arms in the pseudo-population. Often, a standardized mean difference of 0.1 or less between arms is taken to indicate good balance. (Austin, 2011)

3. Survival methods under non proportional hazards

3.1. Quantifying differences between groups with non-proportionality

In the setting of survival outcomes with non-proportionality, we must first choose an estimand. It should be readily interpretable and meaningful given the observed data features. If we fit a proportional hazards model to non-proportional data, we would obtain a single Hazard Ratio (HR), but this is actually a weighted average of the hazard ratios at each observed failure time. The weighted average is problematic because the weights depend on the censoring distribution. This means that factors unrelated to the treatment effect, such as the accrual times and length of follow-up, will affect the estimated HR when hazards are non-proportional. (Kalbfleisch & Prentice, 1981; Horiguchi, Hassett, & Uno, 2019; Uno et al., 2014) Alternative estimands include the average hazard ratio, an average weighted by the number of patients at risk. (Schemper, Wakounig, & Heinze, 2009)

The difficulties inherent in reporting hazard ratios in the setting of non-proportionality can be avoided by focusing on measures that can be defined directly from the survival functions for the treated and control arms, S1(t) and S0(t), respectively.

The simplest approach compares differences in survival probabilities at a particular time t = T .

ΔST=S1(t=T)S0(t=T)

Quantiles of the survival functions can also be compared. For instance, if the survival curves are defined for S(t) = 0.5, median survival times can be compared.

ΔT50=T:S1(t=T)=0.5T:S0(t=T)=0.5

These measures are readily understood by a clinical audience, but they do not comprehensively measure survival effects over the length of follow-up. Another measure is restricted mean survival (RMS), the mean survival up through some time Tmax where S(t) is defined. (Conner et al., 2019) Often, Tmax can be chosen as the largest observed time, although this does rely on the assumption that the censoring distribution meets certain (mild) conditions. (Tian et al., 2020) In this context, one should choose Tmax to be no more than the largest time which is observed in both arms. The RMS for each arm can be found using the area under the survival curve, and the difference in RMS between the treated and control conditions is

ΔRMSTmax=0TmaxS1(t)dt0TmaxS0(t)dt.

RMS (and related extensions) have become increasingly popular. There are several specialized methods for estimating RMS, including doubly robust and flexible parametric models. (Royston & Parmar, 2011; Zhang & Schaubel, 2012)

Each of the measures described above are always interpretable, regardless of whether proportionality holds. In the subsequent sections, we focus on methods which can 1) accommodate IPTW, 2) allow the estimation of the measure of interest (∆ST , ∆T50, and ∆RMSTmax), and 3) are readily implemented using published software. We note that some useful methods, such as those which only support estimation of RMS, are not included in the paper, as they do not meet all the inclusion criteria listed above.

3.2. Cox model

Although the Cox model uses the assumption of proportionality, it is often robust to small deviations from this assumption. However, if n is large, as is common is registry studies, small deviations from proportionality may be statistically significant when assessed using standard diagnostic tests (e.g. Cox-Snell residuals, Schoenfeld residuals). (Grambsch & Therneau, 1994) Therefore, the Cox model may still be useful in some cases were deviations from proportionality are demonstrated. The Cox model is defined as follows:

hitZi=h0(t)expβZi.

IPTW can be readily accommodated in the Cox model, by weighting the contribution of each observation to the partial likelihood function:

L(β)=i=1NexpβZijRtiw^jexpβZjw^i

where R(ti) is the risk set at time ti. The parameter β is the average of the log-hazard ratios at each failure time. Importantly, one can find S^(t) based on the estimates h^0(t) and β^.

We can also extend the standard Cox model, relaxing the assumption of proportionality, by allowing the hazard ratio to vary as a function of time.

hitZi=h0(t)expβZi+fZi,t.

In our simulations, we consider two particular cases, 1) allowing the hazard to vary by log-time, where the parameter κ defines the strength of the non-proportionality of the treatment effect.

fZi,t=κZilog(t),

and 2) using a piecewise constant treatment effect

fZi,t=00t<C1κ1ZiC1t<C2κ2ZiC2t.

Where C1 and C2 are the pre-specified times at which the constant hazards are allowed to change. These models can all be implemented with the survival package in R. (Therneau, Terry M., 2020)

3.3. Parametric AFT models

Accelerated Failure Time (AFT) models assume a linear model for log(t). (Buckley & James, 1979; Collett, 2015; Jin, Lin, & Ying, 2006; Wei, 1992)

logTi=μ+βZi+σϵi,

where Ti is the realization of time t for subject i. Here, we consider parametric models, where the error term ϵ is assumed to take on a given distribution. A common choice is the Gumbel distribution, which yields a Weibull hazard model; however, the Weibull model is also a proportional hazards model when the shape parameter is constant. We can accommodate non-proportionality by allowing the hazard ratio to change over time.

hi(tZ)=expβZi+fZi,tγλγtγ1, (1)

where λ is the scale and γ is the shape parameter. If we let

fZi,t=κZilog(t),

then (1) can be re-written as

logTi=1γ+κZilog(λ)+βZi+ϵi.
ϵi Gumbel 

If we allow the shape to vary by treatment, this allows the hazard ratio to change over log-time.

Another flexible alternative is the generalized gamma model. (Marshall & Olkin, 1997), a 3 parameter gamma model. In the AFT model specification of Cox et. al (C. Cox, Chu, Schneider, & Muñoz, 2007), (Ti) has a generalized gamma distribution with the parameters (μ + βZ), σ (σ > 0), and λ, such that

logTi=μ+βZi+σϵi
ϵi|λ|ϵiΓλ2λ2ϵiλλ2expλ2ϵiλ

This model encompasses other distributions as special cases; when λ = 1 we have a Weibull AFT model, if λ = 0 it results in a log-normal model, and when λ = σ we have the standard Gamma model.

For these AFT models, the weights w^i can be applied during estimation of parameters as frequency weights (i.e. the weights which create the pseudo-population). Each subjects’ weight therefore determines their contribution to the likelihood function. Here, we focus on the 3-parameter Weibull and generalized gamma AFT models, but a wide range of AFT models can be fit in R using the flexsurv package. Other parametric models, including spline-based models such as the as the Royston-Parmar model can also be fit using this package. (Royston & Parmar, 2002) This model can be used with a proportional hazards or a proportional odds framework, and time-varying covariate effects may also be included. Recent papers have used this model with IPTW; although these methods mainly focus on RMS, this is an interesting area for future development. (Jackson, Christopher, 2016; Mozumder, Rutherford, & Lambert, 2021)

We also note that there are a variety of semi-parametric AFT models available which relax the parametric assumptions for the distribution of ϵi. However, many of these methods were developed for clinical trials, and may not allow for weighting. (Pang, Platt, Schuster, & Abrahamowicz, 2021) The well-known AFT models of Hernán and Robins use IPTW to account for time-varying treatments in a trial, not treatments with time-varying effects with covariate imbalance at baseline. (Hernán, Cole, Margolick, Cohen, & Robins, 2005; Robins & Tsiatis, 1992) Other semi-parametric AFT models developed for clinical trials cannot readily implement weights. (Burzykowski, 2022) We found that currently available software for estimation of semi-parametric AFT models either did not allow for weights (Komárek, Lesaffre, & Hilton, 2005) or were limited in their available estimands. (S. Chiou, Kang, & Yan, 2015; S. H. Chiou, Kang, & Yan, 2014) Therefore, none of the semi-parametric AFT models we evaluated met all inclusion criteria for our study, where we focus on methods which are readily accessible to analysts conducting applied studies. However, this is an area of active development, so new methods and software implementations may be available soon.

3.4. Kaplan-Meier method

Another strategy is to avoid parametric assumptions entirely, and use Kaplan-Meier curves with IPTW to estimate the survival functions. (Cole & Hernán, 2004; Xie & Liu, 2005)

S^(t)=tjt1djw^njw^

Where djw^=i:Ti=tjw^iδi is the weighted number of events and njw^=i:Titjw^i is the weighted number of people at risk at time tj. Weighted curves are easily estimated in R via the survival package.

3.5. Pseudo-observations

Another non-parametric method which can estimate the survival curve is the recently-developed pseudo-observations approach. (Andersen et al., 2004; Andersen & Pohar Perme, 2010; Andersen et al., 2017) If no missing data (censoring) or confounders are present, the treatment effect can be found by averaging over outcomes for each subject.

θ=E(f(Y))=1nZifYi,

where nZ is the total number of subjects receiving treatment Z. When there is censoring, we can define the pseudo-observation for subject i as

θ^i=nZθ^nZ1θ^i

where θ^ is a consistent estimate for θ, and θ^i is the estimator applied to all observations excluding subject i. The pseudo-observations are then used in place of all observations (censored and non-censored). In a survival context θ is the probability of survival at time t, and θi is a binary indicator for the status (alive/dead) of subject i at time t. We can estimate SZ(t) using the Kaplan-Meier method, which then can estimate the pseudo-observations (θi) for all observations in arm Z.

θ^Zi=nZS^Z(t)nZ1S^Z(t)i

Heuristically, nZS^(t) is the expected number of patients alive at time t, and nZ1S^(t)i is the expected number of patients other than patient i alive at time t. θ^i is used in place of (possibly unobserved) survival status at time t. Therefore, the probability of being alive at time t in arm Z is

P^(YZ,t)=1nZiθ^Zi.

Incorporating propensity-score weights to account for confounding is straightforward.

P^(YZ,t)=1nZiw^iθ^Zi

There are several benefits of the pseudo-observations approach. It can be used in a causal framework, and requires weaker assumptions than regression-based models. It also readily supports estimation of many outcomes of interest. We can estimate P^(YZ,t) at each failure time, defining the survival curve, which in turn allows us to estimate the measures of interest described in section 3.1. Pseudo-observations can also be used to estimate RMS. This method can be implemented using the R package pseudo. (Pohar Perme, Maha & Gerster, Mette, 2017)

3.6. Variance estimation

To conduct inference for the outcomes of interest, we need to quantify uncertainty. Some methods have closed-form variance estimators for our estimands of interest, but others do not. This motivates the use of a non-parametric bootstrap to estimate variances and find confidence intervals.(Efron & Tibshirani, 1993) Bootstrap-based estimates are available for every combination of model and outcome we discuss above. Further, to correctly account for the variance of e^, the estimated propensity scores, we can re-estimate the propensity score within each bootstrap iteration. (Austin, 2016) One consideration when using the bootstrap is the number of observations which remain present in the risk set at the longest follow-up times of interest; there must sufficient numbers so that the bootstrap replicates can obtain valid estimates.

4. Simulation Studies

4.1. Simulation Methods

We tested the performance of the methods described above using simulation studies. These studies follow the framework of Morris et al. (Morris, White, & Crowther, 2019) for best practices in conducting simulation studies for statistical methods.

Aims:

This aims of this study were to neutrally assess the performance of IPTW survival methods in observational studies when non-proportional hazards are present.

Data generating mechanisms:

We considered 7 data generating mechanisms. We based our covariate distributions, outcome distributions, and effect sizes on the NCDB cancer studies to mimic real world scenarios as best as possible.

First, we simulated a set of covariates based on the observed multivariate distribution of the NCDB covariates in the renal dataset. We used the R package gen0rd (Barbiero, Alessandro & Alda Ferrari, Pier, 2015) to draw variables representing gender, age, stage, histology, tumor grade, Charlson comorbidity score, race, ethnicity, insurance status, facility type, income, and education. As our goal here is to evaluate the survival analysis methods, and not the propensity score estimation methods, we held this set of covariates constant for each simulation; we employ simulated covariates instead of the actual covariates so that our data and simulation code can be shared. (Available online at https://github.com/BethHandorf/NonPH_IPTW)

Next, we used a propensity score model to determine each subjects’ probability of being assigned to the treatment condition (versus control). The effect size (log-scale) for each covariate ranged from 0.8 (stage) to 0.03 (one year of age). (See Appendix Table 1 for full specification.) Effect sizes were chosen to emulate realistic scenarios grounded in the NCDB studies. We set the intercept such that, on average, half of the population received treatment. In each simulation, observed treatment status was drawn from the binomial based on each individuals’ treatment probability, as defined by our propensity score model.

In our base case, survival times were drawn based on a Weibull model with a time-varying treatment effect.

h(t)=γλtγ1expβ1Z+f(Z,t)+Xψ (2)

In our base case, the hazard ratio varied by log time f(Z, t) = β2Zlog(t), we set β1 = −0.69, β2 = 0.25, with a sample size of 5000 total cases. For each individual, survival times under Z = 1 and Z = 0 were both drawn using the R package simsurv. (Brilleman, Sam, 2019) In additional scenarios, we generated survival times using the generalized gamma AFT model, with times drawn using the R package flexsurv

logTi=μ+β1Zi+Xψ+σϵiϵi|λ|ϵiΓλ2λ2ϵi λλ2expλ2ϵiλ (3)

with β1 =−0.69 and σ=1.2. We varied the values of λ and μ in different scenarios.

In each survival model, covariate effect sizes ranged from 0 to 0.8 (see Appendix Table 1). Censoring times were drawn from the uniform (with a range of 8–15), to mimic administrative censoring. Observations with a censoring time less than their true event time were considered censored at that time.

We simulated data using 7 parametric survival models, varying equations (2) and (3):

  1. Base case: Weibull model with β1 = −0.69 ; f(Z, t) = 0.25Zlog(t)

  2. Weibull model with piecewise constant (PWC) hazard ratio: β1 = 0; f(Z, t) = −0.25ZI(t ≥ 2), where I(·) is the indicator function

  3. Base case with modest non-proportionality β1 = −0.69 ; f(Z, t) = 0.125Zlog(t)

  4. Base case with modest treatment effect β1 = −0.41 ; f(Z, t) = 0.25Zlog(t)

  5. Base case, with a smaller sample size (N=500 per arm)

  6. Generalized gamma survival model with μ = 2 and λ=−0.5

  7. Generalized gamma model with μ = 3.5 and λ=2.5

Scenario 1 (base case) had a large treatment effect with substantial non-proportionality. Scenarios 2–4 approximated the parameters and non-proportionality from the clinical examples, and scenario 5 was used to assess how the methods perform with a more modest sample size. The generalized gamma models used in scenarios 6–7 both result in survival curves with non-proportional hazards, but they have different forms of non-proportionality from scenarios 1–5, and from each other.

Estimands:

Outcomes of interest for this study were median survival, restricted mean survival, and survival probability at time T , as described in Section 3.1

Methods:

The 7 data generating mechanisms were each used to generate 500 simulated datasets. For each patient, we generated 3 simulated survival times: 1) Their survival time when Z = 1, 2) their survival time if Z = 0, and 3) Their observed survival time, subject to their treatment assignment and their censoring time. The first 2 outcomes were used to calculate the true treatment effects for each patient, which in turn were aggregated to calculate the treatment effects for the population. The final outcome (along with a censoring indicator) was used to fit the survival models.

For each set of simulated outcomes, we first estimated the propensity scores using a logistic regression model, and then calculated the IPTWs. Using the estimated weights, we fit the models described in Section 3.2-3.5. Variance estimates, along with 95% confidence intervals were calculated via a non-parametric bootstrap, as described in Section 3.6, based on 500 bootstrap replicates.

Performance Measures We assessed the performance of each model based on the bias, coverage, and bootstrap standard error. Bias was defined as the difference between the estimated treatment effect and the true treatment effect, and we calculated the mean bias over the 500 simulated samples. For clarity, we present the absolute value of the mean bias in the results below. A bias close to zero is desirable for each method/estimand combination. Coverage was defined as the proportion of simulations where the estimates fall within their respective 95% confidence intervals. Coverage close to the nominal 0.95 level is desirable. Standard error is the average bootstrap-based standard error over the 500 simulations. Smaller standard errors are desirable, as they indicate more efficient estimates.

4.2. Simulation Results

For the base case, the bias was often largest for the standard Cox model, as expected given the substantial degree of non-proportionality in the simulation. For example, considering median survival, bias for the Cox model was −0.347 years, while all other methods had biases less than −0.069. The generalized gamma AFT model also had large bias relative to other methods; at 10 years, it’s bias was actually greater than that of the Cox model (2.5% vs 2.0%, respectively). Even though this three-parameter model is flexible and allows for non-proportionality, fitting this mis-specified parametric model still introduced substantial biases into the results. In the base case, the Cox model with a treatment effect varying by log-time was correctly specified, as was the fully parametric AFT model with variable location and scale parameters. These models had lower biases than the standard Cox model (e.g. for median survival, biases were −0.042 and −0.056, respectively). However, the non-parametric Kaplan-Meier and pseudo-observation methods generally had the lowest biases (e.g. biases for median survival were −0.021 and −0.020, respectively). Finally, we note that the RMS was less sensitive to the choice of method than the other outcomes of interest; the relative differences in the sizes of the biases were smallest for this survival outcome. (see Figure 1)

Figure 1:

Figure 1:

Mean bias* in estimates from simulation studies

CTV=Cox Time-Varying; LT=Log-Time; PWC=Piece-Wise Constant; AFT=Accelerated Failure Time; GG=Generalized Gamma; WBL LS= Weibull Location-Scale; Pseudo=Pseudo-Observations; Wtd KM=Weighted Kaplan-Meier; Mod NPH= Modest Non-Proportional Hazards; Mod TE= Modest Treatment effect; Sm SS=Small Sample Size

*For ease of comparison, we present the absolute value of the mean bias

As shown in Table 1, coverage was worst for the standard Cox model, followed by the generalized gamma AFT model. The time-varying Cox models did better, although coverage was surprisingly low for the log-time model at the 2-year outcome (0.82). Coverage for the non-parametric methods was close to the nominal level for all outcomes. Standard Errors (SEs) of the differences in treatment effects were usually slightly larger for the non-parametric methods than for the parametric and semi-parametric methods. (See Appendix Figure 1) Comparing the SEs for the Kaplan-Meier versus the Weibull model, the increases ranged from 0.3–24.4%, and were lowest for RMS and highest for 2-year survival.

Table 1:

Empirical coverage probability of the 95% bootstrap percentile confidence interval

Method 2y 5y 10y Median RMS
Cox 0.00 0.07 0.61 0.81 0.71
CTV LT 0.82 0.91 0.93 0.94 0.93
CTV PWC 0.96 0.96 0.95 0.95 0.93
AFT GG 0.00 0.37 0.44 0.94 0.90
AFT WBL LS 0.89 0.93 0.93 0.95 0.93
Pseudo 0.97 0.97 0.96 0.95 0.94
Wtd KM 0.97 0.96 0.96 0.95 0.94

For piecewise constant hazards (scenario 2), biases were typically lowest for the non-parametric methods and when using a PWC time-varying Cox model. An exception to this was for RMS, where the naїve Cox model and Weibull AFT model had the lowest bias. This likely occurred because bias from different timepoints were in different directions (positive at 2 and 5 years, negative at 10 years), and may have cancelled each other out when calculating the area under the survival curve. (Note that Figure 1 shows the absolute value of the average biases). Coverages (See Appendix Table 2) were close to the nominal when biases were low. Standard errors were generally similar to be base case, except for median survival, where they were somewhat larger.

When the effect of non-proportionality was modest (scenario 3), biases tended to be smaller than those of the base case, especially for estimates of survival probabilities at a given timepoint. When the treatment effect was modest (scenario 4), results were generally similar to those of the base case. In the small sample size case (scenario 5), the bias was similar to that of the base case, with the exception of RMS, where across methods, biases were larger than those of the base case. As expected, SEs were higher when the sample size was smaller.

Scenarios 6 and 7, which used the generalized gamma distribution, had low biases when using generalized gamma parametric models. This is expected, as the models are correctly specified (unlike in scenarios 1–5). Biases were also low when using non-parametric models, or the Cox model with a piecewise constant time-varying hazard. In scenario 6, when λ = −0.5 the standard Cox model and the Weibull AFT model performed particularly poorly, while when λ = 2.5, the Cox model with a treatment effect varying by log-time generally performed the worst. Models with large biases also had the worst coverage. Notably, many methods applied to scenario 6 and 7 resulted in high coverage, and coverage was above the nominal in some cases (especially for the non-parametric methods), indicating that the standard errors may be overestimated by the bootstrap procedure. Similar to what we saw in the base case, standard errors were lowest when the correctly specified parametric models were used, but improvements over the non-parametric methods were small.

5. Comparative effectiveness of Cancer Therapies

We applied each method to estimate the effect of treatments on survival in two clinical studies using the National Cancer Database (NCDB). The NCDB is a large registry which encompasses approximately 70% of newly diagnosed cancer cases in the United States, containing patient, tumor, and treatment measures. (Boffa et al., 2017) The studies discussed here featured two different forms of non-proportional hazards. In the first, the treatment groups had similar survival initially, but differences arose in longer-term outcomes. In the second example the benefit of the treatment was attenuated over time. Both studies used de-identified NCDB data, which has been determined by our Institutional Review Board to be not human subjects research.

5.1. Sarcoma

Soft-tissue sarcoma is a rare cancer of the connective tissue (e.g. fat, muscle, blood vessel), usually arising in the extremities. Stage III disease (high grade, large, or deep tumor) is typically treated surgically or with radiation therapy. It remains uncertain whether chemotherapy provides an overall survival benefit for these patients, and its use is optional according to national guidelines. (von Mehren, Margaret, 2020) This question was studied in a NCDB cohort of Stage III sarcoma patients treated with definitive surgical resection of the primary tumor, with or without chemotherapy pre/post operatively. Of the 5,337 cases, 28% were treated with chemotherapy. (Movva et al., 2015)

Due to the non-randomized nature of the data, there were substantial differences between treatment groups. For example, patients treated with chemotherapy were younger and had fewer recorded co-morbid conditions. Furthermore, the hazards for the treatment groups exhibited substantial non-proportionality, as is clearly evident upon examining the complementary log-log survival curves. (See Figure 2) Testing the Schoenfeld residuals provides further evidence of non-proportionality (P=0.047). We used IPTW to account for covariate imbalance between the treatment groups. The propensity score was estimated using all available covariates (including age, gender, insurance status, income, comorbidity score, tumor histology, grade, size, anatomic site, treating facility type, and travel distance.) The IPTW Kaplan-Meier curves are shown in Figure 2. There was little effect of chemotherapy during the first two years, but patients treated with chemotherapy did better in the long term. These results motivated simulation scenario 2 (PWC). We estimated propensity scores via logistic regression. Upon assessing balance, the maximum standardized difference dropped from 0.64 (age) to 0.11 (histology). All other variables had standardized differences <0.1, with most being <0.05, indicating good balance.

Figure 2:

Figure 2:

Survival in stage III sarcoma by chemotherapy

We applied each methods to the sarcoma data. Results for RMS, 2-year survival, and 5-year survival are shown in Table 2 (sample sizes were not sufficient at the end of follow-up to obtain bootstrap confidence intervals for median and 10-year survival). The different methods yielded different effect size estimates; however, these differences were often modest and in some cases would not change statistical inferences. For example, ∆RMS varied from a minimum of 0.409 (AFT with a generalized gamma distribution) to a maximum of 0.468, but all models showed an improvement in RMS for patients treated with chemotherapy. In other cases the inferences would differ based on the model chosen. At 2 years, the standard Cox model showed a statistically significant difference of 3.6% between arms. Both AFT models also found significantly higher 2-year survival in the chemotherapy group. However, the weighted Kaplan-Meier method found a non-significant difference of −0.7%. Both time-varying Cox models and the pseudo-observations method also produced non-significant differences.

Table 2:

Chemotherapy in sarcoma: results from each model

RMS
2 years
5 years
LCL UCL LCL UCL LCL UCL
Cox 0.452 0.146 0.757 0.036 0.013 0.060 0.052 0.018 0.088
TV Cox: log-T 0.457 0.132 0.752 0.018 −0.011 0.043 0.052 0.019 0.087
TV Cox: PWC 0.442 0.117 0.743 −0.003 −0.038 0.028 0.044 0.006 0.081
AFT Gen Gamma 0.409 0.114 0.680 0.039 0.011 0.064 0.047 0.014 0.079
AFT Wbl TV shape 0.468 0.160 0.773 0.026 0.002 0.048 0.054 0.020 0.090
Weighted K-M 0.439 0.109 0.746 −0.007 −0.043 0.026 0.041 0.003 0.078
Pseudo-obs* 0.439 0.110 0.745 −0.007 −0.043 0.026 0.041 0.003 0.078
*

Centered weights

∆ = Difference between Treatment effects for chemo vs no chemo, boldface denotes significance

LCL = Lower confidence limit, UCL = Upper confidence limit

We note that in this data analysis, we found that the pseudo-observations method was sensitive to the mean of the estimated weights. While in expectation the mean of the weights is 1, in practice there may be small deviations. The pseudo-observations method was more affected by these deviations than the other methods we evaluated. This sensitivity can be explained by how the weights are applied and how the survival function is estimated. Each patient’s (binary) contribution to the survival function is weighted, and the weighted indicators are then averaged. Therefore, if the mean of the weights is not equal to 1, at baseline the survival function will not start at 1. To address this issue we centered the weights so their mean was exactly 1. However, we note that this was done heuristically, and not theoretically driven. Further exploration of this issue is warranted.

5.2. Renal Cancer

The second motivating clinical example for this work was a study of surgical options for early stage renal cancer. These small tumors can be treated with Radical Nephrectomy (RN) or Partial Nephrectomy (PN).

PN preserves more kidney tissue, but surgery/recovery is more difficult and it is more likely to be given to healthier patients. (Motzer, Robert J., 2020; Ristau et al., 2018) Here, we focus on a subgroup: patients aged 51–60 with T1a tumors, which gives a sample size of 28,973 patients (61.1% treated with PN). In unadjusted analysis, patients given PN had improved survival over those given RN. However, these results are subject to confounding by indication. Examination of the complementary log-log plots shows a notable deviation from non-proportionality, which is confirmed by testing the Schoenfeld residuals (P<0.0001). (Grambsch & Therneau, 1994) Again, we estimated propensity scores via logistic regression. Here, the maximum standardized difference dropped from 0.30 (facility type) to 0.01 (geographic location), indicating excellent balance.

When we fit an IPTW Cox model to these data, allowing the hazards to vary as a function of log-time, such that h(t) = λ0(t)exp(β1Z + β2Zlog(t)) the maximum likelihood estimates are β^1=0.468 and β^2=0.136. Therefore, the non-proportionality observed here is close to our modest time-varying hazard simulation scenario 3, and the treatment effect is modest as well, similar to what we used in scenario 4.

Our results were largely similar, regardless of what model was used. (See Table 3) This was particularly the case for the RMS and five-year survival. Two-year survival showed the largest range in results, with estimated differences ranging from 0.9% (Generalized Gamma AFT model) to 1.6% (Cox model with piece-wise time-varying effects). The standard Cox model gave an estimated difference of 1%, compared to 1.5% from the Kaplan-Meier method. Even with these small effects, we found that none of the 95% confidence limits crossed zero, so all models would lead to the same inference: a small benefit for PN over RN. The congruous inferences were partially driven by the very large sample size, and the resulting small confidence intervals; however, given the clear evidence of non-proportionality, it is notable that the point estimates for the standard Cox model were close to those of time-varying or non-parametric methods.

Table 3:

Effect of PN vs RN in renal cancer: Results from each model

RMS
2 years
5 years
LCL UCL LCL UCL LCL UCL
Cox 0.268 0.199 0.331 0.010 0.007 0.012 0.027 0.020 0.034
TV Cox: log-T 0.267 0.199 0.330 0.014 0.010 0.018 0.029 0.022 0.036
TV Cox: PWC 0.264 0.196 0.327 0.016 0.012 0.020 0.030 0.021 0.037
AFT Gen Gamma 0.263 0.194 0.326 0.009 0.007 0.012 0.026 0.020 0.033
AFT Wbl TV shape 0.264 0.196 0.329 0.019 0.014 0.023 0.031 0.024 0.038
Weighted K-M 0.268 0.200 0.333 0.015 0.011 0.020 0.029 0.020 0.036
Pseudo-obs* 0.269 0.200 0.334 0.015 0.011 0.020 0.029 0.020 0.036
*

Centered weights

∆ = Difference between Treatment Effects for PN vs RN, boldface denotes significance

LCL = Lower confidence limit, UCL = Upper confidence limit

6. Discussion

When analyzing survival outcomes, we often see non-proportional hazards. Based on the findings of our neutral comparison study, it seems that non-proportionality is readily addressable in practice with weighted Kaplan-Meier curves, the simplest method we assessed. It performed quite well across a variety of scenarios and outcome measures of interest. The IPTW Kaplan-Meier method does not require specialized software or methods. There is an efficiency penalty, but the increase in the size of the standard errors (compared to parametric methods) was small in practice. We believe that the larger standard errors are a minor drawback when compared to the benefits of fewer assumptions and reduced bias.

This work has several limitations. Here, we used simple IPTW weights. Alternative propensity score weights have also been proposed; (Austin & Stuart, 2015) notably, variance stabilized weights have good properties, especially when one treatment is given to a small proportion of the patients. In the current paper, we use a correctly specified logistic regression model estimate the propensity score. The impact of model misspecification was out the scope of this paper, and this issue has been studied elsewhere. (Austin & Stuart, 2017) In modern practice, propensity score models can be estimated using flexible approaches that don’t rely on a specific parametric form. (van der Laan, Mark J et al., 2007; Pirracchio et al., 2015; Cannas & Arpino, 2019; Zou et al., 2021) Such methods have been shown to improve prediction, and may help to achieve balance between treatment arms. A further limitation of our paper is that we did not consider potential effects of unmeasured confounders; propensity score approaches will not balance unmeasured confounders. (Austin, 2011) Additionally, the non-proportionality we studied was limited to 4 functional forms. Our simulation studies were directly informed by the two cancer studies discussed in Section 5. We extended our simulations by examining survival data generated by two very different generalized gamma models, and found consistent results. However, it is possible that some forms of non-proportionality may behave differently.

In this paper, We sought to better understand how choice of method may affect results of these real-world analyses, and to learn broader lessons about how best to accommodate non-proportionality in large, observational studies. Although both clinical examples had large sample sizes, we found in simulations that the results held up even when sample sizes were small. Taken together, our simulation results and clinical examples show that one can easily protect against incorrect inferences using IPTW Kaplan-Meier curves to estimate treatment effects.

Supplementary Material

1

Figure 3:

Figure 3:

Survival in early stage renal cancer by surgery type

Acknowledgement

Elizabeth Handorf’s research in this publication was supported by the National Cancer Institute of the National Institutes of Health under Award Number P30CA006927 and U54CA221705. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The data used in the study are derived from a de-identified NCDB file. The American College of Surgeons and the Commission on Cancer have not verified and are not responsible for the analytic or statistical methodology employed, or the conclusions drawn from these data by the investigator.

Footnotes

Declaration of Conflicting Interests: The Authors declare that there are no conflicts of interest.

IRB Statement: As this study used only de-identified data, it was determined by the Fox Chase Cancer Center Institutional Review Board to be not human subjects research.

Data Availability:

All code to generate and analyze data for the simulation studies are available at https://github.com/BethHandorf/NonPH_IPTW

All code to analyze the data for the clinical examples are also available on github. Applications to obtain the data used here can be made to the American College of Surgeons at https://www.facs.org/quality-programs/cancer/ncdb/puf

References

  1. Ambrogi F, Iacobelli S, & Andersen PK (2022, March). Analyzing differences between restricted mean survival time curves using pseudo-values. BMC Medical Research Methodology , 22 (1), 71. doi: 10.1186/s12874-022-01559-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Andersen PK, Hansen MG, & Klein JP (2004, December). Regression Analysis of Restricted Mean Survival Time Based on Pseudo-Observations. Lifetime Data Analysis, 10 (4), 335–350. doi: 10.1007/s10985-004-4771-0 [DOI] [PubMed] [Google Scholar]
  3. Andersen PK, & Pohar Perme M (2010, February). Pseudo-observations in survival analysis. Statistical Methods in Medical Research, 19 (1), 71–99. doi: 10.1177/0962280209105020 [DOI] [PubMed] [Google Scholar]
  4. Andersen PK, Syriopoulou E, & Parner ET (2017). Causal inference in survival analysis using pseudo-observations. Statistics in Medicine, 36 (17), 2669–2681. doi: 10.1002/sim.7297 [DOI] [PubMed] [Google Scholar]
  5. Austin PC (2011, May). An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. Multivariate Behavioral Research, 46 (3), 399–424. doi: 10.1080/00273171.2011.568786 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Austin PC (2016). Variance estimation when using inverse probability of treatment weighting (IPTW) with survival analysis. Statistics in Medicine, 35 (30), 5642–5655. doi: 10.1002/sim.7084 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Austin PC, & Stuart EA (2015). Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Statistics in Medicine, 34 (28), 3661–3679. doi: 10.1002/sim.6607 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Austin PC, & Stuart EA (2017, August). The performance of inverse probability of treatment weighting and full matching on the propensity score in the presence of model misspecification when estimating the effect of treatment on survival outcomes. Statistical Methods in Medical Research, 26 (4), 1654–1670. doi: 10.1177/0962280215584401 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Barbiero Alessandro, & Alda Ferrari Pier (2015). GenOrd: Simulation of Discrete Random Variables with Given Correlation Matrix and Marginal Distributions [Google Scholar]
  10. Boffa DJ, Rosen JE, Mallin K, Loomis A, Gay G, Palis B, … Winchester DP (2017, December). Using the National Cancer Database for Outcomes Research: A Review. JAMA Oncology , 3 (12), 1722–1728. doi: 10.1001/jamaoncol.2016.6905 [DOI] [PubMed] [Google Scholar]
  11. Brilleman Sam. (2019). Simsurv: Simulate Survival Data [Google Scholar]
  12. Buckley J, & James I (1979, December). Linear regression with censored data. Biometrika, 66 (3), 429–436. doi: 10.1093/biomet/66.3.429 [DOI] [Google Scholar]
  13. Burzykowski T (2022, March). Semi-parametric accelerated failure-time model: A useful alternative to the proportional-hazards model in cancer clinical trials. Pharmaceutical Statistics, 21 (2), 292–308. doi: 10.1002/pst.2169 [DOI] [PubMed] [Google Scholar]
  14. Cannas M, & Arpino B (2019). A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting. Biometrical Journal , 61 (4), 1049–1072. doi: 10.1002/bimj.201800132 [DOI] [PubMed] [Google Scholar]
  15. Chiou S, Kang S, & Yan J (2015, April). Rank-based estimating equations with general weight for accelerated failure time models: An induced smoothing approach. Statistics in Medicine, 34 (9), 1495–1510. doi: 10.1002/sim.6415 [DOI] [PubMed] [Google Scholar]
  16. Chiou SH, Kang S, & Yan J (2014, November). Fitting Accelerated Failure Time Models in Routine Survival Analysis with R Package aftgee. Journal of Statistical Software, 61 , 1–23. doi: 10.18637/jss.v061.i11 [DOI] [Google Scholar]
  17. Cole SR, & Hernán MA (2004, July). Adjusted survival curves with inverse probability weights. Computer Methods and Programs in Biomedicine, 75 (1), 45–49. doi: 10.1016/j.cmpb.2003.10.004 [DOI] [PubMed] [Google Scholar]
  18. Collett D (2015). Modelling survival data in medical research Chapman and Hall/CRC. [Google Scholar]
  19. Conner SC, Sullivan LM, Benjamin EJ, LaValley MP, Galea S, & Trinquart L (2019). Adjusted restricted mean survival times in observational studies. Statistics in Medicine, 38 (20), 3832–3860. doi: 10.1002/sim.8206 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Cox C, Chu H, Schneider MF, & Muñoz A (2007). Parametric survival analysis and taxonomy of hazard functions for the generalized gamma distribution. Statistics in Medicine, 26 (23), 4352–4374. doi: 10.1002/sim.2836 [DOI] [PubMed] [Google Scholar]
  21. Cox DR (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society: Series B (Methodological), 34 (2), 187–202. doi: 10.1111/j.2517-6161.1972.tb00899.x [DOI] [Google Scholar]
  22. Deb S, Austin PC, Tu JV, Ko DT, Mazer CD, Kiss A, & Fremes SE (2016, February). A Review of Propensity-Score Methods and Their Use in Cardiovascular Research. Canadian Journal of Cardiology , 32 (2), 259–265. doi: 10.1016/j.cjca.2015.05.015 [DOI] [PubMed] [Google Scholar]
  23. Efron B, & Tibshirani RJ (1993). An Introduction to the Bootstrap Chapman and Hall/CRC. [Google Scholar]
  24. Enewold L, Parsons H, Zhao L, Bott D, Rivera DR, Barrett MJ, … Warren, J. L. (2020, May). Updated Overview of the SEER-Medicare Data: Enhanced Content and Applications. Journal of the National Cancer Institute. Monographs, 2020 (55), 3–13. doi: 10.1093/jncimonographs/lgz029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Funk MJ, Westreich D, Wiesen C, Stürmer T, Brookhart MA, & Davidian M (2011, April). Doubly Robust Estimation of Causal Effects. American Journal of Epidemiology , 173 (7), 761–767. doi: 10.1093/aje/kwq439 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Grambsch PM, & Therneau TM (1994). Proportional Hazards Tests and Diagnostics Based on Weighted Residuals. Biometrika, 81 (3), 515–526. doi: 10.2307/2337123 [DOI] [Google Scholar]
  27. Hernán MA, Cole SR, Margolick J, Cohen M, & Robins JM (2005, July). Structural accelerated failure time models for survival analysis in studies with time-varying treatments. Pharmacoepidemiology and Drug Safety , 14 (7), 477–491. doi: 10.1002/pds.1064 [DOI] [PubMed] [Google Scholar]
  28. Horiguchi M, Hassett MJ, & Uno H (2019, July). How Do the Accrual Pattern and Follow-Up Duration Affect the Hazard Ratio Estimate When the Proportional Hazards Assumption Is Violated? The Oncologist , 24 (7), 867–871. doi: 10.1634/theoncologist.2018-0141 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Jackson Christopher. (2016). Flexsurv: A Platform for Parametric Survival Modeling in R. Journal of Statistical Software, 70 (8), 1–33. doi: 10.18637/jss.v070.i08 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Jiménez JL (2022, January). Quantifying treatment differences in confirmatory trials under non-proportional hazards. Journal of Applied Statistics, 49 (2), 466–484. doi: 10.1080/02664763.2020.1815673 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Jin Z, Lin DY, & Ying Z (2006). On Least-Squares Regression with Censored Data. Biometrika, 93 (1), 147–161. [Google Scholar]
  32. Kalbfleisch JD, & Prentice RL (1981, April). Estimation of the average hazard ratio. Biometrika, 68 (1), 105–112. doi: 10.1093/biomet/68.1.105 [DOI] [Google Scholar]
  33. Kaplan EL, & Meier P (1958, June). Nonparametric Estimation from Incomplete Observations. Journal of the American Statistical Association, 53 (282), 457–481. doi: 10.1080/01621459.1958.10501452 [DOI] [Google Scholar]
  34. Klein JP, & Moeschberger ML (2003). Survival Analysis: Teshniques for Censored and Truncated Data (second edition ed.). Springer. [Google Scholar]
  35. Komárek A, Lesaffre E, & Hilton JF (2005, September). Accelerated Failure Time Model for Arbitrarily Censored Data With Smoothed Error Distribution. Journal of Computational and Graphical Statistics, 14 (3), 726–745. doi: 10.1198/106186005X63734 [DOI] [Google Scholar]
  36. Lunceford JK, & Davidian M (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. Statistics in Medicine, 23 (19), 2937–2960. doi: 10.1002/sim.1903 [DOI] [PubMed] [Google Scholar]
  37. Mao H, Li L, Yang W, & Shen Y (2018). On the propensity score weighting analysis with survival outcome: Estimands, estimation, and inference. Statistics in Medicine, 37 (26), 3745–3763. doi: 10.1002/sim.7839 [DOI] [PubMed] [Google Scholar]
  38. Marshall AW, & Olkin I (1997, September). A new method for adding a parameter to a family of distributions with application to the exponential and Weibull families. Biometrika, 84 (3), 641–652. doi: 10.1093/biomet/84.3.641 [DOI] [Google Scholar]
  39. Morris TP, White IR, & Crowther MJ (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38 (11), 2074–2102. doi: 10.1002/sim.8086 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Motzer Robert J. (2020, July). NCCN Clinical Practice Guidelines in Oncology (NCCN Guidelines®) Kindey Cancer version 1.2021 National Comprehensive Cancer Network. [Google Scholar]
  41. Movva S, von Mehren M, Ross EA, & Handorf E (2015, November). Patterns of Chemotherapy Administration in High-Risk Soft Tissue Sarcoma and Impact on Overall Survival. Journal of the National Comprehensive Cancer Network , 13 (11), 1366–1374. doi: 10.6004/jnccn.2015.0165 [DOI] [PubMed] [Google Scholar]
  42. Mozumder SI, Rutherford MJ, & Lambert PC (2021, March). Estimating restricted mean survival time and expected life-years lost in the presence of competing risks within flexible parametric survival models. BMC Medical Research Methodology , 21 (1), 52. doi: 10.1186/s12874-021-01213-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Pang M, Platt RW, Schuster T, & Abrahamowicz M (2021, November). Flexible extension of the accelerated failure time model to account for nonlinear and time-dependent effects of covariates on the hazard. Statistical Methods in Medical Research, 30 (11), 2526–2542. doi: 10.1177/09622802211041759 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Pirracchio R, Petersen ML, & van der Laan M (2015, January). Improving Propensity Score Estimators’ Robustness to Model Misspecification Using Super Learner. American Journal of Epidemiology , 181 (2), 108–119. doi: 10.1093/aje/kwu253 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Pohar Perme Maha & Gerster Mette (2017). Pseudo: Computes Pseudo-Observations for Modeling [Google Scholar]
  46. Rahman R, Fell G, Ventz S, Arfé A, Vanderbeek AM, Trippa L, & Alexander BM (2019, November). Deviation from the Proportional Hazards Assumption in Randomized Phase 3 Clinical Trials in Oncology: Prevalence, Associated Factors, and Implications. Clinical Cancer Research: An Official Journal of the American Association for Cancer Research, 25 (21), 6339–6345. doi: 10.1158/1078-0432.CCR-18-3999 [DOI] [PubMed] [Google Scholar]
  47. Ristau BT, Handorf EA, Cahn DB, Kutikov A, Uzzo RG, & Smaldone MC (2018). Partial nephrectomy is not associated with an overall survival advantage over radical nephrectomy in elderly patients with stage Ib-II renal masses: An analysis of the national cancer data base. Cancer , 124 (19), 3839–3848. doi: 10.1002/cncr.31582 [DOI] [PubMed] [Google Scholar]
  48. Robins J, & Tsiatis AA (1992, June). Semiparametric estimation of an accelerated failure time model with time-dependent covariates. Biometrika, 79 (2), 311–319. doi: 10.1093/biomet/79.2.311 [DOI] [Google Scholar]
  49. Rosenbaum PR, & Rubin DB (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70 , 412–8. doi: 10.2307/2335942 [DOI] [Google Scholar]
  50. Rosenbaum PR, & Rubin DB (1984). Bias in Observational Studies Using Subclassification on the Propensity Score. Journal of the American Statistical Association, 79 , 516–524. [Google Scholar]
  51. Royston P, & Parmar MKB (2002). Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Statistics in Medicine, 21 (15), 2175–2197. doi: 10.1002/sim.1203 [DOI] [PubMed] [Google Scholar]
  52. Royston P, & Parmar MKB (2011, August). The use of restricted mean survival time to estimate the treatment effect in randomized clinical trials when the proportional hazards assumption is in doubt. Statistics in Medicine, 30 (19), 2409–2421. doi: 10.1002/sim.4274 [DOI] [PubMed] [Google Scholar]
  53. Sachs MC, & Gabriel EE (2022, May). Event History Regression with Pseudo-Observations: Computational Approaches and an Implementation in R. Journal of Statistical Software, 102 , 1–34. doi: 10.18637/jss.v102.i09 [DOI] [Google Scholar]
  54. Schemper M, Wakounig S, & Heinze G (2009). The estimation of average hazard ratios by weighted Cox regression. Statistics in Medicine, 28 (19), 2473–2489. doi: 10.1002/sim.3623 [DOI] [PubMed] [Google Scholar]
  55. Therneau Terry M. (2020). A Package for Survival Analysis in R [Google Scholar]
  56. Tian L, Jin H, Uno H, Lu Y, Huang B, Anderson KM, & Wei L (2020, December). On the empirical choice of the time window for restricted mean survival time. Biometrics, 76 (4), 1157–1166. doi: 10.1111/biom.13237 [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Uno H, Claggett B, Tian L, Inoue E, Gallo P, Miyata T, . . . Wei, L.-J. (2014, August). Moving Beyond the Hazard Ratio in Quantifying the Between-Group Difference in Survival Analysis. Journal of Clinical Oncology , 32 (22), 2380–2385. doi: 10.1200/JCO.2014.55.2208 [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. van der Laan, Mark J, Polley EC, & Hubbard AE (2007). Super Learner. Statistical Applications in Genetics and Molecular Biology, 6 (1). [DOI] [PubMed] [Google Scholar]
  59. von Mehren Margaret. (2020, May). NCCN Clinical Practice Guidelines in Oncology (NCCN Guidelines®)Soft Tissue Sarcoma Version 2.2020 National Comprehensive Cancer Network. [DOI] [PubMed] [Google Scholar]
  60. Wei LJ (1992). The accelerated failure time model: A useful alternative to the Cox regression model in survival analysis. Statistics in Medicine, 11 (14–15), 1871–1879. [DOI] [PubMed] [Google Scholar]
  61. Xie J, & Liu C (2005). Adjusted Kaplan–Meier estimator and log-rank test with inverse probability of treatment weighting for survival data. Statistics in Medicine, 24 (20), 3089–3110. doi: 10.1002/sim.2174 [DOI] [PubMed] [Google Scholar]
  62. Zhang M, & Schaubel DE (2012). Double-Robust Semiparametric Estimator for Differences in Restricted Mean Lifetimes in Observational Studies. Biometrics, 68 (4), 999–1009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Zou B, Mi X, Tighe PJ, Koch GG, & Zou F (2021). On kernel machine learning for propensity score estimation under complex confounding structures. Pharmaceutical Statistics, 20 (4), 752–764. doi: 10.1002/pst.2105 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

Data Availability Statement

All code to generate and analyze data for the simulation studies are available at https://github.com/BethHandorf/NonPH_IPTW

All code to analyze the data for the clinical examples are also available on github. Applications to obtain the data used here can be made to the American College of Surgeons at https://www.facs.org/quality-programs/cancer/ncdb/puf

RESOURCES