Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Apr 1.
Published in final edited form as: Lifetime Data Anal. 2018 Mar 30;25(2):361–379. doi: 10.1007/s10985-018-9429-4

Bayes Factors For Choosing Among Six Common Survival Models

Jiajia Zhang 1, Timothy Hanson 2, Haiming Zhou 3
PMCID: PMC6165714  NIHMSID: NIHMS956248  PMID: 29603046

Abstract

A super model that includes proportional hazards, proportional odds, accelerated failure time, accelerated hazards, and extended hazards models, as well as the model proposed in Diao et al. (2013) accounting for crossed survival as special cases is proposed for the purpose of testing and choosing among these popular semiparametric models. Efficient methods for fitting and computing fast, approximate Bayes factors are developed using a nonparametric baseline survival function based on a transformed Bernstein polynomial. All manner of censoring is accommodated including right, left, and interval censoring, as well as data that are observed exactly and mixtures of all of these; current status data are included as a special case. The method is tested on simulated data and two real data examples. The approach is easily carried out via a new function in the spBayesSurv R package.

Keywords: Interval censoring, Model choice, Bernstein polynomial, Bayes factor

1 Introduction

One of the central interests in health sciences research is to identify and quantify the association between the mortality/incidence of a certain disease and its potential risk factors, so that risk factors can be used in disease prevention and control. Survival models serve as the major statistical tools in analyzing mortality/incidence data. Among them, the Cox proportional hazards model (PH) (Cox, 1992) is unquestionably the most popular one in practice, where the risk factor has a multiplicative association with hazard risk. Let Hx(t) be the cumulative hazard function for a subject with covariates x = (x1, x2, …, xp)′ and H0(t) be the baseline cumulative hazard function for those with x = 0. The proportional hazards model can be written as

Hx(t)=eβxH0(t)

where β = (β1, β2, …, βp)′ is a vector of unknown coefficients, and eβj represents the hazard ratio corresponding to a one unit increase of the jth covariate. When the proportional hazard assumption is invalid, the accelerated failure time (AFT) model (Kalbfleisch and Prentice, 2011), and proportional odds (PO) model can be considered as alternative models. Let Sx(t) and S0(t) denote the survival function and baseline survival function, and hx(t), h0(t) denote the hazard and baseline hazard function corresponding to Hx(t), H0(t). The accelerated failure time model can be written as

Sx(t)=S0{eβxt},

where eβj represents the time scale change due to the jth covariate in survival probability. The proportional odds model can be written as

1Sx(t)Sx(t)=eβx1S0(t)S0(t),

where eβj represents the change in failure odds by time t due to the jth covariate.

All of the models mentioned above do not allow crossing survival curves for different covariate combinations. Failure to capture crossing survival may incorrectly characterize the association between risk factors and mortality/incidence. To address this potentially unrealistic constraint, Chen and Wang (2000) and Chen et al. (2014) consider the accelerated hazards model (AH), which is

hx(t)=h0{eβxt},

where eβj represents the time scale change in hazard risk due to the jth covariate. Zhang and Peng (2009) discuss properties of the hazard function under PH, AH and AFT models. Etezadi-Amoli and Ciampi (1987), Chen and Jewell (2001), and Li et al. (2015) consider the extended hazards model (EH)

hx(t)=exp(βx)h0{exp(γx)t}. (1)

Here, β = γ gives AFT, γ = 0 gives PH, and β = 0 gives the AH model.

Quantin et al. (1996) consider a generalization of PH that allows for crossing survival curves Hx(t) = eβxH0(t)exp(xγ); the PH model is formally nested within when γ = 0. Also see Devarajan and Ebrahimi (2011) and references therein. Diao et al. (2013) add covariates to the model of Yang and Prentice (2005) (YP) yielding the hazard model

hx(t)=exp(βx+γx)h0(t)exp(βx)F0(t)+exp(γx)S0(t). (2)

They point out that

limt0+hx1(t)hx2(t)=eβ(x1x2),limthx1(t)hx2(t)=eγ(x1x2),

thus β gives a short-term relative risk interpretation whereas γ gives a long-term relative risk interpretation. Note that β = γ and γ = 0 give PH and PO as formally nested special cases, respectively.

Each model listed above can capture different characteristics of survival data. However, choosing which model is the most appropriate and accurate in reflecting the association between the potential risk factors and mortality/incidence is a challenge and an important question that needs to be addressed. The YP model (2) and EH model (1) augment β with an entirely new set of p regression effects, say γ, to formally nest simpler models within a larger super model. Such augmentations allow for simpler models to be special cases arising from standard linear constraints on the parameters, thus likelihood ratio tests tests for frequentist models, or efficient computation of Bayes factors for Bayesian models can be used.

Another complication arising from survival data is that survival times can be censored in myriad ways, including right, left, and interval censoring (Sun, 2006), as well as data that are observed exactly and mixtures of all of these; current status data are included as a special case. It is challenge to handle all types of censoring simultaneously using frequentist approaches.

This paper develops a super model that includes the PH, PO, AFT, AH, YP and EH models as formally nested special cases. As such, model choice among these models can be carried out by computing approximate Bayes factors based on the Savage-Dickey ratio (Verdinelli and Wasserman, 1995). A transformed Bernstein polynomial prior proposed by Chen et al. (2014) is used to model baseline survival S0 and a multivariate normal g-prior for regression coefficients is developed. All manner of censoring is accommodated and the approach is implemented via a new function in the spBayesSurv R package. Once a model is chosen, any of PH, PO, or AFT can be fitted through many existing R packages including the spBayesSurv R package. The remaining paper is organized as follows: Section 2 describes the proposed super model; Section 3 lists details about the Bayesian estimation procedure, including prior development, posterior sampling, and Bayes factor computation; Section 4 presents a simulation and two real data analyses with software implementation. Conclusions are made in Section 5.

2 Model

The super model proposed has the following closed form

Sx(t)=[1+e(βoβh+βq)xF0{eβqxt}S0{eβqxt}]exp{(βhβq)x}, (3)

where the baseline cumulative distribution and survival functions are F0(·) and S0(·). The hazard is computed to be

hx(t)=e(βoβh+βq)xh0{eβqxt}e(βo+βq)xF0{eβqxt}+eβhxS0{eβqxt}. (4)

Then fx (t) = hx (t)Sx (t) through (3) and (4).

The super model includes PH, AFT, PO, AH, EH and YP models as special cases. One can show if Hh : βq = 0, βo = βh is true, then

Sx(t)=S0(t)exp(βhx),

the PH model obtains. Similarly, assuming Ho : βq = βh = 0 implies

Fx(t)Sx(t)=eβoxF0(t)S0(t),

the PO model. Assuming Hq : βo = 0, βh = βq implies

Sx(t)=S0{eβqxt},

the AFT model (proportional quantiles). Assuming Ha : βh = 0, βq + βo = 0 implies

hx(t)=h0{eβqxt},

the AH model obtains. YP model (2) occurs as a special case when Hy : βq = 0; EH model (1) is a special case when He : βh = βq + βo.

We seek to fit model (3) assuming a transformed Bernstein polynomial prior on S0(·), and test the adequacy of the formally nested hypotheses Hh, Ho, Hq, Ha, Hy and He via Bayes factors.

3 Priors and Bayes factors

3.1 Transformed Bernstein polynomial prior on baseline survival S0

For a given positive integer J, the Bernstein polynomial of degree J − 1 is defined by

b(x|J,ξJ)=j=1JξJjβ(x|j,Jj+1), (5)

where ξJ = (ξJ1, …, ξJJ)′ is a vector of positive weights satisfying j=1JξJj=1 and β(·|a, b) denotes a beta density with parameters (a, b). Clearly b(x|J, ξJ) is a density function and is very flexible, so that, in fact, any smooth density with support (0, 1) can be well approximated by a Bernstein polynomial (Ghosal, 2001). More precisely, if f(x) is any continuously differentiable density with support (0, 1) and bounded second derivative, it can be shown that, with suitable choice of ξJ,

sup0<x<1|f(x)b(x|J,ξJ)|=O(J1).

Integrating (5) gives the corresponding cumulative distribution function (cdf)

B(x|J,ξJ)=j=1JξJjIx(j,Jj+1), (6)

where Ix(a, b) is the cdf associated with β(x|a, b). Note that one can calculate (6) without too much computational cost through the recursive relation

Ix(j+1,Jj)=Ix(j,Jj+1)Γ(J+1)Γ(j+1)Γ(Jj+1)xj(1x)Jj.

A referee has brought up the issue of consistency and the choice of J. Note at the outset that none of the semiparametric models under consideration support the “truth,” as they are all first-order approximations to reality formulated to provide readily interpretable regression coefficients. However, some assurance that the Bernstein polynomial supports a wide range of density shapes and is consistent over this range is comforting. Petrone and Wasserman (2002) show that under mild conditions on the true underlying density and suitable priors on J ∈ ℕ+ and ξJ, the posterior predictive density (i.e. Bayes estimate of the density with respect to quadratic loss) is Hellinger consistent. Fitting such a model is complicated and typically done via reversible jump MCMC (Green, 1995). As such, the vast majority of authors simply fix J at some “reasonable” value, truncating the estimate; Chen et al. (2014) suggest J = 15 based on simulations involving the random L1 distance between the prior and the truth. Accordingly, Petrone and Wasserman (2002) further argue that a truncated Bernstein polynomial will converge to a Bayes estimate that minimizes the Kullback-Liebler distance between the truncated Bayesian estimate and the truth. Certainly larger values J > 15 can be chosen to achieve more flexible estimates of the baseline survival density; the supplemental material in Chen et al. (2014) can provide a guide in terms of L1. However, there is a “law of diminishing returns”, also observed by Hanson (2006), in that the LPML tends to level off and not increase after some K for JK. Restated, the cross-validated predictive ability of the model does not increase after some K. In this spirit, and similar to the use of AIC in choosing the number of mixands in finite mixture models, one could choose J = K based on when the LPML levels off. However, each computation of the LPML requires a separate MCMC run.

A remarkably useful result is that any Bernstein polynomial can be written in terms of Bernstein polynomials of higher degree through the relation

β(x|j,Jj)=JjJβ(x|j,Jj+1)+jJβ(x|j+1,Jj).

It follows that b(x|J − 1, ξJ−1) can be written as b(x|J,ξJ) with a suitable choice of ξJ. Since every lower order Bernstein polynomial J < K is included as a special case of J = K, one only need pick one reasonable J = K; a prior on 1 ≤ JK is superfluous.

Regarding the prior for ξJ, we consider a Dirichlet distribution,

ξJ|J~Dirichlet(α,,α), (7)

where α > 0 is a parameter. An attractive property of the BP prior specified above is that E[b(x|J,ξJ)]=j=1Jβ(x|j,Jj+1)/J=1 and E[B(x|J, ξJ)] = x for x ∈ (0, 1); i.e. the BP prior is centered at a uniform distribution over (0, 1).

We next describe how we define a random survival function S0 based on (5). Let {Sθ : θΘ} denote a parametric family of survival functions with support on positive reals ℝ+. For example, a log-logistic family is defined as Sθ(t) = {1+(eθ1t)exp(θ2)}−1 in our R function, where θ = (θ1, θ2)′. Weibull and log-normal families are also implemented in the function. In our experience all three parametric distribution families yield similar results across many data sets. Note that Sθ(t) always lies in the interval (0, 1) for 0 < t < ∞, so a natural prior on S0, termed the transformed Bernstein polynomial (TBP) prior, is

S0(t)=B(Sθ(t)|J,ξJ), (8)

with density

f0(t)=b(Sθ(t)|J,ξJ)fθ(t), (9)

where fθ is the density associated with Sθ. Clearly, the random distribution S0 is centered at Sθ, that is, E[S0(t)] = Sθ(t) and E[f0(t)] = fθ(t). The weight parameters ξJ “adjust” the shape of the baseline survival S0 relative to the prior guess Sθ. If all ξJjs are equal to 1/J then S0Sθ. This adaptability makes the TBP prior attractive in its flexibility, but also anchors the random S0 firmly about Sθ. Moreover, unlike a mixture of Polya trees prior, the TBP prior always selects smooth densities, leading to more efficient posterior sampling.

The TBP parameter α acts much like the precision in a Dirichlet process (Ferguson, 1973), controlling how stochastically “pliable” S0 is relative to Sθ. Large values of α indicate a strong belief that S0 can be modeled using Sθ, since as α tends to infinity, the random S0 is Sθ with probability one. On the other hand, a smaller values of α allow more pronounced deviations of S0 from Sθ. The choice of α = 1 has been advocated by many authors, e.g. recently Chen et al. (2014). Similar to Dirichlet processes we consider a gamma prior on α, say, α ~ Γ(a0, b0), where a0 is the shape parameter and b0 is the rate parameter. Through L1 considerations, Chen et al. (2014) provide some guidance on choosing an informative prior for α, but this is not pursued here; in our experience different priors for α leads to very similar posterior inference in reasonably large sample sizes.

3.2 Prior on regression coefficients

The g-prior (Zellner, 1983) has been widely considered for model selection in Bayesian regression models. Hanson et al. (2014) develop an informative g-prior for logistic regression; we consider their approach adapted for use in the semiparametric survival models considered here. The prior is

β~Np(0,gn(XX)1), (10)

where n is the sample size, X* is the usual n × p design matrix only with mean-centered predictors, i.e. 1nX=0p. Derivations in Hanson et al. (2014) imply that for covariates x generated from some distribution H with support on 𝒳 ⊂ ℝp and β assigned in (10),

e(xμ)β~log N(0,gp),

where μ = ∫𝒳 xH (dx). Thus, a priori, the relative risks (PH), acceleration factors (AFT), and odds factors (PO) of random individuals x relative to their sample mean approximately follow a log-normal distribution in reasonably large samples. A simple method for choosing g is to pick a number M such that any random quantity e(xμ)′β is less than M with probability r. A simple calculation reveals that

g=[log MΦ1(r)]21r.

For example, choosing M = 10 and r = 0.9 yields g=3.228p; these are the values considered here. Concisely,

βh,βo,βq~iidNp(0,S0), where S0=3.228np(XX)1.

3.3 Likelihood construction and MCMC

Let ti be a random survival time for the ith individual and xi be a related p-dimensional vector of covariates, i = 1, …, n. Assume the survival time ti lies in the interval (ai, bi), 0 ≤ aibi ≤ ∞. Here left-censored data are of the form (0, bi), right-censored (ai, ∞), interval-censored (ai, bi) and uncensored values simply have ai = bi, i.e., we define (x, x) = {x}.

Denote by 𝒟 = {(xi, ai, bi); i = 1, …, n} the set of observed data. Assume ti ~ Sxi (·), where Sxi (t) is given by (3) with the TBP prior on S0(t) and f0(t) defined in (8) and (9). Set β=(βh,βo,βq). The likelihood for (ξJ, θ, β) is given by

L(ξJ,θ,β)=i=1n[Sxi(ai)Sxi(bi)]I{ai<bi}fxi(ai)I{ai=bi}. (11)

Markov chain Monte Carlo (MCMC) is carried out through an empirical Bayes approach coupled with adaptive Metropolis-Hastings updating (Haario et al., 2001). The posterior density given the data 𝒟 is

p(ξJ,θ,β,α|𝒟)L(ξJ,θ,β)p(ξJ|α)p(α)p(θ)p(βq)p(βo)p(βh),

where p(ξJ|α) is the density of the Dirichlet distribution in (7) and the remaining terms are prior densities for α, θ, βh, βo, and βq. Here we assume α ~ Γ(a0, b0), θ ~ N2(θ0, V0) and βh,βo,βq~iidNp(0,S0).

Note that when ξJj = 1/J the underlying parametric model with S0(t) = Sθ(t) is obtained and ℒ(ξJ, θ, β) is equal to the corresponding parametric likelihood function. Thus, a fit from a standard parametric survival model can provide starting values for the TBP survival model. Consider a standard fit log ti = τ0 + τxi + σεi using the survreg function in the survival package for R, where ε1,,εn~iidF(ε). For log-logistic data F(ε)=eε1+eε (standard logistic), for Weibull F(ε) = 1 − exp(−eε) (extreme value), and for log-normal F(ε) is the standard normal cdf. This model has a scale σ, intercept τ0, and regression coefficients τ′ = (τ1, …, τp). We parametrize Sθ(t) so that θ1 = −τ0 and θ2 = −log σ. Let θ̂ and be the point and asymptotic variance estimates for θ via the survreg fit. To choose starting values for β, we fit both the Weibull and log-logistic survreg models. Noting that the Weibull model has both PH and AFT representations and the log-logistic model has both PO and AFT representations, the survreg fits with Weibull and log-logistic will provide us coefficient estimates under each of the PH, PO and AFT, denoted by βh0, βo0 and βq0, respectively, and let Sh0, So0 and Sq0 be their covariance estimates. If the Weibull model has smaller AIC, we set β^=(βq0,0,βq0) and Ŝ = diag(Sq0, So0, Sq0); otherwise, we set β^=(0,βo0,0) and Ŝ = diag(Sh0, So0, Sq0).

For ease of posterior sampling, we work with z = (z1, …, zJ−1)′ through the relation ξJj=ezj/(k=1Jezj) for j = 1, …, J, where zJ = 0. Under the Dirichlet prior (7), the induced prior on z is:

p(z|α)=Γ(αJ)Γ(α)Jj=1J[ezjk=1Jezj]α.

The vector z can be updated using adaptive Metropolis-Hastings. Suppose we are currently in iteration l and have sampled the states z(0), z(1), …, z(l−1). We select an index l0 (e.g., l0 = 5000) for the length of an initial period and define

l={0,ll0(2.4)2J1(𝒞l+1010IJ1)l>l0.

Here 𝒞l is the sample variance of z(0), z(1), …, z(l−1), and Σ0 is an initial diagonal covariance matrix of z, defined so that the variance of zj is 0.16. The choice of 0.16 is based on extensive simulation studies; other choices (as long as it is not too small or large) will have little impact on posterior inferences. We generate z=(z1,,zJ1) from NJ−1(z(l−1), Σl) and accept it with probability

min {1,L(ξJ,θ,β)j=1J(ξJj)αL(ξJ(l1),θ,β)j=1J(ξJj(l1))α},

where ξJ and ξJ(l1) are defined corresponding to z* and z(l−1), respectively.

The centering distribution parameters θ are updated via adaptive Metropolis-Hastings. At iteration l, each candidate is sampled as θ* ~ N2(θ(l−1), Σl) and accepted with probability

min {1,L(ξJ,θ,β)ϕ2(θ|θ0,V0)L(ξJ,θ(l1),β)ϕ2(θ(l1)|θ0,V0)}.

where ϕ2(·|θ0, V0) denotes the density of N2(θ0, V0), and Σl is defined similarly as above, but with Σ0 set to be .

The survival model coefficients β ∈ {βo, βh, βq} are updated via adaptive Metropolis-Hastings as well with proposal β* ~ Np(β(l−1), Σl) and acceptance probability

min {1,L(ξJ,θ,β)ϕp(β|β0,S0)L(ξJ,θ,β(l1))ϕp(β(l1)|β0,S0)},

where Σl is defined similarly as above with Σ0 = Ŝ.

Finally, the precision parameter α is updated via adaptive Metropolis-Hastings with normal proposal α* ~ N1(α(l−1), Σl) with Σl defined as above but taking Σ0 = 0.16, and the acceptance probability is

min {1,(α)a1ebαΓ(αJ)Γ(α(l1))Jj=1J(ξJj)α1(α(l1))a1ebα(l1)Γ(α)JΓ(α(l1)J)j=1J(ξJj)α(l1)1}.

Regarding default choices for hyperparameters, we set a0 = b0 = 1, θ0 = θ̂, and V0 = 10. Note here we assume a relatively informative prior on θ to avoid potential instability of MCMC and obviate confounding between Sθ and the Bernstein polynomial.

3.4 Approximate Bayes factors for model selection

Once the model is fitted via MCMC, the triples {(βqm,βom,βhm)}m=1M are obtained after burnin and thinning. Let BFq, BFo, BFh, BFa, BFy, BFe be the Bayes factors for testing the AFT, PO, PH, AH, YP and EH assumptions relative to the full model, respectively. A large-sample approximation to the Savage-Dickey ratio based on approximate normality is proposed to compute these Bayes factors (Li et al., 2015, Zhou et al., 2017); see Appendix A of the online material for details.

The BF for PH relative to the super model is

BFhN2p(0;mh,Vh)Np(0;0,S0)Np(0;0,2S0),

where mh and Vh are the posterior sample mean and variance of (βq,βoβh), respectively. The BF for AFT relative to the super model is

BFqN2p(0;mq,Vq)Np(0;0,S0)Np(0;0,2S0),

where mq and Vq are the posterior sample mean and variance of (βo,βhβq), respectively. The BF for PO relative to the super model is

BFoN2p(0;mo,Vo)Np(0;0,S0)Np(0;0,S0),

where mo and Vo are the posterior sample mean and variance of (βq,βh), respectively. The BF for AH relative to the super model is

BFaN2p(0;ma,Va)Np(0;0,S0)Np(0;0,2S0),

where ma and Va are the posterior sample mean and variance of (βh,βq+βo), respectively. The BF for YP relative to the super model is

BFyNp(0;my,Vy)Np(0;0,S0),

where my and Vy are the posterior sample mean and variance of βq, respectively. The BF for EH relative to the super model is

BFeNp(0;me,Ve)Np(0;0,3S0),

where me and Ve are the posterior sample mean and variance of βhβqβo, respectively.

4 Illustrations

4.1 Simulated data

To show that the method correctly picks the right model most of the time, we generate 500 data sets of size n = 200, 500, and 1000 from the super model under six scenarios: (1) βq = 0, βo = βh = 1, i.e the PH, (2) βq = βh = 0, βo = 1, i.e. the PO, (3) βo = 0, βh = βq = 1, i.e. the AFT, (4) βh = 0, βo = −βq = 1, i.e. the AH, (5) βq = 0, βo = −βh = 1, i.e the YP, and (6) βh = 1, βo = βq = (0.5, 0.5)′, i.e. the EH. In each case, we consider three baseline survival functions: lognormal S0(t) = 1 − Φ (log t), mixture of two lognormals S0(t) = 1 − [0.5Φ ((log t + 1)/0.5) + 0.5Φ ((log t − 1)/0.5)], and Weibull S0(t) = 1 − exp{−(0.5t)0.8}. The covariate vector is chosen as xi = (xi1, xi2) with xi1~iidBernoulli(0.5) and xi2~iidN(0,1). Finally, a non-informative censoring scheme is used, where we apply right censoring to half of the sample data and interval censoring to the other half. Here the right censoring times are independently simulated from a Uniform(2, 6) distribution. For interval censoring, each subject is assumed to have N observation times, say, O1, O2, …, ON, where (N − 1) ~ Poisson(2) and (OkOk1)|N~iidExp(1) with O0 = 0, k = 1, …, N. A censoring interval has endpoints which are the two adjacent observation times (possibly 0 or ∞) that include the true survival time. The final data yield around 20% right censored, 40% uncensored, 25% left censored and 15% interval censored under all settings. Models were fit with J = 15, a loglogistic TBP and the default priors introduced in Section 3. For each MCMC run, 5,000 scans were thinned from 50,000 after a burn-in period of 10,000 iterations. Table 1 reports the proportion (out of 500 replicated data sets) of times each model is picked. The model picked is the one with the largest value among BFh, BFo, BFq, BFa, BFy and BFe relative to the super model.

Table 1.

Proportion of times Bayes factor selects each model when truth is known out of 500 replicated data sets.

Model picked

Baseline n AFT PH PO AH EH YP
True AFT model
Lognormal 200 0.918 0.034 0.024 0.000 0.024 0.000
500 0.956 0.000 0.030 0.000 0.014 0.000
1000 0.964 0.000 0.030 0.000 0.004 0.000
Mixture 200 0.970 0.004 0.000 0.000 0.026 0.000
500 0.966 0.000 0.000 0.000 0.034 0.000
1000 0.972 0.000 0.000 0.000 0.028 0.000
Weibull 200 0.432 0.552 0.000 0.000 0.016 0.000
500 0.356 0.618 0.000 0.000 0.024 0.000
1000 0.310 0.640 0.000 0.000 0.005 0.000

True PH model
Lognormal 200 0.030 0.950 0.002 0.000 0.018 0.000
500 0.000 0.982 0.000 0.000 0.018 0.000
1000 0.000 0.980 0.000 0.000 0.020 0.000
Mixture 200 0.000 0.948 0.040 0.000 0.012 0.000
500 0.000 0.986 0.012 0.000 0.002 0.000
1000 0.000 0.992 0.002 0.000 0.002 0.004
Weibull 200 0.414 0.558 0.014 0.000 0.014 0.000
500 0.396 0.524 0.000 0.000 0.080 0.000
1000 0.324 0.526 0.000 0.000 0.150 0.000

True PO model
Lognormal 200 0.878 0.068 0.044 0.000 0.010 0.000
500 0.748 0.006 0.240 0.000 0.006 0.000
1000 0.418 0.000 0.578 0.000 0.004 0.000
Mixture 200 0.002 0.150 0.842 0.000 0.000 0.006
500 0.000 0.012 0.980 0.000 0.000 0.008
1000 0.000 0.000 0.998 0.000 0.000 0.002
Weibull 200 0.816 0.024 0.146 0.000 0.014 0.000
500 0.000 0.012 0.980 0.000 0.000 0.006
1000 0.062 0.002 0.930 0.000 0.000 0.006

True AH model
Lognormal 200 0.008 0.000 0.000 0.982 0.008 0.002
500 0.000 0.000 0.000 0.982 0.014 0.004
1000 0.000 0.000 0.000 0.974 0.020 0.006
Mixture 200 0.000 0.000 0.000 0.968 0.032 0.000
500 0.000 0.000 0.000 0.946 0.054 0.000
1000 0.000 0.000 0.000 0.860 0.140 0.000
Weibull 200 0.388 0.062 0.000 0.544 0.006 0.000
500 0.546 0.176 0.004 0.272 0.002 0.000
1000 0.500 0.376 0.008 0.114 0.002 0.000

True EH model
Lognormal 200 0.522 0.358 0.026 0.000 0.094 0.000
500 0.288 0.134 0.016 0.000 0.562 0.000
1000 0.040 0.004 0.008 0.000 0.940 0.006
Mixture 200 0.092 0.026 0.030 0.000 0.852 0.000
500 0.000 0.000 0.000 0.000 1.000 0.000
1000 0.000 0.000 0.000 0.000 1.000 0.000
Weibull 200 0.390 0.582 0.008 0.002 0.018 0.000
500 0.338 0.624 0.002 0.000 0.036 0.000
1000 0.356 0.550 0.000 0.000 0.092 0.002

True YP model
Lognormal 200 0.000 0.000 0.000 0.972 0.024 0.004
500 0.000 0.000 0.000 0.848 0.076 0.076
1000 0.000 0.000 0.000 0.534 0.182 0.284
Mixture 200 0.000 0.000 0.000 0.024 0.004 0.972
500 0.000 0.000 0.000 0.000 0.002 0.998
1000 0.000 0.000 0.000 0.000 0.000 1.000
Weibull 200 0.000 0.000 0.000 0.046 0.700 0.254
500 0.000 0.000 0.000 0.000 0.280 0.720
1000 0.000 0.000 0.000 0.000 0.052 0.948

When the baseline is the mixture of lognormal distributions, our method works very well even for the smallest sample size n = 200; for larger sample sizes n = 500 and n = 1000 the correct classification rates are all approaching one except for the AH model. When AH is the truth, the proportion picking AH decreases (from 97% to 86%) as n increases while the proportion choosing EH increases. To confirm this observation, we also tried the size of n = 2000, and the proportions of choosing AH and EH are 57% and 43%, respectively. In other words, as the sample size increases, our method tends to favor the more complex EH model against the special case of AH. Since EH includes AH as a special case the choice is not incorrect, but is more complex than necessary.

When the baseline is lognormal, our method also works well for most cases except when the true model is PO or YP. For instance, when PO is the truth with n = 1000 the method has a 58% chance of picking PO and a 42% chance of choosing AFT. However, picking AFT does not mean that a wrong model is picked if one notices that lognormal can be well approximated by loglogistic and loglogistic AFT is also a PO model. When lognormal YP is the truth with n = 1000, our method only has a 28% chance to pick YP with the remaining % allocated to AH or EH. In this case, we also tried the size of n = 2000, resulting in AH, EH or YP being picked with proportions being 12%, 35% and 53%, respectively. One reason to explain such a low correct classification rate is that lognormal YP considered here could be very close to a AH and/or a EH model; the baseline distribution plays a large role in how “close” competing semiparametric models actually are and several models may predict equally well. In addition, when lognormal EH is the truth, we need a sample size of n = 1000 or larger to identify the correct model. Otherwise, our method tends to select the simpler models AFT or PH, both special cases of EH.

To see how our method performs when the baseline is Weibull, first note that the Weibull AFT, Weibull PH and Weibull AH are all equivalent models, and they are also special cases of EH. Keeping that in mind, we can see that our method has overall low misclassification proportions across most scenarios with the following several exceptions. First, when EH is the truth, both AFT and PH have high chance to be picked; this can be explained by the fact that simulation scenario (6) is not only an EH but also an AFT and a PH. Second, when PO or YP is the truth, we need sample size n = 1000 or larger to identify the correct model. When sample size is small like n = 200 under true PO (or YP), our method picks AFT (or EH) with 82% (or 70%) chance. This may be because the estimated baseline function hardly deviates from the TBP’s centering loglogistic distribution with the small sample size leading to the fitted model close to an AFT (or EH).

To study the impact of the informative g-prior, we also compared two cases M = 10 and M = 50 for part of the simulation scenarios in Appendix C.1 of the online material, and the two different values yielded almost identical results.

The proposed super model can also be used for survival function estimates when all six Bayes factors are less than 1, i.e., none of the six models fit the data better than the super model. We next demonstrate its finite sample performance. We generate 500 data sets of size n = 500 from the super model with βh = βo = βq = 1 which is none of the six models. All other simulation settings are the same as before. Table 2 shows the posterior inference results for the regression coefficients. We can see that all coefficient estimates are nearly unbiased with the coverage probabilities around the nominal level 95% when the true baseline is mixture of two lognormals. However, these encouraging results do not hold when the baseline survival function is lognormal or Weibull. This is not surprising, since the super model with Weibull baseline becomes non-identifiable if one notices that the AFT, PH and AH models with Weibull baselines are all equivalent with appropriate reparametrizations. The same argument also applies to the lognormal baseline, since lognormal can be well approximated with a scaled loglogistic and loglogistic AFT is equivalent to loglogistic PO. Figure 1 presents the average, across the 500 MC replicates, of fitted (posterior means over a grid of time points) survival functions when x = (0, 0)′ and x = (0, 1)′; the super model capably estimates complex (here bimodal) survival curves very accurately even for the lognormal and Weibull baselines. Therefore, the super model can still be used for survival/density estimates, even though interpretation of the three sets of regression coefficients is challenging.

Table 2.

Simulated data when the super model is the truth and sample size is n = 500. Averaged bias (BIAS) and posterior standard deviation (PSD) of each point estimate, standard deviation (across 500 MC replicates) of the point estimate (SD-Est), and coverage probability (CP) for the 95% credible interval.

Parameter BIAS PSD SD-Est CP
Lognormal baseline
βh,1 = 1 −0.021 0.882 0.397 0.996
βh,2 = 1 0.014 0.447 0.238 0.990
βo,1 = 1 0.079 1.504 0.624 0.990
βo,2 = 1 0.059 0.753 0.391 0.990
βq,1 = 1 −0.056 0.843 0.357 0.988
βq,2 = 1 −0.042 0.417 0.222 0.988

Mixture baseline
βh,1 = 1 0.062 0.305 0.259 0.972
βh,2 = 1 0.079 0.189 0.171 0.954
βo,1 = 1 −0.027 0.315 0.302 0.962
βo,2 = 1 −0.020 0.194 0.189 0.952
βq,1 = 1 0.003 0.109 0.099 0.974
βq,2 = 1 −0.003 0.066 0.065 0.948

Weibull baseline
βh,1 = 1 −0.093 0.780 0.465 0.990
βh,2 = 1 −0.098 0.444 0.275 0.988
βo,1 = 1 0.156 0.921 0.531 0.990
βo,2 = 1 0.175 0.491 0.304 0.980
βq,1 = 1 −0.160 0.943 0.541 0.990
βq,2 = 1 −0.193 0.504 0.322 0.976

Fig. 1.

Fig. 1

Simulated data when true mode is none of the six models and sample size is n = 500. Mean, across the 500 MC replicates, of the posterior mean of the survival functions when x = (0, 0)′ (upper lines) and x = (0, 1)′ (lower lines). The true curves are represented by continuous lines and the fitted curves are represented by dashed lines.

4.2 Veterans Administration Lung Cancer Trial

The data considered is the well-known Veterans Administration lung cancer trial (Prentice, 1973), which has been incorporated into MASS package in R. As in Cheng et al. (1995), Murphy et al. (1997), Yang and Prentice (1999), and Hanson (2006) we consider a subgroup of n = 97 patients with no prior therapy. Two covariates considered are performance status, a measure that is a multiple of 10 and ranges from 0 to 100, and the tumor type, a factor with four levels (large=1, adeno=2, small=3, squamous=4). Six of the 97 survival times are censored. Cheng et al. (1995) used the transformation model; Murphy et al. (1997) and Yang and Prentice (1999) considered the PO model; Hanson (2006) considered the AFT, PH and PO model.

The proposed super model is fit with J = 15, a Weibull TBP, and the hyperparameter settings in Section 3.3; see Appendix B.1 of the online material for R commands. The Bayes factors for testing AFT, PH, PO, AH, EH and YP vs. the super model are 115, 27, 97, 0.25, 123, and 11, respectively.

The AFT, PH, PO, EH, and YP fit better than the super model; the EH, AFT and PO models fit about the same and are about four times better than PH. The LPML for the super model compares favorably to those observed in Hanson and Yang (2007), most notably the log-logistic regression model had the best LPML of about −509. Since the parametric log-logistic model has both PO and AFT properties, seeing that these semiparametric models are favored about the same makes sense. Other centering distributions gave roughly the same results, log-logistic gave −509.6 and lognormal gave −511.5.

Since the EH model has the highest Bayes factor, the super model can be used as a model on its own for prediction. Figure 2 presents the predictive survival densities for squamous with score equal to 40, 60 and 80; the code is available in Appendix B.1 of the online material. These plots can be compared to Figure 1 in Hanson and Yang (2007), which have much rougher densities. The Polya tree encourages spikiness in densities, whereas the transformed Bernstein allows multimodality but tends to smooth over spurious spikes.

Fig. 2.

Fig. 2

Preditive densities for squamous, score=40, 60, 80.

Notice that the BF comparing EH to AFT is 123.0/115.1 ≈ 1.07. Thus the AFT model may be considered adequate and can be fitted parametrically via survreg or semiparametrically by the lss package in R. Other R packages for fitting semiparametric AFT models are reviewed in Zhou and Hanson (2015) including spBayesSurv.

4.3 Breast Cancer Study

Beadle et.al (Beadle et al., 1984) reported a retrospective study to compare the cosmetic effects of radiotherapy alone versus radiotherapy and adjuvant chemotherapy on 94 women with early breast cancer. There are 46 patients in radiation only group and 48 patients in radiation plus chemotherapy group. Patients were observed initially every 46 months, but, as their recovery progressed, the interval between visits lengthened. The event of interest was the time to first appearance of moderate or severe breast retraction. There are 5.3% of the women who were left censored, 55.8% were interval censored and 38.9% were right censored. The dataset is available in the R package KMsurv.

The proposed super model is fit with J = 15, a Weibull TBP and the hyperparameter settings in Section 3.3; see Appendix B.2 of the online material for R commands. The Bayes factors for testing AFT, PH, PO, AH, EH and YP vs. the super model are 18, 32, 4, 24, 8 and 8, respectively. All models fit better than the super model; the PH and AH models fit about the same and are about seven times better than PO. In choosing between PH and AH, log-log survival plots can help. Figure 3(a) shows crossing lines based on Turnbull’s estimator (Turnbull, 1976), suggesting that the AH may be more appropriate for these data. In fact, the estimated survival curves in Figure 3(b) from the super model show crossing survival, which is disallowed under PH.

Fig. 3.

Fig. 3

Breast Cancer Data

5 Discussion

We proposed a new super model which includes PH, PO, AFT, AH, EH and YP models as specials cases. Bayes factors have been developed under the transformed Bernstein polynomial prior. Simulation studies demonstrate the appropriate model can be selected based on this approach; the proposed model appears to work especially well for choosing among the mostly widely-used PH, PO, and AFT models. The R package spBayesSurv implements the proposed method directly as demonstrated via two real data analyses.

Note that the AFT, PH and AH models are equivalent under the Weibull distribution. The AFT and PO models are equivalent under the loglogistic distribution. The EH model includes PH and AH as special cases, and the YP model includes PH and PO as special cases. In practice, a small sample size may cause a lot of uncertainty. If we look at Table 1 closely for the sample size of n = 200, the proportion of the “correct” model being picked (including all equivalent models) are all nearly 95% or above when the true model is AFT, PH, PO or AH. When EH (or YP) is the truth with a small sample size, our method tends to select a simpler model (one of AFT, PH, PO or AH) that is closest to EH (or YP). Therefore, for small n, we recommend choosing a model only among AFT, PH, PO and AH; the EH or YP models may be poorly identified in such cases depending on the true baseline survival function. Additionally, in smaller sample sizes several models may fit similarly; in such cases a final model can be chosen based on the most suitable assumption for answering clinical questions of interest (e.g. proportional hazards), interpretability (e.g. hazard ratios) and simplicity.

When none of the six simpler models is picked, the proposed super model can be used for accurate survival estimates although the regression coefficients do not have useful interpretation. Other alternatives are to consider a general linear transformation model (Zeng and Lin, 2007), or Bayesian nonparametric model, e.g. De Iorio et al. (2009); however, just as in the proposed super model, there is no easy interpretation of model coefficients. The latter model can be fit using the function anovaDDP in spBayesSurv.

The approach we have taken is to formally nest commonly used semiparametric models into a large, encompassing ‘super model.’ An alternative approach is parametric transformations. In terms of cumulative hazards Hx(·) and baseline cumulative hazard H0(·), semiparametric linear transformation models can be written as

Hx(t)=G{eβxH0(t)}.

Zeng and Lin (2007) note that G(x)=1ρ[(1+x)ρ1] gives PH when ρ = 1 or PO as ρ → 0+; also G(x)=1ρ log(1+ρx) gives PO when ρ = 1 or PH as ρ → 0+. The latter model is equivalent to the generalized odds rate model of Scharfstein et al. (1998). Yin and Ibrahim (2005) instead consider

1ρ[hx(t)ρ1]=1ρ[h0(t)ρ1]+xβ.

Here, ρ = 1 gives the additive hazards model whereas ρ → 0+ gives PH. It has been generally noted by these authors that estimation of ρ is problematic and inference proceeds typically by fitting several values of ρ and choosing the value closest to one or zero that maximizes a likelihood or posterior density. In all of these models one of two common models is obtained on the boundary of the parameter space, i.e. ρ → 0+, which presents unique challenges to model selection and estimation.

Supplementary Material

10985_2018_9429_MOESM1_ESM

Contributor Information

Jiajia Zhang, Department of Epidemiology and Biostatistics, University of South Carolina, Columbia, SC 29208, USA.

Timothy Hanson, Senior Principal Statistician, Medtronic Inc., Minneapolis, Minnesota, USA.

Haiming Zhou, Division of Statistics, Northern Illinois University, DeKalb, IL 60115, USA.

References

  1. Beadle GF, Come S, Henderson IC, Silver B, Hellman S, Harris JR. The effect of adjuvant chemotherapy on the cosmetic results after primary radiation treatment for early stage breast cancer. International Journal of Radiation Oncology Biology Physics. 1984;10(11):2131–2137. doi: 10.1016/0360-3016(84)90213-x. [DOI] [PubMed] [Google Scholar]
  2. Chen Y, Hanson T, Zhang J. Accelerated hazards model based on parametric families generalized with Bernstein polynomials. Biometrics. 2014;70(1):192–201. doi: 10.1111/biom.12104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chen YQ, Jewell NP. On a general class of semiparametric hazards regression models. Biometrika. 2001;88(3):687–702. [Google Scholar]
  4. Chen YQ, Wang M-C. Analysis of accelerated hazards models. Journal of the American Statistical Association. 2000;95(450):608–618. [Google Scholar]
  5. Cheng S, Wei L, Ying Z. Analysis of transformation models with censored data. Biometrika. 1995;82(4):835–845. [Google Scholar]
  6. Cox DR. Breakthroughs in Statistics. Springer; 1992. Regression models and life-tables; pp. 527–541. [Google Scholar]
  7. De Iorio M, Johnson WO, Müller P, Rosner GL. Bayesian nonparametric nonproportional hazards survival modeling. Biometrics. 2009;65(3):762–771. doi: 10.1111/j.1541-0420.2008.01166.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Devarajan K, Ebrahimi N. A semi-parametric generalization of the Cox proportional hazards regression model: Inference and applications. Computational Statistics & Data Analysis. 2011;55(1):667–676. doi: 10.1016/j.csda.2010.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Diao G, Zeng D, Yang S. Efficient semiparametric estimation of short-term and long-term hazard ratios with right-censored data. Biometrics. 2013;69(4):840–849. doi: 10.1111/biom.12097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Etezadi-Amoli J, Ciampi A. Extended hazard regression for censored survival data with covariates: A spline approximation for the baseline hazard function. Biometrics. 1987;43(2):181–192. [Google Scholar]
  11. Ferguson TS. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1(2):209–230. [Google Scholar]
  12. Ghosal S. Convergence rates for density estimation with Bernstein polynomials. Annals of Statistics. 2001;29(5):1264–1280. [Google Scholar]
  13. Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82(4):711–732. [Google Scholar]
  14. Haario H, Saksman E, Tamminen J. An adaptive Metropolis algorithm. Bernoulli. 2001;7(2):223–242. [Google Scholar]
  15. Hanson T, Yang M. Bayesian semiparametric proportional odds models. Biometrics. 2007;63(1):88–95. doi: 10.1111/j.1541-0420.2006.00671.x. [DOI] [PubMed] [Google Scholar]
  16. Hanson TE. Inference for mixtures of finite Polya tree models. Journal of the American Statistical Association. 2006;101(476):1548–1565. [Google Scholar]
  17. Hanson TE, Branscum AJ, Johnson WO, et al. Informative g-priors for logistic regression. Bayesian Analysis. 2014;9(3):597–612. [Google Scholar]
  18. Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. John Wiley & Sons; 2011. [Google Scholar]
  19. Li L, Hanson T, Zhang J. Spatial extended hazard model with application to prostate cancer survival. Biometrics. 2015;71(2):313–322. doi: 10.1111/biom.12268. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Murphy S, Rossini A, van der Vaart AW. Maximum likelihood estimation in the proportional odds model. Journal of the American Statistical Association. 1997;92(439):968–976. [Google Scholar]
  21. Petrone S, Wasserman L. Consistency of Bernstein polynomial posteriors. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002;64(1):79–100. [Google Scholar]
  22. Prentice RL. Exponential survivals with censoring and explanatory variables. Biometrika. 1973;60(2):279–288. [Google Scholar]
  23. Quantin C, Moreau T, Asselain B, Maccario J, Lellouch J. A regression survival model for testing the proportional hazards hypothesis. Biometrics. 1996;52(3):874–885. [PubMed] [Google Scholar]
  24. Scharfstein DO, Tsiatis AA, Gilbert PB. Semiparametric efficient estimation in the generalized odds-rate class of regression models for right-censored time-to-event data. Lifetime Data Analysis. 1998;4(4):355–391. doi: 10.1023/a:1009634103154. [DOI] [PubMed] [Google Scholar]
  25. Sun J. The Statistical Analysis of Interval-Censored Failure Time Data. Springer; 2006. [Google Scholar]
  26. Turnbull BW. The empirical distribution function with arbitrarily grouped, censored and truncated data. Journal of the Royal Statistical Society. Series B (Methodological) 1976;38(3):290–295. [Google Scholar]
  27. Verdinelli I, Wasserman L. Computing Bayes factors using a generalization of the Savage-Dickey density ratio. Journal of the American Statistical Association. 1995;90(430):614–618. [Google Scholar]
  28. Yang S, Prentice R. Semiparametric analysis of short-term and long-term hazard ratios with two-sample survival data. Biometrika. 2005;92(1):1–17. [Google Scholar]
  29. Yang S, Prentice RL. Semiparametric inference in the proportional odds regression model. Journal of the American Statistical Association. 1999;94(445):125–136. [Google Scholar]
  30. Yin G, Ibrahim JG. Bayesian frailty models based on Box-Cox transformed hazards. Statistica Sinica. 2005;15(3):781–794. [Google Scholar]
  31. Zellner A. Applications of Bayesian analysis in econometrics. Journal of the Royal Statistical Society. Series D (The Statistician) 1983;32:23–34. [Google Scholar]
  32. Zeng D, Lin D. Semiparametric transformation models with random effects for recurrent events. Journal of the American Statistical Association. 2007;102(477):167–180. [Google Scholar]
  33. Zhang J, Peng Y. Crossing hazard functions in common survival models. Statistics & Probability Letters. 2009;79(20):2124–2130. doi: 10.1016/j.spl.2009.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Zhou H, Hanson T. Non-parametric Bayesian Inference in Biostatistics. Springer; 2015. Bayesian spatial survival models; pp. 215–246. [Google Scholar]
  35. Zhou H, Hanson T, Zhang J. Generalized accelerated failure time spatial frailty model for arbitrarily censored data. Lifetime Data Analysis. 2017;23(3):495–515. doi: 10.1007/s10985-016-9361-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

10985_2018_9429_MOESM1_ESM

RESOURCES