Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jan 28.
Published in final edited form as: Proc ACM Conf Health Inference Learn (2020). 2020 Apr 2;2020:10–18. doi: 10.1145/3368555.3384454

Variational Learning of Individual Survival Distributions

Zidi Xiu 1, Chenyang Tao 2, Ricardo Henao 3
PMCID: PMC8797054  NIHMSID: NIHMS1734845  PMID: 35098265

Abstract

The abundance of modern health data provides many opportunities for the use of machine learning techniques to build better statistical models to improve clinical decision making. Predicting time-to-event distributions, also known as survival analysis, plays a key role in many clinical applications. We introduce a variational time-to-event prediction model, named Variational Survival Inference (VSI), which builds upon recent advances in distribution learning techniques and deep neural networks. VSI addresses the challenges of non-parametric distribution estimation by (i) relaxing the restrictive modeling assumptions made in classical models, and (ii) efficiently handling the censored observations, i.e., events that occur outside the observation window, all within the variational framework. To validate the effectiveness of our approach, an extensive set of experiments on both synthetic and real-world datasets is carried out, showing improved performance relative to competing solutions.

Keywords: Variational Inference, Survival Analysis, Neural Networks, Individual Personal Distribution, Time-to-event modeling, Black-box inference, Latent Variable Models

1. INTRODUCTION

Prediction of event times, also known as survival analysis in the clinical context, is one of the most extensively studied topics in the statistical literature, largely due to its significance in a wide range of clinical and population health applications. It provides a fundamental set of tools to statistically analyze the future behavior of a system, or an individual. In the classical setup, the primary goal of time-to-event modeling is to either characterize the distribution of the occurrence of an event of interest on a population level [20, 21], or more specifically, to estimate a risk score on a subject level [11]. In recent years, there has been a surge of interest in the prediction of individualized event time distributions [46].

A characteristic feature in the study of time-to-event distributions is the presence of censored instances, which refer to an event that is not reported during the follow-up period of a subject. This can happen, for instance, when a subject drops out during the study (right censoring), including when the study terminates before the event happens (administrative censoring). Unlike many conventional predictive models, where incomplete observations are usually safely ignored, censored observations contain crucial information that should be adequately considered. To efficiently leverage the censored observations, together with the complete observations, a classical treatment is to work with the notion of a hazard function, formally defined as the instantaneous event risk at time t, which can be computed by contrasting the event population to the population at risk at a specific time. Estimates can be derived, for instance by optimizing the partial likelihood defined by the relative hazards in the case of the Cox Proportional Hazard model (CoxPH) [11]. Alternatively, other work follows the standard Maximal Likelihood Estimation (MLE) framework, where the individual event distribution is a deformed version of some baseline distribution. For example, in the Accelerated Failure Time model (AFT) [20], covariate effects are assumed to rescale the temporal index of event-time distributions, i.e., they either accelerate or delay event progression. For censored events, their likelihoods are given as the cumulative density after the censoring time [1].

While vastly popular among practitioners, these models have been criticized for a number of reasons, in particular for the assumptions they make, that consequently render them unfit for many modern applications [45]. For instance, most survival models, including CoxPH and the proportional odds model [31], work under the premise of fixed covariate effects, overlooking individual uncertainty. However, it has been widely recognized that, individual heterogeneity and other sources of variation are common and often time-dependent [2]. In real-world scenarios, these random factors are typically costly to measure, if not impossible to observe. Unfortunately, many models are known to be sensitive to the violation of this fixed effect assumption, raising seriously concerns when deployed in actual practice [18].

Alternatively, machine learning techniques have been leveraged to overcome the limitations of standard statistical survival modeling schemes, especially in terms of model flexibility to address the complexity of data. For example, survival trees employed special node-splitting strategies to stratify the population and derive covariate-based survival curves [6], support vector machines [24] and neural networks [13] have been used for more expressive predictors and LASSO-type variants [47] simultaneously execute variable selection to boost statistical efficiency. Bayesian statistics has also been explored in the context of model selection [28], averaging [33] and imposing prior beliefs [14]. Recent advances in modern machine learning bring extra traction to the concept of data-driven survival models, an important step toward precision medicine. Prominent examples include direct deep learning extensions of CoxPH [22, 27], accelerated failure time [8] and Bayesian exponential family models [37]. Other efforts include the use of Gaussian Process to capture complex interactions between covariates in relation to event times [15] and competing risks [3]. It has been argued that direct modeling of the event distribution might be beneficial [46], and more recently, adversarial distribution matching has also been considered for survival applications [8] with promising results reported.

In this work we present a principled approach to address the challenges of nonparametric modeling of time-to-event distributions in the presence of censored instances. Our approach, named Variational Survival Inference (VSI), builds upon recent developments in black-box variational inference [36]. It directly targets the estimation of individualized event-time distributions, rather than a risk score that correlates with event ordering. By explicitly accounting for latent variables in its formulation, VSI better accommodates for individual uncertainty. The proposed VSI is a highly scalable and flexible framework without strong assumptions, featuring easy implementation, stable learning, and importantly, it does not rely on ad-hoc regularizers. Our key contributions include: (i) a variational formulation of nonparametric time-to-event distribution modeling conditioned on explanatory variables; (ii) a cost-effective treatment of censored observations; (iii) a thorough discussion on how our modeling choices impact VSI performance, and (iv) an empirical validation confirming that the proposed VSI compares favorably to its counterparts on an extensive set of tasks, covering representative synthetic and real-world datasets.

2. BACKGROUND

A dataset for survival analysis is typically composed of a collection N of triplets D=Yi=ti,δi,Xii=1N, where i indexes the subjects involved in the study. For each triplet, Xip denotes the set of explanatory variables, ti is the observation time and δi is the event indicator. To simplify our discussion, we only consider the standard survival setup. This means δi is binary with δi = 1 indicating the event of interest happened at ti, otherwise δi = 0 corresponds to a censoring event, i.e., no event occurs until ti and the subject is unobserved thereafter. This distinction creates a natural partition of the dataset D=DcDe, with Dc = {Yi : δi = 0} and De = {Yi : δi = 1} representing the censored and event groups, respectively.

2.1. Statistical survival analysis

In survival analysis, one is interested in characterizing the survival function S(t), defined as the probability that any given subject survives until time t. The basic descriptors involved in the discussion of survival analysis are: the cumulative survival density F(t) = 1 ‒ S(t), the survival density f(t) = tF(t), the hazard function ht=limΔt0PtT<t+Δt|TtΔt and the cumulative hazard function Ht=0thsds. The following expressions are fundamental to survival analysis [1]: S(t) = exp(−H(t)) and f(t) = h(t)S(t). Further, we use S(t|x), f(t|x), F(t|x), h(t|x), H(t|x)) to denote their individualized (subject-level) counterparts given explanatory variables x. All survival models leverage these definitions to derive population-level estimators or subject-level predictive functions, e.g., of risk, S(t|x), or event time, f(t|x).

2.2. Variational inference

For a latent variable model, pθ (x,z) = pθ(x|z)p(z), we consider xp as an observation, i.e., data, and zm as latent variable. The marginal likelihood, given by pθx=pθx,zdz, typically does not enjoy a closed form expression. To avoid direct numerical estimation of pθ (x)Variational Inference (VI) optimizes a variational bound to the marginal log-likelihood. The most popular choice is known as the Evidence Lower Bound (ELBO) [44], given by

ELBOxEZ~qϕz|xlogpθx,ZqϕZ|xlogpθx, (1)

where qϕ (z|x) is an approximation to the true (unknown) posterior pϕ (z|x), and the inequality is a direct result of Jensen’s inequality. The variational gap between the ELBO and true log-likelihood is the KL-divergence between posteriors, i.e., KLqϕz|x||pθz|x=Eqϕz|xlogqϕz|xlogpθz|x, which implies the ELBO tightens as qϕ (z|x) approaches the true posterior pϕ (z|x). For estimation, we seek parameters θ and ϕ that maximize the ELBO. At test time, qϕ (z|x) is used for subsequent inference tasks on new data. Given a set of observations xii=1N sampled from data distribution x ~ pd, maximizing the expected ELBO is also equivalent to minimizing the KL-divergence KL(pdpθ ) between the empirical and model distributions. When pθ (x|z) and qϕ (z|x) are specified as neural networks, the resulting architecture is more commonly known as the Variational Auto-Encoder (VAE) [25] in the context of computational vision and natural language processing.

3. VARIATIONAL SURVIVAL INFERENCE

Below we detail the construction of the Variational Survival Inference (VSI) model, which results in predictions of the time-to-event distribution pθ (t|x) given attribute x, with the individual uncertainty accounted in the form of a latent variable z whose distribution is estimated under the VI framework. Unlike classical survival models, we do not need to specify a parametric form for the baseline distribution, e.g., the base hazard h0 (t) in CoxPH [11] or the base density p0 (t) in AFT [20]. Instead, we leverage the power of deep neural networks to amortize the learning of the event time and survival distributions, allowing arbitrary (high-order) interactions between the predictors and survival time to be captured. This overcomes the limitations caused by the restrictive assumptions made in the classical statistical survival analysis frameworks, thus allowing flexible inference of time-to-event distributions.

3.1. Variational bound of observed events

We start the discussion with the simplest scenario, that for which there are no censoring events. Our goal is to maximize the expected log-likelihood 1/Nilogpθti|Xi. To model the conditional likelihood, we consider a latent variable model of the form pθ (t, z|x). The unconditional formulation of the ELBO in (1) can be readily generalized to case conditional on event times as

ELBOt|x=EZ~qϕz|x,tlogpθt,Z|xqϕZ|x,t, (2)

where qϕ (z|x, t) denotes the conditional posterior approximation to the true (unknown) pθ (z|x, t).

In particular, we assume a model distribution with the following decomposition

pθt,z|x=pθt|z,xpθz|x=pθt|zpθz|x, (3)

which posits that z is a sufficient statistics of x w.r.t. survival time t. Another key assumption we make is that, unlike in the standard variational inference model, we have used a learnable inhomogeneous prior pθ (z|x) for the latent z to replace the standard fixed homogeneous prior p(z). Such covariate-dependent prior formulation allows the model to account for individual variation, thus further helping to close the variational gap [43]. Replacing (3) into the ELBO expression in (2) results in the usual likelihood and KL decomposition pair

ELBOt|x=EZ~qϕz|x,tlogpθt|ZKLqϕz|x,tpθz|x, (4)

from which we can see that maximizing the ELBO is equivalent to estimate the parameters of a probabilistic time-to-event model pθ (t|z)pθ (z|x) with maximum likelihood such that the inhomogeneous prior pθ (z|x) matches as well as possible a conditional posterior that explicitly accounts for event times, qθ (z|x, t). At test time, only pθ (z|x) will be used to make predictions provided that t is not available during inference.

More specifically, pθ (t|z), pθ (z|x) and qϕ (z|x, t) are defined as neural networks

pθt|z=softmaxgz;θ,pθz|x=Nμpx;θ,Σpx;θ,qϕz|x,t=Nμqx,t;ϕ,Σqx,t;ϕ, (5)

where pθ (t|z) is represented on a discretized time line (see below for details), g(z; θ), μp(x; θ), Σp(x; θ) and μ(x,t; ϕ), Σq(x,t; ϕ) are deep neural nets parameterized by model parameters θ and variational parameters ϕ, and Nμ,Σ denotes the multivariate Gaussian with mean μ and (diagonal) covariance Σ. For standard tabular data, we use Multi Layer Perceptrons (MLPs) to specify these functions.

3.2. Variational bound of censored events

Addressing censoring in the formulation is more challenging as this type of partial observation is not subsumed in the conventional VI framework. To address this difficulty, we recall that in likelihood-based survival analysis, the likelihood function for censored observations is given by logSθ (t|x), where Sθ (t|x) is the survival function and t is the censoring time. For censored observations Y with δ = 0, we do not have the exact event time t. This means that we only have partial information of the events, in that the event should happen only after the censoring time t.

To derive a tractable objective for censored observations, we first expand Lcx,t=logSθt|x based on its definition and an application of Fubini’s theorem [38] and Jensen’s inequality, i.e.,

Lcx,t=logSθt|x=logtpθt|xdtEqϕz|t,xlogpθz|xqϕz|t,x+logSθt|z=Eqϕz|t,xlogSθt|zKLqϕz|t,x||pθz|xELBOct|x

where the censored log-likelihood bound ELBOc (t|x) is only evaluated on Dc, i.e., the subset of censored observations. See Supplementary Materials for the full derivation of ELBOc (t|x).

3.3. Implementing VSI

In the current instantiation of the model, we discretize time into M bins spanning the time horizon of the (training) data. This means that (at inference) t is only known up to the time bin it falls into. We note this is not a restrictive assumption as many survival data is only known up to certain temporal accuracy. That said, generalization to continuous observations is fairly straightforward. For datasets that do have a natural discretization, we leave the choice to the user. In this study, we partition the temporal index based on the percentiles of observed event time, while also allowing for an artificial (M + 1)-th bin to account for event times beyond the full observation window, i.e., events happening after the end-of-study as observed in the training cohort.

Since both pθ (z|x) and qϕ (z|x, t) are assumed to be Gaussian, the following closed-form expression can be used in the computation of the KL terms above

KLqϕz|x,tpθz|x=12trΣp1Σq+μpμqTΣp1μpμqm+logdetΣpdetΣq. (6)

Following Ranganath et al. [36], we use diagonal covariance matrices and apply the reparameterization trick to facilitate stable differatiable learning.

In order to compute the term Sθ (t|x), we use discretized time scheme as previously described, and sum up all predicted probabilities subsequent to bin t. Note that this can be readily generalized to continuous time models. So long as the cumulative distribution of pθ (t|z) enjoys a closed form expression, a numerical integration scheme is not necessary to implement VSI.

3.4. Importance-Weighted estimator for likelihood evaluation

For evaluation purposes, we need to be able to compute the model’s log-likelihood for an observation Y = (xi,ti, δi), i.e.,

LVSIxi,ti;θ=δilogpθti|xi+1δilogSθti|xi. (7)

In this study, we use the importance-weighted (IW) estimator [7], which provides a tighter bound to the log-likelihood. While more sophisticated alternatives might provide sharper estimates [32], we deem IW estimator sufficient for the scope of this study. Additionally, while the tighter bound can be repurposed for training, it does not necessarily result in improved performance [35], which we find to be the case in this study.

To obtain a more accurate value of the likelihood, we use the approximate posterior as our proposal, and use the following finite sample estimate

p^θti|xi=pθti|zpθz|xqϕz|ti,xqϕz|ti,xdz1Ll=1Lpθti|zlpθzl|xqϕzl|ti,x,

where L is the number of samples. The log-likelihood for the corresponding conditional survival function is

S^θti|xi=t>tipθti|zpθz|xqϕz|ti,xqϕz|ti,xdzdt1Ll=1Lt>tipθt,zl|xdtqϕzl|ti,x

Note that by nature of Jensen’s inequality, the resultant estimand will be an under-estimation of the true log-likelihood. As L goes to infinity, the approximated lower bound will converge to the true log-likelihood.

3.5. Making Predictions

Predictive time-to-event distribution

During inference, given a new data point with x*, according to the generative model pθt|x=pθt,z|xdz=pθt|zpθz|xdz, where the integration is conducted numerically by Monte Carlo sampling.

Point estimation of time-to-event

To better exploit the learned approximated posterior qϕ (z|x, t ), we generalize the importance sampling idea and provide a weighted average as time-to-event summary, rather than for instance using a summary statistic such as median or mean. Specifically, consider multiple samples of tl~pθt|x, then calculate a weighted average as

t=l=1Lwltll=1Lwl,wl=pθzl|xqϕzl|tl,x,tl~pθt|x,zl~qϕz|tl,x. (8)

In the Supplementary Materials we show that (8) gives better model performance for point-estimate-based evaluation metrics, Concordance Index in particular, compared to other popular summary statistic such as the median of t* ~ pθ (t|x*) with L empirical samples.

4. DISSECTING VSI

In the experiments, we show the effectiveness of the proposed VSI model in recovering underlying time-to-event distributions. To provide additional insight into the differentiating components of the VSI model, we consider two baseline models that partially adopt a VSI design, as detailed below.

VSI without a qϕ arm (VSI-NoQ)

In VSI, we use the variational lower bound to maximize the likelihood in survival studies by implicitly forcing the unknown intractable model posterior pθ (z|x) to be close to the tractable posterior approximation qϕ (z|x, t). Via the KL divergence minimization, such matching allows the model to better account for interactions between covariates x and event times t captured by qϕ (z|x, t) to better inform the construction of the latent representation z via isolating out the individual uncertainty encoded by pθ (z|x). If we exclude the interaction term (x, t) in qϕ and only make the prediction with x, i.e., with the approximate posterior given by qϕ (z|x), through the same stochastic latent representation z, then naturally the optimal solution is to equate qϕ (z|x) with the prior pθ (z|x)1. This basically eliminates qϕ from our formulation, and therefore we call this variant VSI-NoQ.

More specifically, without a qϕ arm the model described in Section 3 essentially becomes a feed-forward model with a special stochastic hidden layer z. In this case, the model likelihood is given by pθt|x=pθt,z|xdz=pθt|zpθz|xdz, where pθt|z and pθ (z|x) are defined as in (3). Note that the only difference with VSI is the lack of the KL divergence term to match pθ (z|x) to qϕ (z|x, t). This baseline model (VSI-NoQ) is considered to dissect the impact of excluding complex interaction between covariates and event time when constructing the individualized priors.

Deterministic feed-forward model (MLP)

To understand the importance of the stochastic latent representations z, we consider a straightforward baseline which directly predicts the event time distribution based on the input x, i.e., pθ (·|x) = Softmax(gθ (x)), which is essentially a standard multinomial regression with censored observation. In our study, we use the MLP to implement gθ (x). And as such, hereafter we will refer to this model as MLP. Additionally, we also considered standard randomization schemes, such as dropout [42], in the construction of a stochastic neural net, which promises to improve performance. Such strategy also incorporates randomness, however differs principally from the modeled uncertainty exploited by our VSI scheme. In our experiment section, we report the best results from MLP with or without dropout.

These baseline approaches use feed-forward deep learning networks to learn pθ (t|x) without incurring the notation of variational inference. In the experiments we will show that the variational inference is crucial to the accurate learning of time-to-event distributions, resulting in better performance relative to these baselines, especially when the proportion of censoring events is high.

5. RELATED WORK

Machine learning and survival analysis

Early attempts of combining machine learning techniques with statistical survival analysis, such as the Faraggi-Simon network (FS-network) [13], often failed to demonstrate a clear advantage over classical baselines [40]. Recent progresses in machine learning allow researchers to overcome the difficulties suffered by prior studies. For example, Katzman et al. [23] showed that weight decay, batch normalization and dropout significantly improved the performance of FS-network. Li et al. [27] analyzed survival curves based on clinical images using deep convolution neural net (CNN). In addition to deep nets, Fernández et al. [15] showed that Gaussian Process can be used to effectively capture the non-linear variations in CoxPH models, and Alaa and van der Schaar [3] further proposed a variant that handles competing risks. Similar to these works, our VSI also draws power from recent advances in machine learning to define a flexible learner.

Bayesian survival analysis

Bayesian treatment of survival models has a long history. Raftery et al. [34] first considered modeling uncertainties for survival data, Zupan et al. [49] reported probabilistic analysis under Bayesian setup. More recently, Fard et al. [14] exploited the Bayesian framework to extrapolate priors, and Zhang and Zhou [48] described a Bayesian treatment of competing risks. Closest to VSI is the work of deep exponential family model (DEF) survival model [37], where the authors introduced a Bayesian latent variable model to model both predictors x and survival time t. Unlike our VSI, DEF still imposes strong parametric assumptions on the survival distribution, and it’s not clear how the censored observations are handled in DEF’s actual implementation. Another key difference between DEF and VSI is the factorization of joint likelihood. As the VSI encoder will only seek to capture the latent components that are predictive of the survival time distribution, while DEF encoder also needs to summarize information required to reconstruct covariates x. We argue that our VSI factorization of joint probability is more sensible for survival time modeling, because modeling x not only adds model complexity but also introduces nuisance to the prediction of survival time t. For datasets with large covariates dimensions and noisy observations, the DEF features can be dominated by the ones predictive of x rather t, compromising the main goal of modeling the survival distribution.

Individual uncertainties and randomization

The seminal work of Aalen [2] first identified importance of accounting for the individual uncertainties, the main culprit for the failure of classical survival models, which can be remedied by explicitly modeling the random effects [18]. Alternatively, Ishwaran et al. [19] presented Random Survival Tree (RST) to predict cumulative hazards using a tree ensemble, demonstrating the effectiveness of a randomization scheme for statistical survival models. Our approach differs from the above schemes by systematically account for individual uncertainty using the randomness of latent variables.

Direct modeling of survival distribution

The pioneering work of Yu et al. [46] advocated the prediction of individual survival distributions, which is learned using a generalized logistic regression scheme. This idea is further generalized in the works of Luck et al. [29] and Fotso [16]. Recently, Chapfuwa et al. [8] explored the use of deep Generative Adversarial Network (GAN) to capture the individual survival distribution, which is closest to our goal. Compared the proposed VSI, the adversarial learning of survival distribution is largely unstable, and its success crucially relies on the use of ad-hoc regularizers.

6. EXPERIMENTS

To validate the effectiveness of the proposed VSI, we benchmarked its performance against the following representative examples from both statistical and machine learning survival analysis schemes: AFT-Weibull, CoxPH, LASSO-based CoxNet [41], Random Survival Forest (RSF) [19] and deep learning based DeepSurv [23]. To fully appreciate the gains from using a variational setup, we further compared the results with the baselines discussed in Section 4, namely, the feed-forward model (MLP) and VSI model without the backward encoding arm qϕ (z|t, x) (VSI-NoQ).

For data preparation, we randomly partition data into three non-overlapping sets for training (60%), validation (20%) and evaluation (20%) purposes respectively. All models are trained on the training set, and we tune the model hyper-parameters wrt the out-of-sample performance on the validation set. The results reported in the paper are based on the evaluation set using best-performing hyper-parameters determined by the validation set. We apply ADAM optimizer with learning rate of 5×10‒4 during training, with mini-batches of size 100. The early stopping criteria of no improvement on the validation datasets is enforced.

To ensure fair comparisons, all deep-learning based solutions are matched for the number parameters and similar model architectures & similar hyper-parameter settings. TensorFlow code to replicate our experiments can be found at https://github.com/ZidiXiu/VSI/. The details of the VSI model setups are related to the Supplementary Materials (SM).

6.1. Evaluation Metrics

To objectively evaluate these competing survival models, we report a comprehensive set of distribution-based and point-estimate based scores to assess model performance, as detailed below.

Concordance Index (C-Index) is commonly used to evaluate the consistency between the model predicted risk scores and observed event rankings [17]. Formally, it is defined as

C-Index=1i,j𝟙fxi>fxj

, where =ti<tj|δi=1 is the set of all valid ordered pairs (event i before event j) and f (x) is a scalar prediction made by the model. Higher is better.

Time-dependent Concordance Index is a distribution generalization of the scalar risk score based C-Index [4], which is computed from the predicted survival distribution. Formally it is given by

Ctd=PF^ti|xi>F^ti|xj|ti<tj.

, where F^ denotes the model predicted cumulative survival function. We report the results using the following empirical estimator

C^td=1i,j𝟙F^ti|xi>F^tj|xj

Kolmogorov-Smirnov (KS) distance For synthetic datasets, we also report the KS distance [30] between the predicted distribution and the ground truth. KS computes the maximal discrepancy between two cumulative densities, i.e.,

KS=suptF1tF2t,

and a lower KS indicates better match of two distributions.

Test log-likelihood We also report the average log-likelihood on the held-out test set. A higher score indicates the model is better aligned with the ground-truth distribution in the sense of KL-divergence. Additionally, we also evaluate the spread of empirical likelihood wrt the models. In the case of an expected log-likelihood tie, models with the more concentrated log-likelihoods are considered better under the maximal entropy principle [9] (i.e., as observed instances received more uniform/similar likelihoods, better generalization of the model is implied).

Coverage Rate To quantify the proportion of observed time covered in the predicted personalized time-to-event distributions, we calculated the coverage rate for different percentile ranges. For subjects with event observations, the coverage rate is defined as the proportion of observations fall in the percentile ranges [l,u] of the predicted distributions, where l, u respectively denotes lower and upper quantile of percentile ranges, i.e.,

CoverRateeventsl,u=1neyiDeIl<ti<u

In our experiments, we report coverage rates of events at percentile range [l,u] ∈ {[0.05, 0.95], [0.1, 0.9], [0.15, 0.85], [0.2, 0.8], [0.25, 0.75], [0.3, 0.7], [0.35, 0.65], [0.4, 0.6], [0.45, 0.55]} of the predicted personalized distributions. For censoring, we calculate the proportion of the censoring time happened before the percentiles of predicted range, since the true time-to-event for censoring is happened after censoring time,

CoverRatecensorl=1ncyiDcItil

We evaluated the coverage rate for censoring at l ∈ {0.1, 0.2, · · · , 0.9} percentiles.

For all coverage rates, a higher score implies better performance. Coverage rates for events and censoring should be considered together to evaluate model performance.

6.2. Synthetic datasets

Following Bender et al. [5] we simulate a realistic survival data based on the German Uranium Miners Cohort Study in accordance with the Cox-Gompertz model

T=1αlog1αlogUλexpβage×AGE+βradon×RADON

, with U ~ Unif[0, 1]. This model simulates the cancer mortality associated with radon exposure and age. Model parameter are derived from real data: α = 0.2138, λ= 7 × 10−8, βage = 0.15 and βradon = 0.001. Covariates are generated according to

AGE~N24.3,8.42,RADON~N266.8,507.82,

where Nμ,σ2 denotes a normal distribution with mean μ and variance σ2. We simulate uniform censoring within a fixed time horizon c, i.e., we let Ci ~ UNIF(0,c), then δi=𝟙Ti<Ci and Ti = Ci if Ci < Ti. By setting different upper bounds c for censoring, we achieve different observed event rates, 100%(c = ∞), 50%(c = 100) and 30%(c = 70). For each simulation we randomly draw N = 50k iid samples.

Prediction of subject-level distribution

In practice, for each subject we only observe one t from its underlying distribution. Our goal is to accurately predict the underlying distribution from the covariates x alone (since t and δ are not observed at test time), by learning from the observed instances. Figure 1 compares our VSI prediction with the ground-truth for two random subjects, which accurately recovers of individual survival distribution for both observed (Figure 1(a)) and censored cases (Figure 1(b)).

Figure 1:

Figure 1:

Two simulated time-to-event distributions with 30% event rate showing that VSI successfully predicts the underlying distributions from covariates. (left: events, right:censoring)

To systematically evaluate the consistency between the predicted and the true distributions, we compare average KS distance from models trained with various event rates in Table 1. Since the underlying generative process is based on CoxPH model, we consider the results from CoxPH as the oracle reference, since there is no model mis-specification. At 100% event rate (i.e., complete observation), apart from the oracle CoxPH, all models perform similarly. The VSI variants give slightly better results compared with MLP and AFT-Weibull. As the proportion of observed events decreases, VSI remains the best performing model, closely followed by the parametric AFT-Weibull. Note that neither MLP nor VSI-NoQ matches the performance of VSI, which suggests that the full VSI design better accommodates censoring observations.

Table 1:

KS statistic for simulation study.

Event Rate 100% 50% 30%
CoxPH (Oracle) 0.027 0.032 0.027
AFT-Weibull 0.057 0.058 0.068
MLP 0.047 0.063 0.064
VSI-NoQ 0.049 0.068 0.066
VSI 0.044 0.052 0.059

Average log-likelihood and C-Index

To validate the effectiveness of VSI, we also provide a comprehensive summary of model performance against other popular or state-of-the-art alternatives in Table 2, under various simulation setups with different evaluation metrics. VSI consistently outperforms its counterparts in terms of the average log-likelihood and time-dependent C-Index. Together with the observation that VSI also yields better KS distance (see Table 1), converging evidence suggests our VSI better predicts the individual survival distributions relative to other competing solutions.

Table 2:

Model performance summary for simulation study based onCtd, C-Index and average test log-likelihood. Confidence Intervals for C-Index provided in the SM. For NA entries, the corresponding evaluation metric can not be applied.

Models Ctd
C-Index Raw
log-likelihood
100% 50% 30% 100% 50% 30% 100% 50% 30%
CoxPH 0.757 0.755 0.761 0.773 0.781 0.793 NA NA NA
Coxnet NA NA NA 0.776 0.784 0.760 NA NA NA
AFT-Weibull 0.742 0.750 0.768 0.773 0.781 0.793 −4.43 −2.29 −1.47
RSF 0.631 0.638 0.608 0.701 0.718 0.712 −14.12 −8.02 −5.35
DeepSurv NA NA NA 0.772 0.781 0.793 NA NA NA
MLP 0.744 0.751 0.770 0.772 0.781 0.793 −4.15 −2.22 −1.41
VSI-NoQ 0.748 0.749 0.763 0.772 0.781 0.793 −4.16 −2.22 −1.41
VSI 0.748 0.756 0.772 0.773 0.781 0.793 −4.15 −2.22 −1.40

We also compared the raw C-Index and the corresponding confidence intervals using the weighted average of model predicted survival time (defined in Sec 3.5) as the risk score, and we did not find significant differences between alternative methods, as shown in Table 2 and Supplemental Materials. Thus VSI can deliver comparable performance relative to models that are compatible with the data generating mechanism. Raw C-Index quantifies the corresponding pairs without considering the time horizon, thus the distinctions among good performing models are not significant.

To provide a more informative summary, We plot the test log-likelihood distributions for selected models in Figure 2. We can see that VSI log-likelihoods estimates are tighter and higher for both observed and censored observations, especially when we have low event rates. The (0.10, 0.90) percentiles range for simulation studies please refer to SM.

Figure 2:

Figure 2:

Test log-likelihood distributions for the 50% event rate simulation dataset. (left: events, right:censoring)

Coverage Plots

In Figure 3, VSI achieves both relatively high coverage for event (Figure 3(a)) and censored observations (Figure 3(b)), comparing to the oracle method CoxPH in this synthetic example. Note that while RSF performs better for the observed events, its performance on censored cases falls well below other solutions.

Figure 3:

Figure 3:

Test coverage rate for the 50% event rate simulation dataset. (left: events, right: censoring)

We refer the readers to our Supplementary Materials for additional simulations and analyses based on toy models.

6.3. Real-World datasets

Moving beyond toy simulations, we further compare VSI to competing solution on the following three real-world datasets, i) FLCHAIN [12]: a public dataset to determine whether the elevation in free light chain assay provides prognostic information to the general population survival, ii) SUPPORT [26]: a public dataset for a prospective cohort study to estimate survival over seriously ill hospitalized adults for 180 days period, and iii) SEER [39]: a public dataset aim to study cancer survival among adults, which contains 1988 to 2001 information, provided by U.S. Surveillance, Epidemiology, and End Results (SEER) Program. In this experiments, we used 10-year follow-up breast cancer subcohort in SEER dataset. We follow the data pre-processing steps outlined in Chapfuwa et al. [8]. To handle the missing values in data, we adopt the common practice of median imputation for continuous variables and mode imputation for discrete variables.

Summary statistics of the datasets are shown in Table 3, where N is the total number of observations, p denotes the total number of variables after one-hot-encoding, NaN(%) stands for the proportion of missingness in covariates, and loss of information stands for the proportion of censoring observations happened after the maximum event time t.

Table 3:

Summary Statistics for Real Datasets.

FLCHAIN SUPPORT SEER
N 7,894 9,105 68,082
Event rate(%) 27.5 68.1 51.0
p(cat) 26(21) 59(31) 789(771)
NaN(%) 2.1 12.6 23.4
Max event t 4998days 1944days 120months
Loss of Info(%) 10.45 1.57 0.0

In Table 4 we compare the C-Indices and average log-likelihood. The advantage of VSI is more evident for the more challenging real datasets, especially in the cases of low observed event rates. For example, with 30% event rate, in SUPPORT dataset, VSI Confidence Interval for raw C-Index as (0.809, 0.846), while the standard CoxNet is only (0.763,0.805) and AFT (0.782,813), i.e., the overlaps with that of VSI are very small. Similar results were observed for other datasets and baseline solutions. VSI shows remarkable robustness against data incompleteness in a real-world scenario, achieving the best results according to all three metrics. For VSI the raw C-Index is computed from the weighted average of VSI predicted distribution, please refer to SM for more details. In Figure 4, the distribution of log-likelihood is more concentrated, in addition to a higher mean. To quantitatively evaluate the concentration, we report the difference between 10% and 90% quantiles of log-likelihood in Table 5. The quantile ranges of VSI are considerably smaller compared to alternative solutions under most experimental settings. This verifies VSI enjoys better model robustness compared to other popular alternatives, especially in the case of high censoring rates.

Table 4:

Summary for Real Datasets based on C-Index and average log-likelihood. Confidence Intervals for C-Index are provided in SM. NA implies the corresponding evaluation metric can not be evaluated.

Models Ctd
C-Index Raw
log-likelihood
FLCHAIN SUPPORT SEER FLCHAIN SUPPORT SEER FLCHAIN SUPPORT SEER
Coxnet NA NA NA 0.790 0.797 0.819 NA NA NA
AFT-Weibull 0.777 0.752 NA 0.792 0.797 NA −3.09 −4.39 NA
RSF NA NA NA 0.771 0.751 0.796 NA NA NA
DeepSurv NA NA NA 0.785 0.678 NA NA NA NA
MLP 0.775 0.768 0.821 0.751 0.811 0.811 −1.91 −2.86 −2.50
VSI-NoQ 0.745 0.772 0.820 0.745 0.824 0.809 −2.45 −2.79 −2.50
VSI 0.787 0.775 0.824 0.792 0.827 0.826 −1.85 −2.74 −2.49

Figure 4:

Figure 4:

log-likelihood distributions for SUPPORT Dataset, (left: events, right:censoring)

Table 5:

Quantile ranges for log-likelihood in Real Datasets. Note AFT did not converge to reasonable solutions for SEER.

Models Observed Censored
FLCHAIN SUPPORT SEER FLCHAIN SUPPORT SEER
AFT 2.491 4.706 NA 0.468 1.850 NA
MLP 2.970 4.273 1.780 0.518 1.540 0.623
VSI-NoQ 7.34 4.744 1.801 0.559 1.634 0.529
VSI 2.213 4.143 1.718 0.537 1.354 0.508

Together with the coverage plots in Figure 5, VSI has relative high coverage for both events and censoring cases which indicates better performance in capturing the true event time in challenging real-world datasets. The consistency of those results have been verified through repeated runs on these three datasets. For more detailed results please refer to SM.

Figure 5:

Figure 5:

Coverage rate for SUPPORT Dataset, (left: events, right: censoring)

7. CONCLUSIONS

We presented an approach for learning time-to-event distributions conditioned on covariates in a nonparametric fashion by leveraging a principled variational inference formulation. The proposed approach, VSI, extends the variational inference framework to survival data with censored observations. Based on synthetic and diverse real-world datasets, we demonstrated the ability of VSI to recover the underlying unobserved time-to-event distribution, as well as providing point estimations of time-to-event for subjects that yield excellent performance metrics consistently outperforming feed-forward deep learning models and traditional statistical models.

As future work, we plan to extend our VSI framework to longitudinal studies, where we can employ a recurrent neural net (RNN) to account for the temporal dependencies. For datasets with observations made at irregular intervals, for instance, the Neural-ODE model [10] can be applied. Our work can be also adapted to make dynamic predictions of event times to serve the needs of modern clinical practices.

Supplementary Material

VSI supplementary material

CCS CONCEPTS.

Applied computingHealth informatics;Computing methodologiesModeling methodologies.

Acknowledgements.

The authors would like to thank the anonymous reviewers for their insightful comments. This research was supported in part by by NIH/NIBIB R01-EB025020.

Footnotes

ACM Reference Format:

Zidi Xiu, Chenyang Tao, and Ricardo Henao. 2020. Variational Learning of Individual Survival Distributions. In ACM Conference on Health, Inference, and Learning (ACM CHIL ‘20), April 2–4, 2020, Toronto, ON, Canada. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3368555.3384454

1

Based on a KL-vanishing argument.

Contributor Information

Zidi Xiu, Department of Biostatistics & Bioinformatics, Duke University.

Chenyang Tao, Electrical & Computer Engineering, Duke University.

Ricardo Henao, Department of Biostatistics & Bioinformatics, Duke University.

REFERENCES

  • [1].Aalen Odd, Borgan Ornulf, and Gjessing Hakon. 2008. Survival and event history analysis: a process point of view Springer Science & Business Media. [Google Scholar]
  • [2].Aalen Odd O. 1994. Effects of frailty in survival analysis. Statistical Methods in Medical Research 3, 3 (1994), 227–243. [DOI] [PubMed] [Google Scholar]
  • [3].Alaa Ahmed M and van der Schaar Mihaela. 2017. Deep multi-task gaussian processes for survival analysis with competing risks In Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc., 2326–2334. [Google Scholar]
  • [4].Antolini Laura, Boracchi Patrizia, and Biganzoli Elia. 2005. A time-dependent discrimination index for survival data. Statistics in Medicine 24, 24 (2005), 3927–3944. [DOI] [PubMed] [Google Scholar]
  • [5].Bender Ralf, Augustin Thomas, and Blettner Maria. 2005. Generating survival times to simulate Cox proportional hazards models. Statistics in medicine 24, 11(2005), 1713–1723. [DOI] [PubMed] [Google Scholar]
  • [6].Bou-Hamad Imad, Larocque Denis, Ben-Ameur Hatem, et al. 2011. A review of survival trees. Statistics Surveys 5 (2011), 44–71. [Google Scholar]
  • [7].Burda Yuri, Grosse Roger, and Salakhutdinov Ruslan. 2015. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519 (2015). [Google Scholar]
  • [8].Chapfuwa Paidamoyo, Tao Chenyang, Li Chunyuan, Page Courtney, Goldstein Benjamin, Carin Lawrence, and Henao Ricardo. 2018. Adversarial time-to-event modeling. arXiv preprint arXiv:1804.03184 (2018). [PMC free article] [PubMed] [Google Scholar]
  • [9].Chen Liqun, Tao Chenyang, Zhang Ruiyi, Henao Ricardo, and Carin Duke Lawrence. 2018. Variational inference and model selection with generalized evidence bounds In International Conference on Machine Learning. 892–901. [Google Scholar]
  • [10].Qi Chen Tian, Rubanova Yulia, Bettencourt Jesse, and Duvenaud David K. 2018. Neural ordinary differential equations. In Advances in neural information processing systems 6571–6583. [Google Scholar]
  • [11].Cox David R. 1972. Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological) 34, 2 (1972), 187–202. [Google Scholar]
  • [12].Dispenzieri Angela, Katzmann Jerry A, Kyle Robert A, Larson Dirk R, Therneau Terry M, Colby Colin L, Clark Raynell J, Mead Graham P, Kumar Shaji, Melton L Joseph III, et al. 2012. Use of nonclonal serum immunoglobulin free light chains to predict overall survival in the general population. In Mayo Clinic Proceedings, Vol. 87. Elsevier, 517–523. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Faraggi David and Simon Richard. 1995. A neural network model for survival data. Statistics in medicine 14, 1 (1995), 73–82. [DOI] [PubMed] [Google Scholar]
  • [14].Jahanbani Fard Mahtab, Wang Ping, Chawla Sanjay, and Reddy Chandan K. 2016. A bayesian perspective on early stage event prediction in longitudinal data. IEEE Transactions on Knowledge and Data Engineering 28, 12 (2016), 3126–3139. [Google Scholar]
  • [15].Fernández Tamara, Rivera Nicolás, and Whye Teh Yee. 2016. Gaussian processes for survival analysis. In Advances in Neural Information Processing Systems 5021–5029. [Google Scholar]
  • [16].Fotso Stephane. 2018. Deep Neural Networks for Survival Analysis Based on a Multi-Task Framework. arXiv preprint arXiv:1801.05512 (2018). [Google Scholar]
  • [17].Harrell Frank E, Califf Robert M, Pryor David B, Lee Kerry L, and Rosati Robert A. 1982. Evaluating the yield of medical tests. Jama 247, 18 (1982), 2543–2546. [PubMed] [Google Scholar]
  • [18].Hougaard Philip. 1995. Frailty models for survival data. Lifetime data analysis 1, 3 (1995), 255–273. [DOI] [PubMed] [Google Scholar]
  • [19].Ishwaran Hemant, Kogalur Udaya B, Blackstone Eugene H, Lauer Michael S, et al. 2008. Random survival forests. The annals of applied statistics 2, 3 (2008), 841–860. [Google Scholar]
  • [20].Kalbfleisch John D and Prentice Ross L. 2011. The statistical analysis of failure time data Vol. 360. John Wiley & Sons. [Google Scholar]
  • [21].Kaplan Edward L and Meier Paul. 1958. Nonparametric estimation from incomplete observations. Journal of the American statistical association 53, 282 (1958), 457–481. [Google Scholar]
  • [22].Katzman Jared L, Shaham Uri, Cloninger Alexander, Bates Jonathan, Jiang Tingting, and Kluger Yuval. 2016. Deep survival: A deep cox proportional hazards network. stat 1050 (2016), 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Katzman Jared L, Shaham Uri, Cloninger Alexander, Bates Jonathan, Jiang Tingting, and Kluger Yuval. 2018. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC medical research methodology 18, 1 (2018), 24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Khan Faisal M and Bayer Zubek Valentina. 2008. Support vector regression for censored data (SVRc): a novel tool for survival analysis In 2008 Eighth IEEE International Conference on Data Mining. IEEE, 863–868. [Google Scholar]
  • [25].Kingma Diederik P and Welling Max. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013). [Google Scholar]
  • [26].Knaus William A, Harrell Frank E, Lynn Joanne, Goldman Lee, Phillips Russell S, Connors Alfred F, Dawson Neal V, Fulkerson William J, Califf Robert M, Desbiens Norman, et al. 1995. The SUPPORT prognostic model: objective estimates of survival for seriously ill hospitalized adults. Annals of internal medicine 122, 3(1995), 191–203. [DOI] [PubMed] [Google Scholar]
  • [27].Li Hongming, Boimel Pamela, Janopaul-Naylor James, Zhong Haoyu, Xiao Ying, Ben-Josef Edgar, and Fan Yong. 2019. Deep Convolutional Neural Networks for Imaging Data Based Survival Analysis of Rectal Cancer. arXiv preprint arXiv:1901.01449 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Lisboa Paulo JG, Wong H, Harris P, and Swindell Ric. 2003. A Bayesian neural network approach for modelling censored data with an application to prognosis after surgery for breast cancer. Artificial intelligence in medicine 28, 1 (2003), 1–25. [DOI] [PubMed] [Google Scholar]
  • [29].Luck Margaux, Sylvain Tristan, Cardinal Héloïse, Lodi Andrea, and Bengio Yoshua. 2017. Deep learning for patient-specific kidney graft survival analysis. arXiv preprint arXiv:1705.10245 (2017). [Google Scholar]
  • [30].Massey Frank J Jr. 1951. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association 46, 253 (1951), 68–78. [Google Scholar]
  • [31].Murphy SA, Rossini AJ, and van der Vaart Aad W. 1997. Maximum likelihood estimation in the proportional odds model. J. Amer. Statist. Assoc 92, 439 (1997), 968–976. [Google Scholar]
  • [32].Neal Radford M. 2001. Annealed importance sampling. Statistics and computing 11, 2 (2001), 125–139. [Google Scholar]
  • [33].Raftery Adrian E. 1995. Bayesian model selection in social research. Sociological methodology 25 (1995), 111–164. [Google Scholar]
  • [34].Raftery Adrian E, Madigan David, and Volinsky Chris T. 1996. Accounting for model uncertainty in survival analysis improves predictive performance. Bayesian statistics 5 (1996), 323–349. [Google Scholar]
  • [35].Rainforth Tom, Kosiorek Adam R, Anh Le Tuan, Maddison Chris J, Igl Maximilian, Wood Frank, and Whye Teh Yee. 2018. Tighter variational bounds are not necessarily better. arXiv preprint arXiv:1802.04537 (2018). [Google Scholar]
  • [36].Ranganath Rajesh, Gerrish Sean, and Blei David. 2014. Black box variational inference. In Artificial Intelligence and Statistics 814–822. [Google Scholar]
  • [37].Ranganath Rajesh, Perotte Adler, Elhadad Noémie, and Blei David. 2016. Deep survival analysis. arXiv preprint arXiv:1608.02158 (2016). [Google Scholar]
  • [38].Resnick Sidney. 2003. A probability path Birkhauser Verlag AG. [Google Scholar]
  • [39].Gloeckler Ries LA, Young JL, Keel GE, Eisner MP, Lin YD, Horner MJ, et al. 2007. SEER survival monograph: cancer survival among adults: US SEER program, 1988–2001, patient and tumor characteristics. National Cancer Institute, SEER Program, NIH Pub 07–6215 (2007), 193–202. [Google Scholar]
  • [40].Schwarzer Guido, Vach Werner, and Schumacher Martin. 2000. On the misuses of artificial neural networks for prognostic and diagnostic classification in oncology. Statistics in medicine 19, 4 (2000), 541–561. [DOI] [PubMed] [Google Scholar]
  • [41].Simon Noah, Friedman Jerome, Hastie Trevor, and Tibshirani Rob. 2011. Regularization paths for Cox’s proportional hazards model via coordinate descent. Journal of statistical software 39, 5 (2011), 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Srivastava Nitish, Hinton Geoffrey, Krizhevsky Alex, Sutskever Ilya, and Salakhutdinov Ruslan. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929–1958. [Google Scholar]
  • [43].Tomczak Jakub M and Welling Max. 2017. VAE with a VampPrior. arXiv preprint arXiv:1705.07120 (2017). [Google Scholar]
  • [44].Wainwright Martin J. and Jordan Michael I.. 2008. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning 1 (2008), 1–305. [Google Scholar]
  • [45].Wang Ping, Li Yan, and Reddy Chandan K. 2019. Machine learning for survival analysis: A survey. ACM Computing Surveys (CSUR) 51, 6 (2019), 110. [Google Scholar]
  • [46].Yu Chun-Nam, Greiner Russell, Lin Hsiu-Chin, and Baracos Vickie. 2011. Learning patient-specific cancer survival distributions as a sequence of dependent regressors. In Advances in Neural Information Processing Systems 1845–1853. [Google Scholar]
  • [47].Helen Zhang Hao and Lu Wenbin. 2007. Adaptive Lasso for Cox’s proportional hazards model. Biometrika 94, 3 (2007), 691–703. [Google Scholar]
  • [48].Zhang Quan and Zhou Mingyuan. 2018. Nonparametric Bayesian Lomax delegate racing for survival analysis with competing risks. In Advances in Neural Information Processing Systems 5002–5013. [Google Scholar]
  • [49].Zupan Blaž, Demšar Janez, Kattan Michael W, Beck J Robert, and Bratko Ivan. 1999. Machine learning for survival analysis: a case study on recurrence of prostate cancer. In Joint European Conference on Artificial Intelligence in Medicine and Medical Decision Making Springer, 346–355. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

VSI supplementary material

RESOURCES