The estimation and modelling of cause-specific cumulative incidence functions using time-dependent weights

Paul C Lambert

. Author manuscript; available in PMC: 2018 Dec 10.

Published in final edited form as: Stata J. 2017 Mar;17(1):181–207.

The estimation and modelling of cause-specific cumulative incidence functions using time-dependent weights

Paul C Lambert ¹

PMCID: PMC6287714 EMSID: EMS79612 PMID: 30542252

Abstract

Competing risks occur in survival analysis when an individual is at risk of more than one type of event and the occurrence of one event precludes the occurrence of any other event. A measure of interest with competing risks data is the cause-specific cumulative incidence function (CIF) which gives the absolute (or crude) risk of having the event by time t, accounting for the fact that it is impossible to have the event if a competing event is experienced first. The user written command, stcompet, calculates non-parametric estimates of the cause-specific CIF and the official Stata command, stcrreg, fits the Fine and Gray model for competing risks data. Geskus (2011) has recently shown that some of the key measures in competing risks can be estimated in standard software by restructuring the data and incorporating weights. This has a number of advantages as any tools developed for standard survival analysis can then be used for the analysis of competing risks data. This paper describes the stcrprep command that restructures the data and calculates the appropriate weights. After using stcrprep a number of standard Stata survival analysis commands can then be used for the analysis of competing risks. For example, sts graph, failure will give a plot of the cause-specific CIF and stcox will fit the Fine and Gray proportional subhazards model. Using stcrprep together with stcox is computationally much more efficient than using stcrreg. In addition, the use of stcrprep opens up new opportunities for competing risk models. This is illustrated by fitting flexible parametric survival models to the expanded data to directly model the cause-specific CIF.

Keywords: st0001, Survival Analysis, Competing Risks, Time-Dependent Effects

1. Introduction

Competing risks occur in survival analysis when a subject is at risk of more than one type of event. A classic example is when there is consideration of different causes of death. If a subject dies of one particular cause they are no longer at risk of death from any other cause. Interest may lie in the cause-specific hazard rate which can be estimated using standard survival techniques by treating any competing events as censored observations at the time of of occurrence of the competing event. An alternative measure is the cause specific cumulative incidence function (CIF) which gives an estimate of absolute or crude risk of death accounting for the possibility that individuals may die of other causes(Putter et al. 2007).

A non-parametric estimate of the cause-specific CIF (Kalbfleisch and Prentice 1980) can be estimated within Stata using the user written command stcompet available from SSC (Coviello and Boggess 2004). Regression models of the cause-specific CIF are usually based around the subdistribution hazard, which is the hazard function corresponding to the cause-specific CIF. The approach of Fine and Gray (1999) is implemented within Stata via the stcrreg command. This uses a weighting procedure where individuals with a competing event remain at risk, with a weight that depends on the censoring distribution for the study population as a whole.

Geskus (2011) has recently proposed an alternative way to estimate the cause-specific CIF that uses weighted versions of standard estimators. A corresponding R command, crprep, which is part of the mstate package, has been developed that restructures the data and calculates the appropriate weights (de Wreede et al. 2011). This paper describes a Stata command, stcrprep, that has similar functionality to crprep with some further extensions to allow parametric models for the cause-specific CIF to be fitted. After using stcrprep, the cause-specific CIF can be plotted using sts graph and a Fine and Gray model can be fitted using stcox. An advantages of this approach are that some of the methods developed for the Cox model can be used for models on the subdistribution hazard scale. For example, testing and visualization of the proportional subdistribution hazards assumption can use Schoenfeld residuals using estat phtest.

Parametric models have a number of advantages over using Cox models when modelling cause specific hazard functions, if the parametric form is appropriate. Flexible parametric survival models use restricted cubic splines to estimate the underling hazard and survival functions which enables virtually any shaped hazard to be captured (Royston and Parmar 2002). These models are implemented in Stata using the stpm2 command available from SSC (Lambert and Royston 2009; Royston and Lambert 2011). This paper will show that by restructuring the data and calculation of appropriate weights these models can be used to directly estimate and model cumulative incidence functions.

This remainder of this is laid out as follows. Section 2 describes some competing risks theory and the Fine and Gray model for the cause-specific CIF. Section 3 describes the syntax of the stcrprep command. Section 4 gives some examples of using stcrprep for non-parametric estimation of the cause-specific CIF and fitting the Fine and Gray model. Section 5 describes how the approach can be extended to parametric models with section 6 giving some examples. Finally, section 7 discusses the approach and briefly describes some possible extensions.

2. Methods

In competing risks a patient is at risk from K different causes. The cause-specific hazard function, h_k (t), for cause k is defined as,

h_{k} (t) = lim_{Δ t \to 0} \frac{P (t \leq T < T + Δ t, cause = k | T \geq t)}{Δ t}

When the competing events are deaths from different causes these can be thought of the mortality rate for cause k conditional on survival to time t. To be at risk at time t, an individual cannot have died of the cause of interest or any of the competing causes of death.

It is possible to transform the cause-specific hazard function to a survival function, S_k(t) as follows,

S_{k} (t) = exp (- \int_{0}^{t} h_{k} (u) d u)

(1)

If the competing events are assumed to be independent (conditional on covariates), then 1 – S_k(t) can be interpreted as the probability of having event k by time t in the hypothetical world where it is not possible have any of the competing events. However, in many situations the assumption of independence will not be reasonable and the estimates obtained through equation 1 are not interpretable as probabilities. Even if independence is reasonable, it may be of more interest to calculate probabilities in the real world rather than in this hypothetical world. In such situations it is of more interest to estimate the cause-specific cumulative incidence function.

The cause-specific cumulative incidence function, F_k(t) is defined as

F_{k} (t) = \int_{0}^{t} h_{k} (s) exp (- \int_{0}^{s} \sum_{k = 1}^{K} h_{k} (u) d u) d s

(2)

This gives the probability of having event k as a function of time, accounting for the fact that subjects have a chance of having one of the competing events first. It depends on the underlying hazard rates for all K causes. When estimating cause-specific CIFs the competing events do not have to be independent.

A non-parametric estimate of the cause-specific cumulative incidence function for cause k can be obtained as follows(Kalbfleisch and Prentice 1980),

{\hat{F}}_{k} (t) = \sum_{j | t_{j} \leq t} \hat{S} (t_{j - 1}) \frac{d_{k j}}{n_{j}}

(3)

where Ŝ(t_j–1) is the Kaplan-Meier estimate of the all cause survival function at time t_j–1, d_kj is the number of deaths due to cause k at time t_j with n_j being the number at risk.

A mathematically equivalent estimate of the CIF is the product limit estimate,

{\hat{F}}_{k} {(t)}^{P L} = \prod_{j | t_{j} \leq t} {1 - \frac{d_{k j}}{n_{j}^{*}}}

(4)

where $n_{j}^{*}$ is the observed number at risk, n_j, at time t_j, augmented by a weighted sum of individuals who had a competing event. The weights for each individual for a competing event are obtained through estimation of the censoring distribution, S_c(t). The weight for individual i at time t_j is,

w_{i j} = {\begin{array}{l} 1 if still at risk at t_{j} \\ 0 if censored before t_{j} \\ \frac{\hat{S_{c}} (t_{j})}{\hat{S_{c}} (t_{l})} if had competing event t_{l} < t_{j} \end{array}

(5)

giving $n_{j}^{*} = \sum_{i = 1}^{N} w_{i j} .$ Note that the weight for those that experience a competing event is the probability of not being censored by time t_j given that they had a competing event at time t_l. The censoring distribution, S_c(t), is usually calculated using the Kaplan-Meier method. The weights are time-dependent and need to be calculated at each event time subsequent to the competing event.

2.1. Models for the cumulative incidence function

Although models that estimate the cause-specific hazards for each of the competing causes can be used to estimate cause-specific cumulative incidence function using equation 2, they provide no direct estimate of covariate effects on the cause-specific cumulative incidence function. An alternative approach was proposed by Fine and Gray (1999) who proposed a model for the subhazard function where the subhazard is defined as

\begin{array}{l} h_{k}^{s} (t) = & - \frac{d ln (1 - F_{k} (t))}{Δ t} \\ = & {lim}_{Δ t \to 0} \frac{P (t \leq T < T + Δ t, cause = k | T \geq t or (T \leq t and cause \neq k))}{Δ t} \end{array}

The subhazard rate has quite an awkward interpretation since it is the event rate at time t for those who have not had the event of interest, but may have had a competing event. If the events are different causes of death then it is the event rate where people who have died from competing causes are still considered at risk of the event of interest.

An advantage of the subhazard function is that it is possible to derive the cause-specific CIF for cause k by the usual transformation from hazard to survival function,

F_{k} (t) = 1 - exp (\int_{0}^{t} h_{k}^{s} (u) d u)

(6)

Fine and Gray defined the proportional subhazards model as

h_{k}^{s} (t) = h_{k, 0}^{s} (t) exp (x β)

(7)

This is of the same form as a Cox proportional hazards model, but where the hazard function has been replaced by the subhazard function. The β coefficients give log subhazard ratios. The Fine and Gray model is similar to the Cox model in that the baseline subhazard is not directly estimated. Estimation of parameters for a proportional subhazards model is more complex than the Cox model since those that experience competing events are kept in the risk set, but have time-dependent weights after their event time incorporated into the partial-likelihood with the weights being a function of the censoring distribution(Fine and Gray 1999). The weights are needed since they account for the fact that the probability that these observations could have been censored increases with follow-up time. Estimation is by maximizing the weighted partial likelihood,

\log L = \sum_{i = 1}^{n} d_{i} [x_{i} β - \log {\sum_{j \in R_{i}} w_{i j} exp (x_{i} β)}]

(8)

where d_i is the event indicator and R_i is the set of observations, j, that are at risk at time t_i. Note that equation 8 is the same as the partial likelihood for the Cox model with the addition of the weights, w_ij. These weights are calculated internally when using the stcrreg command to fit the Fine and Gray model. When the stcrprep command is used the data is expanded and the relevant weights are calculated. These weights are the same as those given in equation 5, so that the model can be estimated using stcox.

3. Syntax

stcrprep [if] [in], events(varname) [byg(varlist) byh(varlist) censvalue(#) epsilon(#) keep(varlist) noshorten trans(numlist) wtstpm2 censcov(varlist) censdf(#) censtvc(varlist) censtvcdf(#)]

stcrprep is an st command and the data must be stset before using it. All events must be defined in the failure option of stset. The id option of stset is compulsory. The censoring distribution is estimated using the Kaplan-Meier method unless the wtstpm2 option is used.

3.1. Options

events(varname) specifies the variable that defines the different events. By default censored observations have the value 0. This can be changed using the censvalue() option.

byg(varlist) calculate censoring weights separately by varlist.

byh(varlist) calculate left truncation weights separately by varlist.

censvalue(#) The value that denotes a censored observation in the variable given in the events() option. By default this is zero.

epsilon(#) value added to survival time when calculating probability of censoring to ensure that events occur before censoring. Default value is 0.000001.

keep(varlist) names of variables to keep in expanded dataset. This is generally a list of the variables that you want to include in the analysis.

noshorten Do not collapse over rows with equal weights.

trans(numlist) lists the transitions of interest, i.e. the events contained in the events() option that you want to have as an event of interest in the analysis. By default all events are included. If only one event is of interest and the data is large then use this option to only create a data set for the specific event of interest.

The following options fits a parametric model using stpm2 for the censoring distribution.

wtstpm2 requests that the censoring distribution is estimated by fitting a flexible parametric survival model using stpm2.

censcov(varlist) lists covariates to include in the model for the censoring distribution.

censdf(#) gives the degrees of freedom used for the baseline when using stpm2 to obtain the censoring distribution. The default is 5 df.

censtvc(varlist) gives any variables to be included as time-dependent effects when using stpm2 to estimate the censoring distribution.

censtvcdf(#) gives the degrees of freedom used for any time-dependent effects when using stpm2 to obtain the censoring distribution. The default is 3 df.

4. Examples 1

4.1. European Blood and Marrow Transplantation Data

In order to demonstrate the methods we use data for 1977 patients from the European Blood and Marrow Transplantation (EBMT) registry who received an allogeneic bone marrow transplantation. Time is measured in days from transplantation to either relapse or death. There is only one covariate of interest, the EBMT risk score, which has been categorized into 3 groups (low, medium and high risk). The data is available as part of the mstate R packagede Wreede et al. (2011).

The estimated cause-specific cumulative incidence functions for relapse and for death by risk group have been estimated using stcompet and are shown in Figure 1. The more severe the risk group the higher the probability of both relapse and death.

Non-parametric estimates of the cause-specific CIF for relapse and death using `stcompet`

4.2. Using `stcrreg`

The Fine and Gray model, implemented in stcrreg, will be first applied so that subsequent model results can be compared. The output below shows the model where relapse is the outcome of interest

     . stset time, failure(status==1) scale(365.25) id(patid) noshow
                     id: patid
          failure event: status == 1
     obs. time interval: (time[_n-1], time]
      exit on or before: failure
         t for analysis: time/365.25


`1977`	`total observations`
`0`	`exclusions`

`1977`	`observations remaining, representing`
`1977`	`subjects`
`456`	`failures in single-failure-per-subject data`
`3796.057`	`total analysis time at risk and under observation`
	`at risk from`	`t`	`=`	`0`
	`earliest observed entry`	`t`	`=`	`0`
	`last observed exit`	`t`	`=`	`8.454483`

`(Std. Err. adjusted for 1,977 clusters in patid)`

`_t`	`SHR`	`Robust` `Std. Err.`	`z`	`P>\|z\|`	`[95% Conf. Interval]`

`score`
`Medium risk`	`1.271221`	`.1554323`	`1.96`	`0.050`	`1.000333`	`1.615465`
`High risk`	`1.769853`	`.3238535`	`3.12`	`0.002`	`1.236465`	`2.533337`

`failcode`	`patid`	`status`	`tstart`	`tstop`	`weight_c`	`weight_t`	`status`

`relapse`	`17`	`died`	`0.00000`	`2.28884`	`1.00000`	`1`	`died`
`relapse`	`17`	`died`	`2.28884`	`2.31622`	`0.99000`	`1`	`died`
`relapse`	`17`	`died`	`2.31622`	`2.32717`	`0.98497`	`1`	`died`
`relapse`	`17`	`died`	`2.32717`	`2.36003`	`0.97992`	`1`	`died`
`relapse`	`17`	`died`	`2.36003`	`2.55441`	`0.91392`	`1`	`died`
`relapse`	`17`	`died`	`2.55441`	`2.65845`	`0.89843`	`1`	`died`
`relapse`	`17`	`died`	`2.65845`	`2.89938`	`0.85142`	`1`	`died`
`relapse`	`17`	`died`	`2.89938`	`3.02806`	`0.80937`	`1`	`died`
`relapse`	`17`	`died`	`3.02806`	`3.18960`	`0.76176`	`1`	`died`
`relapse`	`17`	`died`	`3.18960`	`3.26626`	`0.74578`	`1`	`died`
`relapse`	`17`	`died`	`3.26626`	`3.62765`	`0.63847`	`1`	`died`
`relapse`	`17`	`died`	`3.62765`	`3.89870`	`0.59519`	`1`	`died`
`relapse`	`17`	`died`	`3.89870`	`3.97536`	`0.57881`	`1`	`died`
`relapse`	`17`	`died`	`3.97536`	`4.10951`	`0.55124`	`1`	`died`
`relapse`	`17`	`died`	`4.10951`	`4.39425`	`0.51163`	`1`	`died`
`relapse`	`17`	`died`	`4.39425`	`4.50103`	`0.47714`	`1`	`died`
`relapse`	`17`	`died`	`4.50103`	`4.69815`	`0.45968`	`1`	`died`
`relapse`	`17`	`died`	`4.69815`	`5.08419`	`0.37101`	`1`	`died`
`relapse`	`17`	`died`	`5.08419`	`5.22656`	`0.32235`	`1`	`died`
`relapse`	`17`	`died`	`5.22656`	`5.33607`	`0.30995`	`1`	`died`
`relapse`	`17`	`died`	`5.33607`	`5.97673`	`0.22772`	`1`	`died`
`relapse`	`17`	`died`	`5.97673`	`6.27515`	`0.20170`	`1`	`died`

`died`	`17`	`died`	`0.00000`	`2.28884`	`1.00000`	`1`	`died`


`70262`	`total observations`
`0`	`exclusions`

`70262`	`observations remaining, representing`
`1141`	`failures in single-record/single-failure data`
`13820.402`	`total analysis time at risk and under observation`
	`at risk from`	`t`	`=`	`0`
	`earliest observed entry`	`t`	`=`	`0`
	`last observed exit`	`t`	`=`	`8.454483`

`score`	`Events` `observed`	`Events` `expected`

`Low risk`	`79`	`99.64`
`Medium risk`	`328`	`324.33`
`High risk`	`49`	`32.04`

`Total`	`456`	`456.00`


`127730`	`total observations`
`0`	`exclusions`

`127730`	`observations remaining, representing`
`1141`	`failures in single-record/single-failure data`
`14418.735`	`total analysis time at risk and under observation`
	`at risk from t`	`=`	`0`
	`earliest observed entry t`	`=`	`0`
	`last observed exit t`	`=`	`8.454483`

`_t`	`Haz. Ratio`	`Std. Err.`	`z`	`P>\|z\|`	`[95% Conf. Interval]`

`score`
`Medium risk`	`1.271235`	`.1593392`	`1.91`	`0.056`	`.9943389`	`1.625238`
`High risk`	`1.769899`	`.3219273`	`3.14`	`0.002`	`1.239148`	`2.52798`

	`rho`	`chi2`	`df`	`Prob>chi2`

`1b.score`	`.`	`.`	`1`	`.`
`2.score`	`-0.16806`	`12.81`	`1`	`0.0003`
`3.score`	`-0.18701`	`15.80`	`1`	`0.0001`

`global test`		`18.52`	`2`	`0.0001`

	`exp(b)`	`Std. Err.`	`z`	`P>\|z\|`	`[95% Conf. Interval]`

xb
score
`2`	`1.270361`	`.1592233`	`1.91`	`0.056`	`.9936652`	`1.624105`
`3`	`1.769961`	`.3219309`	`3.14`	`0.002`	`1.239203`	`2.528048`

`_rcs1`	`1.427308`	`.02829`	`17.95`	`0.000`	`1.372923`	`1.483846`
`_rcs2`	`1.124042`	`.0158208`	`8.31`	`0.000`	`1.093458`	`1.155482`
`_rcs3`	`1.038305`	`.0135277`	`2.89`	`0.004`	`1.012127`	`1.06516`
`_rcs4`	`.9690918`	`.0078049`	`-3.90`	`0.000`	`.9539146`	`.9845105`
`_cons`	`.2090643`	`.0235494`	`-13.89`	`0.000`	`.1676481`	`.260712`


	`Coef.`	`Std. Err.`	`z`	`P>\|z\|`	`[95% Conf. Interval]`

`xb`
`score2`	`.228696`	`.1254022`	`1.82`	`0.068`	`-.0170877`	`.4744797`
`score3`	`.5271293`	`.1827853`	`2.88`	`0.004`	`.1688766`	`.885382`
`_rcs1`	`.5427155`	`.0601134`	`9.03`	`0.000`	`.4248953`	`.6605356`
`_rcs2`	`.1243894`	`.0141427`	`8.80`	`0.000`	`.0966702`	`.1521086`
`_rcs3`	`.0414002`	`.0131365`	`3.15`	`0.002`	`.0156531`	`.0671473`
`_rcs4`	`-.0286467`	`.0081218`	`-3.53`	`0.000`	`-.0445651`	`-.0127283`
`_rcs_score21`	`-.2132078`	`.0640181`	`-3.33`	`0.001`	`-.338681`	`-.0877347`
`_rcs_score31`	`-.2983195`	`.0758822`	`-3.93`	`0.000`	`-.447046`	`-.149593`
`_cons`	`-1.562087`	`.1126074`	`-13.87`	`0.000`	`-1.782793`	`-1.34138`

	`exp(b)`	`Std. Err.`	`z`	`P>\|z\|`	`[95% Conf. Interval]`

`xb`
`score`
`2`	`1.657222`	`.1566403`	`5.34`	`0.000`	`1.376972`	`1.994509`
`3`	`2.674231`	`.3717455`	`7.08`	`0.000`	`2.036447`	`3.511759`

`_rcs1`	`1.371692`	`.0155809`	`27.82`	`0.000`	`1.341491`	`1.402572`
`_rcs2`	`1.174037`	`.0109122`	`17.26`	`0.000`	`1.152843`	`1.19562`
`_rcs3`	`1.031093`	`.0079927`	`3.95`	`0.000`	`1.015546`	`1.046878`
`_rcs4`	`.9829622`	`.0047848`	`-3.53`	`0.000`	`.9736288`	`.9923851`
`_cons`	`.2555708`	`.0218471`	`-15.96`	`0.000`	`.2161462`	`.3021865`

	`exp(b)`	`Std. Err.`	`z`	`P>\|z\|`	`[95% Conf. Interval]`

`xb`
`score`
`2`	`1.421112`	`.1060141`	`4.71`	`0.000`	`1.227804`	`1.644854`
`3`	`1.870384`	`.1798566`	`6.51`	`0.000`	`1.549098`	`2.258305`

`_rcs1`	`1.284626`	`.0111808`	`28.78`	`0.000`	`1.262898`	`1.306728`
`_rcs2`	`1.169755`	`.0094023`	`19.51`	`0.000`	`1.151471`	`1.188329`
`_rcs3`	`1.018619`	`.0071389`	`2.63`	`0.008`	`1.004722`	`1.032707`
`_rcs4`	`.9817383`	`.0040202`	`-4.50`	`0.000`	`.9738903`	`.9896494`
`_cons`	`.2067648`	`.0142718`	`-22.84`	`0.000`	`.1806021`	`.2367175`


	`Coef.`	`Std. Err.`	`z`	`P>\|z\|`	`[95% Conf. Interval]`

`xb`
`score2`	`.3558905`	`.0732233`	`4.86`	`0.000`	`.2123755`	`.4994055`
`score3`	`.6257657`	`.096561`	`6.48`	`0.000`	`.4365097`	`.8150217`
`_rcs1`	`.336392`	`.0270512`	`12.44`	`0.000`	`.2833726`	`.3894114`
`_rcs2`	`.1591962`	`.0084767`	`18.78`	`0.000`	`.1425821`	`.1758103`
`_rcs3`	`.0192598`	`.0072978`	`2.64`	`0.008`	`.0049564`	`.0335632`
`_rcs4`	`-.0174127`	`.0040958`	`-4.25`	`0.000`	`-.0254403`	`-.0093851`
`_rcs_score21`	`-.0932552`	`.0287435`	`-3.24`	`0.001`	`-.1495914`	`-.0369191`
`_rcs_score31`	`-.1515066`	`.033426`	`-4.53`	`0.000`	`-.2170203`	`-.0859929`
`_cons`	`-1.58155`	`.0675385`	`-23.42`	`0.000`	`-1.713923`	`-1.449177`

PERMALINK

The estimation and modelling of cause-specific cumulative incidence functions using time-dependent weights

Paul C Lambert

Abstract

1. Introduction

2. Methods

2.1. Models for the cumulative incidence function

3. Syntax

3.1. Options

4. Examples 1

4.1. European Blood and Marrow Transplantation Data

Figure 1.

4.2. Using stcrreg

4.3. Using stcrprep

4.4. Plotting the cause-specific CIF using sts graph

Figure 2.

4.5. Testing for differences in the CIF using sts test

4.6. Proportional subhazards model using stcox

4.7. Time gains for large datasets

4.8. Testing the proportional subhazards assumption

Figure 3.

4.9. Estimation within one model

4.10. Fitting cause-specific hazards models

5. Parametric Models

6. Examples 2

6.1. Parametric proportional subhazards model

Figure 4.

Figure 5.

6.2. Non-proportional subhazards

Figure 6.

6.3. Models on other scales

Figure 7.

7. Conclusion

About the authors

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

4.2. Using `stcrreg`

4.3. Using `stcrprep`

4.4. Plotting the cause-specific CIF using `sts graph`

4.5. Testing for differences in the CIF using `sts test`

4.6. Proportional subhazards model using `stcox`