Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 May 20.
Published in final edited form as: Stat Med. 2019 Jan 6;38(11):2030–2046. doi: 10.1002/sim.8085

Estimation of the Distribution of Longitudinal Biomarker Trajectories Prior to Disease Progression

Xuelin Huang 1,*, Lei Liu 2, Jing Ning 1, Liang Li 1, Yu Shen 1
PMCID: PMC6501595  NIHMSID: NIHMS1014542  PMID: 30614014

Abstract

Summary

Most studies characterize longitudinal biomarker trajectories by looking forward at them from a commonly used time origin, such as the initial treatment time. For better understanding the relationship between biomarkers and disease progression, we propose to align all subjects by using their disease progression time as the origin and then looking backwards at the biomarker distributions prior to that event. We demonstrate that such backward-looking plots are much more informative than forward-looking plots when the research goal is to understand the shape of the trajectory leading up to the event of interest. Such backward-looking plotting is an easy task if disease progression is observed for all the subjects. However, when these events are censored for a significant proportion of subjects in the study cohort, their time origins cannot be identified and the task of aligning them cannot be performed. We propose a new method to tackle this problem by considering the distributions of longitudinal biomarker data conditional on the failure time. We use landmark analysis models to estimate these distributions. Compared to a naïve method, our new method greatly reduces estimation bias. We apply our method to a study for chronic myeloid leukemia patients whose BCR-ABL transcript expression levels after treatment are good indicators of residual disease. Our proposed method provides a good visualization tool for longitudinal biomarker studies for the early detection of disease.

Keywords: Biomarker, Disease recurrence, Landmark analysis, Survival analysis

1 |. INTRODUCTION

Early detection of disease and of disease progression or recurrence is critical for the treatment of cancer and many other diseases. Using biomarker measurements observed over time to predict a forthcoming failure event is thus of great interest. In cancer research, many biomarkers have been discovered for screening and prognosis. Having accurate estimations of the biomarker distributions over time can greatly improve decision making for cancer diagnosis and treatment. This is applicable not only to the initial diagnosis and treatment, but also to disease recurrence and subsequent salvage treatments.

Our motivating example is a study of chronic myeloid leukemia (CML), for which the majority (around 90%) of cases were caused by the oncogene BCR-ABL1. Tyrosine kinase inhibitors (TKIs) such as imatinib and dasatinib are very effective in blocking the expression of the BCR-ABL protein. Therefore, treatment with these TKIs usually results in a substantial decrease in the BCR-ABL expression level, which brings the disease under control. However, this is not a cure, for two reasons. First, patients need to take one of these TKI drugs every day without stopping. Second, the disease may become resistant to the drug. By that time, the BCR-ABL level will increase again. Figure 1 displays some examples of these down-and-up BCR-ABL trajectories.

FIGURE 1.

FIGURE 1

BCR-ABL expression level trajectories showing down-and-up patterns for CML patients receiving treatment.

The transcript level of the BCR-ABL oncogene is an important indicator of the CML disease burden. After the initial treatment, patients are followed regularly to have their BCR-ABL levels measured. The current practice is to wait until clinical symptoms of disease progression are observed to consider new treatments, which may be a new TKI drug, potentially in combination with chemotherapy. However, if we can characterize the trend of BCR-ABL expression levels over time, this may help physicians better predict disease progression so as to initiate intervention earlier to better combat the disease. Avoiding disease progression is very important because a new episode of acute disease may shorten the patient’s survival time. To do that, it is important to understand the changing patterns of BCR-ABL expression levels in the one or two years prior to disease progression. This prompts us to develop methods to estimate the distribution of longitudinal biomarker trajectories by looking backwards from a failure event of interest, using that as the time origin for aligning the longitudinal data prior to that event as opposed to the commonly used approach of “looking forward”. The differences between looking forward and looking backward are illustrated by Figure 2. In the top panel, we see the distribution of BCR-ABL expression levels by “looking forward”, using the initial treatment time as the origin. Viewing the BCR-ABL scatterplot in this way, it is difficult to identify any pattern of BCR-ABL expression level changes prior to disease progression. A locally weighted scatterplot smoothing (LOWESS) curve shows a decreasing pattern, which is hard to interpret. In the bottom panel of Figure 2, we “look backwards” at patients’ BCR-ABL expression level measurements from the time of disease progression, using these times as the time origin to align all the BCRABL measurements. This figure includes only patients for whom the time of disease progression was observed. It shows a clear increasing pattern of BCR-ABL expression values prior to disease progression.

FIGURE 2.

FIGURE 2

Top: Distribution of BCR-ABL expression levels over time for all subjects, using the beginning of treatment as the time origin. Bottom: Distribution of BCR-ABL expression levels over time for all subjects for whom the time of disease progression was observed, using disease progression as the time origin.

When the event of interest and the longitudinal marker data are observed for all subjects, the above tasks of “backwards looking and estimation” are relatively straightforward. However, when the event of interest is subject to right censoring, these estimations need special handling. The techniques commonly used in survival analysis for censored data may not be directly applicable due to the use of “looking backwards.” Ignoring data from subjects for whom the event times are censored may lead to estimation bias. The simulation studies in Section 3 demonstrate this bias and show that our proposed method greatly reduces bias.

Most statistical methods for longitudinal data are developed to “look forward”, i.e., to estimate the distributions and changes of biomarkers for patients who have received a specific treatment. Methods for “looking backwards” are rare, even though such methods are important for describing disease prognosis, early prevention and treatment, as elaborated above. Liu et al. proposed an ad hoc “turn back time” method to capture the end-of-life costs before death2. Chan and Wang used an inverse-probability-of-censoring-weighted (IPCW) method for backwards estimation of stochastic processes with failure events as the time origins3. Their nonparametric method uses longitudinal data from only individuals whose failure events are observed and ignores the longitudinal data from censored subjects. This may result in loss of efficiency. It is desirable to also use the observed portion of the longitudinal biomarker data for censored subjects. However, the difficulty is that, for a patient whose time to disease progression is censored, his/her time origin in the “looking backwards” plot is missing. We do not know where to put his/her biomarker measurements at post-treatment time t in the “looking backwards” plot.

In this article, we provide a solution for the problems described above. We use the observed longitudinal marker data from all subjects, no matter whether the event times are observed or right-censored. We use an infinite series of Cox proportional hazards models4 to conduct sequential landmark analyses, and impose smoothness conditions on these series of models. These sequential landmark analyses account for the association between the failure time and longitudinal biomarkers, conditioning on baseline covariates such as demographics and other potential risk factors. By conditioning on these factors, we can look at the biomarker distributions of any specific subpopulation. This usually gives a clearer picture than a mixture of distributions for a heterogeneous population. We do not assume any model for the biomarker trajectory distributions. In this sense, the biomarker distributions are estimated nonparametrically. As a first step for robust estimation and visualization of biomarker distributions of different subpopulations, this may serve to generate a biomedical hypothesis and check the goodness of fit for models with more biomarker distribution assumptions.

This article is organized as follows. The proposed “backwards looking and estimation” method is described in Section 2. It achieves nearly unbiased estimation for the biomarker trajectory distributions prior to disease progression, as demonstrated by the simulation studies in Section 3; whereas a naïve method results in severely biased results. In Section 4, the proposed method is applied to characterize the trajectory distribution of BCR-ABL transcript levels for CML patients prior to disease progression. A summary is provided and relevant issues are discussed in Section 5. Theoretical considerations, parameter estimation and simulation study details are provided in the Appendices.

2 |. METHOD

We first consider a simple situation in which the biomarker data are fixed as baseline covariates (time-independent) m×1 vector Z. After obtaining an estimation of the distribution of Z conditional on the failure T, we extend it to situations with longitudinal biomarker data Z(t), 0 ≤ tT. In this situation, denote the (time-independent) baseline covariates by Y. In all situations, denote the censoring time by C, and X = min(T, C), Δ = I(TC), an indicator of the event time being observed or not, and at-risk indicator R(t) = I(Xt). For any t, let Z¯(t)={Z(s):0st}.

Regarding censoring events, we assume coarsening at random. This concept was formulated by Heitjan and Rubin5 and can be explicitly expressed in our situation as follows. Denote h(⋅) as the hazard function of the censoring time C. Let D¯(t) represent the information about all the data (except the censoring time) accumulated up to time t, i.e.,D¯(t)={I(Tt),TI(Tt),Y,Z¯(t)}. Then the assumption is that the censoring hazard at time t may depend on all the information observed up to time t, but not any further, namely, h{t|D¯(τ)}=h{t|D¯(t)}, where τ is the end of the follow-up time of the study.

For the simple situation with time-independent biomarker data Zi for each subject i = 1,⋅⋅⋅, n, we specify their hazard functions of Ti as

λi(t|Zi)=λ0(t)exp(βZi), (1)

where λ0(t) ≥ 0 is an arbitrary baseline hazard function, β is an m×1 unknown parameter vector, and β′ is its transponse (similar notations are used hereafter for all vectors). Based on this model, under the independent censoring assumption, the baseline covariate distribution conditional on the event time T = t, denoted as FZ(z|t, β) = Pr(Zz|T = t, β), can be estimated by a weighted empirical distribution as

F^Z(z|t,β)=i=1nI(Ziz)πi(t,β), (2)

with the weight for the i-th subject being

πi(t,β)=Ri(t)exp(βZi)j=1nRj(t)exp(βZj).

This πi(t, β) can be interpreted as the probability that the i-th subject fails at time t, given that one of the at-risk subjects fails at time t. The above estimator was provided by Theorem 1 of Xu and O’Quigley6. Note that 0 ≤ πi(t, β) ≤ 1, and i=1nπi(t,β)=1. The above theorem shows that the distribution of covariate Z conditional on failure time T = t is equal to a weighted empirical distribution of Z, with weights being πi(t, β), i = 1,⋅⋅⋅, n. In situations when only the distribution of a subset of Z is of interest, e.g., Z=(Z1,Z2), and we are interested in Pr(Z2z2|T = t), we may simply apply the above results with z = (z1, z2) and z1 = ∞. The above estimator in (2) then becomes i=1nI(Z2iz2)πi(t,β).

We may interpret the above theorem as a method for “looking backwards” to estimate the distribution of Z(0) conditional on the failure time being T = t. In a similar fashion, for a longitudinal biomarker trajectory Z(t), 0 ≤ tT, it is plausible to assume that it may have some changing pattern prior to failure time T. Knowing this pattern can help predict an imminent failure event. Therefore, we would like to estimate the biomarker distribution at all the time points [Ta, T], where a specifies the length of time period of interest, such as 12 months. This boils down to the estimation of the distribution of Z(Tu) for any 0 ≤ ua. Ideally, to obtain robust estimation results, this task should be done without assuming a parametric model for Z(t), t ≥ 0. Certainly we also need to account for baseline covariates that do not change over time (denoted by a p×1 vector Y). With these considerations, we assume a series of landmark analysis models7, by first specifying that, at baseline, conditional on baseline covariates Yi, the survival time Ti has hazard function

λi(t|Yi)=λ0(t)exp(αYi), (3)

where λ0(t) is an unspecified hazard function. Later, at time s > 0, conditional on Tis and a q × 1 biomarker vector Zi(s), we assume the following working model for the hazard function for Ti at a future time s + u,

λi{s+u|Tis,Yi,Zi(s)}=λs0(u)exp{αYi+β(s)Zi(s)},u>0, (4)

where λs0(u) ≥ 0 is an unspecified hazard function. This equation specifies a series of models as in (1), with a moving baseline at time s. It does not specify the full joint distribution of biomarker data and survival time. It is a working model for data analyses. It can be fitted by the two-stage approach provided by Huang et al7. The coarsening at random assumption specified earlier ensures that the independent censoring assumption holds for the above model (4) for any s ≥ 0.

Then, conditional on T = t and considering the biomarker distribution at an earlier time s, with s = tu, we may view time s as a new “baseline” and apply Theorem 1 of Xu and O’Quigley6 to Z(s) (denoting it as Z(tu) since s = tu). We denote its distribution conditional on survival time T = t as FZ{z|t, u, α, β(tu)} = Pr{Z(tu) ≤ z|T = t, α, β(tu)}. Then we can estimate it using a weighted empirical distribution as

F^Z{z|t,u,α,β(tu)}=i=1nI{Zi(tu)z}πi{t,u,α,β(tu)},0u<t, (5)

with

πi{t,u,α,β(tu)}=Ri(t)exp{αYi+β(tu)Zi(tu)}j=1nRj(t)exp{αYj+β(tu)Zj(tu)}.

The above equation provides an estimate of the distribution of longitudinal biomarker Z(tu) given that the failure time is T = t.

The above formula (5) is conditional on a fixed event time T = t = s + u to “look backwards” and estimate the distribution of Z(tu) = Z(s). Next, we compute the distribution of Z(Tu) for all the possible values of T. This involves an integration over the distribution of T. As mentioned, we consider for a specific time window 0 ≤ ua. For this reason, we consider only the subset of subjects with Ta. Moreover, the distribution of T is not identifiable beyond the maximum follow-up τ. Consequently, we consider the distribution of Z(Tu) to be conditional on aTτ, and we derive the distribution of Z(Tu) as

G(z|u)=Pr{Z(Tu)z|aTτ}=aτPr{Z(tu)z|T=t}f(t)S(a)S(τ)dt (6)

where f(t) and S(t) are respectively the probability density function and survival function of T. Using the corresponding notations with a hat to indicate their estimators, denoting the observed survival times (subject to censoring) as xi, i = 1,⋅⋅⋅, n (with abuse of notation assuming they are already sorted in ascending order), and applying the results in (5), the above integration can be estimated by

G^(z|u)=i=1nI(xia)F^Z{z|xi,u,y,α^,β^(xiu)}S^(xi)S^(xi)S^(a)S^(τ)=i=1n[I(xia)S^(xi)S^(xi)S^(a)S^(τ).j=1nI(Zj(xiu)z)Rj(xi)exp{α^Yj+β^(xiu)Zj(xiu)}k=1nRk(xi)exp{α^Yk+β^(xiu)Zk(xiu)}]. (7)

Details for obtaining the estimators α^ and β^(t) are provided in Appendix B (with smoothness constraints for β(t)), and S^(t) is the Kaplan-Meier estimator for S(t).

With the above estimated distribution, we can set the event time T as the origin, plot the mean of Z(Tu) against time denoted as −u (u > 0) (using negative values to indicate they are prior to the event of interest), for a series of values of u such as 3, 6, 9 and 12 months. Then the distribution of the longitudinal biomarker trajectories Z(⋅) is aligned by the failure time T as the origin. This makes it feasible and convenient to visually identify any informative patterns of biomarker change prior to failure. For example, if we observe a pattern of increasing biomarker values that is linear, quadratic, or exponential in the one year prior to disease progression, we can build a model for such a pattern to model biomarkers and use it to predict the risk of imminent disease progression. We emphasize that such a pattern is very likely to be hidden if all the biomarkers are plotted forwards starting from the time of treatment initiation. This is because patients have different times of disease progression, and at any time point, there is only a small fraction of patients who present with increased biomarker values. This dilutes or masks the pattern of increasing biomarker values across patients in the year prior to disease progression.

The above equation (7) gives an estimator for the distribution of Z(Tu) marginalized over the baseline covariates Y. To estimate their joint distribution, we may simply use the estimator below for G(y, z|u) = Pr(Yy, Z(Tu) ≤ z) (with some altered notation).

G^(y,z|u)=i=1n[I(xia)S^(xi)S^(xi)S^(a)S^(τ)j=1nI{Yjy,Zj(xiu)z}Rj(xi)exp{α^Yj+β^(xiu)Zj(xiu)}k=1nRk(xi)exp{α^Yk+β^(xiu)Zk(xiu)}]. (8)

For Y with a discrete distribution, we may estimate Pr{Y = y, Z(Tu) ≤ z} and then estimate Pr{Z(Tu) ≤ z|Y = y} = Pr{Y = y, Z(Tu) ≤ z}∕Pr(Y = y). Although in principle this can also be applied to Y with a continuous distribution, it may yield unstable estimation results. It is more reasonable to consider Pr(Z(Tu) ≤ z|y1Yy2) for a continuous random variable Y. A sketch of the asymptotic properties for the proposed estimators, including consistency, normality and variance estimation, is provided in Appendix A. Other inferential methods, such as hypothesis testing, need further development and are deferred to future work.

3 |. SIMULATION

We assume a linear model for longitudinal biomarker trajectories, Zi(t) = Zi(0) + Bi t, i = 1,⋅⋅⋅, n, and Cox models for failure risks λi(t|Z¯i(t))=λ0exp[β(t)V{Z¯i(t)}]. We use two scenarios for β(t). In the first scenario, we set β(t) ≡ β and β(t)β. In the second scenario, we let β(t) = β log(t) and V{Z¯i(t)} be the changing rate Zi(t)∕t. Details are provided in Appendix C. The sample size is n = 200, and 1,000 replicates are conducted for each simulation setting.

Taking advantage of the data simulation, if we pretend that the above censoring mechanism does not exist, then all the values of Ti and the measurements of Zi(t) prior to Ti are observed. Consequently, we can easily compute the true distribution of Zi([Ti] − u) for integers u = 1, 2, 3, ⋅⋅⋅, up to a maximum time window of interest, such as 12 months. This Zi([Ti] − u) is a close approximation of Zi(Tiu). This approximation is necessary because we generated Zi(t) on integers t only, and it is sufficient for practical use. We use the results obtained this way as the true biomarker distribution, and apply them to evaluate the estimation results when censoring is present.

We apply the method proposed in Section 2 to each simulated data set. Figure 3 shows the estimated mean biomarker levels prior to disease progression, respectively by the proposed method and by a naïve method that ignores censored subjects, for each of the two scenarios described above. We see that, for both scenarios, by the proposed method, the estimated biomarker mean levels over time prior to disease progression are very close to the true mean levels; whereas the estimated biomarker mean levels obtained by the naïve method (ignoring censored subjects) are much lower than the true values. Checking the results from a single data set (in Figure 6), we also see that the naïve method substantially underestimates the mean biomarker levels prior to disease progression. If a patient uses these naïve results for risk assessment, he/she will overestimate his/her risk of disease progression. For example, if a patient’s biomarker level is 11, the solid or long dashed lines (true values and estimates obtained by the proposed method) indicate that his disease progression is about 8 months away. However, if he uses the dotted line (obtained by the naïve method), he would think his disease progression is only about 2 months away. Certainly, the degree of bias depends on the censoring rate. With a censoring rate of 50% (the two panels on the left-hand side), the bias is very severe in this case. When the censoring rate is 15% (the two panels on the right-hand side), the bias associated with the naïve method is reduced, but still substantial. In comparison, the estimates obtained by the proposed method are close to being unbiased.

FIGURE 3.

FIGURE 3

Estimated mean biomarker levels prior to disease progression, by the proposed method (dashed line) and a naïve method that ignores censored subjects (dotted line), compared with the true values (solid line): top panels for scenario 1 (constant regression coefficient β); bottom panels for scenario 2 (time varying coefficient β(t) = β log(t); left panels, 50% censoring rate; right panels, 15% censoring rate.

FIGURE 6.

FIGURE 6

Estimated mean biomarker levels prior to disease progression, by the proposed method (dashed line) and a naïve method that ignores censored subjects (dotted line), compared with the true values (solid line)

Chan and Wang3 proposed to use the inverse-probability-of-censoring (IPCW) method for “backward estimation”. The validity of their method is based on an important assumption that censoring is independent of the biomarker trajectory and failure time. When this assumption is violated, it may give substantially biased results, as shown in appendix D. In Appendix E, we show that the asymptotic confidence intervals of our estimators match well with their empirical counterparts, and the coverage probabilities of our estimated 95% confidence intervals are all close to their nominal level.

To further illustrate the biomarker distributions prior to disease progression, Figure 4 displays their quantile distributions in the first scenario with constant regression coefficients. In this 2 × 3 matrix of plots, the left, middle and right columns are respectively their true values, estimates by the proposed method, and estimates by the naïve method ignoring censored subjects. The top and bottom rows are for the settings of 50% and 15% censoring rates respectively. In each setting, it can be seen that our proposed method results in much smaller bias than the naïve method. The quantile distribution estimation results for the second scenario (with time varying regression coefficients) draw similar conclusions (not shown). These plots can help us understand the biomarker distributions over time and thus help in making decisions. For example, the biomarker measurements for a patient may indicate that his level is higher than the median at six months prior to disease progression, which would mean that a new intervention should be initiated for that patient. Whether to use the median or other quantiles can be determined by the treating physician based on his/her experience and/or results from further statistical reports on risk analysis.

FIGURE 4.

FIGURE 4

Quantile distributions of biomarker levels prior to disease progression for the scenario 1 (constant regression coefficient β over time). Left column panels: Truth; Middle column panels: by the proposed method; Right column panels: by the naïve method ignoring censored subjects. Top row panels: 50% censoring rate; bottom row panels: 15% censoring rate.

4 |. DISTRIBUTION OF BCR-ABL TRANSCRIPT LEVEL PRIOR TO DISEASE PROGRESSION

The expression level of the BCR-ABL oncogene is a good indicator of disease burden for patients with CML8. After treatment with imatinib or dasatinib, patients’ BCR-ABL expression levels usually drop significantly. During regular follow-up visits, BCR-ABL expression levels are measured. Although BCR-ABL expression levels fluctuate over time, a significant increase may indicate an elevated risk of disease progression. Therefore, it is helpful to use BCR-ABL expression levels to monitor disease progression, and initiate preventive treatments early. The first step is to estimate the trajectories of BCR-ABL expression levels prior to disease progression, which is the purpose of this article.

We consider a study of 670 CML patients for whom front-line therapy using imatinib had failed, and who were then enrolled in a trial to receive dasatinib8. Only 567 patients had BCR-ABL gene expression measurements assessed both before and after the dasatinib treatment. Our data analysis is based on these 567 patients. In total, there were 5,708 BCR-ABL measurements made from these 567 patients, ranging from 2 to 23 measurements for individual patients (mean = 9.3, median = 9.0 measurements). All the values of BCR-ABL are normalized to be between 0 and 100. A scatterplot of all these BCR-ABL measurements is shown in the top panel of Figure 2. From this plot, we can see that, on average, BCR-ABL levels decreased in the few months after the beginning of treatment, which is the time origin of this plot. However, it is difficult to visualize any pattern of change for BCR-ABL prior to disease progression. There are two reasons for this difficulty. First, data from all patients with and without disease progression are mixed together in this plot. Second, data from the different patients are not lined up using disease progression as the time origin. Consequently, the changing pattern of BCR-ABL before disease progression is “diluted” and thus is not easily seen.

To solve the above problem, we first use a crude way to help visualize the changing pattern of BCR-ABL before disease progression. We include only the 238 patients whose disease progression times are observed. For each of these patients, we denote his/her time of disease progression as zero, and the time before disease progression then has negative values. We draw a scatterplot of their BCR-ABL expression levels in this new time scale, and add a LOWESS curve to show the changing pattern (the bottom panel of Figure 2 ). This plot reveals an increasing trend of BCR-ABL levels prior to disease progression. Although it is revealing, it ignores censored patients and thus may lead to bias and loss of efficiency. This has motivated us to propose the method described in Section 2 to solve this problem.

We apply our method to compute the distribution of BCR-ABL expression levels prior to disease progression. Patients are lined up at their disease progression time, i.e., using disease progression as the time origin, and the number of months before disease progression is denoted by a negative number. We account for censored observations by assuming a model, as in equation (4).

Based on discussions with our clinical collaborators, we include two baseline variables and one time-dependent covariate as follows: (1) Age60: a binary indicator of whether the baseline age is 60 years or older, (2) Highdose indicator: 140mg vs. 100mg, and (3) Bcrcurrt(t): BCR-ABL level at time t > 0. Time is measured in months.

With the longitudinal measurements Bcrcurrti(t), we assume that

λi{t+u|Tit,Age60i,Highdosei,Bcrcurrti(t)}=λt0(u)exp{α1Age60i+α2Highdosei+β(t)Bcrcurrti(t)}, (9)

with the format of β(t) specified in Appendix B.

With the above model (9) serving as model (4) in Section 2, the distributions of BCR-ABL expression levels prior to disease progression can be computed through the formula in equation (7). We consider four groups of patients based on age and treatment dose level: patients with ages younger than 60 years or not and receiving high-dose or low-dose levels of treatment. Note that although some patients had initial BCR-ABL values as high as 80, most likely those values dropped after they received treatment, but came back up before disease progression was observed. The time window we choose is not to characterize the initial drop in BCR-ABL expression values, but to capture the changing trend prior to disease progression. With this in mind, we estimate and plot the mean and 75% expression levels during the 18 months prior to disease progression for each of the above groups (Figure 5, black lines). Here, we choose to present the 75% quantiles rather than the medians because the medians are close to zero and thus are not very meaningful. Many patients had very low BCR-ABL transcript levels even prior to disease progression. This made the BCR-ABL increasing trend not clearly reflected in their mean values, but that is clearly shown in the 75% quantiles, which were less affected by those low BCR-ABL transcript levels prior to disease progression. The results in Figure 5 also indicated that the groups with ages 60 years or older and receiving high-dose dasatinib had a much clearer increasing pattern than the group with ages younger than 60 and receiving low-dose dasatinib, while the other two groups had only slightly increasing patterns. For practical use, we may show quantile distributions, such as 80%, 90% and 95%. Using these distributions, physicians can help patients make treatment decisions. For example, they can tell whether a patient’s biomarker level is increasing or decreasing in terms of the quantile distribution (just as a pediatrician uses a child’s growth curve to tell parents that their child’s height is moving from the median to the 75% quantile). Moreover, these changes in quantile distribution can trigger the use of a new treatment. The mean values and the slope of these biomarkers can also provide valuable information for patients to compare against their own values.

FIGURE 5.

FIGURE 5

Estimated mean (solid lines) and 75% quantile (dashed lines) of BCR-ABL transcript levels of CML patients prior to disease progression, grouped by age (< 60 or ≥ 60 years) and treatment dose level (high or low), by the proposed method (black lines) and the IPCW method (gray lines), respectively.

Chan and Wang3 proposed the IPCW method for backward estimation of a stochastic process with failure events as the time origin. They used the cumulative medical cost before death as a motivating example and formulated the problem in a different way than our consideration of the biomarker trajectory distribution prior to a failure event. We adapt their estimator to our setting to analyze the CML data set and present the estimated distributions of BCR-ABL expression levels prior to disease progression in Figure 5 (gray lines). We applied the IPCW method to each of the four groups defined by age and dose as described above. We estimate the within-group probability of not being censored (i.e., disease progression was observed), and use its inverse as weights to compute the mean and quantile distributions. Censored patients receive a weight of zero. Their longitudinal biomarker data are not used. This wasted information, together with the separate analyses for subgroups, may severely reduce sample sizes and produce unstable results, as indicated by, for example, a striking spike for the estimated 75% quantile distribution for the high-dose group with ages younger than 60 years. In contrast, our proposed estimator is based on a regression model. It uses the longitudinal data from all the subjects, including the censored subjects, and does not need separate analyses for subgroups. Consequently, it is more likely to produce stable estimation results.

5 |. DISCUSSION

In this article, we propose a method to look backwards from a failure event such as disease progression to observe the prior biomarker distributions. This can reveal interesting biomarker patterns that are not obvious if we look forward from the initiation of a treatment. Although this kind of statistical tool for visualization is very helpful for biomedical discovery, there is not much in the literature about such methods. We propose a new method that includes two features. The first is to incorporate the information from subjects with censored disease progression times. Failure to account for these subjects may result in substantial bias in the estimation, as demonstrated by the simulation studies. The second feature is to use regression models to provide estimators for biomarker trajectories conditional on baseline covariates, such as age and other demographic variables, without making assumptions on the shape or distribution of the biomarker trajectories.

The proposed estimation procedure provides consistent estimation when model (4) is correctly specified. However, model (4) is a working model in our simulations, that is, the simulated data were not generated exactly from this model. That led to a small bias in the simulation. In the analysis of real dataset, one can use model checking to guide the choice of flexible models and minimize the possibility of misspecification. The model we use for estimation does not specify a data generating mechanism for biomarkers and the survival times conditional on biomarkers. Our simulation studies show that, with this working model, there is some bias, but the bias is much smaller than what results from using a naïve method that ignores censored subjects. We argue that this is a good property of the proposed method because in practice, the data generating mechanisms are unknown, and usually do not satisfy the commonly used Cox proportional hazards assumption. With a potentially mis-specified working model, our proposed method can still reduce bias, but may not completely eliminate it. In real practice, biomarkers have different shapes of trajectories. Some of them may spike prior to an clinical event, while others keep increasing slowly all the time. The estimation results depend on model (4) used in the Method section. This model is essentially a proportional hazards model, although functions involving time are incorporated to account for time-varying covariate effects. Commonly used techniques for checking the goodness-of-fit of the proportional hazards assumption, such as that by looking at the relationship between martingale residuals (or Schoenfeld residuals9) and time, can be applied. Our proposed methodology guarantees that, as long as the working model (4) fits the data adequately, the proposed estimator is consistent. The proposed method can be applied to biomarker evaluation studies. We can plot the trajectory distributions prior to disease onset for the biomarkers of interest, to observe their changing patterns and then build models accordingly. The R code to implement the proposed method is available upon request.

ACKNOWLEDGMENTS

This research was supported in part by the U.S. National Science Foundation through grant DMS 1612965 and by the U.S. National Institutes of Health through grants U54 CA096300, U01 CA152958, 5P50 CA100632 and R01 HS020263.

Appendices

A. DEVELOPMENT OF ASYMPTOTIC PROPERTIES

By assuming the smooth regression coefficient β(t) as a linear combination of some basis functions such as fractional polynomials, β(t) can be simply denoted as β. Denote (5) as FZ(z|t, u, y, α, β) (or simply F when its meaning is clear from the context, and similarly for other notations). Denote (7) as G^(z|u,y,α^,β^,Λ^0()), S(t) = Pr(Tt), and A(t) = {1 − S(t)}∕{S(a) − S(τ)}. Then G^ can be written as G^=aτF^dA^. The estimated cumulative density function F^Z{z|xi,u,y,α^,β^(xiu)}, the indicator function I(ta), and the Kaplan-Meier estimator S^(t) are all bounded and monotonic, thus are in the Glivenko-Cantelli class by10, Theorem 2.7.5. Using the permanence properties of Glivenko-Cantelli class10 (Section 2.6.5), the function S^(t)S^(t) is in the Glivenko-Cantelli class. Since the difference S^(a)S^(τ) is bounded away from zero by the regularity conditions, the ratio {S^(t)S^(t)}/{S^(a)S^(τ)} is also in the Glivenko-Cantelli class. By these results and the permanence properties of Glivenko-Cantelli class, the function G^() is in the Glivenko-Cantelli class. The consistency of F^ for F can be established similarly as done by Xu and O’Guigley6. The consistency of A^ for A follows from the properties of the Breslow estimator for the cumulative hazard function in the Cox proportional hazards model. Note that G^1, i.e., it is bounded. Then the consistency of G^ for G can be proven by applying the Glivenko-Cantelli theorem and the bounded convergence theorem11. As for the asymptotic variance of G^, there are two sources of variation: one caused by the estimation of the conditional probability F given the true values of other parameters such as Λ(⋅); the other caused by the estimation of Λ(⋅) and other parameters. It can be shown (with a brief derivation given later) that

n(F^F)GF,
n(A^A)GA,

where ⇝ denotes convergence in the distribution to Gaussian processes GF and GA. With these results, the functional delta theorem (van der Vaart, 1998) indicates that

n(G^G)aτGAdF+aτGFdA.

Suppose that, with some algebra (shown below) and applying the standard results on the asymptotic properties by Anderson and Gill (1982), n(F^F) can be represented by the sum of i.i.d. influence functions n1/2i=1nfi and n(A^A) similarly can be represented by n1/2i=1ngi, then the asymptotic distribution of n(G^G) is normal with zero mean and variance estimated by

1ni=1n(aτf^idF^+aτg^idA^)2.

Denote the true values for α and β by α0 and β0 respectively, then the derivation of influence functions fi for F^ can proceed as

n{F^(z|t,u,α^,β^)F(z|t,u,α0,β0)}=n{F^(z|t,u,α^,β^)F^(z|t,u,α0,β^)}+n{F^(z|t,u,α0,β^)F^(z|t,u,α0,β0)}+n{F^(z|t,u,α0,β0)F(z|t,u,α0,β0)}=nF^(z|t,u,α,β)α|α=α0,β=β^(α^α)+nF^(z|t,u,α,β)β|α=α0,β=β0(β^β)+ni=1nπi(t,u,α0,β0){I(Zi(tu)z)F(z|t,u,α0,β0)}+op(1).

In the above derivation, n(α^α0) and n(β^β0) have asymptotically normal distributions with mean zero and variance estimated by the summation of i.i.d. influence functions given by12,7. The πi(t, u, α0, β0) in the third term involves Rj(t) = I(Tjt)I(Cjt), j = 1, ⋅⋅⋅, n. Simplifying them as Rj and their expectations as ej, and denoting ωj = exp{α0y + β0Zj(tu)} and Zi = Zi(tu), the third term can be written as i.i.d. terms (omitting some negligible terms):

n(j=1nejwj)2i=1nI(Ziz)(1Riwi)(Riwieiwi)+{I(Ziz)F}eiwi.

Summarizing all of the above, we can write n(F^F) as a summation of i.i.d. influence functions. The derivation for the influence functions for A^ can be done using techniques similar to those shown above.

B. TIME-VARYING COEFFICIENT β(T) AND PARAMETER ESTIMATION

We rewrite the model (4) as below,

λi{s+u|Tis,Yi,Zi(s)}=λ0(u)exp[αYi+{β0(s),β(s)}{1,Zi(s)}],u>0.

Here for easy notation, we assume Z(t) is 1-dimensional. Denoting ν = t + 1, we assume

{β0(t),β(t)}{1,Z(t)}=β0(t)+β1(t)Z(t)=β00+β01ln(v)+β02v+β031v+{β10+β11ln(v)+β12v+β131v}Z(t),

where ν = t +1 instead of the original t is used to avoid log(0) and 1/0. By doing this, we change the functions {β0(t), β′(t)} to an unknown vector denoted as β. Those different baseline hazard functions λs0(u) in model (4) are rewritten as λ0(u)exp{β0(s)}.

Huang et al.7 made explicit an inherent property of hazard functions that is described below. Suppose the hazard function for a general survival time T is λT (t). At time s > 0, given Ts, denote the hazard function for Ts by λTs(u). Then it can be shown that λTs(u) = λT (s + u). Applying this property, we have

λTis(u|Tis,Yi)=λ0(s+u)exp(αYi),u>0.

After incorporating the above property of hazard functions, Huang et al.7 used a two-stage approach for model specification and parameter estimation. The first stage is to consider the baseline information only through the model λi(t|Yi) = λ0(t)exp(αYi). The Cox partial likelihood function is maximized to estimate α, and λ0(t) (viewed as a function with point mass at event times) is estimated by the Breslow estimator13, using baseline covariates Yi and observed survival information Xi and Δi, i = 1,⋅⋅⋅, n. Then the second stage is to specify the hazard function conditional on the observed biomarker information, as in equation (4), and then estimate β as shown below. Suppose patient i has biomarker measurements taken at time points tij, j = 1,⋅⋅⋅, ni, with 0<ti1<<ti,ni<xi. At each time point tij, the likelihood function connecting the biomarker value zi(tij) and the final survival outcome (xi, δi) is

Lij{β(t)}=L{Xi=xi,Δi=δi|Ti>tij,yi,zi(tij),α,λ0()}=[λ0(xi)exp{αyi+β(tij)zi(tij)}]δiexp[k:tij<xkxiλ0(xk)exp{αyi+β(tij)zi(tij)}].

We denote lij{β(t)} = log[Lij{β(t)}]. Then, using a working independence assumption among the different time points tij, j = 1,⋅⋅⋅,ni, we have the following “working” log-likelihood function:

l{β(t)}=i=1nli{β(t)}=i=1nj=1nilij{β(t)}=i=1nj=1niδi[log{λ0(xi)}+αyi+β(tij)zi(tij)]exp{αyi+β(tij)zi(tij)}k:tij<xkxiλ0(xk).

Then we obtain estimators for β by maximizing the above log-likelihood function, with estimators for α and λ0(t) being plugged into it.

Fractional polynomials have base functions {ln(s),s±k,k=12,1,2,}. Binder et al.14 compared the performance by cubic splines and fractional polynomials by simulation studies and concluded that cubic splines are good for situations with many local curvatures while fractional polynomials are better for global smoothing situations without much local fitting. We believe that the time-varying coefficients β(t) will be a global smoothing function without many local curvatures. Therefore, we use fractional polynomials to do the smoothing. We use an intercept term β0(t) in β(t) for the model in (4) to help make the hazard functions for different landmark time points s compatible. In other words, we assume the terms exp{β0(s1)} and exp{β0(s2)} can help to at least partially absorb the incompatible parts between the hazard functions for landmark analyses at s1 and s2.

C. DATA GENERATING MECHANISMS FOR SIMULATION STUDIES

C.1. Constant regression coefficient

To generate data for simulations, we use a linear random effects model for biomarker trajectories {Zi(t),t ≥ 0}, and assume that the hazard function λi(t|Zi(t)) for failure time Ti satisfies the proportional hazards model with Zi(t) as time-dependent covariates, which can be decomposed as the true value Mi(t) plus random variations εi(t) as follows.

Zi(t)=Mi(t)+εi(t),Mi(t)=Bit,λi(t|M¯i(t))=λ0exp{βMi(t)},i=1,,n, (10)

where Bi ~ uniform(b1, b2), and εi(t) ~ Normal(0, 1), independently across t’s and i = 1, ⋅⋅⋅, n. Note that although the above hazard function is conditioned on the entire history of the true values of Z(t), we assume that only the current true value of Z(t) is relevant. Then we have

Λi(t)=0tλ0exp(βBis)ds=1βBiλ0{exp(βBit)1}.

By Si(t) = exp{−Λi(t)}, we obtain that

Si(t)=exp[λ0{exp(βBit)1}βBi]. (11)

Since Si(Ti) ~ uniform(0, 1), we generate Ui ~ uniform(0, 1) and let Si(Ti) = Ui. Plugging this into (11) and solving the resulting equation, we obtain Ti as

Ti=1βBilog{log(Ui)βBiλ0+1}.

In practice, the biomarker data Zi(t), 0 ≤ tTi can only be observed at discrete time points. We assume they are observed at t = 0, 1, 2, ⋅⋅⋅, min(100, [Ti]), where [Ti] denotes the integer part of Ti.

C.2. Time-varying regression coefficient

To consider more general settings of data generation, we modify the constant regression coefficient β in the Cox model in equation (10) to have the time-varying coefficient as shown below,

λi(t|M¯i(t))=λ0exp{βlog(t)Mi(t)/t}.

Then we can obtain that

Si(t)=exp(λ0tβBit+1βBi+1). (12)

Again, using that Si(Ti) ~ uniform(0, 1), we generate Ui ~ uniform(0, 1) and let Si(Ti) = Ui. Plugging this into (12) and solving the resulting equation, we obtain Ti as

Ti={log(Ui)(βBi+1)λ0}1βBi+1.

C.3. Censoring mechanisms and data analysis results

To generate the censoring times, assume that the hazard function for censoring time Ci depends on the biomarker true value Mi(t) as

hi(t|M¯i(t))=h0exp{γMi(t)}. (13)

Then Ci can be generated similarly as above using another uniform(0, 1) random number Vi independent of Ui. This data generating mechanism satisfies the assumption that, at any time point t, Ci is independent of Ti given M¯i(t) and Ti > t, Ci > t, as indicated by (10) and (13).

The data analysis results of an example data set (with constant regression coefficient and 50% censoring rate) are shown in Figure 6.

C.4. The inverse-probability-of-censoring-weighting (IPCW) method

Chan and Wang3 proposed to use the inverse-probability-of-censoring-weighting (IPCW) method to estimate accumulative medical cost from Tu to T, where T is the death time of a subject. They called this “backward estimation of stochastic processes with failure event as time origins”. Although different from our setting of estimating the mean biomarker value E[Z(Tu)], it can be adapted for our purpose. However, the validity of Chan and Wang’s method is based on an important assumption that censoring is independent of the biomarker trajectory and failure time. They did not account for the effects of baseline covariates either. In a setting where censoring is independent of baseline covariates, longitudinal biomarkers, and the failure time, their approach provides consistent estimates. However, in other settings, such as in our simulation settings, where the censoring time depends on longitudinal biomarkers, the method by Chan and Wang may give substantially biased results. For example, for the simulation setting with 50% censoring rate reported in Section 3 in the main text of this article, the results by their method are shown in Figure 7 below. In contrast, our proposed method gives nearly unbiased estimates (see Figure 3 in main body). Moreover, our approach does not need to estimate the distribution of censoring time, which is nuisance in most of the applications. This makes our approach more convenient for implementation.

FIGURE 7.

FIGURE 7

The IPCW method by Chan and Wang gives biased results. Legend: true values shown by solid line, IPCW estimates by dashed line, 95% confidence intervals by dotted lines.

We have shown in Section A the large sample properties of our proposed estimator, namely, its variance converges in a n rate. This ensures that its variance can be estimated by the bootstrap method, which is easy to implement (easy computer programming). Theoretical variance formulas are available, but more tedious for programming. Simulation study results (for the first scenario in Section 3 of main body with constant regression coefficients) are shown in Figure 8. In this Figure, we can see also that the confidence intervals of our estimators estimated by the bootstrap method match well with their empirical counterparts, and the empirical coverage probabilities of our estimated 95% confidence intervals are all close to their nominal level. This validates the inference properties of our proposed method.

FIGURE 8.

FIGURE 8

Top panels: The asymptotic 95% confidence intervals (CIs) and the empirical 95% CIs by the proposed method (all averaged over 1,000 simulations). Bottom: The coverage probabilities of the 95% CIs by the proposed method (dotted line: 95%). Legend for CIs: true values shown by solid line, estimates by dashed line, 95% CIs by dotted lines.

References

  • 1.Shet AS, Jahagirdar BN, Verfaillie CM. Chronic myelogenous leukemia: mechanisms underlying disease progression. Leukemia. 2002;16:1402–1411. [DOI] [PubMed] [Google Scholar]
  • 2.Liu L, Wolfe RA, Kalbfleisch JD. A shared random effects model for censored medical costs and mortality. Stat Med. 2007;26(1):139–55. [DOI] [PubMed] [Google Scholar]
  • 3.Chan KCG, Wang MC. Backward estimation of stochastic processes with failure events as time origins. Ann Appl Stat. 2010;4(3):1602–1620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Cox DR. Regression models and life tables (with discussion). J R Stat Soc Series B Stat Methodol. 1972;34:187–200. [Google Scholar]
  • 5.Heitjan DF, Rubin DB. Ignorability and coarse data. Ann Stat. 1991;19:2244–2253. [Google Scholar]
  • 6.Xu R, O’Quigley J. Proportional hazards estimate of the conditional survival function. J R Stat Soc Series B Stat Methodol. 2000;62:667–680. [Google Scholar]
  • 7.Huang X, Yan F, Ning J, Feng Z, Choi S, Cortes J. A two-stage approach for dynamic prediction of time-to-event distributions. Stat Med. 2016;35:2113–2296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Quintas-Cardama A, Choi S, Kantarjian H, Jabbour E, Huang X, Cortes J. Predicting outcomes in patients with chronic myeloid leukemia at any time during tyrosine kinase inhibitor therapy. Clin Lymphoma Myeloma Leuk. 2014;14(4):327–334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Therneau TM, Grambsch PM. Modeling Survival Data: Extending the Cox Model. New York: Springer; 2000. [Google Scholar]
  • 10.Vaart Aad W, Wellner Jon A. Weak convergence and empirical processes. New York: Springer; 1996. [Google Scholar]
  • 11.Vaart AW. Asymptotic Statistics. Cambridge UK: Cambridge University Press; 1998. [Google Scholar]
  • 12.Lin DY, Wei LJ. The robust inference for the Cox proportional hazards model. J Am Stat Assoc. 1989;84:1074–1078. [Google Scholar]
  • 13.Breslow NE. Discussion of the paper by D. R. Cox. J R Stat Soc Series B Stat Methodol. 1972;34:216–217. [Google Scholar]
  • 14.Binder H, Sauerbrei W, Royston P. Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: a simulation study with continuous response. Stat Med. 2013;32(13):2262–2277. [DOI] [PubMed] [Google Scholar]

RESOURCES