Skip to main content
Biometrika logoLink to Biometrika
. 2010 Jun 30;97(3):683–698. doi: 10.1093/biomet/asq036

Analysis of cohort studies with multivariate and partially observed disease classification data

Nilanjan Chatterjee 1, Samiran Sinha 2, W Ryan Diver 3, Heather Spencer Feigelson 3
PMCID: PMC3372244  PMID: 22822252

Summary

Complex diseases like cancers can often be classified into subtypes using various pathological and molecular traits of the disease. In this article, we develop methods for analysis of disease incidence in cohort studies incorporating data on multiple disease traits using a two-stage semiparametric Cox proportional hazards regression model that allows one to examine the heterogeneity in the effect of the covariates by the levels of the different disease traits. For inference in the presence of missing disease traits, we propose a generalization of an estimating equation approach for handling missing cause of failure in competing-risk data. We prove asymptotic unbiasedness of the estimating equation method under a general missing-at-random assumption and propose a novel influence-function-based sandwich variance estimator. The methods are illustrated using simulation studies and a real data application involving the Cancer Prevention Study II nutrition cohort.

Some key words: Competing-risk, Etiologic heterogeneity, Influence function, Missing cause of failure, Partial likelihood, Proportional hazard regression, Two-stage model

1. Introduction

Epidemiological researchers commonly use prospective cohort studies to investigate risk factors associated with the incidence of chronic diseases such as heart disease, diabetes and cancers. The proportional hazards model (Cox, 1972) is widely used to analyze data from cohort studies for the purpose of making inferences about covariate relative-risk/hazard-ratio parameters. In the standard Cox model, the disease of interest is treated as a single event and the time to incidence of the event is treated as the outcome of interest. In modern epidemiological studies, however, the disease of interest can often be classified into finer subtypes based on various pathologic and molecular traits of the disease. Although there has recently been tremendous progress in methods for using such disease classification data for the study of survival and prognosis after disease incidence, much less attention has been given to methods for incorporating such disease trait data into an etiologic investigation of a disease.

In this article, we develop methods for incorporating disease trait information into the analysis of cohort data with the aim of studying etiological heterogeneity, that is, whether the effects of the underlying risk factors vary for the different subtypes of a disease. The basic idea involves using a competing-risk framework to model the hazards of different disease subtypes separately. There are, however, two major analytic difficulties. First, a combination of multiple disease traits, some of which can be ordinal or even continuous, may define a very large number of disease subtypes. Using each disease subtype as a separate entity, without imposing any further structure, will require a large number of parameters in the underlying Cox models, potentially causing problems of interpretation, inefficiency and power loss. A complete-case analysis, using only those diseased subjects who have complete trait information, can result in both bias and inefficiency.

We propose dealing with these problems using a two-stage regression model coupled with an estimating equation inferential method. There are several novel aspects to our proposal. First, motivated by our earlier work on polytomous logistic regression models (Chatterjee, 2004), we propose to reduce the number of parameters in the competing-risk proportional hazards model by imposing a natural structure on the relative-risk parameters through the underlying disease traits. The parameters of the reduced model themselves are of scientific interest and are useful for testing etiologic heterogeneity in terms of the underlying disease traits. Second, for the purpose of inference, we propose a general extension of an estimating equation method of Goetghebeur & Ryan (1995) that was originally designed to deal with missing information on causes of failure in an unstructured competing-risk problem. The proposed extension allows one to incorporate the underlying structure of the competing events through a second-stage trait design-matrix, with the missing data being dealt with by taking suitable expectations of the design-matrix, given the observed covariate and trait data. Third, we prove unbiasedness of our estimating equation method under a more general missing-at-random assumption than has been considered before. Finally, we use empirical process theory to develop the asymptotic theory of our estimating equation method and an associated influence-function-based robust sandwich variance estimator under the general missing-at-random assumption. The finite sample properties of the proposed estimator are studied via simulation studies involving small and large numbers of disease subtypes. Moreover, we apply the proposed method to a dataset on breast cancer incidence and various histopathologic traits of breast tumours from the well-known Cancer Prevention Study II of the American Cancer Society.

2. Model and assumption

2.1. Notation and models for the cause-specific hazard

Suppose that in a cohort study of size n, each subject is followed until the first occurrence of the disease of interest or the censoring. Following standard convention, let T denote the underlying time-to-event for the disease and C denote time to censoring. For standard cohort analysis, the outcome is represented by (Δ, V), where Δ = I (T < C) denotes the indicator of whether or not the disease occurred before censoring and V = min(C, T) denotes the time-to-censoring or time-to-disease, whichever occurs first. Let us assume that, if disease occurs (Δ = 1), the study collects data on K disease traits, Y = (Y1, . . ., YK), which could be, for example, various tumour characteristics. If the kth trait defines Mk categories for the disease, then one can define potentially a total of M = M1 × M2 × . . . MK subtypes, based on all possible combinations of the various characteristics. Suppose that 𝒟Y denotes the set of all possible values of Y. Breast cancer patients, for example, are often classified into four categories based on the presence or absence of estrogen and progesterone receptors. Given that follow-up ends at the occurrence of any type of disease, the M subtypes can be treated as competing events. If X denotes a vector of covariates of interest, assumed without loss of generality to be time independent, one can use the proportional hazards model to specify the cause-specific hazard for each subtype of the disease as

hy(t|X)=limh0h1pr(tT<t+h,Y=y|Tt,X)=λy(t)exp(Xβy),

where λy(·) is the baseline hazard function and βy is the log hazard ratio parameter associated with the disease subtype y. A complication is that modern molecular epidemiologic studies collect data on an array of different traits, which can be represented by a mixture of categorical, ordinal and continuous variable(s). In the above approach, even with few covariates and disease traits, the total number of regression coefficients can become very large. In the following section, we consider reducing the number of these parameters by using a second-stage model, following an idea introduced by Chatterjee (2004) in the context of polytomous logistic regressions.

2.2. A second-stage regression model for the subtype-specific regression parameter

First, we focus on modelling the regression coefficients associated with a single covariate. The indexing of the different disease subtypes by the K underlying disease traits immediately suggests the following log-linear representation for the hazard-ratio parameters:

β(y1,,yk)=θ(0)+k=1Kθk(yk)(1)+k=1KkkKθkk(yk,yk)(2)++θ12K(y1,,yK),(K) (1)

where θ(0) represents the regression coefficient corresponding to a referent subtype of the disease, the coefficients θk(yk)(1) represent the first-order parameter contrasts, θkk(yk,yk)(2) denote the second-order parameter contrasts, and so on.

The above representation of the hazard ratio parameters suggests a natural and hierarchical way of reducing the number of parameters by constraining suitable contrasts to be zero. For instance, if we assume that the second and all higher-order contrasts are equal to zero, then (1) reduces to

β(y1,,yk)=θ(0)+k=1Kθk(yk)(1), (2)

and in this case the heterogeneity between two subtype-specific regression coefficients is β(y1,,yk1,yk,yk+1,,yK)β(y1,,yk1,yk,*yk+1,,yK)=θk(yk)(1)θk(yk*)(1). Thus, in this model, the contrasts of the form θk(yk)(1)θk(yk*)(1) give a measure of the degree of etiologic heterogeneity among the disease subtypes with respect to the kth trait, holding the level of other traits to be constant. Further, for ordinal or continuous disease traits, one can impose ordering constraints on the contrast parameters by using models of the form

θk(yk)(1)=θk(1)syk(k)(k=1,,Mk),

where {0=s1(k)s2(k)sMk(k)} are a set of prespecified scores assigned to the Mk levels of the kth trait. This model summarizes the degree of etiologic heterogeneity with respect to the kth characteristic in terms of a single regression coefficient θk(1), with θk(1)=0 implying no heterogeneity.

The additive model (2) could be extended to include interaction terms between pairs of disease traits, thus potentially allowing the degree of etiologic heterogeneity for one trait to vary with the level of the other trait, and vice versa. For ordinal traits, the interaction parameters can also be constrained to maintain their ordering by modelling them in terms of underlying continuous scores. Details of modelling may be found in Chatterjee (2004).

The second-stage model described above can be represented as β = ℬθ, where the design matrix relates the coefficients β of the unstructured competing-risk Cox model to the parameters θ of the log-linear model. When there are multiple covariates of interest, say X = (X1, . . ., XP), then one can consider a separate two-stage model of the form βp = (p)θp for each of the covariates Xp. In the following exposition, we will assume for notational convenience that (p) = for all p.

3. Inference

3.1. Estimation methodology

If there are no missing data on any of the disease traits, inference on the covariate regression parameters can be performed based on a partial likelihood of the form

Ln=i:Δi=1{hYi(Vi|Xi)j=1nI(VjVi)hYi(Vi|Xj)}=i:Δi=1{exp(p=1PYiTθpXip)j=1nI(VjVi)exp(p=1PYiTθpXjp)}, (3)

where Yi = (Yi1, . . ., YiK) denotes the observed disease traits for the i th subject given Δi = 1, i.e., for a case, and YiT represents the row of the design matrix that corresponds to the trait values Yi. The maximum partial likelihood estimate of θ can be obtained by maximizing (3) without requiring any assumption about the subtype-specific baseline hazard functions λy(t). The score function associated with (3) can be written as

Sθ*=Δi=1{YiXij=1nI(VjVi)YiXjexp(p=1PYiTθpXjp)j=1nI(VjVi)exp(p=1PYiTθpXjp)},

where ⊗ denotes the Kronecker product. If data on some or all of the disease traits are missing for some study subjects, one can use (3) to perform a complete-case analysis involving only those cases that have complete data on all the disease traits. Such complete-case analyses, however, may potentially require discarding large numbers of cases, resulting in inevitable loss of efficiency and possibly bias due to nonrandom missingness.

Several researchers have studied the problem of missing cause-of-failure with two underlying competing events. Dewanji (1992) proposed a partial-likelihood-based solution by assuming that the baseline hazards for the two causes of failure are proportional. Goetghebeur & Ryan (1995) exploited the same assumption but proposed a more robust estimation equation approach that is less sensitive to misspecification of the constant hazard-ratio assumption. Craiu & Duchesne (2004) proposed a full-likelihood-based solution allowing for flexible piecewise-constant modelling of the baseline hazard functions. Lu & Tsiatis (2005) considered a multiple-imputation method that estimates the odds of one failure type against the other as a function of failure time and covariates by parametric logistic regression analysis of the complete-case data. Gao & Tsiatis (2005) proposed an inverse probability-weighted complete case estimator and a doubly robust estimator in general linear transformation models. Although each of these methods has its own advantages, that of Goetghebeur & Ryan (1995) is particularly apt for extension to our setting, as it is quite efficient and yet relies minimally on the assumptions about the failure-type-specific baseline hazard functions. In particular, when there are no missing data on cause of failure, the method simplifies to the standard partial likelihood method. When there are missing data, though some assumptions about the inter-relationship among the event-specific baseline hazard functions are needed, the method remains relatively robust, as it relies on those assumptions only when it is needed for dealing with events with missing cause of failure. In the following, we propose a general extension of the approach of Goetghebeur & Ryan (1995) for a structured competing-risk problem where the competing events are defined by underlying related disease traits.

We first introduce some additional notation. Let Rk denote the indicator of whether the kth disease trait is observed, Rk = 1, or not, Rk = 0, for a diseased subject. Define R = (R1, . . ., RK) and let r = (r1, . . ., rK) denote a realization of R. We observe that r can take on 2K possible values, each corresponding to a particular pattern of missing data for the disease traits. Suppose that 𝒟R denotes the set of all possible values of R. We will assume that the data are missing-at-random, in the sense that

pr(R=r|T,X,Y,Δ=1)=pr(R=r|T,X,Δ=1)=π(r)(T,X), (4)

i.e. the probabilities for the different patterns of missing data do not depend on the trait values Y. The model (4) is more general than that considered by Goetghebeur & Ryan, as they allowed the missingness probability to depend only on T, but not on X. For any given patten of missingness r, partition the disease traits as Y = (Yor, Ymr), where Yor and Ymr denote the vectors of traits that have been observed and missing, respectively. Now, we can write the hazard of the disease with a missingness pattern R = r and disease trait Yor as

hyor(t|X)=ymrλ(ymr,yor)(t)exp{Xiβ(ymr,yor)}dμ(ymr),

where μ(·) denotes Lebesgue measure on the sample space of Ymr.

Define

Qyorymr(t,X)=h(ymr,yor)(t|X)hyor(t|X)

and observe that Qyorymr(t,X) can be interpreted as the conditional probability that a diseased subject with the incidence time t and the covariate value X has trait value y = (Ymr, Yor), given the observed disease traits Yor. Now define

yor(Y|T=t,X)=ymr(yor,ymr)Qyorymr(t,X)dμ(ymr)

to be the conditional expectation of the design vector Y given the observed traits yor, the time to first disease occurrence T = t and the covariate data X. Now, we propose to estimate θ based on

Sθr𝒟RSθ(r)r𝒟RΔi=1,Ri=r{Yior(Y|Ti,Xi)Xij=1nI(VjVi)Yior(Y|Ti,Xj)XihYior(Ti|Xj)j=1nI(VjVi)hYior(Ti|Xj).}=0. (5)

The unbiasedness of the left-hand side of (5) under the general missing-at-random model (4) is proven in the Appendix. In the absence of any missing data on disease traits, Yior(Y|X) corresponds to the row of the second-stage design matrix associated with the observed disease traits Yi and (5) is equivalent to the score equation associated with the partial likelihood (3). If we assumed discrete disease subtypes and did not impose any structure through the second-stage model, then Yio(Y|Xi,Ti) would simply correspond to the conditional probability vector for observing the different disease subtypes, given the observed traits, and (5) would essentially be the equation proposed by Goetghebeur & Ryan for dealing with two failure types.

The estimating equation (5) cannot be used by itself, as it involves the nuisance baseline hazard function λy(t). A complete nonparametric estimation of λy(t) may not be feasible because when the number of disease subtypes becomes large, the number of cases for the individual subtypes becomes sparse. In the following, we consider a semiparametric estimation approach for the baseline hazard functions. First, we express

λy(t)=λ(1,,1)(t)exp{αy(t)},

where λ(1,...,1)(t) denotes the baseline hazard associated with a reference disease subtype and exp{αy(t)} denotes the baseline hazard for the disease subtype y expressed as a multiple of that for the reference subtype. Similar to Goetghebeur & Ryan, we now invoke an assumption of a time-independent hazard-ratio, that is, αy(t) ≡ αy for all values of y. In addition, to overcome the potential sparsity problem when there are a large number of disease subtypes, we propose to specify exp(αy) using log-linear models analogous to those we used to specify the covariate hazard-ratio parameters exp(βy). We could, for example, consider an additive model of the form

α(y1,,yk)=ξ(0)+k=1Kξk(yk)(1).

Let α = 𝒜ξ represent such a model, and let 𝒜Y denote the row of the design matrix 𝒜 that corresponds to the trait Y. Define

hu(t|X)=y𝒟Yλy(ti)exp(Xiβy)dμ(y),Quy(X)=exp(αy+Xβy)y𝒟Yexp(αy+Xβy)dμ(y),wu(X)=y𝒟Yexp(αy+Xβy)dμ(y)

and observe that hu (t | X) denotes the marginal hazard for the disease integrated over all different values of the disease traits and Quy(Xi) is the probability that a subject has disease trait y, given simply that it is a case (Δ = 1) but ignoring all the other disease trait information.

Under the constant-hazard-ratio assumption αy(t) ≡ αy, the quantities Qyorymr(t,X) and hyor (t | X) in the definition of (5) can be replaced by

Qyorymr(X)=exp(α(ymr,yor)+Xβ(ymr,yor))ymrexp(α(yor,ymr)+Xβ(yor,ymr))dμ(ymr)

and

wyor(X)=ymrexp(α(yor,ymr)+Xβ(yor,ymr))dμ(ymr),

respectively. We further let

yor(𝒜Y|X)=ymr𝒜(yor,ymr)Qyorymr(X)dμ(ymr) and u(𝒜Y|X)=y𝒟Y𝒜yQuy(X)dμ(y).

Now, following Goetghebeur and Ryan (1995), we propose estimating ξ based on the partial likelihood

Ln*=r𝒟RRi=r,Δi=1{hyior(Vi|Xi)j=1nI(VjVi)hu(Vi|Xj)}.

The associated score function can be conveniently expressed as

Sξr𝒟RSξ(r)r𝒟RΔi=1,Ri=r{yior(𝒜Y|Xi)j=1nI(VjVi)u(𝒜Y|Xj)ωu(Xj)j=1nI(VjVi)ωu(Xj)}.

In our applications, we solved both sets of estimating equations Sθ = 0 and Sξ = 0 using the Newton–Raphson method. In principle, these equations can also be solved by an em-type algorithm where the expectation steps will involve computing the conditional expectations Yio(Y|Xi,Ti) and Yior(𝒜Y|Xi), and then the m-step will involve solving partial-likelihood-type score equations of the form (5).

3.2. Asymptotic theory and variance estimation

In this section, we study the asymptotic theory for the proposed estimator. Unlike a martingale-based approach considered by Goetghebeur & Ryan, here we consider an empirical process representation of the score functions to derive the influence function of the proposed estimator and an associated robust sandwich estimator for the asymptotic variance. Application of the robust sandwich estimator, as opposed to a model-based estimator, in this setting is particularly appealing because of the possibility of the misspecification of the models for the baseline hazard functions as a function of time t and the disease trait y. In the following lemma, we first provide a result about asymptotic unbiasedness of the estimating functions.

Lemma 1. Under the regularity conditions listed in the Appendix and the missing-at-random mechanism (4), as n → ∞, the convergence results

n1Sθn1r𝒟RSθ(r)0,n1Sξn1r𝒟RSξ(r)0

hold in probability at the true parameter values. Moreover, if π (T, X) = π (T), then for each specific r,

n1Sθ(r)0andn1Sξ(r)0,

in probability.

An interesting corollary of the above lemma is that the complete-case analysis, which corresponds to r = (1, . . ., 1), is unbiased when the missingness probability depends only on T, but not on X. Lemma 1 also shows that the martingale representation that Goetghebeur & Ryan used for their estimating function for each specific type of missing data pattern invalid under the more general missing data model we consider. Now, we define some further notation. Let

Sθ(1)(Vi,Yior)=n1j=1nI(VjVi)Yior(Y|Xj)XjωYior(Xj),Sξ(1)(Vi,u)=n1j=1nI(VjVi)u(𝒜Y|Xj)ωu(Xj),S(0)(Vi,Yior)=n1j=1nI(VjVi)ωYior(Xj),S(0)(Vi,u)=n1j=1nI(VjVi)ωu(Xj),

and denote by s(1)(Vi,Yior), s(0)(Vi,Yior), s(1)(Vi, u) and s(0)(Vi, u) the corresponding population expectations. The parameter vector η = (θTT)T is estimated by solving

Tn=(SθSξ)=0.

Define = − limn→∞(1/n)∂Tn/∂η. The formulae for the various components of are given in the Appendix. The following theorem summarizes the asymptotic property of the estimators.

Theorem 1. Under the regularity conditions listed in the Appendix, the estimating equation Tn =0 has a unique consistent sequence of solutions (θ̂n, ξ̂n) that is asymptotically normally distributed, with the influence function representation

n1/2(θ^nθ0ξ^nξ0)=𝒤1n1/2i=1n(JniθJniξ)+op(1),

where

Jniθ=Δir𝒟RI(Ri=r){Yior(Y|Xi)Xis(1)(Vi,Yior)s(0)(Vi,Yior)}+r𝒟REΔ,R,V,Yor[ΔI(R=r)I(VVi)ωYor(Xi)s(0)(V,Yor){Yor(Y|Xi)Xis(1)(V,Yor)s(0)(V,Yor)}]

and

Jniξ=Δir𝒟RI(Ri=r){Yior(𝒜Y|Xi)Xis(1)(Vi,u)s(0)(Vi,u)}+EΔ,V[ΔI(VVi)ωu(Xi)s(0)(V,u){u(𝒜Y|Xi)s(1)(V,u)s(0)(V,u)}].

To obtain an empirical sandwich variance estimator, we can estimate Jniθ and Jniξ by

J^niθ=Δir𝒟RI(Ri=r){Yior(Y|Xi)XiS(1)(Vi,Yior)S(0)(Vi,Yior)}+r𝒟Rl=1nΔlI(Rl=r)I(VlVi)ωYlor(Xi)S(0)(Vl,Ylor){Ylor(Y|Xi)XiS(1)(Vl,Ylor)S(0)(Vl,Ylor)}

and

J^niξ=Δir𝒟RI(Ri=r){Yior(𝒜Y|Xi)XiS(1)(Vi,u)S(0)(Vi,u)}+l=1nΔlI(VlVi)ωu(Xi)S(0)(Vl,u){u(𝒜Y|Xi)S(1)(Vl,u)S(0)(Vl,u)}.

The sandwich variance estimator for η̂ = (θ̂T,ξ̂T)T can now be obtained as ℐ̂−1V̂ ℐ̂−T, where ℐ̂ = ∂Tn/∂η and V^=i=1nJ^niJ^niT. See the Supplementary Material for the formula for ℐ̂.

4. Simulation study

4.1. Background information

In this section we study the finite sample performance of the proposed estimator under correct and misspecified models for the baseline hazard functions. We consider two different simulation scenarios. In the first, we assume the total number of distinct disease traits to be small and the number of cases observed for each trait value to be reasonably large. We consider trait values to be missing-completely-at-random and missing-at-random. In the second setting, we allow the number of distinct disease traits to be large and the number of cases observed for the different trait values to be potentially sparse.

4.2. Moderate number of disease subtypes

In this setting, we simulated data by mimicking the incidence pattern of four subtypes of breast cancer defined by the presence or absence of estrogen and progesterone receptors. We denote the four disease subtypes as (2, 2), (2, 1), (1, 2) and (1, 1). We first generated a scalar covariate X from the N (0, σ = 1) distribution for a cohort of size n, with n = 10 000. Next we generated the age-at-onset for the four different subtypes of the disease using the proportional hazards model

λ(y1,y2)(t|X)=λ(y1,y2)(t)exp(β(y1,y2)X),

with a Weibull specification for the baseline hazard functions,

λ(y1,y2)(t)=γ(y1,y2)λ(y1,y2)γ(y1,y2)tγ(y1,y2)1(y1,y2=1,2). (6)

We assumed that β(y1,y2)=θ(0)+θ1(y1)(1)+θ2(y2)(2) with θ1(1)(1)=θ2(1)(2)=0 for identifiability. We set the value of θ(0), a common parameter across all subtypes, to be 0.35, indicating an overall positive association between the covariate and the hazard of the disease. In addition, we set θ2(2)(1)=0.25, implying a stronger effect of the covariate on the risk of progesterone receptor positive tumours compared with progesterone receptor negative tumours. We assumed θ1(2)(1)=0, that is, no difference in the effect of the covariate between estrogen-receptor positive and negative tumours. We assume γy1,y2 = γ = 4.355 for all (y1, y2), which guaranteed constant hazard ratios between the different subtypes. We further specified λ(y1,y2) in such a way that the hazard ratios α (y1,y2) = {λ(y1,y2)/λ(1,1)}γ satisfied the constraint

log{α(y1,y2)}=ξ(0)+ξ1(y1)(1)+ξ2(y2)(1),

with ξ(0)=0,ξ1(1)(1)=ξ2(1)(1)=0,ξ1(2)(1)=0.0295 and ξ2(2)(1)=2.1779.

We generated the censoring time C for the subjects using a N (75, 52) distribution. In our simulation setting, we observed, on average, 3.91%, 3.86%, 0.56% and 0.51% cases of subtypes (2, 2), (2, 1), (1, 2) and (1, 1), respectively. After generating data for the entire cohort, we simulated missing trait values under missing-completely-at-random and missing-at-random models. Under missing-completely-at-random, we deleted estrogen receptor and progesterone receptor status randomly, independently of each other, using Bernoulli sampling with the probability of missing, p = 1 − π, for each being 0.20 or 0.50. Under missingness-at-random, we assumed the trait values to be completely observed for cases whose X and T values are greater than the 80th percentile values for the respective distributions. Among the remaining cases, we simulated missing traits by deleting estrogen receptor and progesterone receptor status randomly, independently of each other, using Bernoulli sampling with the probability of missing data, p, for each being 0.31 and 0.78, so that the overall missingness probability for each trait is, approximately, 0.20 or 0.50, respectively.

Under each scenario, we simulated 500 datasets and analyzed each of them using three different methods. In the first, subsequently referred to as full cohort analysis, we applied the partial likelihood (3) to the entire cohort, assuming that data on both of the disease traits had been observed for all the cases. In the second, subsequently referred to as complete-case analysis, we applied the partial likelihood (3) to the cohort after deleting all the cases that had missing data in any of the traits. In the third, subsequently referred to as the estimating equation analysis, we applied the proposed method to analyze data on all the cases, including those with one or both of the traits missing.

Table 1 reveals that the proposed estimating equation method had negligible bias for estimation of both the null and nonnull values of the θ-parameters. Moreover, the proposed robust sandwich variance estimator also performed very well for estimation of the true variance of the parameter estimates. The coverage probabilities for the associated 95% confidence intervals were also close to the nominal level. Under the missingness-completely-at-random mechanism, the complete-case analysis also produced unbiased parameter estimates. The estimating equation method, however, was clearly more efficient, as it incorporated data on all cases irrespective of whether they had missing traits or not. In fact, when the proportion of missing data was modest, e.g. 20%, the estimating equation method lost only 4–15% efficiency compared with the full-cohort analysis. In contrast, even with a modest amount of missing data, the loss of efficiency for the complete-case analysis ranged between 30% and 40%. Under the missing-at-random mechanism, the complete-case analysis yielded a biased estimate of the parameter θ(0), with the magnitude of bias being quite high when the missingness probability was 0.50. Moreover, under the missing-at-random mechanism, the loss of efficiency for the complete case analysis was more dramatic than that observed under the missing-completely-at-random mechanism.

Table 1.

Results of the simulation study, where the disease has four subtypes based on two disease traits, each with two levels. The true value of θ(0) = 0.35, θ1(2)(1)=0 and θ2(2)(1)=0.25. Each of the disease traits is missing via a missing-completely-at-random and missing-at-random mechanism. The cohort size for the simulation is n = 10 000

Missing-completely-at-random Missing-at-random
Method θ(0) θ1(2)(1) θ2(2)(1) θ(0) θ1(2)(1) θ2(2)(1)
Full cohort Bias (×102) −0.06 −0.10 0.42 −0.06 −0.10 0.42
Var (×102) 0.25 0.45 1.15 0.25 0.44 1.15
Evar (×102) 0.24 0.46 1.15 0.24 0.46 1.15
95% CP 95.6 94.4 94.2 95.6 94.4 94.2
20% missing 20% missing
Complete case Bias (×102) 0.69 −0.13 0.93 2.49 −0.19 0.47
Var (×102) 0.36 0.65 1.79 0.48 0.98 2.39
Evar (×102) 0.38 0.72 1.81 0.51 0.97 2.41
95% CP 96.2 96.4 95.8 94.0 94.6 93.8
Estimating equation Bias (×102) −0.08 −0.11 0.70 0.00 −0.18 0.17
Var (×102) 0.26 0.51 1.35 0.28 0.59 1.54
Evar (×102) 0.27 0.57 1.40 0.29 0.65 1.61
95% CP 95.6 96.0 96.0 95.8 96.0 96.2
50% missing 50% missing
Complete case Bias (×102) 1.24 −0.27 0.03 24.30 1.00 2.90
Var (×102) 1.06 1.96 5.03 4.48 9.96 22.52
Evar (×102) 0.98 1.87 4.81 4.29 8.25 24.17
95% CP 93.4 94.2 94.6 76.8 92.2 95.2
Estimating equation Bias(×102) −0.10 −0.15 0.65 0.10 0.51 −2.66
Var (×102) 0.43 1.02 2.57 0.67 1.84 4.55
Evar (×102) 0.37 0.90 2.18 0.63 1.86 4.47
95% CP 93.2 93.2 92.0 94.4 93.8 91.6

Evar, means of estimated variances; 95% CP, 95% coverage probability.

Next we investigated the robustness of the estimating equation method against misspecification of the semiparametric model for the trait-specific baseline hazard functions. The simulation design was the same as above, except that in model (6) we allowed the parameters γ(y1,y2) and λy1,y2 to vary freely. In particular, we set γ(1,1) = 2.692, γ(1,2) = 4.058, γ(2,1) = 5.433, γ(2,2) = 4.355 and λ(1,1) = 0.0026, λ(1,2) = 0.0038, λ(2,1) = 0.0063 and λ(2,2) = 0.0072, resulting in approximately 7.06%, 0.73%, 1.64% and 1.34% of subtypes (2, 2), (1, 2), (2, 1) and (1, 1), respectively, in the underlying cohort. Figure 1 of the Supplementary Material shows how log{λ(y1,y2)(t)/λ(1,1)(t)} changes over time. From the results shown in Table 2, we observe that in this setting, under the missing-completely-at-random mechanism all of the methods produced nearly unbiased estimates for all the parameters, but some noticeable bias was observed for the estimating equation method for the estimation of the parameter θ1(2)(1) under the setting of 50% missing data. In contrast, under the missing-at-random mechanism, the complete-case analysis produced severe bias in estimating the parameter θ(0) and the corresponding 95% coverage probability was unacceptably low. Under the missing-at-random mechanism, the bias of the estimating equation method also increased, but still remained small in absolute terms, and the corresponding 95% coverage probabilities were reasonable.

Table 2.

Results of the simulation study with a misspecified model for the baseline hazard functions. Here, the disease has four subtypes based on two disease traits, each with two levels. The true value of θ(0) = 0.35, θ1(2)(1)=0, and θ2(2)(1)=0.25. Each of the disease traits is missing-completely-at-random and missing-at-random with probability 0.5 or 0.2. The cohort size for the simulation is n = 10 000

Missing-completely-at-random Missing-at-random
Method θ(0) θ1(2)(1) θ2(2)(1) θ(0) θ1(2)(1) θ2(2)(1)
Full cohort Bias (×102) 0.09 −0.15 0.41 −0.04 0.43 −0.13
Var (×102) 0.14 0.56 0.70 0.15 0.58 0.75
Evar (×102) 0.14 0.55 0.72 0.14 0.55 0.72
95% CP 94.8 96.2 95.6 96.0 95.4 94.8
20% missing 20% missing
Complete case Bias (×102) 1.15 0.01 0.34 3.11 1.07 −1.37
Var (×102) 0.20 0.88 1.09 0.30 1.17 1.46
Evar (×102) 0.22 0.87 1.13 0.29 1.17 1.53
95% CP 95.4 94.4 94.2 90.8 95.4 95.8
Estimating equation Bias (×102) −0.37 2.02 −0.35 −0.68 3.60 −1.46
Var (×102) 0.16 0.65 0.88 0.17 0.74 0.95
Evar (×102) 0.15 0.64 0.83 0.16 0.72 0.93
95% CP 93.6 95.0 94.2 94.2 91.4 94.2
50% missing 50% missing
Complete case Bias (×102) 2.21 −0.45 0.92 25.84 3.75 −8.37
Var (×102) 0.59 2.32 2.68 2.66 9.73 14.91
Evar (×102) 0.56 2.25 2.94 2.35 9.70 13.67
95% CP 93.6 95.2 96.0 60.6 95.4 92.8
Estimating equation Bias (×102) −0.14 4.49 −0.79 −1.99 7.23 −4.15
Var (×102) 0.19 0.97 1.16 1.59 8.82 3.09
Evar (×102) 0.20 0.96 1.23 1.66 7.23 2.46
95% CP 94.2 93.4 95.2 91.9 88.2 93.3

Evar, means of estimated variances; 95% CP, 95% coverage probability.

4.3. Large number of disease subtypes

In the third setting, we considered three disease traits, each with four levels, say yj = 1, 2, 3, and 4, with the total number of disease subtypes equal to 4 × 4 × 4 = 64. As earlier, we generated the failure times for different disease subtypes from a trait-specific Cox proportional hazards model, where the covariate log hazard ratio parameters satisfied the constraint

β(y1,y2,y3)=θ(0)+θ1(1)s1(y1)+θ2(1)s2(y2)+θ3(1)s3(y3),

with sj(yj) = (yj − 1)0.3 (Chatterjee, 2004) and the baseline hazard functions followed a Weibull distribution of the form (6). We chose θ(0) = 0.35, θ1(1)=0.15, θ2(1)=0.0 and θ3(1)=0.5. We allowed 64 unrestricted values for the λ-and γ - parameters of the baseline Weibull distribution by randomly drawing their values from the Un(3.5, 4) and Un(0.0021, 0.0024) distributions, respectively. As before, X ~ N (0, 1) and the censoring time was generated from an N (75, 52) distribution. In this setting, the fraction of the subjects in the cohort who developed the disease was approximately 11%, with the subtype-specific disease occurrence rates ranging between 0.076% and 0.314%. We considered two different sample sizes, n = 5000 and 10 000.

As before, we analyzed each dataset using three methods: full-cohort, complete-case and estimating equation. For the estimating equation method, we assumed constant hazard ratios across subtypes and a working model of the form log{α(y1,y2,y3)}=ξ(0)+ξ1(y1)(1)+ξ2(y2)(2)+ξ3(y3)(3). The results from this simulation study, shown in Table 1 of the Supplementary Material, reveal that all of the different methods produced valid inferences in this setting of highly stratified disease subtypes. The bias of the estimating equation method, along with that for the complete-case and the full-cohort analyses, was small even though the working model for the baseline hazard functions was incorrectly specified for the first method. Furthermore, the estimating equation method often gained remarkable efficiency compared with the complete-case analysis.

5. Analysis of the Cancer Prevention Study II, Nutrition Cohort

The Cancer Prevention Study II, Nutrition Cohort, is a prospective study of cancer incidence and mortality among men and women in the United States that was established in 1992 and was ended on 30 June 2005.

In brief, the study participants completed a mailed, self-administered questionnaire in 1992 or 1993 that included a food frequency diet assessment and information on demographic, medical, behavioural, environmental and occupational factors. Beginning in 1997, follow-up questionnaires were sent to cohort members every 2 years to update exposure information and to ascertain newly diagnosed cancers; response rates for all follow-up questionnaires were at least 90% (Feigelson et al., 2006). For the purpose of illustration, we considered weight gain from age 18 to the year 1992 as the main covariate of interest as it has previously been shown to be related to the risk of breast cancer in the Cancer Prevention Study II cohort. After excluding those women who were lost to follow-up, had unknown weights, had extreme values of weight or reported prevalent breast or other cancer at baseline, except nonmelanoma skin cancer, we were left with 44 172 women who are at postmenopausal baseline in 1992.

Among the 44 172 women, we found that 1516 had some form of breast cancer. The cancer cases were verified either by obtaining medical records or through linkage with state cancer registries when complete medical records could not be obtained. We analyze available data on five tumour traits: grade, with three categories, well/moderately/poorly differentiated; stage, with two categories, localized/distant; histologic type, with three categories, ductal/lobular/other; estrogen receptor status with two categories, positive/negative; progesterone receptor status, with two categories, positive/negative. The aim of our analysis was to study how the association between weight gain and risk of breast cancer varied by various tumour traits.

The five traits yielded a total of up to 3 × 2 × 3 × 2 × 2 = 72 subtypes. Out of the 1516 cancer patients, 782 subjects had information on all of the disease traits, while the remaining 734 subjects had information missing on at least one of the traits. Let y1, y2, y3, y4 and y5 denote the level of grade, stage, histology, estrogen receptor status, and progesterone receptor status, respectively. We modelled the hazard of the various cancer subtypes as

h(y1,y2,y3,y4,y5)(t|X)=λ(y1,y2,y3,y4,y5)(t)exp(Xβ(y1,y2,y3,y4,y5)),

where, further, we assumed β(y1,y2,y3,y4,y5)=θ(0)+θ1(y1)(1)+θ2(y2)(1)+θ3(y3)(1)+θ4(y4)(1)+θ5(y5)(1). We set well-differentiated, localized, ductal, estrogen receptor positive and progesterone receptor positive as the referent levels for the associated tumour traits. Thus, in our model, θ(0) represents the log hazard ratio of weight gain associated with this referent breast cancer subtype and the parameters θk(yk)(1), k = 1, 2, 3, 4, 5, yield a measure of heterogeneity in the effect of weight gain by the levels of the corresponding disease trait. We performed both the complete-case and the estimating equation analysis of the data. For our estimating equation method, we assumed the constraint

log{λ(y1,y2,y3,y4,y5)(t)/λ(1,1,1,1,1)(t)}=ξ1(y1)(1)+ξ2(y2)(1)+ξ3(y3)(1)+ξ4(y4)(1)+ξ5(y5)(1).

From the results in Table 3, it is evident that both methods produced estimates of θ(0) that are positive and highly significant, indicating an overall positive association of weight gain with the risk of breast cancer. Moreover, the significance of the estimate of θ2(2)(1) in both the methods indicated that the association between weight gain and risk of breast cancer is stronger for distant compared with localized tumours. The significance of the estimate of θ5(2)(1) in both methods also suggested that the association between weight gain and risk of breast cancer is stronger for progesterone receptor positive compared with progesterone receptor negative tumours. For all parameters, the estimating equation method produces substantially smaller standard errors than does the complete-case analysis.

Table 3.

Results of the data analysis. Here we consider five disease traits: grade, stage, histology, estrogen receptor status and progesteron receptor status. Grade and histology each have three levels, and stage, estrogen receptor status, and progesteron receptor status each have two levels

Grade (Ref.: Well) Stage (Ref.: Localized) Histology (Ref.: Ductal) ER Status (Ref.: ER+) PR Status (Ref.: PR+)
Moderate Poor Distant Lobular Other ER− PR−
% missing 24.5 2.0 0.0 32.8 36.6
Method θ(0) θ1(2)(1) θ1(3)(1) θ2(2)(1) θ3(2)(1) θ3(3)(1) θ4(2)(1) θ5(2)(1)
Complete case EST 1.082 −0.210 −0.059 0.809 −0.257 0.376 0.375 −0.921
SE 0.305 0.361 0.403 0.312 0.414 0.472 0.506 0.398
p-value <0.001 0.559 0.883 0.009 0.535 0.426 0.459 0.021
Estimating equation EST 0.877 0.075 0.168 0.693 −0.519 0.327 0.760 −1.148
SE 0.246 0.289 0.301 0.209 0.274 0.287 0.408 0.341
p-value <0.001 0.796 0.576 0.001 0.058 0.254 0.062 0.001

EST, estimate; SE, standard error; ER, estrogen receptor; PR, progesteron receptor.

6. Discussion

The proposed method is semiparametric in the sense that it involves an unspecified baseline hazard function for a baseline disease subtype. The method requires an assumption of missing-at-random but it does not require any modelling assumption for π (T, X), the missingness probability, for the disease traits. The asymptotic unbiasedness of the method, however, does require correct parametric specification for the inter-relationships of the nuisance baseline hazard functions for the different disease subtypes.

Our simulation study suggests that the bias generated by the proposed method for estimation of θ, the focus parameters of interest, is generally quite small, even when the model for λy(t)/ λ(1,...,1)(t) is grossly misspecified. Our theory is valid for more general parametric specification of λy(t)/ λ(1,...,1)(t) which does not necessarily assume constancy over t. Thus the proposed methods can be used to conduct sensitivity analyses against alternative models for the baseline hazards. In principle, one could also use the constant hazard-ratio and log-linear modelling assumption to construct a doubly robust estimator for θ (Gao & Tsiatis, 2005) that would be unbiased for a correctly specified model either for π (T, X) or for λy(t)/ λ(1,...,1)(t). It would be interesting to develop such a method and to examine how its gain in robustness compensates for its potential loss of efficiency compared with the estimating equation method in practical settings. The computer code is available from the authors upon request.

Acknowledgments

The research of Dr. Chatterjee, and partially that of Dr. Sinha, was supported by the intramural program of the National Cancer Institute, National Institutes of Health, USA.

Appendix

Regularity conditions. In the following, let |θ|p denote the sum of the absolute values of the θ-parameters associated with the vector of covariates Xp.

  • (C1) The missingness probability π(r)(T, S) > 0 almost surely in (T, S) for r = (1, . . ., 1).

  • (C2) The elements of the second-stage design matrices and 𝒜 remain uniformly bounded in absolute value by constants, say cB and cA, respectively.

  • (C3) Assume that the function X3exp(cBp=1P|θ|pXp) can be bounded by an integrable function of X uniformly in an open neighbourhood of θ0.

  • (C4) The functions EV,XI(Vυ)exp(cAcBp=1P|θ|pXp) are bounded away from zero uniformly in υ and η = (θTT)T in an open neighbourhood of η0.

  • (C5) The matrix is positive definite.

Proof of the asymptotic unbiasedness of the estimating equation under the general missing-at-random assumption. We provide an outline of the proof the asymptotic unbiasedness of the estimating equations Sθ = 0 and Sξ = 0 under the general missing at random mechanism specified by (4). Further details of the proof can be found in the Supplementary Material.

First, it is easy to see that the asymptotic limit of (1/n)Sθ(r) can be written in general form as

ER,V,Δ,Yor(I(R=r)Δ[θlog{hYor(V|X)}s(1)(V,Yor)s(0)(V,Yor)]),

where

s(1)(V,Yor)=EV,XI(VV)[log{hYor(V|X)}/θ]hYor(V|X),s(0)(V,Yor)=EV,XI(VV)hYor(V|X),and log{hYor(V|X)}/θ=Yor(y|V,X)X.

Now, under the missing-at-random assumption, we can show

C(r)E[ΔI(R=r)log{hYor(V|X)}/θ]=υ,yorEV,X(I(Vυ)π(r)(υ,X)[log{hYor(υ|X)}/θ]h(υ,yor|X))dυdμ(yor).

Similarly, we can show,

D(r)E{ΔI(R=r)s(1)(V,Yor)s(0)(V,Yor)}=D(r)=s(1)(υ,yor)s(0)(υ,yor)EV,X{I(Vυ)π(r)(υ,X)hyor(υ|X)}dυdμ(yor).

Now, we note that, if π(r)(T, X) = π(r)(T), then we have

C(r)=D(r)=π(r)(υ)s(1)(υ,yor)dυdμ(yor),

implying the asymptotic unbiasedness of Sθ(r) for each specific r.

When π(r)(T, X) depends on X, in general C(r)D(r), but we can show that

rDRC(r)=rDRD(r)=υ,ys(1)(υ,y)dυdμ(y)

by rearrangements of integrals and sums.

Derivation of the influence function and asymptotic normality. The asymptotic unbiasedness of the estimating function, together with the fact that under the given regularity conditions, −(1/n)∂Tn/∂η uniformly in an open neighbourhood η0, proves the local consistency of the proposed estimator. In the following lemma, we state a key step that is needed for the proof of Theorem 1.

Lemma A1. Under the regularity conditions described in (C1)–(C5),

  1. n1/2{S(1)(υ,y)S(0)(υ,y)s(1)(υ,y)s(0)(υ,y)}=1n1/21s(0)(υ,y)j=1nI(Vjυ)hy(Vj|Xj)[θlog{h(Vj,y|Xj)}s(1)(υ,y)s(0)(υ,y)]+op(1).

  2. The above result holds uniformly for all (υ, y).

The proof of part (i) follows by a standard application of the functional δ-theorem (Theorem 20.8 of van der Vaart, 1998) by viewing the quantities S(k)(υ, y), k = 0, 1, as functions of the underlying empirical process defined by {Vj,Xj}j=1n. Since both s(1) (υ, y) and s(0)(υ, y) are linear functionals of the underlying distribution function of V and X, the required condition for their Hadamard differentiability follows easily under the regularity conditions (C2)–(C3). Moreover, under (C4), we can apply the chain rule to show that the ratio s(1)(υ, y)/s(0)(υ, y) is Hadamard differentiable. Part (ii) of Lemma A1, that is, uniform convergence, follows by the uniform boundedness conditions (C2)–(C4).

Now, to derive an asymptotic representation of n−1/2Sθ, we write

1n1/2Sθ=1n1/2r𝒟Ri=1nΔiI(Ri=r)[θlog{hYior(Vi|Xi)}s(1)(Vi,Yior)s(0)(Vi,Yior)]1n1/2r𝒟Ri=1nΔiI(Ri=r){S(1)(Vi,Yior)S(0)(Vi,Yior)s(1)(Vi,Yior)s(0)(Vi,Yior)},

where the second term, using Lemma A1, can be written as

r𝒟Ri=1n1n3/2ΔiI(Ri=r)s(0)(Vi,Yior)j=1nI(VjVi)hYior(Vj|Xj)[θlog{hYior(Vj|Xj)}s(1)(Vi,|Yior)s(0)(V,i|Yior)]+op(1),

which, in turn, after rearrangement of the sums, can be written as

1n1/2j=1nr𝒟R(1ni=1nΔiI(Ri=r)s(0)(Vi,Yior)I(ViVj)hYior(Vj|Xj)×[θlog{hYior(Vj|Xj)}s(1)(Vi,Yior)s(0)(Vi,Yior)])+op(1). (A1)

Now, by the law of large numbers, it is easy to see that the expression within the square brackets of expression (A1) converges to the second term in the expression of Jniθ given in Theorem 2. The derivation of the asymptotic representation of Sξ as a sum of independently and identically distributed random variables requires a similar step. The asymptotic normality of (θ̂n, ξ̂n) now follows by the application of the central limit theorem.

Supplementary material

Supplementary material is available at Biometrika online.

References

  1. Chatterjee N. A two-stage regression model for epidemiological studies with multivariate disease classification data. J Am Statist Assoc. 2004;99:127–38. [Google Scholar]
  2. Cox DR. Regression models and life-tables. J. R. Statist. Soc. B. 1972;34:187–220. [Google Scholar]
  3. Craiu RV, Duchesne T. Inference based on the EM algorithm for the competing risks model with masked causes of failure. Biometrika. 2004;91:543–58. [Google Scholar]
  4. Dewanji A. A note on a test for competing risks with missing failure type. Biometrika. 1992;79:855–7. [Google Scholar]
  5. Feigelson HS, Patel AV, Teras LR, Gansler T, Thusn MJ, Calle EE. Adult weight gain and histopathologic characteristics of breast cancer among postmenopausal women. Cancer. 2006;107:12–21. doi: 10.1002/cncr.21965. [DOI] [PubMed] [Google Scholar]
  6. Gao G, Tsiatis AA. Semiparametric estimators for the regression coefficients in the linear transformation competing risks model with missing cause of failure. Biometrika. 2005;92:875–91. [Google Scholar]
  7. Goetghebeur E, Ryan L. Analysis of competing risks survival data when some failure types are missing. Biometrika. 1995;82:821–33. [Google Scholar]
  8. Lu K, Tsiatis AA. Comparison between two partial likelihood approaches for the competing risks model with missing cause of failure. Lifetime Data Anal. 2005;11:29–40. doi: 10.1007/s10985-004-5638-0. [DOI] [PubMed] [Google Scholar]
  9. van der Vaart AW. Asymptotic Statistics. Cambridge, UK: Cambridge University Press; 1998. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material is available at Biometrika online.


Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES