Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 May 27.
Published in final edited form as: Stat Med. 2008 May 20;27(11):1911–1933. doi: 10.1002/sim.3159

Corrected score estimation in the proportional hazards model with misclassified discrete covariates

David M Zucker 1,*,, Donna Spiegelman 2
PMCID: PMC4035127  NIHMSID: NIHMS534417  PMID: 18219700

SUMMARY

We consider Cox proportional hazards regression when the covariate vector includes error-prone discrete covariates along with error-free covariates, which may be discrete or continuous. The misclassification in the discrete error-prone covariates is allowed to be of any specified form. Building on the work of Nakamura and his colleagues, we present a corrected score method for this setting. The method can handle all three major study designs (internal validation design, external validation design, and replicate measures design), both functional and structural error models, and time-dependent covariates satisfying a certain ‘localized error’ condition. We derive the asymptotic properties of the method and indicate how to adjust the covariance matrix of the regression coefficient estimates to account for estimation of the misclassification matrix. We present the results of a finite-sample simulation study under Weibull survival with a single binary covariate having known misclassification rates. The performance of the method described here was similar to that of related methods we have examined in previous works. Specifically, our new estimator performed as well as or, in a few cases, better than the full Weibull maximum likelihood estimator. We also present simulation results for our method for the case where the misclassification probabilities are estimated from an external replicate measures study. Our method generally performed well in these simulations. The new estimator has a broader range of applicability than many other estimators proposed in the literature, including those described in our own earlier work, in that it can handle time-dependent covariates with an arbitrary misclassification structure. We illustrate the method on data from a study of the relationship between dietary calcium intake and distal colon cancer.

Keywords: errors in variables, nonlinear models, proportional hazards

1. INTRODUCTION

Many regression analyses involve explanatory variables that are measured with error. It is well known that failing to account for covariate error can lead to biased estimates of the regression coefficients. For linear models, theory for handling covariate error has been developed over the past 50 or more years; Fuller [1] provides an authoritative exposition. For nonlinear models, theory has been developing over the past 25 or so years. Carroll et al. [2] provide a comprehensive summary of the development to date; currently, the covariate error problem for nonlinear models remains an active research area. In particular, beginning with Prentice [3], a growing literature has developed on the Cox [4] proportional hazards survival regression model when some covariates are measured with error.

Three basic design setups are of interest. In all three designs, we have a main survival cohort for which surrogate covariate measurements and survival time data are available on all individuals. The three designs are as follows: (1) the internal validation design, where the true covariate values are available on a subset of the main survival cohort, (2) the external validation design, where the measurement error distribution is estimated from data outside the main survival study, and (3) the replicate measurements design, where replicate surrogate covariate measurements are available, either on a subset of the survival study cohort or on individuals outside the main survival study. Also, two types of models for the measurement error are of interest (see [1, p. 2; 2, Section 2]): structural models, where the true covariates are random variables, and functional models, where the true covariates are fixed values. Structural model methods generally involve estimation of some aspects of the distribution of the true covariate values; in functional model methods, this process is avoided.

We focus here on discrete covariates subject to misclassification, which are of interest in many epidemiological studies. In the case of a binary event outcome, there is extensive literature on the effects of misclassification on estimation of, and inference about, the relative risk and related parameters. Bross [5] is an early seminal paper. Detailed reviews have been given by Chen [6], Kuha and Skinner [7], and Walter and Irwig [8], while Kuha et al. [9] provide a concise summary of much of the development. Correction for misclassification entails estimating the classification probabilities through one of the designs listed in the preceding paragraph. Given appropriate estimates of the classification probabilities, consistent estimates of the relative risk and related inferences can proceed using ‘matrix methods’ [5, 10, 11] or maximum likelihood methods [12, 13].

The Cox survival regression model with covariate errors has been examined in a number of settings.Much of the existing work focuses on the independent additive error model, which assumes that the observed covariate value is equal to the true value plus a random error whose distribution is independent of the true value. In the case of discrete covariates subject to misclassification, this model practically never holds, and hence, the methods built upon it do not apply. Other methods are available in the literature that do apply to misclassification problems, but they are subject to substantial limitations, as we now describe.

The work of Zhou and Pepe [14] and of Zhou and Wang [15] deals with the internal validation design. Their approach involves empirically estimating the conditional mean of a certain function of the true covariate vector conditional on the observed covariate vector. This process entails stratification or smoothing with respect to the observed covariate vector. When the covariate vector is of moderate to high dimension, the ‘curse of dimensionality’ causes this approach to break down, even if only one of the covariates is error prone. Chen [16] presents an alternate method for the internal validation design. His method combines the regression coefficient estimate based on the validation sample only with information gleaned from the rest of the main study cohort. Chen’s approach assumes that it is possible to form a satisfactory initial estimate of the regression coefficient vector based on the validation sample alone. This is not the case, however, for studies where the event rate is low to moderate, the main study sample size is in the thousands, and the validation study sample size is in the low hundreds. Thus, in such cases, which often arise in practice, Chen’s approach breaks down. In addition, the methods of Zhou and Pepe, Zhou and Wang, and Chen do not cover the external validation or replicate measures setups.

Spiegelman et al. [34] and Wang et al. [18] discuss the simple and well-known regression calibration method, which applies to all three design setups. This method, however, is only an approximate method and does not yield a consistent estimator. Zucker and Spiegelman [19] and Zucker [20] present general methods suitable for all three designs, but their methods cover only time-independent covariates and their approaches do not seem generalizable to time-dependent covariates. Hu et al. [21] present a method that can handle time-dependent covariates under all three designs in a more general setting, but their approach is complex and its asymptotic properties were not formally examined. Most of the methods cited above apply only to structural models.

Another option is to apply the SIMEX approach, which is a general approach for covariate error problems [2, Chapter 5]. Recently, Küchenhoff et al. [22] developed a SIMEX method for the misclassification setting. Küchenhoff et al. treat generalized linear models; their method could be extended to survival models. The SIMEX approach, however, has some disadvantages. It requires multiple runs of the model-fitting process, and it relies on an extrapolation scheme that is uncertain and does not necessarily yield a consistent estimator.

Finally, it is possible in principle to apply methods that have been developed in the missing covariate literature, such as that of Herring and Ibrahim [23]. Most of this work deals with the internal validation setup, though the approach could be extendable to the other two design setups. These methods, however, are highly complex, and they cannot effectively handle the functional model setting or time-dependent covariates.

Thus, the currently existing methods are subject to substantial limitations, even for the internal validation design, and all the more for the external validation and replicate measures designs. There is a need for a new method that overcomes these limitations. In particular, there is a need for a convenient method for all three study designs that can handle general measurement error structures, both functional and structural models, and time-dependent covariates.

The aim of this paper is to present such a method for the case where the error-prone covariates are discrete. The misclassification may take any specified form desired, including an unstructured form. The error-free covariates are allowed to be either discrete or continuous. The time-dependent covariates are required to satisfy a certain ‘localized error’ condition, which we describe later. We present basic asymptotic properties of the method and examine its finite-sample performance in a simulation study.

In the case where the classification probabilities are estimated from replicate measurements data, it is necessary in this estimation process to regard the true covariate value as a random variable, as in a structural model (see the end of Section 4). However, when external replicate data are used, the marginal distribution of the true covariate need not be the same in the replicate sample as in the main study; we need only portability of the conditional distribution of the observed covariate given the true covariate.

Our proposed method follows the corrected score approach. Nakamura [24, 25] described the basic idea behind the approach. He then developed it by some detail under the independent additive error model. However, as noted above, this error model is not appropriate for discrete covariates. Accordingly, we build instead on the work of Akazawa et al. [26], which dealt with logistic regression with discrete covariates subject to misclassification. We extend the work of Azakawa et al. in a number of directions. First, we extend their approach from logistic regression to the Cox survival regression model; this extension involves substantial new technical development. Second, we allow for the case where, in addition to the error-prone discrete covariates, there may be a large number of other covariates measured without error; these other covariates can be either discrete or continuous. Finally, while Akazawa et al. assume the classification probabilities to be known, we allow them to be estimated, and we derive corrections to the estimated covariance matrix of the parameter estimates that reflect the error in estimating the classification probabilities. In the absence of misclassification, our method reduces to the classical Cox partial likelihood method.

The paper is organized as follows. Section 2 presents our proposed method for the case where the misclassification probabilities are assumed known. Section 3 discusses the case where the misclassification probabilities are estimated. Section 4 presents a simulation study of the method in the case of a single binary error-prone covariate. In the simulations, we consider both the case where the misclassification probabilities are known and the case where these probabilities are estimated from external replicate measurement data. Section 5 presents an application to data from the Nurses Health Study on the relationship between dietary calcium intake and distal colon cancer [27]. Section 6 provides a summary and discussion.

2. THE PROPOSED METHOD

2.1. The setup

We assume a standard survival analysis setup. We have observations on n independent individuals. We denote, for individual i, the survival time by Ti0 and the time of right censoring by Ci. The observed survival data consist of the observed follow-up time Ti=min(Ti0,Ci) and the event indicator δi=I(Ti0Ci). We let Yi (t) =I (Tit) denote the at-risk indicator. As usual, we assume that the covariate processes are left continuous with right limits, and that the failure process and the censoring process are conditionally independent given the covariate process in the sense described by Kalbfleisch and Prentice [28, Section 6.3.2]. Left truncation can be handled by setting Yi (t) to zero until the time at which individual i comes under observation.

The covariate structure is as follows. We denote the true covariate vector by Xi (t), and its dimension by p. We partition the vector Xi (t) into subvectors Wi (t) and Zi (t), where Wi (t) is a p1-vector of error-prone covariates and Zi (t) is a p2-vector of error-free covariates. We denote the observed value of Wi (t) by i (t). The vectors Wi (t) and i (t) are assumed to be discrete. The possible values of Wi (t) (each one a p1-vector) are denoted by w1, . . ., wK. The range of values of i (t) is assumed to be the same as that for Wi (t). For example, we could have a scalar binary covariate representing the presence or the absence of a given condition. Or the covariate might be the number of servings of a certain food that a person consumes per day. We denote by k(i, t) the value of k such that i (t)=wk. The vector Zi (t) of error-free covariates is allowed to be either discrete or continuous. The case where the model involves interaction terms between the error-prone and error-free covariates can be accommodated with suitable minor notational changes.

We assume that the measurement error process is ‘localized’ in the sense that it depends only on the current true covariate value. More precisely, the assumption is that, conditional on the value of Xi (t), the value of i (t) is independent of the survival and censoring processes and of the values of Xi (s) for st. This assumption is plausible in many circumstances, such as situations in which the main source of error is technical or a laboratory error, or a reading/coding error, as with diagnostic X-rays and dietary intake assessments. The assumption will not be directly satisfied for covariates that represent cumulative exposure, though it may be possible to adapt our approach to cumulative exposure variables by working with the successive increments in observed exposure. For time-independent covariates, the assumption reduces to an assumption that the measurement error is independent of the survival and censoring processes. Under the localized error assumption, Yi(t) and i (t) are conditionally independent given Xi (t).

We denote Akl(i,t)=Pr(W˜i(t)=wl|Wi(t)=wk,Zi(t)), which defines a square matrix A(i,t) of classification probabilities. Note that the formulation here differs from that of Zucker and Spiegelman [19]. In addition, we allow here for the possibility that the classification probabilities depend on individual-specific factors, including the error-free covariates. Note also that the classification probabilities Akl(i,t) are allowed to depend on t, either directly or through time-dependent individual-specific factors. Under this formulation, we can account for improvements in measurement techniques over time. In addition, if internal validation data are available, we can dispense with the localized error assumption. The assumption can be avoided by using classification rate estimates based only on the internal validation sample units that are still at risk at each given point in time. For now, we assume that A(i,t) is known. In Section 3, we will consider the case where A(i,t) is estimated.

We work under the Cox proportional hazards model, where the hazard function is of form

λ(t|x)=λ0(t)eβTx (1)

with λ0(t) being a baseline hazard function of unspecified form and β being a p-vector of unknown regression parameters that we wish to estimate. It is possible to extend the methodology to the case where eβTx is replaced by a general relative risk function ψ(x; β), as in Thomas [29] and Breslow and Day [30, Section 5.1(c)]. This extension is described in a version of this paper available at the following website: http://www.hsph.harvard.edu/faculty/spiegelman/manuscripts.html

2.2. The key idea

The key idea behind our method is as follows. The Cox partial likelihood score function involves terms of the form G(Xi (t))=G(Wi (t), Zi (t)), where G is some function. Since Wi (t) is not directly observed, G(Wi (t), Zi (t)) cannot be directly evaluated. Instead, we seek an observable function Gi*(W˜i(t),Zi(t)) such that

E[Gi*(W˜i(t),Zi(t))|Wi(t),Zi(t)]=G(Wi(t),Zi(t)) (2)

In the case of discrete error-prone covariates, a Gi* satisfying (2) may be constructed by a simple device we define

Gi*(W˜i(t),Zi(t))=l=1KBk(i,t)l(i,t)G(wl,Zi(t)) (3)

where k(i, t) is as defined previously and Bkl(i,t) is the (k, l) element of the matrix B(i,t)=[A(i,t)]−1. We then have

E[Gi*(W˜i(t),Zi(t))|Wi(t)=wm,Zi(t)]=k=1KAmk(i,t)l=1KBkl(i,t)G(wl,Zi(t))=l=1k(k=1KAmk(i,t)Bkl(i,t))G(wl,Zi(t))=G(wm,Zi(t)) (4)

so that (2) is indeed satisfied. Akazawa et al. [26] introduced this device for the case of logistic regression with covariate error.

2.3. The method

We now present our method. The classical Cox [4, 31] partial likelihood score function in the case with no measurement error is given by

Ur(β)=1ni=1nδi(Xir(Ti)e1r(Ti)e0(Ti)) (5)

where

e0(t)=1nj=1nYj(t)eβTXi(t)
e1r(t)=1nj=1nYj(t)Xir(t)eβTXi(t)

Now define

ξir(t)=l=1kBk(i,t)l(i,t)xr(wl,zi(t)) (6)
ψi(t,β)=l=1kBk(i,t)l(i,t)exp(βTx(wl,zi(t))) (7)
ηir(t,β)=l=1KBk(i,t)l(i,t)xr(wl,Zi(t))exp(βTx(wl,Zi(t))) (8)
e0*(t)=1nj=1nYj(t)ψj(t,β) (9)
e1r*(t)=1nj=1nYj(t)ηjr(t,β) (10)

where x(w, z) denotes the x vector formed by the subvectors w and z. Our proposed corrected score function is then given by the following obvious analogue of (5):

Ur*(β)=1ni=1nδi(ξir(Ti)e1r*(Ti)e0*(Ti)) (11)

The proposed corrected score estimator is the solution to U* (β)=0, where U* denotes the vector whose components are Ur*.

We have

E[Yi(t)ξir(t)|Xi(t)]=E[Yi(t)E[ξir(t)|Xi(t),Yi(t)]|Xi(t)]=E[Yi(t)E[ξir(t)|Xi(t)]|Xi(t)]=E[Yi(t)Xir(t)|Xi(t)] (12)

where the second equality follows from the conditional independence of Yi (t) and i (t) given Xi (t), and the third equality follows from the argument of Section 2.2. Similarly,

E[Yi(t)ψi(t,β)|Xi(t)]=E[Yi(t)eβTXi(t)|Xi(t)] (13)
E[Yi(t)ηir(t,β)|Xi(t)]=E[Yi(t)Xir(t)eβTXi(t)|Xi(t)] (14)

Thus, referring to the quantity in parentheses in (11), the first term and the numerator and denominator of the second term all have the correct expectation. Now, U* (β) is not an exactly unbiased estimating function, because the expectation of a ratio is not equal to the ratio of the expectations. However, as indicated in Appendix A.1, it follows from the law of large numbers that U* (β) is an asymptotically unbiased score function.

Accordingly, under standard conditions like those of Andersen and Gill [32], our corrected score estimator will be consistent and asymptotically normal. Appendix A.1 presents an outline of the asymptotic arguments. See Huang and Wang [33] for a related discussion in a similar context. Denoting the true value of β by β0, the asymptotic covariance matrix of n1/2(β̂β0) may be estimated by the sandwich formula

V^=D(β^)1H(β^)D(β^)1 (15)

Here D(β) is −1 times the matrix of derivatives of U* (β) with respect to the components of β and H(β) is an empirical estimate of the covariance matrix of n1/2U* (β). To define these matrices, some additional notations are needed. We define

ϒ^ir(β)=δi[ξir(Ti)e1r*(Ti)e0*(Ti)]1nj:TjTiδj[ηir(Tj,β)e0*(Tj)e1r*(Ti)e0*(Ti)ψi(Tj,β)e0*(Tj)] (16)

In this definition, the first term tends to be the dominant term, especially if the event is rare. We further define

e2rs*(t)=1nj=1nYj(t)l=1KBk(j)l(i,t)xr(wl,Zi(t))xs(wl,Zi(t))exp(x(wl,Zi(t))Tβ)

With these definitions, we have

Hrs(β)=1ni=1nϒ^ir(β)ϒ^is(β) (17)
Drs(β)=1ni=1nδi[e2rs*(Ti)e0*(Ti)(e1r*(Ti)e0*(Ti))(e1s*(Ti)e0*(Ti))] (18)

The expression for Drs (β) is derived by straightforward differentiation. The derivation of the expression for Hrs (β) is given in Appendix A.1.

In the internal validation design, for each individual i in the internal validation sample we can carry out the estimation with i replaced by Wi and A(i,t) replaced by the identity matrix. Alternatively, we can employ the hybrid scheme of Zucker and Spiegelman [19, Section 5], where a separate estimator of β is computed for the validation sample and for the main study excluding the validation sample, and the two estimators are then combined. The hybrid scheme is likely to be more efficient when the validation sample is sizable.

The case where there are replicate measurements, ij of on the individuals in the main study can be handled in various ways. A simple approach is to redefine the quantities given in (6)(8) by replacing Bk(i,t)l(i,t) with the mean of Bk(i,j,t)l(i,t) over the replicates for individual i, with k (i, j, t) defined as the value of k such that ij (t)=wk. The development then proceeds as before.

2.4. Estimation of the cumulative hazard

The cumulative hazard Λ0(t)=0tλ0(u)du can be estimated using the Breslow-type estimator

Λ^0(t)=i=1nδiI(Tit)e0*(Ti,β^) (19)

As discussed at the end of Appendix A.1, the quantity n1/2(Λ̂0(t)−Λ0(t)), for a given t, is asymptotically mean-zero normal. In Appendix A.1, we present an estimator of the variance of this estimator.

3. ESTIMATED CLASSIFICATION PROBABILITIES

We now indicate the changes needed to handle the case where Akl(i,t) are estimated. The relevant estimates may be obtained in several ways. In some cases, estimates are obtained from an external validation study, that is, a separate study with measurements of both Wi and i. Alternately, an internal validation design is used, with some individuals in the main survival study having measurements on both Wi and i. Another possibility is to base the estimates on internal or external replicate measures data; we discuss this in more detail at the end of the section. The theory developed in this section represents a step beyond the work of Akazawa et al. [26], who considered only the case where the classification probabilities are known. This theory is applied in the example presented in Section 5.

The main issue is how to adjust the covariance matrix of the estimates to account for the estimation error in Akl(i,t). Following Zucker and Spiegelman [19], we express A(i,t) as A(i,t)(ω) for some q′-vector of parameters ω. The nature of the function A(i,t)(ω) is dictated by the measurement error model employed. As an illustration, consider the simplest case: a single binary covariate with a common classification matrix A(ω) for all individuals. In this case, with the false-positive and false-negative rates allowed to be different, A(ω) takes the following form (where we assume that the sum of the off-diagonal elements is less than 1):

A(ω)=[ω1(1ω1)(1ω2)ω2] (20)

We presume that the parameter vector ω is estimated from a study of one of the types described above, with m independent units. We presume further that the study yields an estimator ω̂ having an approximate normal distribution with mean ω and covariance matrix m−1Γ, along with an estimator Γ̂ of the matrix Γ. This setup is a typical one in practical applications. For example, for the case of a single 0–1 binary covariate with internal or external validation data, the estimates of ωk = Pr( = k−1|W = k−1), k = 1, 2, are given by the obvious sample proportions, and Γ is a 2×2 diagonal matrix with Γkk = ωk (1−ωk)/ϑk, where ϑk is the fraction of individuals with W = k−1 in the validation study. The procedure for a replicate measures study is discussed at the end of this section. For the asymptotics we assume that m and n are of the same order of magnitude, i.e. m/n→ζ for some constant ζ as n→∞. Otherwise, the error in A(i,t)(ω) will either be dominated by or will dominate the error in β̂ due to the variation in the survival data. Typically, ζ will be between 0 and 1.

Let us now write the corrected score function as U* (β, ω) to indicate explicitly the dependence on ω. Also, let us denote the true value of β by β0 (as before) and the true value of ω by ω0. Since we are now estimating ω0 by ω̂, our estimating equation for β is now U* (β, ω̂) = 0. Using Taylor’s theorem, we can write

0=U*(β,ω^)U*(β0,ω0)D(β0)(β^β)+U˙*(β0ω0)(ω^ω0)

where −Drs is the partial derivative of Ur*(β,ω) with respect to βs evaluated at ω0, and (β, ω) is a matrix whose (r, ν) element is the partial derivative of Ur (β, ω) with respect to ων. Hence, we have

β^β0D(β0)1[U*(β0,ω0)+U˙*(β0,ω0)(ω^ω0)] (21)

If ω is estimated from external data, then U* (β0, ω0) and ω̂ are obviously independent. When ω is estimated from an internal validation sample, the following argument can be put forth to show that these two quantities are asymptotically independent. The contribution to U* (β0, ω0) made by each individual in the study has asymptotically expectation zero conditional on the true covariate values. This is true in particular of the individuals in the internal validation sample. It hence follows from an iterated expectation argument that U* (β0, ω0) and ω̂ are asymptotically uncorrelated (cf. [17, Appendix A]). Since we have asymptotic normality as well, this implies asymptotic independence. As a result, by Slutsky’s theorem, the two terms in brackets in (21) are asymptotically independent, since * (β0, ω0) converges to a deterministic limit.

It therefore follows that, when external data or an internal validation sample is used to estimate ω, the necessary covariance adjustment may be accomplished by replacing the matrix H(β) with the following corrected version:

H(corr)(β,ω)=H(β,ω)+ζ1U˙*(β,ω)Γ^U˙*(β,ω)T (22)

To present the formulas for U˙rν*(β), we define A˙ν(i,t)andB˙ν(i,t) to be, respectively, the partial derivative of the matrices A(i,t) and B(i,t) with respect to ων. By the rule for differentiating an inverse matrix, we have B˙ν(i,t)=B(i,t)A˙ν(i,t)B(i,t). We then obtain

U˙rv(β)=1ni=1nδi[ξ˙ir:v(Ti,β)e˙1r:v*(Ti)e0*(Ti)+(e1r*(Ti)e0*(Ti))(e˙0:v*(Ti)e0*(Ti))] (23)

where ξ̇ir, e˙0:ν*,ande˙1r:ν* are defined analogously to ξir, e0*,ande1r*withBk(i,t)l(i,t) replaced by B˙k(i,t)l:ν(i,t).

When ω is estimated from an internal replicate measures sample, U* (β0, ω0) and ω̂ are no longer asymptotically independent, and hence we must work out the covariance between them. A typical scenario is the case of an i.i.d. setup where ω is estimated by maximum likelihood. In Appendix A.2, we present the appropriate corrected version of H for this setting.

The estimate for var(Λ̂0(t)) can be corrected in a similar manner. Appendix A.3 provides the details.

We now discuss in more depth the estimation of classification probabilities from a replicate measures study. There is a large literature on this problem (and extended versions thereof), especially for the case of a binary error-prone covariate, where the problem is generally known as the problem of evaluating a diagnostic marker without a gold standard. See [35] for a review of this area. The simplest case is that of a single binary (0–1) error-prone covariate W, with subject i in the replicate measures study having Ri replicate measurements ij on the surrogate measure , where the replicates are conditionally i.i.d. given W. This case is relevant to many risk factors of interest in epidemiological studies, such as high blood pressure and high estradiol level, where the risk factor is assessed using direct physiological measurements. In this setting, the subject totals W˜i(tot)=jW˜ij are sufficient statistics. Denote α1 = Pr( = 1|W = 0) and α2 = Pr( = 1|W = 1). Then, conditional on W = k−1 (k = 1, 2), W˜i(tot) has a Bin(Ri, αk) distribution. Defining π to be Pr(W = 1) within the replicate measures sample, the marginal distribution of W˜i(tot) is a mixture of the Bin(Ri, α1) and Bin(Ri, α2) distributions, with respective mixture probabilities 1−π and π:

Pr(W˜i(tot)=j)=(1π)(Rij)α1j(1α1)Rij+π(Rij)α2j(1α2)Rij (24)

The model is identifiable provided that some positive proportion of subjects have Ri⩾3 and the correlation between W and is positive. The latter condition is equivalent to the condition Pr( = 1|W = 0)+Pr( = 0|W = 1)<1. The likelihood function can be expressed directly from (24), and the parameters π, α1, and α2 can then be estimated by maximum likelihood.

In more general settings, the basic ideas are similar, but of course the details are more complex. Several papers have discussed methods for using replicate data to estimate classification probabilities for more general settings, including parsimonious models for polychotomous W and models that allow the conditional distribution of the replicates given W to involve some dependence [8, 3639]. The case of correlated replicates is of particular relevance to risk factors that are measured through self-report, such as dietary or physical activity variables.

4. SIMULATION STUDY

To investigate how our method performs, we carried out a simulation study in the setting of a single 0–1 binary covariate. The design of our simulation study was patterned after Zucker and Spiegelman [19, Section 6] and Zucker [20, Section 3.1]. The assumed study duration was 5 years. The baseline survival distribution was taken to be Weibull, with baseline hazard function λ0(t) = α μ(μt)α−1. The power parameter α was taken equal to 5, which is typical of many types of cancer [30, Section 6.3; 40]. The scale parameter μ was chosen so as to yield a 25 per cent 5-year cumulative incidence rate for the unexposed population. Censoring was taken to be exponential with a rate of 1 per cent per year. For brevity of presentation, the false-positive rate Pr( = 1|W = 0) and the false-negative rate Pr( = 0|W = 1) were taken to be equal to a common classification error rate. A range of values was explored for the prevalence of the risk factor (5, 25, 40 per cent), the classification error rate (1, 5, 10, 20 per cent), and the true relative risk (1.5, 2.0). The number of simulation replications was 5000.

In our first simulation scenario, we assumed that the classification probabilities are known. We took the sample size to be 2000, leading to approximately 500 events in total. Table I shows the results. For comparison, we also present the simulation results given in [19] for the naive Cox partial likelihood estimator ignoring the measurement error, and for the parametric log relative risk estimator obtained by maximizing the full Weibull log likelihood under the relevant measurement error model.

Table I.

Simulation results for a single binary covariate with known classification rates.

Per cent bias in
estimated log RR
Standard deviation—
CSCORE
Per cent
exposed
Error
rate
True
RR
Naive
Cox
CSCORE FWMLE Empirical Mean of
estimates
MSE
ratio
95 per cent
CI coverage
5 1 1.5 −15.89 −1.93 −1.92 0.196 0.194 1.00 95.12
5 1 2.0 −14.00 −0.46 −0.43 0.176 0.173 1.00 95.10
5 5 1.5 −48.45 −4.73 −4.77 0.264 0.255 1.00 94.90
5 5 2.0 −46.08 −2.19 −2.16 0.231 0.218 1.00 93.90
5 10 1.5 −66.02 −6.64 −6.99 0.349 0.338 1.03 94.76
5 10 2.0 −64.16 −2.87 −2.90 0.313 0.281 0.99 93.62
5 20 1.5 −82.24 −17.22 −23.81 0.591 0.620 1.22 93.07
5 20 2.0 −80.92 −7.20 −9.87 0.531 0.482 1.04 91.87
25 1 1.5 −3.03 −0.06 −0.02 0.096 0.096 1.00 94.92
25 1 2.0 −3.01 −0.15 −0.10 0.090 0.089 1.00 94.46
25 5 1.5 −14.09 0.02 0.06 0.106 0.105 1.00 94.56
25 5 2.0 −13.89 −0.28 −0.22 0.100 0.097 1.00 94.62
25 10 1.5 −26.61 −0.01 0.04 0.123 0.120 1.00 94.90
25 10 2.0 −26.14 −0.33 −0.30 0.112 0.110 1.00 94.60
25 20 1.5 −48.41 −0.54 −0.49 0.167 0.164 1.00 94.96
25 20 2.0 −47.42 −0.18 −0.15 0.156 0.149 0.99 94.10
40 1 1.5 −2.43 −0.36 −0.32 0.088 0.087 1.00 94.62
40 1 2.0 −2.11 −0.03 0.01 0.081 0.082 1.00 95.22
40 5 1.5 −10.03 0.39 0.43 0.095 0.094 1.00 94.58
40 5 2.0 −10.35 0.02 0.07 0.091 0.089 1.00 94.90
40 10 1.5 −20.63 0.01 0.05 0.108 0.106 1.00 94.74
40 10 2.0 −20.40 0.28 0.30 0.102 0.101 1.00 95.04
40 20 1.5 −41.04 −0.34 −0.33 0.140 0.142 1.00 95.58
40 20 2.0 −40.66 0.29 0.34 0.135 0.135 1.00 94.90

Note Sample size=2000, unexposed cumulative incidence=25 per cent.

RR, relative risk; CSCORE, corrected score estimator; FWMLE, full Weibull maximum likelihood estimator; MSE ratio, ratio of mean square error of FWMLE to that of CSCORE.

The naive Cox estimator was typically badly biased except under 1 per cent misclassification with exposure prevalence of 25 or 40 per cent. By contrast, our method exhibited excellent performance, comparable to that of the fully parametric Weibull estimator. Under an exposure prevalence of 25 or 40 per cent, our method yielded nearly zero bias in the estimated log relative risk, nearly unbiased standard deviation estimates, and accurate confidence interval coverage. With an exposure prevalence of 5 per cent, the performance of all three estimators under consideration was degraded. This finding is not surprising, because the 5 per cent exposure situation presents two difficulties: (1) the expected number of events in the exposed group is only of the order of 25–50, (2) with a misclassification rate of 5 per cent or more, the predictive value of an observed positive exposure is low. The naive Cox estimator was drastically biased. Our estimator and the Weibull estimator were dramatically less biased, but still exhibited some bias. This bias was due in part to outlying values; for both our estimator and the Weibull estimator, the deviation between the median value of the estimates and the true log relative risk was noticeably lower than the deviation between the mean estimated value and the true value. Overall, in terms of mean square error, the performance of our estimator was found to be nearly identical to that of Weibull estimator. In a few cases, our estimator was better; this reflects the fact that, for a given finite sample size, the asymptotically optimal parametric maximum likelihood expectation can be outperformed by an alternate estimator. The performance of the method proposed here essentially matches that of the methods of Zucker and Spiegelman [19] and Zucker [20], except that the method here was better for the problematic cases with 5 per cent exposure prevalence. The same pattern is seen under a sample size of 10 000 with a cumulative incidence rate of 5 per cent for the unexposed (results not shown).

In our second simulation scenario, we assumed that the classification probabilities are estimated from an external replicate measures study with 250 subjects and three replicate measurements per subject. The procedure for estimating the classification probabilities is described at the end of the preceding section. To get around low cell counts when the misclassification rate is very small (viz. 0.01), we added 12 to all the cell counts. We ran two sets of simulations, one for a sample size of 2000 (about 500 events in total) and the other for a sample size of 1000 (about 250 events in total). Table II presents the results.

Table II.

Simulation results for a single binary covariate with estimated classification rates.

Sample
size
Per cent
exposed
Error
rate
True
RR
Per cent bias
in β̂
Empirical
Std dev.
Mean of
SD estimates
95 per cent
CI coverage
2000 5 1 1.5 −1.10 0.206 0.204 93.14
5 1 2.0 0.22 0.185 0.185 93.72
5 5 1.5 −1.96 0.293 0.292 93.30
5 5 2.0 2.54 0.278 0.270 94.17
5 10 1.5 −1.31 0.441 0.491 94.30
5 10 2.0 4.48 0.412 0.439 94.87
5 20 1.5 13.13 0.670 1.663 96.83
5 20 2.0 11.30 0.637 1.758 95.37
25 1 1.5 0.36 0.097 0.097 94.90
25 1 2.0 0.23 0.091 0.090 94.70
25 5 1.5 0.34 0.109 0.107 94.54
25 5 2.0 0.47 0.101 0.100 95.02
25 10 1.5 0.57 0.123 0.124 94.90
25 10 2.0 0.83 0.117 0.117 95.44
25 20 1.5 1.51 0.183 0.181 95.24
25 20 2.0 3.39 0.187 0.179 95.56
40 1 1.5 1.03 0.087 0.087 95.30
40 1 2.0 0.59 0.083 0.083 94.84
40 5 1.5 0.36 0.095 0.095 95.12
40 5 2.0 0.54 0.091 0.091 94.78
40 10 1.5 0.63 0.106 0.108 95.24
40 10 2.0 0.66 0.106 0.104 94.98
40 20 1.5 2.08 0.150 0.149 94.96
40 20 2.0 1.79 0.148 0.148 95.72
1000 5 1 1.5 −4.07 0.296 0.297 91.58
5 1 2.0 0.68 0.263 0.263 92.42
5 5 1.5 −7.94 0.426 0.442 91.78
5 5 2.0 0.56 0.382 0.383 92.30
5 10 1.5 −12.20 0.600 0.790 91.48
5 10 2.0 −1.38 0.546 0.641 92.79
5 20 1.5 −15.98 2.805 2.130 96.76
5 20 2.0 −12.77 1.776 1.913 95.04
25 1 1.5 −0.47 0.137 0.137 94.56
25 1 2.0 0.55 0.126 0.127 95.46
25 5 1.5 0.25 0.153 0.151 94.76
25 5 2.0 0.89 0.141 0.141 95.16
25 10 1.5 0.20 0.176 0.175 94.80
25 10 2.0 0.33 0.165 0.163 94.24
25 20 1.5 0.34 0.253 0.255 94.78
25 20 2.0 2.91 0.241 0.239 95.18
40 1 1.5 0.33 0.121 0.124 95.32
40 1 2.0 0.22 0.116 0.117 95.58
40 5 1.5 0.80 0.135 0.135 94.90
40 5 2.0 0.68 0.130 0.129 94.86
40 10 1.5 0.50 0.154 0.153 95.20
40 10 2.0 0.60 0.147 0.146 94.88
40 20 1.5 2.92 0.210 0.211 95.12
40 20 2.0 1.93 0.202 0.204 95.84

Note Classification rates estimated from an external replicate measures sample of size 250. Unexposed cumulative incidence=25 per cent. RR, relative risk.

Our method performed very well when the exposure prevalence was 25 or 40 per cent. Across the board, for both n=2000 and 1000, the bias in the estimated log relative risk was minimal, the standard deviation estimate was on target, and the confidence interval coverage was accurate. With n=2000, in most cases there was minimal change in the standard deviation of the log relative risk estimate due to the estimation of the misclassification rates, as compared with the standard deviation under known misclassification rates (shown in Table I). The one exception to this was the case of 20 per cent misclassification, where there was a 5–20 per cent increase in the standard deviation due to the estimation of the misclassification rates. The standard deviations for n=1000 were greater than those for n=2000 by about the expected factor of 2.

The method performed somewhat less well when the exposure prevalence was 5 per cent. The bias was higher in this situation, in some cases reaching the 10–20 per cent level. Still, this is much better than the bias of the naive Cox estimate (shown for n=2000 in Table I). The standard deviation estimates and confidence interval coverage were noticeably inaccurate in some cases. Also, under true misclassification rates of 20 per cent, the misclassification rates could not be successfully estimated in around 10–15 per cent of the simulation replications.

In summary, our method generally performed very well. Good performance was maintained even with estimated misclassification probabilities, except for some problems when the exposure prevalence rate was very low.

5. EXAMPLE

We illustrate our method on data from the Nurses Health Study concerning the relationship between dietary calcium (Ca) intake and cancer of the distal colon (i.e. the furthermost segment of the large intestine) [27, Table 4]. The data consist of observations on female nurses whose calcium intake was assessed through a food frequency questionnaire (FFQ) in 1984 and were followed up to 31 May 1996 for distal colon cancer occurrence. Our analysis includes data on 60 575 nurses who reported in 1984 that they had never taken calcium supplements. In this cohort, there were 150 cases of distal colon cancer during the follow-up period. The analysis focuses on the effect of baseline calcium intake after adjustment for baseline body mass index (BMI) and baseline aspirin use. BMI is defined as the person’s weight in kilograms divided by the square of the person’s height in meters, and is a standard measure of a person’s build (low BMI means thin, high BMI means fat). As in Wu et al.’s Table 4, we work with a binary ‘high Ca’ risk factor defined as 1 if the calcium intake was greater than 700 mg/day and 0 otherwise. Note that one glass of milk contains approximately 300 mg of calcium. BMI is expressed in terms of the following categories: <22, 22 to <25, 25 to <30, and 30kg/m2 or greater. Aspirin use is coded as yes (1) or no (0). Thus, our model has five explanatory variables, one for the binary risk factor (W), three dummy variables for BMI (Z1, Z2, Z3), and one for aspirin use (Z4). BMI and aspirin use status are assumed to be measured without error.

It is well known that the FFQ measures dietary intake with some degree of error and more reliable information can be obtained from a diet record (DR) [41, Chapter 6]. We thus take W to be the Ca risk factor indicator based on the DR and to be the Ca risk factor indicator based on the FFQ. The classification probabilities are estimated using data from the Nurse’s Health Study validation study [41, pp. 122–126]. The estimated specificity was Pr^(W˜=0|W=0)=0.78, with an estimated standard error of 0.042. The estimated sensitivity was Pr^(W˜=1|W=1)=0.72, with an estimated standard error of 0.046.

Table III presents the results of the following analyses: (1) a naive classical Cox regression analysis ignoring measurement error, corresponding to an assumption that there is no measurement error, (2) our method with A assumed known and set according to the foregoing estimated classification probabilities, ignoring the estimation error in these probabilities, and (3) our method with A estimated as above with the estimation error in the probabilities taken into account (main study/external validation study design). The last of these analyses makes use of the theory developed in Section 3.

Table III.

Estimated coefficients and standard errors for the Nurses Health Study of the relationship between dietary calcium intake and distal colon cancer incidence.

High calcium
BMI of 22 to <25
BMI of 25 to <30
BMI of 30+
Aspirin use
Method Estimate Std error Estimate Std error Estimate Std error Estimate Std error Estimate Std error
Cox −0.3448 0.1694 0.6837 0.2240 0.5352 0.2395 0.5729 0.2876 −0.4941 0.1954
CS0 −0.7121 0.3690 0.7124 0.2247 0.5776 0.2419 0.6157 0.2892 −0.4994 0.1955
CS1 −0.7121 0.3832 0.7124 0.2249 0.5776 0.2423 0.6157 0.2896 −0.4994 0.1955

Note Cox, classical Cox regression analysis; CS0, corrected score method, observed classification matrix taken as known; CS1, corrected score method, accounting for uncertainty in the classification matrix.

The results followed the expected pattern. Adjusting for the misclassification in calcium intake had a marked effect on the estimated relative risk for high calcium intake. Accounting for the error in estimating the classification probabilities increased (modestly) the standard error of the log relative risk estimate. The relative risk estimates for high calcium intake and the corresponding 95 per cent confidence intervals obtained in the three analyses were as follows:

Method Estimate 95 per cent CI
Naive Cox 0.71 [0.51, 0.99]
A Known 0.49 [0.24, 1.01]
A Estimated 0.49 [0.23, 1.04]

In general, in the multivariate setting, measurement error (including misclassification) can lead to either attenuation or magnification of covariate effects. In our example, the misclassification led to attenuation of the high calcium effect, so that the corrected relative risk was further from the null value of 1 than the naive relative risk. The misclassification correction had a small effect on the estimated regression coefficients for the BMI dummy variables and essentially no effect on the estimated regression coefficient for aspirin use.

6. SUMMARY AND DISCUSSION

We have considered the Cox [4] proportional hazards model with a set of covariates that includes error-prone discrete covariates along with error-free covariates, which may be discrete or continuous. The misclassification in the discrete error-prone covariates is allowed to be of any specified form. Building on the work of Nakamura [24, 25] and Akazawa et al. [26], we have developed an easily implemented corrected score method for this setting. The method can handle all three major study designs (internal validation design, external validation design, and replicate measures design), both functional and structural error models, and time-dependent covariates satisfying the ‘localized error’ condition described in Section 2. Also, for the internal validation design, the ‘localized error’ condition can be eliminated by using time-dependent classification rate estimates. The method thus represents a significant advance relative to other methods in the literature for this problem. The method performed well in a simulation study, both with misclassification probabilities known and with misclassification probabilities estimated from an external replicate measures study.

In most applications, the new method developed in this paper will be easier to apply than and preferable to our previous method based on weighted transformed Kaplan–Meier curves [19]. Our previous method requires defining strata for every possible configuration of the entire covariate vector (both the error-prone and error-free part). Except when the number of configurations is small, this leads to cumbersome implementation and loss of data for strata having no events. Our current method avoids this problem. Also, our previous method cannot handle continuous error-free covariates, while the current method can. Additionally, our previous method cannot handle time-dependent covariates (nor can the method of Zucker [20]), whereas our current method can handle such covariates if the ‘localized error’ condition applies or internal validation data are available. In some applications, our previous method might be preferred on account of reduced computational burden. Also, our previous method may be more convenient when it is desired to apply a measurement error correction based on published Kaplan–Meier curves for various risk groups as presented in medical and other subject-matter journals. This point is of particular relevance to meta-analysis applications.

This work focuses on the case where the error-prone covariates are discrete. This is admittedly a limitation. However, much of the existing work on the Cox model with covariate error focuses on continuous covariates with independent additive error, and as such does not apply or generalize easily to discrete covariates with misclassification. In many epidemiological studies, the error-prone covariates of interest are in fact discrete. Thus, the method presented here fills a definite need.

Still, there are cases where it is of interest to investigate continuous error-prone risk factors. In the case of a single error-prone continuous risk factor W, the basic equation (2) for classical likelihood models takes the form

G(w)=a(w˜|w)G*(w˜)dw˜ (25)

where a( |w) is the conditional density of given W, the integral is over the entire range of , and the arguments Zi and β are suppressed. Analogous equations are obtained for the case of multiple error-prone continuous covariates. Equation (25) is a Fredholm integral equation of the first kind. Such equations are discussed, for example, in Delves and Mohamed [42, Chapter 12], where numerical solution techniques are discussed. These techniques could be applied to the measurement error problem in suitable cases. However, as Delves and Mohamed indicate, such equations can sometimes be ill conditioned and do not always have a solution. Thus, for example, for the logistic regression model with additive normal covariate error, Stefanski [43] showed that a corrected score function satisfying (2) does not exist.

One way around the problem is to carry out a mild discretization of the error-prone covariate, fine enough to reduce the bias satisfactorily but not so fine as to lead to numerical problems. This approach will not produce a strictly consistent estimator, but it is reasonable to expect that the bias will be small. This supposition is supported by Cochran’s [44] classic work on subclassification, which indicates that the bulk of the information in a continuous variable can often be captured in a discretized version with four to six categories. We are currently exploring this discretization approach in more depth.

Alternatively, an attempt can be made to modify the corrected score approach so that it will work for the model under consideration. Huang andWang [45] developed such a modification for logistic regression. In their work, the terms in the likelihood score function were re-weighted to yield a new likelihood score function for which a corrected score satisfying (2) can be derived. These authors dealt only with independent additive error, which for the Cox model is already covered by existing corrected score methods [33, 46, 47]. Modification of the corrected score approach under other measurement error structures is an open problem.

ACKNOWLEDGEMENTS

This work was supported in part by a grant from the U.S. National Cancer Institute. We thank Els Goetghebeur and Malka Gorfine for their helpful comments and Ruifeng Li for assistance in the data analysis for the example. We also thank the Associate Editor and referees for comments that led to a substantially improved presentation.

APPENDIX A: TECHNICAL DETAILS

A.1. Outline proof of consistency and asymptotic normality

We give here an outline derivation of the asymptotic properties of our estimator. Our goal is to indicate the main steps of the argument without dwelling on the technical details. For simplicity, we focus on the i.i.d. structural model setting; our development will go through for the functional model setting as well provided that {Xi (t) : i = 1,2, . . .} exhibits ergodic behavior suitably similar to that of an i.i.d. sequence. We assume the parameter space is a compact set ℬ of β values of which the true value β0 is an interior point. We further assume that regularity conditions along the lines of Andersen and Gill [32] (AG) are in force over ℬ.

We take up first the issue of consistency. We are operating under AG’s ‘asymptotic stability’ conditions, which in the i.i.d. case follow from the functional law of large numbers in Andersen and Gill’s Appendix III. Define

s0(t,β)=E[Yi(t)exp(βTXi(t))],s1r(t,β)=E[Yi(t)Xir(t)exp(βTXi(t))]

Using AG’s arguments, including appeal to the asymptotic stability conditions, we find that Ur (β) converges uniformly over ℬ to

ur(β)=[s1r(t,β0)(s1r(t,β)s0(t,β))s0(t,β0)]λ0(t)dt

In view of equations (12)(14) of our Section 2.3, the same arguments yield the result that Ur*(β) converges uniformly over ℬ to ur (β). We thus have the following:

  1. The function U* (β), being continuous over ℬ, is therefore uniformly continuous over ℬ.

  2. The function U* (β) converges uniformly to u(β) over ℬ.

  3. As can be seen by inspection, u(β0) = 0.

Moreover, as AG show, β0 is the only zero point of u(β) in ℬ. As a result, given the compactness of the parameter space, convergence of β̂ to β0 follows by standard subsequence arguments.

We now discuss the asymptotic normality of β̂. By Taylor expansion we may write

n1/2(β^β0)=D(β˜)U(β0)

where β̃ lies between β0 and β̂, and hence converges to β0. Hence, as in AG, D(β̃) converges to the limiting value of D(β0), which exists by virtue of the asymptotic stability conditions. It now remains only to show that n1/2U(β0) is asymptotically normal.

We first recall expression (11) for the corrected score function:

Ur*(β)=1ni=1nδi(ξir(Ti)e1r*(Ti)e0*(Ti))

Since Ur*(β0) does not have exactly expectation zero, the martingale approach of AG cannot be applied to derive the asymptotic distribution of U* (β0). Instead, we follow the approach of Lin and Wei [48]. From this point forward, all quantities involving β (including those in which the dependence is suppressed from the notation) are evaluated at the true value β0, except in (A1), which presents a definition for general β. We use counting process notation, based on the definition Ni (t) = I (Tit, δi = 1). We define

N¯(t)=1ni=1nNi(t),𝒩(t)=E[Ni(t)]

We have, from the law of large numbers along with (12) and (13), that (t) → 𝒩 (t), e0*(t)s0(t),andeir*(t)s1r(t) as n → ∞. Here, the dependence on β is suppressed from the notation, and, as mentioned above, evaluation is at β=β0. As seen from Andersen and Gill [32, Appendix III], the convergence is uniform in t. In the development below, the symbol ≐ will denote equality up to negligible terms.

We can write

Ur*(β0)=1ni1nξir(t)dNi(t)e1r*(t)e0*(t)dN¯(t)=1ni1nξir(t)dNi(t)e1r*(t)e0*(t)d𝒩(t)s1r(t)s0(t)d(N¯𝒩)(t)[e1r*(t)e0*(t)s1r(t)s0(t)]d(N¯𝒩)(t)1ni1nξir(t)dNi(t)e1r*(t)e0*(t)d𝒩(t)s1r(t)s0(t)d(N¯𝒩)(t)

In addition,

e1r*(t)e0*(t)=e1r*(t)s0(t)+e1r*(t)[1e0*(t)1s0(t)]=e1r*(t)s0(t)[e1r*(t)e0*(t)s0(t)](e0*(t)s0(t))1s0(t)[e1r*(t)s1r(t)s0(t)e0*(t)s0(t))]

Thus, substituting and re-arranging, we obtain

Ur*(β0)1ni=1nξir(t)dNi(t)s1r(t)s0(t)d(N¯𝒩)(t)1s0(t)[e1r*(t)s1r(t)s0(t)(e0*(t)s0(t))]d𝒩(t)=1Ni=1n(ξir(t)s1r(t)s0(t))dNi(t)(e1r*(t)s0(t)s1r(t)s0(t)e0*(t)s0(t))d𝒩(t)=1Ni=1n(ξir(t)s1r(t)s0(t))dNi(t)1ni=1n(Yi(t)ηir(t,β0)s0(t)s1r(t)s0(t)Yi(t)ψi(t,β0)s0(t))d𝒩(t)=1ni=1nϒir(β0)

where

ϒir(β)=δi[ξir(Ti)e1r*(Ti)e0*(Ti)]Yi(t)[ηir(t,β)s0(t)slr(t)s0(t)ψi(t,β)s0(t)]d𝒩(t) (A1)

From this result it follows immediately from the classical central limit theorem for i.i.d. random vectors that n1/2U* (β0) is asymptotically mean-zero multivariate normal. It is straightforward to see that the asymptotic covariance matrix of n1/2U* (β0) can be estimated consistently by expression (17) evaluated at β̂.

Finally, we turn to the cumulative hazard estimator (19). This estimator can be expressed as

Λ^0(t)=0tdN¯(u)e0*(u,β^)

Using arguments similar to the above, we find that

Λ^0(t)Λ0(t)aT(β^β0)+1ni=1n0tdNi(u)s0(u,β0)1ni=1n0tYi(u)ψi(u,β0)s0(u,β0)d𝒩(u)

with the r th element of a given by

ar=0ts1r(u,β0)s0(u,β0)2d𝒩(u)

Using the approximation β̂β0D(β0) −1U* (β0), and defining c(β)=−D(β) −1a, we obtain

Λ^0(t)Λ0(t)1ni=1nϒi*(β0) (A2)

where

ϒi*(β)=r=1pcr(β)ϒir(β)+0ts0(u,β)1dNi(u)0tYi(u)ψi(u,β)s0(u,β)d𝒩(u)

As before, it is apparent from this representation that the estimator is asymptotically mean-zero normal with variance that can be estimated by

var^(Λ^0(t))=1ni=1nϒ^i*(β^)2 (A3)

where

ϒ^i*(β)=r=1pc^r(β)ϒ^ir(β)+δiI(Tit)e0*(Ti,β)1nj=1nδjI(Tjmin{Ti,t}ψi(Tj,β)e0*(Tj,β)

with ĉ(β) = −D(β)−1 â(β) and

a^r(β)=1ni=1nδiI(Tit)e1r*(Ti,β)e0*(Ti,β)2

A.2. Correction to var^(β^) with A(i,t) estimated from internal replicate data

We work in the setting where the replicate data are i.i.d. across individuals, and ω is estimated by maximum likelihood. Let Ri denote the number of replicates on individual i, and let gi (ω) denote the log likelihood function for (i1, . . ., i Ri). The overall normalized log likelihood is then g(ω)=m−1i∈ℛ gi (ω), where ℛ denotes the set of individuals in the internal replicate measures sample. Let g′ (ω) and g″ (ω) denote the gradient vector and Hessian matrix, respectively, of g(ω), and let gi(ω) denote the gradient of gi (ω). We can then express ω̂ in terms of the classic asymptotic approximation ω̂ ≐ −g″ (ω0)−1 g′ (ω0). Let ϒi (ω) denote the vector comprising the quantities ϒir (β, ω), as defined in (A1), and let ϒ̂i (β̂, ω̂) be the vector of corresponding estimated values ϒir (β̂, ω̂) as defined by (16). Define

Φ=cov([1miϒi*(ω)],mg(ω0))

The limiting value of Φ can then be estimated empirically by

Φ^=1miϒ^i(β^,ω^)gi(ω^)T (A4)

The appropriate corrected version of H is then

H(corr)=H+ζ1U˙*(β^,ω^)Γ^U˙*(β^,ω^)TΦ^g(ω^)1U˙*(β^,ω^)T (A5)

where for the present setup we have Γ̂ = g″ (ω̂)−1.

A.3. Correction to var^(Λ^0(t)) with A(i,t) estimated

When, as in Section 3, we estimate the parameters ω that determine the classification probabilities, representation (A2) becomes

Λ^0(t)Λ0(t)1ni=1nϒi*(β0)+hT(ω^ω0)

where the νth element of h is given by

hv=[U˙*(β,ω)Tc(β)v]1ni=1n0tYi(u)s0(u,β)l=1K[B˙v(i,t)]k(i,u)lexp(βTx(wl,Zi(u)))d𝒩(u)

The estimate (A3) for var(Λ̂0(t)) can be corrected as follows. Define ĥ = ĥ(β ,ω) by

h^v=[U˙*(β,ω)Tc^(β)]v1n2i=1nj=1nδiI(Tjmin{Ti,t})s0(Tj,β)1l=1K[B˙v(i,t)]k(i,Tj)lexp(βTX(wl,Zi(Tj)))

With external data or an internal validation study, it is necessary merely to add to (A3) the term ζ−1ĥTΓ̂ĥ. For the case of internal replicate data, the additional term is

ζ1h^TΓ^h^+[1miϒ^i*(β^)gi(ω^)]Tg(ω^)1h^

REFERENCES

  • 1.Fuller WA. Measurement Error Models. New York: Wiley; 1987. [Google Scholar]
  • 2.Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective. 2nd edn. Boca Raton, FL, London: Chapman & Hall/CRC; 2006. [Google Scholar]
  • 3.Prentice R. Covariate measurement errors and parameter estimation in a failure time regression model. Biometrika. 1982;69:331–342. [Google Scholar]
  • 4.Cox DR. Regression models and life-tables (with Discussion) Journal of the Royal Statistical Society, Series B. 1972;34:187–220. [Google Scholar]
  • 5.Bross IDJ. Misclassification in 2×2 tables. Biometrics. 1954;10:478–486. [Google Scholar]
  • 6.Chen TT. A review of methods for misclassified categorical data in epidemiology. Statistics in Medicine. 1989;8:1095–1106. doi: 10.1002/sim.4780080908. [DOI] [PubMed] [Google Scholar]
  • 7.Kuha J, Skinner C. Categorical data analysis and misclassification. In: Lyberg L, Biemer P, Collins M, DeLeeuw E, Dippo C, Schwarz N, Trewin D, editors. Survey Measurement and Process Quality. New York: Wiley; 1997. pp. 633–670. [Google Scholar]
  • 8.Walter SD, Irwig LM. Estimation of test error rates, disease prevalence, and relative risk from misclassified data: a review. Journal of Clinical Epidemiology. 1988;41:923–937. doi: 10.1016/0895-4356(88)90110-2. [DOI] [PubMed] [Google Scholar]
  • 9.Kuha J, Skinner C, Palmgren J. Misclassification error. In: Armitage P, Colton T, editors. Encyclopedia of Biostatistics. 1st edn. Vol. 1. New York: Wiley; 1998. pp. 2615–2621. [Google Scholar]
  • 10.Marshall JR. Validation study methods for estimating exposure proportions and odds ratios with misclassified data. Journal of Clinical Epidemiology. 1990;43:941–947. doi: 10.1016/0895-4356(90)90077-3. [DOI] [PubMed] [Google Scholar]
  • 11.Morrissey MJ, Spiegelman D. Matrix methods for estimating odds ratios with misclassified exposure data: extensions and comparisons. Biometrics. 1999;55:338–344. doi: 10.1111/j.0006-341x.1999.00338.x. [DOI] [PubMed] [Google Scholar]
  • 12.Espeland MA, Hui SL. A general approach to analyzing epidemiologic data that contain misclassification errors. Biometrics. 1987;43:1001–1012. [PubMed] [Google Scholar]
  • 13.Spiegelman D, Rosner B, Logan R. Estimation and inference for logistic regression with covariate misclassification and measurement error in main study/validation study designs. Journal of the American Statistical Association. 2000;95:51–61. [Google Scholar]
  • 14.Zhou H, Pepe M. Auxiliary covariate data in failure time regression. Biometrika. 1995;82:139–149. [Google Scholar]
  • 15.Zhou H, Wang CY. Failure time regression with continuous covariates measured with error. Journal of the Royal Statistical Society, Series B. 2000;62:657–665. [Google Scholar]
  • 16.Chen YH. Cox regression in cohort studies with validation sampling. Journal of the Royal Statistical Society, Series B. 2002;64:51–62. [Google Scholar]
  • 17.Spiegelman D, McDermott A, Rosner B. The regression calibration method for correcting measurement error bias in nutritional epidemiology. American Journal of Clinical Nutrition. 1997;65(Suppl.):1179S–1186S. doi: 10.1093/ajcn/65.4.1179S. [DOI] [PubMed] [Google Scholar]
  • 18.Wang CY, Hsu L, Feng ZD, Prentice RL. Regression calibration in failure time regression. Biometrics. 1997;53:131–145. [PubMed] [Google Scholar]
  • 19.Zucker DM, Spiegelman D. Inference for the proportional hazards model with misclassified discrete-valued covariates. Biometrics. 2004;60:324–334. doi: 10.1111/j.0006-341X.2004.00176.x. [DOI] [PubMed] [Google Scholar]
  • 20.Zucker DM. A pseudo partial likelihood method for semi-parametric survival regression with covariate errors. Journal of the American Statistical Association. 2005;100:1264–1277. [Google Scholar]
  • 21.Hu P, Tsiatis AA, Davidian M. Estimating the parameters in the Cox model when covariate variables are measured with error. Biometrics. 1998;54:1407–1419. [PubMed] [Google Scholar]
  • 22.Küchenhoff H, Mwalili SM, Lesaffre E. A general method for dealing with misclassification in regression: the misclassification SIMEX. Biometrics. 2006;62:85–96. doi: 10.1111/j.1541-0420.2005.00396.x. [DOI] [PubMed] [Google Scholar]
  • 23.Herring AH, Ibrahim JG. Likelihood-based methods for missing covariates in the Cox proportional hazards model. Journal of the American Statistical Association. 2001;96:292–302. [Google Scholar]
  • 24.Nakamura T. Corrected score function of errors-in-variables models: methodology and application to generalized linear models. Biometrika. 1990;77:127–137. [Google Scholar]
  • 25.Nakamura T. Proportional hazards model with covariates subject to measurement error. Biometrics. 1992;48:829–838. [PubMed] [Google Scholar]
  • 26.Akazawa K, Kinukawa N, Nakamura T. A note on the corrected score function corrected for misclassification. Journal of the Japan Statistical Society. 1998;28:115–123. [Google Scholar]
  • 27.Wu K, Willett WC, Fuchs CS, Colditz GA, Giovannucci EL. Calcium intake and risk of colon cancer in women and men. Journal of the National Cancer Institute. 2002;94:437–446. doi: 10.1093/jnci/94.6.437. [DOI] [PubMed] [Google Scholar]
  • 28.Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2nd edn. New York: Wiley; 2002. [Google Scholar]
  • 29.Thomas DC. General relative-risk models for survival time and matched case–control analysis. Biometrics. 1981;37:673–686. [Google Scholar]
  • 30.Breslow N, Day NE. Statistical Methods in Cancer Research, Volume 2: The Design and Analysis of Cohort Studies. Oxford: Oxford University Press; 1993. [PubMed] [Google Scholar]
  • 31.Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]
  • 32.Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. Annals of Statistics. 1982;10:1100–1120. [Google Scholar]
  • 33.Huang Y, Wang CY. Cox regression with accurate covariates unascertainable: a nonparametric-correction approach. Journal of the American Statistical Association. 2000;95:1209–1219. [Google Scholar]
  • 34.Spiegelman D, Carroll RJ, Kipnis V. Efficient regression calibration in main study/internal validation study designs with an imperfect reference instrument. Statistics in Medicine. 2001;20:139–160. doi: 10.1002/1097-0258(20010115)20:1<139::aid-sim644>3.0.co;2-k. [DOI] [PubMed] [Google Scholar]
  • 35.Hui SL, Zhou XH. Evaluation of diagnostic tests without gold standards. Statistical Methods in Medical Research. 1998;7:354–370. doi: 10.1177/096228029800700404. [DOI] [PubMed] [Google Scholar]
  • 36.Qu Y, Tan M, Kutner MH. Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics. 1996;52:797–810. [PubMed] [Google Scholar]
  • 37.Torrance-Rynard VL, Walter SD. Effects of dependent errors in the assessment of diagnostic test performance. Statistics in Medicine. 1997;6:2157–2175. doi: 10.1002/(sici)1097-0258(19971015)16:19<2157::aid-sim653>3.0.co;2-x. [DOI] [PubMed] [Google Scholar]
  • 38.Formann AK, Kohlmann T. Latent class analysis in medical research. Statistical Methods in Medical Research. 1996;5:179–211. doi: 10.1177/096228029600500205. [DOI] [PubMed] [Google Scholar]
  • 39.Albert PS, McShane LM, Shih JHUS. National Cancer Institute Bladder Tumor Marker Network. Latent class modeling approaches for assessing diagnostic error without a gold standard: with applications to p53 immunohistochemical assays in bladder tumors. Biometrics. 2001;57:610–619. doi: 10.1111/j.0006-341x.2001.00610.x. [DOI] [PubMed] [Google Scholar]
  • 40.Armitage P, Doll R. Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability. Volume 4: Contributions to Biology and Problems of Medicine. Berkeley, California: University of California Press; 1961. Stochastic models for carcinogenesis; pp. 19–38. [Google Scholar]
  • 41.Willett WC. Nutritional Epidemiology. 2nd edn. New York: Oxford University Press; 1998. [Google Scholar]
  • 42.Delves LM, Mohamed JL. Computational Methods for Integral Equations. Cambridge: Cambridge University Press; 1985. [Google Scholar]
  • 43.Stefanski L. Unbiased estimation of a nonlinear function of a normal mean with application to measurement-error models. Communications in Statistics, Theory and Methods. 1989;18:4335–4358. [Google Scholar]
  • 44.Cochran WG. The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics. 1968;24:295–313. [PubMed] [Google Scholar]
  • 45.Huang Y, Wang CY. Consistent function methods for logistic regression with errors in covariates. Journal of the American Statistical Association. 2001;95:1209–1219. [Google Scholar]
  • 46.Kong FH, Gu M. Consistent estimation in Cox’s proportional hazards model with covariate measurement errors. Statistica Sinica. 1999;9:953–969. [Google Scholar]
  • 47.Hu C, Lin DY. Cox regression with covariate measurement error. Scandinavian Journal of Statistics. 2002;29:637–655. [Google Scholar]
  • 48.Lin DY, Wei LJ. The robust inference for the Cox proportional hazards model. Journal of the American Statistical Association. 1989;84:1074–1078. [Google Scholar]
  • 49.Liang KY, Zeger S. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]

RESOURCES