Skip to main content

Some NLM-NCBI services and products are experiencing heavy traffic, which may affect performance and availability. We apologize for the inconvenience and appreciate your patience. For assistance, please contact our Help Desk at info@ncbi.nlm.nih.gov.

NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Jan 1.
Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2012 Dec 4;75(1):185–206. doi: 10.1111/j.1467-9868.2012.01052.x

Robust estimation for homoscedastic regression in the secondary analysis of case–control data

Jiawei Wei 1, Raymond J Carroll 2, Ursula U Müller 3, Ingrid Van Keilegom 4, Nilanjan Chatterjee 5
PMCID: PMC3639015  NIHMSID: NIHMS449968  PMID: 23637568

Summary

Primary analysis of case–control studies focuses on the relationship between disease D and a set of covariates of interest (Y, X). A secondary application of the case–control study, which is often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated owing to the case–control sampling, where the regression of Y on X is different from what it is in the population. Previous work has assumed a parametric distribution for Y given X and derived semiparametric efficient estimation and inference without any distributional assumptions about X. We take up the issue of estimation of a regression function when Y given X follows a homoscedastic regression model, but otherwise the distribution of Y is unspecified. The semiparametric efficient approaches can be used to construct semiparametric efficient estimates, but they suffer from a lack of robustness to the assumed model for Y given X. We take an entirely different approach. We show how to estimate the regression parameters consistently even if the assumed model for Y given X is incorrect, and thus the estimates are model robust. For this we make the assumption that the disease rate is known or well estimated. The assumption can be dropped when the disease is rare, which is typically so for most case–control studies, and the estimation algorithm simplifies. Simulations and empirical examples are used to illustrate the approach.

Keywords: Biased samples, Homoscedastic regression, Secondary data, Secondary phenotypes, Semiparametric inference, Two-stage samples

1. Introduction

Case–control designs are popularly used for studying risk factors for rare diseases, such as cancers. Under this design, a fixed number of ‘cases’ and ‘controls’, i.e. subjects with and without the disease of interest, are sampled from an underlying base population. Data on various covariates on the subjects are then collected in a retrospective fashion so that they reflect history before the disease. The standard method for primary analysis of case–control data involves logistic regression modelling of the disease outcome as a function of the covariates of interest. It is well known that prospective logistic regression analysis for case–control data is efficient under a semiparametric framework that allows the ‘nuisance’ distribution of the underlying covariates to be unspecified (Prentice and Pyke, 1979).

Epidemiologic researchers popularly use controls from case–control studies to examine the interrelationship between certain covariates themselves. Such secondary analysis of case–control studies has received increasing attention in genetic epidemiologic studies, where it is often of interest to investigate the effect of genetic susceptibility, such as single-nucleotide polymorphism (SNP) genotypes, not only on the primary disease outcome, but also on various secondary factors, such as smoking habits, that may themselves be associated with the disease of interest. For such secondary analysis, use of only controls is generally considered a model robust approach since, when the disease is rare, the relationship between covariates in the controls should reflect that of the underlying population without any further model assumptions. It is, however, recognized that inclusion of cases in such analysis can increase efficiency, provided that appropriate adjustment can be made to account for non-random ascertainment in case–control sampling. Li et al. (2010), for example, reported that, if two binary covariates have no interaction with the risk of the disease on a logistic scale, then the association between the factors in the cases remains the same as that for the underlying population. Therefore in such a setting inclusion of cases can increase the efficiency of the secondary analysis.

In this paper, our goal is to develop an approach to secondary association analysis for a continuous covariate, say Y, in a case–control study setting so that both cases and controls can be used to increase efficiency and yet the resulting inference is model robust to distributional assumptions about the covariates. Suppose that data are originally collected from a case–control study of a relatively rare disease. Let D be disease status, with D = 1 denoting a case and D = 0 denoting a control. Suppose also that D is to be modelled by a vector of random covariates (Y, X), where Y is univariate and X is potentially multivariate, by using a standard logistic regression formulation. Consider here the homoscedastic regression model

Y=αtrue+μ(X,βtrue)+ε, (1)

where αtrue is an intercept and μ(·) is a known function, and where ε has mean 0 and is independent of X, but its distribution is otherwise not specified.

To estimate (αtrue, βtrue), we cannot simply ignore the case–control sampling scheme and use the data as they are, because, if Y is a predictor of disease status D, the sampling is biased and in the case–control sample model (1) will not hold.

This paper is organized as follows. In Section 2, we describe recent work on case–control studies that allows efficient estimation if the distribution of Y given X is specified up to parameters. Although the solution is elegant, it suffers from the fact that the resulting estimate may be biased if the hypothesized distribution for Y given X is misspecified.

Section 3 takes an entirely different approach to the basic general problem and describes a simple method that is robust to misspecification of the distribution of Y given X. In Section 4 we describe extensions to cases that the disease rate in the population is known or well estimated from a disease registry or as part of an on-going cohort, and to the case of stratified or frequency-matched studies. Section 5 presents a series of simulation studies, whereas Section 6 presents analysis of an epidemiological data set. Concluding remarks are in Section 7. Technical details are given in Appendix A and Appendix B.

2. Efficient parametric estimation and robustness

2.1. Framework

In this section we outline recent work on efficient estimation for case–control studies when the distribution of Y given X is specified up to a finite dimensional parameter vector. We start with a logistic regression model underlying the case–control analysis, so that pr(D = 1|Y, X) = H0 + m(Y, X, θ1)}, where H(·) is the logistic distribution function and m(·) is an arbitrary known function with unknown parameter vector θ1. For d = 0, 1, let πd = pr(D = d), the probability that D = d in the population, and suppose that there are n1 cases with D = 1 and n0 controls with D = 0. We write n = n0 + n1 and introduce the parameter κ = θ0 + log(n1/n0) − log(π10). This reparameterization has the advantage that we can identify κ and θ1 from a logistic regression analysis of D on (Y, X), although we cannot identify θ0 (Prentice and Pyke, 1979; Chatterjee and Carroll, 2005) from such logistic regression alone.

In the parametric framework the conditional distribution of Y given X is modelled as fε{y − α − μ(x, β), ζ}, where ζ is a finite dimensional nuisance parameter. If in the population Y given X is normally distributed, then ζ = var(ε).

2.2. Population-based case–control studies and notation

Our explicit theoretical and asymptotic results are based on population-based case–control studies, i.e. studies in which random samples of (Y, X) are taken separately for D = 1 and D = 0. We shall refer to these simply as case–control studies. Some case–control studies use a form of stratification, which is sometimes called frequency matching, e.g. a population-based case–control study for each of a number of age ranges and the same number of cases and controls in each age group. With some notation and the inclusion of these strata in the logistic risk model and in the model for Y given X, our results are easily extended to such sampling; see Section 4.

We assume a logistic model for pr(D = 1|Y, X) as

pr(D=1|Y,X)=H{θ0+m(Y,X,θ1)}=exp{θ0+m(Y,X,θ1)}1+exp{θ0+m(Y,X,θ1)}. (2)

Our technical assumptions are assumptions 1–4 in Appendix B.1.

We also mention two important calculations. The density fX of X in the population can be written as

fX(x)=π1fcase(x)+π0fcont(x), (3)

with (π0, π1) defined in Section 2.1, and where fcont(x) and fcase(x) represent the density of X given D = 0 and D = 1 respectively. Since this is a case–control sampling scheme, all expectations are conditional on D1, …, Dn. Define R(β) = Y − μ(X, β) and Ri(β) = Yi − μ(Xi, β). For an arbitrary function G,

E[n1i=1nG{Ri(β),Xi,Di}]=E(E[n1i=1nG{Ri(β),Xi,Di}|D1,,Dn])=n1i=1nE(E[G{Ri(β),Xi,Di}|Di])=d=01(nd/n)E[G{R(β),X,d}|D=d], (4)

the second and last steps following because (Y, X) are independent and identically distributed given D in the case–control sampling scheme.

2.3. Prior results and robustness

For the case–control studies that were described above, Jiang et al. (2006), Chen et al. (2008) and Lin and Zeng (2009) derived the efficient profile likelihood (in the sense that its score for β is an efficient score function), Lin and Zeng (2009) noting importantly that it can be used in our context. See also Monsees et al. (2009). Write Ω = (κ, θ1, θ0). The joint density of (D, Y, X) is

fX(x)fε{yαμ(x,β),ζ}exp[d{θ0+m(y,x,θ1)}]1+exp{θ0+m(y,x,θ1)}.

Let

g(d,y,x,Ω,α,β,ζ)=fε{yαμ(x,β),ζ}exp[d{κ+m(y,x,θ1)}][1+exp{θ0+m(y,x,θ1)}]1.

The semiparametric efficient retrospective profile likelihood for β that makes no assumptions about the distribution of X when the distribution of Y given X is specified is

par(D,Y,X,Ω,α,β,ζ)=g(D,Y,X,Ω,α,β,ζ)d=01g(d,t,X,Ω,α,β,ζ)dt.

Taking logarithms, summing over the observed data and then maximizing in the parameters yields semiparametric efficient inference.

A difficulty arises, however, if the density fε(·) of ε is not specified properly. To see what happens, consider the score for β. Define Lpar(y, x, α, β, ζ) = ∂log[fε{y − α − μ(x, β), ζ}]/∂β. Then the score for β is

𝒦par(D,Y,X,Ω,α,β,ζ)=log{par(D,Y,X,Ω,α,β,ζ)}β=Lpar(Y,X,α,β,ζ)d=01Lpar(t,X,α,β,ζ)g(d,t,X,Ω,α,β,ζ)dtd=01g(d,t,X,Ω,α,β,ζ)dt. (5)

Because ℒpar(·) is a legitimate semiparametric profile likelihood, when summed over the case–control data and evaluated at the true parameters, score (5) has mean 0. However, score (5), when evaluated at the true parameter values, only has mean 0 in general if the density fε(·) of ε is specified properly, i.e. the approach is not always model robust; see Section 5 for numerical evidence. This motivates our search for a robust estimation method, which is a topic that we take up in the next section.

3. Model robust estimation

3.1. Preliminaries

In this section we assume the same framework as in the previous section, with the exception that fε is now unknown. We pursue a sequential approach to derive an estimating equation for the parameters that determine the regression function.

  1. Estimate the true logistic regression parameters κ and θ1 by ordinary logistic regression of D on (Y, X). This can be done legitimately because it is known that ordinary logistic regression in a case–control study consistently estimates κ and θ1 (Prentice and Pyke, 1979; Chatterjee and Carroll, 2005). Denote the estimators by κ̂ and θ̂1.We also suppose that we have a consistent estimator of θ0. This estimator can, for example, be the solution of the equation
    π1=π1n11i=1nDiH{θ0+m(Yi,Xi,θ̂1)}+π0n01i=1n(1Di)H{θ0+m(Yi,Xi,θ̂1)}, (6)
    when the disease rate π1 in the population is known or well estimated, either from a disease registry or from an underlying cohort from which the cases and controls are sampled. Equation (6) leads to a consistent estimator of θ0, since for any function g(y, x) we can estimate ∫g(y, x) fYX(y, x) dydx unbiasedly by
    d=01i=1n(πd/nd)I(Di=d)g(Yi,Xi).
    Call the resulting estimator θ̂0 and denote Ω̂ = (κ̂, θ̂1, θ̂0).
  2. Use a score function for β that would be an appropriate score function if the (Y, X) data arose from random sampling. Define R(β) = Y − μ(X, β). Then the simplest such score function is that from ordinary least squares, which is obtained by differentiating {Y − α − μ(X, β)}2 with respect to β. This yields the score function
    L{R(β),X,α,β}=μβ(X,β){R(β)α}, (7)
    where the subscript means differentiation with respect to β.
  3. Score (7) will not have mean 0 in the case–control sampling scheme, so we adjust it so that it has mean 0 in general.

  4. For technical reasons that are described later, estimation of αtrue must be done via an auxiliary equation depending on the current values, which we generically call α̂(β, Ω), which replaces α in score (7); see below for the definition.

  5. Solve the adjusted score equation to estimate βtrue and hence αtrue. Good starting values for β can be obtained by least squares regression among the controls.

Remark 1. The score function (7) is not the only one possible; for example, we could instead allow for robustness against outliers by replacing function (7) by the estimating equation of an M-estimator (Huber, 1981; Anderson, 2008).

3.2. Estimation algorithm

The development of our methodology is somewhat involved. Here we simply state our proposal, with its development given in Sections 3.3–3.5. As before, define R(β) = Y − μ(X, β).Remember that estimation of αtrue must be done by using an auxiliary equation; see equation (8) directly below. Define

𝒦(Ri(β),x,β,Ω)=1+exp[κ+m{Ri(β)+μ(x,β),x,θ1}]1+exp[θ0+m{Ri(β)+μ(x,β),x,θ1}].

For given (β, Ω), the estimator of αtrue is justified in Section 3.5 and given by

α̂(β,Ω)=n1i=1nRi(β)[d=01(πd/nd)j=1nI(Dj=d)𝒦{Ri(β),Xj,β,Ω}]1n1i=1n[d=01(πd/nd)j=1nI(Dj=d)𝒦{Ri(β),Xj,β,Ω}]1=n1i=1nRi(β)[n1j=1n𝒦̃{Ri(β),Xj,β,Ω,Dj}]1n1i=1n[n1j=1n𝒦̃{Ri(β),Xj,β,Ω,Dj}]1, (8)

where

𝒦̃{Ri(β),Xj,β,Ω,Dj}=d=01(nπd/nd)I(Dj=d)𝒦{Ri(β),Xj,β,Ω}.

Let μβ(x, β) = ∂μ(x, β)/∂β and let L{R(β), X, α, β} be as in equation (7). Then define

n,est(β,Ω)=n1/2i=1n[L{Ri(β),Xi,α̂(β,Ω),β}n1j=1nL{Ri(β),Xj,α̂(β,Ω),β}𝒦̃{Ri(β),Xj,β,Ω,Dj}n1j=1n𝒦̃{Ri(β),Xj,β,Ω,Dj}]. (9)

Our algorithm then is as follows.

  1. Estimate (κ, θ1)T by (κ̂, θ̂1)T, the logistic regression estimates of D on (Y, X). As described previously, this is known to produce consistent estimates of (κtrue, θ1,true)T. Estimate θ0 as explained in Section 3.1. This leads to an estimator Ω̂ of Ωtrue.

  2. Solve 0 = n,est(β, Ω̂) in β to obtain the estimate β̂.

In the next few subsections, we describe how we obtained equation (9), and at the end we describe the asymptotic distribution theory.

3.3. Development of the score when fX and αtrue are known

3.3.1. Adjusting score (7)

We first describe how to proceed when the intercept αtrue, the density fX(·) of X in the population, and fε(t − αtrue), the density of Y − μ(X, βtrue) in the population, are all known; they are not and we shall show how to remove these restrictions in subsequent sections.

The approach is to start with the estimating function (7), which, when summed over the data, does not have mean 0 at the true parameters because of the case–control sampling scheme, i.e. E[Σi=1nL{Ri(βtrue),Xi,αtrue,βtrue}|Di]0, in general. Thus, we need to correct n1Σi=1nL{Ri(β),Xi,α,β} so that it does have mean 0 in the case–control sampling scheme, where expectations are computed as in equation (4). In the on-line supplemental material, we show how to follow the approach of Chen et al. (2009), section 2.3.3, to develop the adjusted estimating function

L{R(β),X,αtrue,β}L(t,x,αtrue,β)𝒦(t,x,β,Ω)fε(tαtrue)fX(x)dtdx𝒦(t,x,β,Ω)fε(tαtrue)fX(x)dtdx. (10)

This is not of much help, since none of fε(·), fX(·) or αtrue are known. In subsequent sections we show how to replace these terms by data-estimated quantities, and thus arrive at equation (9).

3.3.2. Replacing the unknown error density

The problem with expression (10) is that we do not know the form of fε(·), so score (10) cannot be implemented. Similarly to Chatterjee and Carroll (2005) and Spinka et al. (2005), we therefore replace fε(·) by a non-parametric maximum likelihood estimator. The idea is to take the observed Ri(β) = Yi − μ(Xi, β) as the support, and to maximize the log-likelihood with respect to γi = pr{R(β) = Ri(β)}, i = 1, …, n, subject to Σi=1nγi=1. By Chatterjee and Carroll (2005) and Spinka et al. (2005), the resulting estimator for pr{R(β) = Ri(β)} is

pest{Ri(β),Ω}=π0n0[fX(x)𝒦{Ri(β),x,β,Ω}dx]1. (11)

The derivation of equation (11) is given in Appendix A.1. When we make this substitution in expression (10) and sum over the data, the score becomes

i=1nL{Ri(β),Xi,αtrue,β}i=1nL{Ri(β),x,αtrue,β}𝒦{Ri(β),x,β,Ω}pest{Ri(β),Ω}fX(x)dxn1i=1n𝒦{Ri(β),x,β,Ω}pest{Ri(β),Ω}fX(x)dx.

Because the denominator of this expression is π0/n0, by simple algebra it is readily seen that the normalized score function for estimating β can be defined as

0=Qn(αtrue,β,Ω)=n1/2i=1n[L{Ri(β),Xi,αtrue,β}L{Ri(β),x,αtrue,β}𝒦{Ri(β),x,β,Ω}fX(x)dx𝒦{Ri(β),x,β,Ω}fX(x)dx]. (12)

In Appendix A.2 we show that the expectation of Qntrue, β, Ω) in the case–control sampling scheme is equal to 0 when evaluated at (αtrue, βtrue, Ωtrue), but not for arbitrary (β, Ω). This implies that equation (12) is indeed an unbiased estimating equation in the case–control sampling scheme.

3.4. Implementation when fX is unknown but αtrue is known

The density or mass function fX(·) is not known. We estimate the integrals in expression (12) unbiasedly by their sample average over all the observations, so our estimating equation is

0=n(αtrue,β,Ω)=n1/2i=1n[L{Ri(β),Xi,αtrue,β}n1j=1nL{Ri(β),Xj,αtrue,β}𝒦̃{Ri(β),Xj,β,Ω,Dj}n1j=1n𝒦̃{Ri(β),Xj,β,Ω,Dj}]. (13)

3.5. Implementation when the intercept αtrue is unknown

One might reasonably think that estimating the intercept is easy; for example, simply supplement the score with the ordinary least squares score for the intercept, so that L{R(β),X,α,β}=(1,μβT(X,β))T{R(β)α}. The problem with this is that the first component of the estimating equation (13) would then be identically 0 and thus will not produce an estimate of the intercept. The reason for this is that the solution (11) was calculated non-parametrically under the assumption that Rtrue) and X are independent in the population. Since Y − αtrue − μ(X, βtrue) and Y − μ(X, βtrue) are both independent of X in the population, this means that equation (11) cannot lead to an estimate of the intercept. Hence, an alternative approach is required.

To overcome this problem, we estimate the intercept of R(β) by using equation (11), i.e., if fX(·) were known, then αtrue could be estimated by

α̃(β,Ω)=n1i=1nRi(β)pest{Ri(β),Ω}n1i=1npest{Ri(β),Ω}: (14)

a quantity that is free of the π0 that shows up in equation (11). If we then replace the integral in the definition of pest(·) by its average n1Σj=1n𝒦̃{Ri(β),Xj,β,Ω,Dj}, we obtain exactly expression (8). Making this substitution in equation (13), we obtain equation (9). This completes the derivation of our methodology.

3.6. Distribution theory

The asymptotic distribution of our estimator is given in the following result. We refer to Appendix B.1 for the definition of the functions and matrices that are mentioned below, and for the assumptions 1–4 there under which this result is valid. The proof of this theorem is given in Appendix B.2.

Theorem 1. Let (β, Ω) = Θ, and let Θtrue denote its true value. Assume that assumptions 1–4 in Appendix B.1 are valid. Then there is an invertible matrix ℳβ and a function Λ(Y, X, D, Θtrue) with the properties that E{Λ(Y, X, D, Θtrue)|D} = 0 and

n1/2(β̂βtrue)=n1/2β1i=1nΛ(Yi,Xi,Di,Θtrue)+op(1).

Therefore, there is a matrix Σ, defined in Appendix B.1, such that

n1/2(β̂βtrue)N(0,Σ). (15)

Estimating the covariance matrix Σ in expression (15) can be accomplished by a plug-in method or by the bootstrap appropriate for case–control sampling (Wang et al., 1997; Buonaccorsi, 2010).

3.7. Inference via bootstrap resampling

In principle, estimating the covariance matrix Σ in expression (15) can be accomplished by a plug-in method, although the particular form of the function Q1(·) that is defined in Appendix B.1 makes computational speed slow. We have thus chosen to use bootstrap ideas to estimate Σ. Below we explain in detail how this can be done, but the basic idea is that we have random samples from two independent populations, i.e. the cases and the controls, and an estimator that is asymptotically normally distributed.

3.7.1. Bootstrap procedure

Let (Y1*,X1*),,(Yn0*,Xn0*) be drawn randomly with replacement from {(Yi, Xi) : Di = 0}, and similarly let (Yn0+1*,Xn0+1*),,(Yn*,Xn*) be drawn randomly with replacement from {(Yi, Xi) : Di = 1}. This is the method of bootstrap sampling that was suggested by Wang et al. (1997) and Buonaccorsi (2010), page 225, and, since the data consist of samples from two independent populations, is the same as in Babu and Singh (1983); see also Lele (1991).

Let Di*=I(i>n0) and Ri*(β)=Yi*μ(Xi*,β), and define Ω̂*, α̂*(β, Ω) and n,est*(β,Ω) in the same way as Ω̂, α̂(β, Ω) in equation (8) and n,est(β, Ω) in equation (9), but based on (Yi*,Xi*,Di*) instead of (Yi, Xi, Di), i = 1, …, n.

The bootstrapped estimator β̂* of β is then defined as a solution of

0=n,est*(β,Ω̂*)n,est(β̂,Ω̂)=n,est*(β,Ω̂*)

with respect to β. See also Hall and Horowitz (1996), page 897, and Chen et al. (2003), where bootstrapping is used and justified in similar contexts.

3.7.2. Bootstrap consistency

To show the consistency of the above bootstrap procedure, we need to show that n1/2(β̂* − β̂) converges to the same normal limit as the original centred estimator n1/2(β̂ − βtrue). For this we use the same techniques as in the proof of theorem B in Chen et al. (2003), combined with the proof of theorem 1 in Appendix A. More precisely, it can be shown that, under certain regularity conditions, we have that

n1/2(β̂*β̂)=β1n1/2i=1n{Λ(Yi*,Xi*,Di*,Θtrue)Λ(Yi,Xi,Di,Θtrue)}+op*(1),

where op*(1) has the same meaning as op(1), except that the probability is computed under the bootstrap distribution conditional on the original data (Yi, Xi, Di), i = 1, …, n. From this together with the central limit theorem and theorem 1 the result follows.

4. Extensions

4.1. Rare disease approximations

The method that was defined in Section 3 assumes that π1 = pr(D = 1) is known. This is typically not the case, so many researchers adopt rare disease approximations (see below for references), where the word ‘rare’ has no precise definition but is certainly 1% or less. There are at least two ways to proceed in our context. The first is to use the literature, to choose a nominal π1 ≤ 1% and to apply the method in Section 3. In results that are not reported here, this works well in the simulation setting of Section 5. In the literature, most researchers use a different approximation, which is described next and implemented in Section 5. We have not investigated in any detail which approach is preferable.

Let ‘≐’ denote ‘approximately equal’. The estimation procedure simplifies if the disease can be assumed to be rare, i.e. if

pr(D=1|Y,X)=exp{θ0+m(Y,X,θ1)}1+exp{θ0+m(Y,X,θ1)}exp{θ0+m(Y,X,θ1)},

or, equivalently, if pr(D = 0|Y, X) = [1 + exp{θ0 + m(Y, X, θ1)}]−1 ≐ 1. This approximation allows us to replace 𝒦 in the estimating function (12) by

𝒦*{Ri(β),x,β,Ω*}=1+exp[κ+m{Ri(β)+μ(x,β),x,θ1}]. (16)

In addition, Ω = (κ, θ1, θ0) in 𝒦 is replaced by Ω* = (κ, θ1), which does not depend on θ0 any-more, and assumption 4 is no longer required since θ0 is no longer estimated. The proof in Appendix A.2, where we show that the estimating function (12) is unbiased, adapts to the rare disease case in a straightforward way, now using the approximation

fYX|D=d(y,x)=exp[d{θ0+m(y,x,θ1)}]fYX(y,x)[1+exp{θ0+m(y,x,θ1)}]πdexp[d{θ0+m(y,x,θ1)}]fYX(y,x)πd.

Hence the modified estimating function based on 𝒦* is approximately unbiased in the rare disease case.

As in the general case, the rare disease version of the estimating function (12) depends on unknown quantities which must be estimated. The estimation algorithm for the rare disease model is as follows and is explained below. Set

α̂*(β,Ω*)=n1i=1nRi(β)[n01j=1n(1Dj)𝒦*{Ri(β),Xj,β,Ω*}]1n1i=1n[n01j=1n(1Dj)𝒦*{Ri(β),Xj,β,Ω*}]1,
n,est*(β,Ω*)=n1/2i=1n[L{Ri(β),Xi,α̂*(β,Ω*),β}n01j=1n(1Dj)L{Ri(β),Xj,α̂*(β,Ω*),β}𝒮*{Ri(β),Xj,β,Ω*}n01j=1n(1Dj)𝒦*{Ri(β),Xj,β,Ω*}].

As before, estimate Ω* = (κ, θ1) by the logistic regression estimates of D on (Y, X); then solve n,est*(β,Ω̂*)=0 with respect to β to obtain β̂.

The formulae for α̂* and n,est* do not contain an average 𝒦̃*,which could be introduced analogously to the general case where both formulae involve 𝒦̃, and which depends on π1 = P(D = 1). This is explained as follows: both the estimating function (12) and the estimator pest, which is used to estimate αtrue, depend on the unknown density fX. As already explained in Section 2 at equation (3), under the rare disease approximation, fX can be approximated by fcont, i.e. we can use fX empirically using only the controls. This has the advantage that we do not need prior knowledge about the typically unknown disease rate π1. This is in contrast with the general model where we need to know π1 not only to be able to work with 𝒦̃, but also to obtain a consistent estimator of θ0.

Because case–control studies are almost inevitably conducted for rare outcomes, the rare disease approximation is natural in most applications. It is also widely used, a very non-exhaustive list of which includes Piegorsch et al. (1994), Epstein and Satten (2003), Lin and Zeng (2006), Modan et al. (2001), Zhao et al. (2003), Kwee et al. (2007), Lin and Zeng (2009) and Hu et al. (2010).

4.2. Case–control studies with frequency matching

In frequency-matched case–control studies, a few strata are formed based on covariates such as age, and then a population-based case–control study is performed within each stratum. A straightforward approach is to include these matching variables as part of X, to form the estimating function (9) for each stratum and to form a new estimating function as the possibly weighted sum of the estimating functions across the strata. The weights might for example be based on estimates of the size of each stratum in the population. The resulting estimates of (αtrue, βtrue) will be asymptotically normally distributed.

5. Simulations

We performed simulation studies both at and away from the Gaussian model. Our simulations indicate that our proposed estimator has small bias and nearly nominal coverage probability in the cases that we examined, whereas an implementation of the parametric approach (see Section 2.3) may suffer from bias and lower coverage probability (Tables 1 and 2). We also show that our method often achieves significant gains in efficiency when compared with the estimator that uses only the controls. The approach that uses all the data but ignores the case–control sampling design suffers from bias and low coverage; see below.

Table 1.

Results of the simulation study with n1 = 500 cases and n0 = 500 controls, and a disease rate of approximately 1%

Results for normal model Results for gamma model


Controls SPMLE Robust All Controls SPMLE Robust All
θy = 0.00
Mean 0.992 0.991 1.001 0.992 1.002 1.005 1.003 1.003
sd 0.148 0.107 0.119 0.105 0.156 0.111 0.120 0.111
Est. sd 0.154 0.110 0.121 0.109 0.154 0.110 0.121 0.109
90% 0.917 0.911 0.918 0.912 0.892 0.897 0.899 0.901
95% 0.956 0.955 0.965 0.955 0.944 0.943 0.944 0.941
MSE Eff 1.898 1.537 1.965 1.963 1.665 1.957
θy = 0.25
Mean 0.999 1.001 0.990 1.078 1.001 0.997 0.993 1.120
sd 0.154 0.110 0.117 0.109 0.155 0.144 0.120 0.144
Est. sd 0.154 0.111 0.119 0.110 0.153 0.149 0.123 0.148
90% 0.911 0.905 0.908 0.818 0.900 0.924 0.901 0.797
95% 0.955 0.954 0.958 0.889 0.945 0.961 0.947 0.881
MSE Eff 1.951 1.720 1.303 1.148 1.643 0.680
θy = 0.50
Mean 0.995 0.994 0.989 1.177 0.986 0.848 1.024 1.297
sd 0.154 0.114 0.117 0.114 0.144 0.205 0.147 0.208
Est. sd 0.154 0.113 0.120 0.113 0.148 0.208 0.149 0.215
90% 0.903 0.898 0.904 0.525 0.906 0.818 0.905 0.587
95% 0.957 0.947 0.948 0.641 0.953 0.884 0.957 0.719
MSE Eff 1.822 1.704 0.531 0.323 0.938 0.159

‘Normal’ means that ε ~ N(0, 1), and ‘gamma’ means that ε is a centred and scaled gamma random variable with shape parameter 0.4. The analyses performed are ‘controls’ (using only controls), the semiparametric efficient method that assumes normality (‘SPMLE’), our new estimator (‘robust’), and ‘all’, which is the method that uses all the data while ignoring the case–control study. Over 1000 simulations, we computed the mean estimated β (‘mean’), its standard deviation (‘sd’), the mean estimated standard deviation (‘Est. sd’), the coverage for a nominal 90% confidence interval (‘90%’), the coverage for a nominal 95% confidence interval (‘95%’) and the mean-squared error efficiency (‘MSE Eff’) compared with using only the controls.

Table 2.

Results of the simulation study described in Table 1, now with n1 = 150 cases and n0 = 150 controls

Results for normal model Results for gamma model


Controls SPMLE Robust All Controls SPMLE Robust All
θy = 0.00
Mean 0.991 0.993 1.005 0.992 0.998 0.991 1.019 0.990
sd 0.287 0.204 0.233 0.200 0.292 0.201 0.236 0.199
Est. sd 0.282 0.202 0.230 0.200 0.281 0.201 0.230 0.199
90% 0.891 0.908 0.910 0.905 0.892 0.900 0.916 0.902
95% 0.942 0.951 0.965 0.952 0.948 0.950 0.959 0.950
MSE Eff 1.973 1.509 2.043 2.103 1.526 2.151
θy = 0.25
Mean 1.008 1.016 0.983 1.092 1.007 0.994 0.974 1.118
sd 0.301 0.204 0.220 0.202 0.280 0.268 0.223 0.267
Est. sd 0.283 0.204 0.227 0.202 0.273 0.269 0.232 0.268
90% 0.874 0.893 0.933 0.867 0.903 0.900 0.928 0.864
95% 0.933 0.950 0.968 0.930 0.943 0.947 0.968 0.928
MSE Eff 2.156 1.856 1.834 1.088 1.551 0.921
θy = 0.50
Mean 0.986 0.987 0.974 1.173 0.985 0.837 1.006 1.292
sd 0.283 0.199 0.222 0.200 0.265 0.393 0.295 0.400
Est. sd 0.282 0.206 0.235 0.207 0.266 0.381 0.311 0.393
90% 0.903 0.918 0.936 0.798 0.900 0.864 0.938 0.808
95% 0.948 0.958 0.973 0.871 0.943 0.923 0.969 0.888
MSE Eff 2.003 1.597 1.143 0.388 0.806 0.287

The disease rate is approximately 1%.

We generated X from a uniform distribution on (0, 1). The logistic regression model is pr(D = 1|Y, X) = H0 + θyY + θxX), with θ0 = −5.5, θy = 0.00, 0.25, 0.50 and θx = 1. The model for Y given X is a linear regression model, Y = αtrue + βtrueX + ε, with αtrue = 0 and βtrue = 1. We considered two distributions for ε: the standard normal distribution, for which the parametric approach attains the semiparametric efficiency bound, and, for comparison, a standardized gamma distribution with scale parameter 0.4. By equation (2), for θy = 0.00, 0.25, 0.50 the rates of disease are approximately 0.007, 0.008 and 0.010. In the first scenario the case–control study has n1 = 500 cases and n0 = 500 controls. In the second scenario we chose n0 = n1 = 150. We generated 1000 simulated data sets in each setting.

We contrasted four methods. The first uses ordinary linear regression based only on the controls. The second method uses the same approach but is expected to be significantly biased since it is based on the entire data set. The third method is the parametric (‘semiparametric efficient’) method that assumes normal errors, with standard errors obtained by inverting the Hessian of the log-likelihood. The fourth method is our proposed method, with standard errors estimated by using asymptotic formulae. The third and the fourth method were computed by making the rare disease approximation.

The case θy = 0.00 is interesting, because here Y is independent of D given X. Hence all methods should achieve nominal coverage probabilities for estimating βtrue, which is indeed seen in Table 1. Since, with θy = 0.00, all methods are asymptotically valid, the only possibility of seeing a bias is when θy is sufficiently ‘large’. For this reason, we experimented with the cases θy = 0.25 and θy = 0.50. Consider θy = 0.25 first. Here the approach that uses all the data yields a biased estimator of βtrue = 1, with low coverage probabilities. The ‘semiparametric efficient’ method that assumes normality still maintains its nominal coverage probabilities. As expected, since it is efficient if the errors are normal, it indeed outperforms the other approaches in this case. For example, for any two methods, say A and B, with estimates β̂A and β̂B, the mean-squared error efficiency of method A with respect to method B is E{(β̂B − βtrue)2}/E{(β̂A − βtrue)2}, and its estimated version is computed by replacing expectations by averages across the simulations. The semiparametric efficient method has 13% greater mean-squared error efficiency than our method in the normal case. However, in the gamma case, our method has 43% greater mean-squared error efficiency. It also outperforms the approach that uses only the controls, for both normal and gamma errors: in both cases the mean-squared error efficiency is roughly 70% larger.

Finally, in the case θy = 0.50 with normal regression errors, the semiparametric efficient method that assumes normality maintains its nominal coverage probabilities and has 7% greater mean-squared error efficiency than our method and 82% greater efficiency than using only controls. However, when the errors have a gamma distribution, it suffers from bias, increased variance and loss of coverage, with nominal 90% and 95% coverage actually being 81.8% and 88.4% respectively. Our method retains nominal coverage. The controls-only analysis and our method have roughly equal mean-squared error efficiency which is, in particular, much greater than the mean-squared error efficiency of the semiparametric efficient approach for regression models with normal errors.

6. Empirical example

In this section, we illustrate the methodology in a case–control study of prostate cancer, which was originally designed to investigate the risk of prostate-cancer-associated vitamin D biomarkers and genetic variations in vitamin D metabolism pathways (Ahn et al., 2009). The goal of the current analysis, which includes 749 prostate cancer cases and 781 controls, is to examine whether the genetic variations in the vitamin D receptor influence [25(OH)D], which is a serum level biomarker of vitamin D. In the notation of this paper, D is the prostate cancer case–control status and Y is the level of [25(OH)D]. We investigated three SNPs, rs2238136, rs2254210 and rs2239186, each of which represents an ordinal categorical variable coded as 0, 1 or 2 depending on how many copies of the variant allele a subject carries. In our analysis, X consists of three dummy variables for age groups, along with one of the genetic markers.

The results are given in Table 3. We see in Table 3 that none of the coefficients for the SNP are statistically significant. Thus, neither the traditional control-only nor the proposed method detected any association between the vitamin D receptor gene and [25(OH)D] level. These results are consistent with Chen et al. (2009) who noted that, given the downstream role of the vitamin D receptor gene in the vitamin D pathway, it is unlikely that vitamin D receptor polymorphisms could actually influence the level of [25(OH)D]. In spite of a lack of association, it is interesting to observe that the 95% confidence intervals by using our method are much shorter than by using those from the control data only. In terms of mean-squared error efficiency, here estimated as the square of the ratio of the lengths of the confidence intervals, the results for the three SNPs suggest gains in efficiency of 68%, 136% and 125% compared with using only the controls.

Table 3.

Results of the vitamin D receptor data example in Section 6

X Results for our method Results for controls only Efficiency


Estimate Lowe
limitr
Upper
limit
Estimate Lowe
limitr
Upper
limit
SNP 1 0.015 −0.165 0.195 −0.029 −0.262 0.204 1.68
SNP 2 0.023 −0.047 0.093 0.039 −0.069 0.146 2.36
SNP 3 0.015 −0.062 0.092 −0.045 −0.161 0.070 2.25

Three analyses are displayed, one each when X is SNP 1, SNP 2 and SNP 3. Displayed are the parameter estimates of the slope for X (‘estimate’), and lower (‘lower’) and upper (‘upper’) 95% confidence intervals. Our method is contrasted with using linear regression among the controls only. Also displayed is the ‘efficiency’, which is defined as the square of the ratio of the lengths of the confidence intervals.

7. Discussion

If the disease probability pr(D = 1) is known, there are simpler methods for our particular setting that allow estimation of βtrue, based on weighting via equation (3). However, in the common case that pr(D = 1) is not known, the development in Section 3 leads to two natural rare disease approximations that use all the data and not just the data on the controls; see Section 4.1. It would be interesting to investigate which of these two approximate approaches is preferable in general.

Our simulation results are specific to rare diseases, by which we mean certainly that pr(D = 1) ≤ 1%. Biases will arise as the disease probability increases. In addition, since rare disease approximations do not lead to fully consistent estimation, coverage probability in large samples will suffer, since the bias is fixed whereas the variance decreases with sample size. Finally, the methods are likely to suffer in cases that the X-distribution has relatively rare values that are not within the centre of the support of X.

Acknowledgements

This paper represents part of the first author’s doctoral dissertation at Texas A&M University. Wei and Carroll’s research was supported by a grant from the National Cancer Institute (R37-CA057030). Carroll was also supported by award KUS-CI-016-04, made by King Abdullah University of Science and Technology. Chatterjee’s research was supported by a gene–environment initiative grant from the National Heart, Lung and Blood Institute (RO1-HL091172-01) and by the Intramural Research Program of the National Cancer Institute. Muller was supported by a National Science Foundation grant (DMS-0907014). Van Keilegom gratefully acknowledges financial support from Interuniversity Attraction Pole research network P6/03 of the Belgian Government (Belgian science policy), and from the European Research Council under the European Community’s seventh framework programme (FP7/2007-2013), European Research Council grant agreement 203650.

Appendix A

Some derivations

A.1. Derivation of the error density estimator (11)

The key idea of the approach is to introduce discrete probabilities γi = pr{R(β) = Ri(β)}, i = 1, …, n, which yields

pr(D=d)=i=1npr{D=d|R(β)=Ri(β)}γi,

and to work with the maximum likelihood estimates, i.e. with those γi that maximize the retrospective log-likelihood

i=1nlog[pr{R(β)=Ri(β)|D=Di}]=i=1nlog[pr{R(β)=Ri(β)}pr{D=Di|R(β)=Ri(β)}pr(D=Di)]=i=1nlog[k=1nγk1{Ri(β)=Rk(β)}]+i=1nlog[pr{D=Di|R(β)=Ri(β)}k=1npr{D=Di|R(β)=Rk(β)}γk].

Taking the derivative with respect to γk, k = 1, …, n, gives

i=1nI{Ri(β)=Rk(β)}γki=1npr{D=Di|R(β)=Rk(β)}k=1npr{D=Di|R(β)=Rk(β)}γk=γk1i=1npr{D=Di|R(β)=Rk(β)}pr(D=Di)=γk1d=01pr{D=d|R(β)=Rk(β)}ndπd.

Now set this equal to 0 to obtain

γk=[d=01pr{D=d|R(β)=Rk(β)}ndπd]1=[d=01pr{D=d|R(β)=Rk(β),X=x}fX(x)dxndπd]1.

By definition of 𝒦, using that

n0π0+n1π1exp{θ0+m(y,x,θ1)}=n0π0[1+exp{κ+m(y,x,θ1)}],

this is the desired formula (11).

A.2. Unbiasedness of estimation function (12)

All calculations of expectations here will be based on the precise definition of expectations in a case–control sampling scheme; see equation (4). Let (βtrue, Ωtrue) be the true parameter, β an arbitrary value and τ(x, β, βtrue) = μ(x, βtrue) − μ(x, β). To derive the conditional density given the disease state we use the fact that we assume a logistic model, pr(D = 1|Y, X) = H0 + m(Y, X, θ1)}, with H(x) the logistic distribution function, for which

H{θ0+m(Y,X,θ1)}=[1H{θ0+m(Y,X,θ1)}]exp{θ0+m(Y,X,θ1)}.

Now write fYX(·) as the joint density function of (Y, X) in the population. Then, with θ0 and θ1 denoting the true parameters,

πd=pr(D=d)=H{θ0+m(y,x,θ1)}d[1H{θ0+m(y,x,θ1)}]1dfYX(y,x)dydx=[1H{θ0+m(y,x,θ1)}]exp[d{θ0+m(y,x,θ1)}]fYX(y,x)dydx.

It then follows that the density of (Y, X) given D is

fYX|D=d(y,x)=exp[d{θ0+m(y,x,θ1)}]fYX(y,x)[1+exp{θ0+m(y,x,θ1)}]πd.

Recall that κ = θ0 + log(n1/n0) − log(π10). Then equation (4) can now be computed as

n1i=1nE[G{Ri(β),Xi}|Di]=d=01ndnπdG{yμ(x,β),x}exp[d{θ0+m(y,x,θ1)}]1+exp{θ0+m(y,x,θ1)}fYX(y,x)dydx=n0nπ0d=01G{yμ(x,β),x}nd/n0πd/π0exp[d{θ0+m(y,x,θ1)}]1+exp{θ0+m(y,x,θ1)}fYX(y,x)dydx=n0nπ0G(r,x)1+exp[κ+m{r+μ(x,β),x,θ1}]1+exp[θ0+m{r+μ(x,β),x,θ1}]fYX{r+μ(x,β),x}drdx.

The joint density of (Y, X) in the population is fYX(y, x) = fε{y − αtrue − μ(x, βtrue)} fX(x). Hence, fYX{r + μ(x, β), x} = fε{r − αtrue − τ(x, β, βtrue)} fX(x). Thus,

n1i=1nE[G{Ri(β),Xi}|Di]=n0nπ0G(r,x)1+exp[κ+m{r+μ(x,βtrue)τ(x,β,βtrue),x,θ1}]1+exp[θ0+m{r+μ(x,βtrue)τ(x,β,βtrue),x,θ1}]×fε{rαtrueτ(x,β,βtrue)}fX(x)drdx=n0nπ0G{r+τ(x,β,βtrue),x}1+exp[κ+m{r+μ(x,βtrue),x,θ1}]1+exp[θ0+m{r+μ(x,βtrue),x,θ1}]×fε(rαtrue)fX(x)drdx.

Now, since

𝒦(r,x,βtrue,Ωtrue)=(1+exp[κ+m{r+μ(x,βtrue),x,θ1}])(1+exp[θ0+m{r+μ(x,βtrue),x,θ1}])1,

we have that

n1i=1nE[G{Ri(β),Xi}|Di]=n0nπ0fε(rαtrue)fX(x)𝒦(r,x,βtrue,Ωtrue)G{r+τ(x,β,βtrue),x}drdx. (17)

It follows from the convention in equation (4) and equation (17) that

nπ0n0E{Qn(αtrue,β,Ωtrue)}=E{Qn(αtrue,β,Ωtrue)|D1,,Dn}=n1/2fε(rαtrue)fX(x)𝒦(r,x,βtrue,Ωtrue)[L{r+τ(x,β,βtrue),x,α(β,Ωtrue),β}L{r+τ(x,β,βtrue),υ,α(β,Ωtrue),β}𝒦{r+τ(x,β,βtrue),υ,β,Ωtrue}fX(υ)dυ𝒦{r+τ(x,β,βtrue),s,β,Ωtrue}fX(s)ds]dxdr.

If β = βtrue, since τ(x, βtrue, βtrue) = 0, it follows directly that the last term is 0, and therefore 0 = E{Qntrue, βtrue, Ωtrue)|D1, …, Dn}. Hence Qntrue, β, Ωtrue) = 0 is an unbiased estimating equation. If β ≠ βtrue, then in general we shall have 0 ≠ {Qntrue, β, Ωtrue)|D1, …, Dn}.

Appendix B

Asymptotic theory

B.1. Notation and assumptions

In this section we introduce notation that is needed for our main theorem in Section 3.6, and we also state the formal assumptions under which this result will be valid.

Let (β, Ω) = Θ, and let Θtrue denote its true value. Recall equation (4), and define

c*=limn(n0/n),
α(β,Ω)=d=01(nd/n)E(R(β)[fX(x)𝒦{R(β),x,β,Ω}dx]1|D=d)d=01(nd/n)E([fX(x)𝒦{R(β),x,β,Ω}dx]1|D=d),
𝒯{R(β),X,Θ,fX}=L{R(β),X,α(β,Ω),β}L{R(β),x,α(β,Ω),β}𝒦{R(β),x,Θ}fX(x)dx𝒦{R(β),x,Θ}fX(x)dx,
Ω=d=01c*1d(1c*)dE[𝒯{R(βtrue),X,Θ,fX}ΩT|D=d]|Θ=Θtrue,
β=d=01c*1d(1c*)dE[𝒯{R(β),X,Θtrue,fX}βT|D=d]|β=βtrue.

Define

Gnum(r,x,d,Θ)=L{r,x,α(β,Ω),β}𝒦̃(r,x,d,Θ),
Gden(r,x,d,Θ)=𝒦̃(r,x,d,Θ),
𝒜num(r,Θ)=d=01(nd/n)E{Gnum(r,X,D,Θ)|D=d},
𝒜den(r,Θ)=d=01(nd/n)E{Gden(r,X,D,Θ)|D=d}.

Write

n(β,Θ)=n1/2i=1n[n1j=1nGnum{Ri(β),Xj,Dj,Θ}n1j=1nGden{Ri(β),Xj,Dj,Θ}𝒜num{Ri(β),Θ}𝒜den{Ri(β),Θ}]

and

W{Ri(β),Xj,Dj,Θ}=Gnum{Ri(β),Xj,Dj,Θ}𝒜num{Ri(β),Θ}𝒜den{Ri(β),Θ}𝒜num{Ri(β),Θ}[Gden{Ri(β),Xj,Dj,Θ}𝒜den{Ri(β),Θ}]𝒜den2{Ri(β),Θ}.

Also define

i(β)={Ri(β),Xi,Di},
=(r,x,d),
Q1{i(β),j(β),Θ}=W{Ri(β),Xj,Dj,Θ}+W{Rj(β),Xi,Di,Θ},
Q2j(,β,Θ)=E[W{R(β),x,d,Θ}|D=j],
h1j(,β,Θ)=E[Q1{,(β),Θ}|D=j](j=0,1),
h2{Ri(β),Xi,Di,Θ}=n0n(1Di)h10{i(β),β,Θ}+n1nDih11{i(β),β,Θ}+n0nDiQ20{i(β),β,Θ}+n1n(1Di)Q21{i(β),β,Θ},
mθ1(y,x,θ1)=m(y,x,θ1)θ1,
Φ(y,x,d,Ω)=(1,mθ1(y,x,θ1))T[dH{κ+m(y,x,θ1)}],
𝒩Ω=d=01c*1d(1c*)d[E{Φ(Y,X,D,Ω)/Ω|D=d}|Ω=Ωtrue]1,
Λ(Yi,Xi,Di,Θtrue)=Ω(𝒩ΩΦ(Yi,Xi,Di,Ωtrue),Ψ(Yi,Xi,Di,Ωtrue))Th2{Ri(βtrue),Xi,Di,Θtrue}+𝒯{Ri(βtrue),Xi,Θtrue,fX},

where the function Ψ(Yi, Xi, Di, Ωtrue) is defined in assumption 4 below. Finally, let

Σ=d=01c*1d(1c*)dβ1cov{Λ(Y,X,D,Θtrue)|D=d}(β1)T.

Next, introduce the following assumptions, under which the main result in Section 3.6 is valid.

Assumption 1. The error ε is independent of X. The error distribution Fε is twice continuously differentiable, and the distribution FX of X is once continuously differentiable. The corresponding densities are denoted by fε and fX.

Assumption 2. There exists some 0 < c* < 1 such that n0/nc*.

Assumption 3. The function μ(x, β) is three times continuously differentiable with respect to β, m(y, x, θ1) is twice continuously differentiable with respect to y and θ1, and Φ(y, x, d, Ω) is continuously differentiable with respect to Ω. Also, the matrices ℳβ and E{∂Φ(Y, X, D, Ω)/∂Ω|D = d}|Ω=Ωtrue are invertible.

Assumption 4. The estimator θ̂0 satisfies

θ̂0θ0,true=n1Ψ(Yi,Xi,Di,Ωtrue)+op(n1/2),

for some function Ψ that satisfies E{Ψ(Y, X, D, Ωtrue)|D} = 0.

B.2. Proofs

We are now ready to give the proof of our main asymptotic result. Before giving a formal proof, let us first highlight the main steps of the proof. First, it follows from Appendix A.2 that n(α, β, Ω) is an unbiased estimating function. Plugging in an estimator of αtrue, we use a Taylor expansion of n,est(β̂, Ω̂) = 0 around the true β and Ω, which gives a regular asymptotically linear expansion of n1/2(β̂ − βtrue). Finally we apply the central limit theorem to obtain the required asymptotic normality result. Along the way, we must show an asymptotic expansion for ℋn(β, Θ), which is given in lemma 1. The notation in the statement of this lemma was introduced in the previous section.

Lemma 1. Assume that assumptions 1–3 are valid. Then, for each β and Θ,

n(β,Θ)=n1/2i=1nh2{Ri(β),Xi,Di,Θ}+op(1),

where E[h2{R(β), X, D, Θ}|D] = 0.

Proof. Define

Znum{R(β),Θ}=n1/2j=1n[Gnum{R(β),Xj,Dj,Θ}𝒜num{R(β),Θ}],
Zden{R(β),Θ}=n1/2j=1n[Gden{R(β),Xj,Dj,Θ}𝒜den{R(β),Θ}].

Since by assumption 2 we have that n1/n0c, 0 < c < ∞, it follows that Znum{R(β),Θ} = Op(1) and Zden{R(β), Θ} = Op(1), for each β and Θ. Hence, by a Taylor series expansion and assumption 3,

n1j=1nGnum{R(β),Xj,Θ,Dj}n1j=1nGden{R(β),Xj,Θ,Dj}𝒜num{R(β),Θ}𝒜den{R(β),Θ}=𝒜num{R(β),Θ}+n1/2Znum{R(β),Θ}𝒜den{R(β),Θ}+n1/2Zden{R(β),Θ}𝒜num{R(β),Θ}𝒜den{R(β),Θ}=n1/2Znum{R(β),Θ}𝒜den{R(β),Θ}𝒜num{R(β),Θ}𝒜den2{R(β),Θ}n1/2Zden{R(β),Θ}+op(n1/2).

Thus,

n(β,Θ)=n3/2(i=1nj=1nGnum{Ri(β),Xj,Dj,Θ}𝒜num{Ri(β),Θ}𝒜den{Ri(β),Θ}i=1nj=1n𝒜num{Ri(β),Θ}𝒜den2{Ri(β),Θ}[Gden{Ri(β),Xj,Dj,Θ}𝒜den{Ri(β),Θ}])+op(1)=n(β,Θ)+op(1).

By definition, E{ℬn(β, Θ)|D1, …, Dn} = 0. By the definition of W{Ri(β), Xj, Dj, Θ},

n(β,Θ)=n3/2i=1nj=1nW{Ri(β),Xj,Dj,Θ}.

Without loss of generality, we can make the first n0 observations be the controls, and the last nn0 observations be the cases. Then,

n(β,Θ)=n3/2i=1n0j=1n0W{Ri(β),Xj,Dj,Θ}+n3/2i=n0+1nj=n0+1nW{Ri(β),Xj,Dj,Θ}+n3/2i=n0+1nj=1n0W{Ri(β),Xj,Dj,Θ}+n3/2i=1n0j=n0+1nW{Ri(β),Xj,Dj,Θ}=n3/2i=1n0j=1i1Q1{i(β),j(β),Θ}+n3/2i=n0+1nj=n0+1i1Q1{i(β),j(β),Θ}+n3/2i=n0+1nj=1n0W{Ri(β),Xj,Dj,Θ}+n3/2i=1n0j=n0+1nW{Ri(β),Xj,Dj,Θ}+op(1).

An easy calculation shows that

var[n3/2i=n0+1nj=1n0W{Ri(β),Xj,Dj,Θ}n1n3/2j=1n0Q21{j(β),β,Θ}]0,

and similarly

var[n3/2i=1n0j=n0+1nW{Ri(β),Xj,Dj,Θ}n0n3/2j=n0+1nQ20{j(β),β,Θ}]0,

Hence we have shown that

n(β,Θ)=(n0n)3/2n03/2i=1n0j=1i1Q1{i(β),j(β),Θ}+(n1n)3/2n13/2i=n0+1nj=n0+1i1Q1{i(β),j(β),Θ}+n1n3/2i=1n0Q21{i(β),β,Θ}+n0n3/2i=n0+1nQ20{i(β),β,Θ}+op(1).

Except for the factor (n0/n)3/2, the first term above is a classical symmetric U-statistic of order 2 applied to independent and identically distributed observations, since by convention the first n0 observations are the controls. It then follows from standard U-statistic theory that (see, for example, Van der Vaart (1998))

n(β,Θ)=(n0n)3/2n01/2i=1n0h10{i(β),β,Θ}+(n1n)3/2n11/2i=n0+1nh11{i(β),β,Θ}+n1n3/2i=1n0Q21{i(β),β,Θ}+n0n3/2i=n0+1nQ20{i(β),β,Θ}+op(1)=n1/2i=1nh2{Ri(β),Xi,Di,Θ}+op(1).

This completes the proof.

B.2.1. Proof of theorem 1

Because of the unbiasedness of the estimating function (13) and the fact that expression (14) is consistent and asymptotically normally distributed for αtrue when evaluated at (βtrue, Ωtrue), the estimate is consistent for βtrue, and α(βtrue, Ωtrue) = αtrue. Set

𝒥(R(β),X,β,Ω)=μβ(X,β)μβ(x,β)𝒦{R(β),x,β,Ω}fX(x)dx𝒦{R(β),x,β,Ω}fX(x)dx,
c1n(β,Ω)=n1i=1n𝒥{Ri(β),Xi,β,Ω},
c1(β,Ω)=d=01ndnE[𝒥{R(β),X,β,Ω}|D=d].

We use the fact that 0 = n,est(β, Ω̂)|β=β̂. By a Taylor series expansion and assumption 3,

0=n,est(βtrue,Ωtrue)+βT{n1/2n,est(βtrue,Ωtrue)}n1/2(β̂βtrue)+ΩT{n1/2n,est(βtrue,Ωtrue)}n1/2(Ω̂Ωtrue)+op(1).

However, since α̂(βtrue, Ωtrue) is a consistent estimator for αtrue, it is clear that we have that

n1/2{n,est(β,Ωtrue)/βT}β=βtrue=β+op(1)

and

n1/2{n,est(βtrue,Ω)/ΩT}Ω=Ωtrue=Ω+op(1).

Hence it follows that

0=n,est(βtrue,Ωtrue)+βn1/2(β̂βtrue)+Ωn1/2(Ω̂Ωtrue)+op(1).

Because of its form, another Taylor series expansion and under assumption 3,

n,est(βtrue,Ωtrue)=n(αtrue,βtrue,Ωtrue)+c1(βtrue,Ωtrue)n1/2{α̂(βtrue,Ωtrue)α(βtrue,Ωtrue)}+op(1).

However, we can obtain by the same argument as in Appendix A.2 that c1true, Ωtrue) = 0. In addition, using the same tools as in lemma 1, n1/2{α̂(βtrue, Ωtrue) − α(βtrue, Ωtrue)} = Op(1). We have thus shown that

n1/2(β̂βtrue)=β1{n(αtrue,βtrue,Ωtrue)+Ωn1/2(Ω̂Ωtrue)}+op(1). (18)

Because (κ, θ1) is estimated by ordinary logistic regression, and assumption 4 gives a representation for θ̂0 − θ0,true, it follows from standard theory that

n1/2(Ω̂Ωtrue)=n1/2i=1n(𝒩ΩΦ(Yi,Xi,Di,Ωtrue),Ψ(Yi,Xi,Di,Ωtrue))T+op(1).

We thus have from equation (18) that

n1/2(β̂βtrue)=β1{n(αtrue,βtrue,Ωtrue)+Ωn1/2i=1n(𝒩ΩΦ(Yi,Xi,Di,Ωtrue),Ψ(Yi,Xi,Di,Ωtrue))T}+op(1).

We can now apply lemma 1 to ntrue, βtrue, Ωtrue) with Gnum(r, x, d, Θ) = L{r, x, α(β, Ω), β} 𝒦̃(r, x, d, Θ) and Gden(r, x, d, Θ) = 𝒦̃(r, x, d, Θ). Invoking lemma 1, it follows that

n(αtrue,βtrue,Ωtrue)=n1/2i=1n𝒯{Ri(βtrue),Xi,Θtrue,fX}n1/2i=1nh2{Ri(βtrue),Xi,Di,Θtrue}+op(1).

We have shown in Appendix A.2 that the first term has mean 0. Remember from lemma 1 that E[h2{Rtrue), X, D, Θtrue}|D] = 0. Moreover, the estimating equation for logistic regression is unbiased and assumption 4 ensures that E[Ψ(Y, X, D, Ωtrue)|D] = 0. Summarizing, we have shown that

n1/2(β̂βtrue)=β1n1/2i=1nΛ(Yi,Xi,Di,Θtrue)+op(1),Λ(Yi,Xi,Di,Θtrue)=Ω(𝒩ΩΦ(Yi,Xi,Di,Ωtrue),Ψ(Yi,Xi,Di,Ωtrue))Th2{Ri(βtrue),Xi,Di,Θtrue}+𝒯{Ri(βtrue),Xi,Θtrue,fX},
0=E{Λ(Y,X,D,Θtrue)|D},

as claimed.

Footnotes

Supporting information

Additional ‘supporting information’ may be found in the on-line version of this article:

‘Supplemental material for Robust estimation for homoscedastic regression in the secondary analysis of case-control data’.

Contributor Information

Jiawei Wei, Texas A&M University, College Station, USA.

Raymond J. Carroll, Texas A&M University, College Station, USA

Ursula U. Müller, Texas A&M University, College Station, USA

Ingrid Van Keilegom, Université catholique de Louvain, Louvain-la-Neuve, Belgium, and Tilburg University, The Netherlands.

Nilanjan Chatterjee, National Cancer Institute, Rockville, USA.

References

  1. Ahn J, Albanes D, Berndt SI, Peters U, Chatterjee N, Freedman ND, Abnet CC, Huang WY, Kibel AS, Crawford ED, Weinstein SJ, Chanock SJ, Schatzkin A, Hayes RB the Prostate, Lung, Colorectal and Ovarian Trial Project Team. Vitamin D-related genes, serum vitamin D concentrations and prostate cancer risk. Carcinogenesis. 2009;30:769–776. doi: 10.1093/carcin/bgp055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Anderson R. Modern Methods for Robust Regression. New York: Sage; 2008. [Google Scholar]
  3. Babu GJ, Singh K. Inference on means using the bootstrap. Ann. Statist. 1983;11:999–1003. [Google Scholar]
  4. Buonaccorsi JP. Measurement Eror: Models, Methods and Applications. Boca Raton: Chapman and Hall; 2010. [Google Scholar]
  5. Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation in case-control studies of gene-environment interactions. Biometrika. 2005;92:399–418. [Google Scholar]
  6. Chen Y-H, Carroll RJ, Chatterjee N. Retrospective analysis of haplotype-based case-control studies under a flexible model for gene-environment association. Biostatistics. 2008;9:81–99. doi: 10.1093/biostatistics/kxm011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chen Y-H, Chatterjee N, Carroll RJ. Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies. J. Am. Statist. Ass. 2009;104:220–233. doi: 10.1198/jasa.2009.0104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chen X, Linton O, Van Keilegom I. Estimation of semiparametric models when the criterion function is not smooth. Econometrica. 2003;71:1591–1608. [Google Scholar]
  9. Epstein M, Satten GA. Inference on haplotype effects in case-control studies using unphased genotype data. Am. J. Hum. Genet. 2003;73:1316–1329. doi: 10.1086/380204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Hall P, Horowitz J. Bootstrap critical values for tests based on generalized-method-of-moments estimators. Econometrica. 1996;6:891–916. [Google Scholar]
  11. Hu YJ, Lin DY, Zeng D. A general framework for studying genetic effects and gene–environment interactions with missing data. Biostatistics. 2010;11:583–598. doi: 10.1093/biostatistics/kxq015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Huber PJ. Robust Statistics. New York: Wiley; 1981. [Google Scholar]
  13. Jiang Y, Scott AJ, Wild CJ. Secondary analysis of case-control data. Statist. Med. 2006;25:1323–1339. doi: 10.1002/sim.2283. [DOI] [PubMed] [Google Scholar]
  14. Kwee LC, Epstein MP, Manatunga AK, Duncan R, Allen AS, Satten GA. Simple methods for assessing haplotype-environment interactions in case-only and case-control studies. Genet. Epidem. 2007;31:75–90. doi: 10.1002/gepi.20192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Lele S. Resampling using estimating equations. In: Godambe UP, editor. Estimating Functions. New York: Oxford University Press; 1991. pp. 295–304. [Google Scholar]
  16. Li H, Gail MH, Berndt S, Chatterjee N. Using cases to strengthen inference on the association between single nucleotide polymorphisms and a secondary phenotype in genome-wide association studies. Genet. Epidem. 2010;34:427–433. doi: 10.1002/gepi.20495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Lin DY, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies (with discussion) J. Am. Statist. Ass. 2006;101:89–118. [Google Scholar]
  18. Lin DY, Zeng D. Proper analysis of secondary phenotype data in case-control association studies. Genet. Epidem. 2009;33:256–265. doi: 10.1002/gepi.20377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Modan MD, Hartge P, Hirsh-Yechezkel G, Chetrit A, Lubin F, Beller U, Ben-Baruch G, Fishman A, Menczer J, Struewing JP, Tucker MA, Wacholder S for the National Israel Ovarian Cancer Study Group. Parity, oral contraceptives and the risk of ovarian cancer among carriers and noncarriers of a BRCA1 or BRCA2 mutation. New Engl. J. Med. 2001;345:235–240. doi: 10.1056/NEJM200107263450401. [DOI] [PubMed] [Google Scholar]
  20. Monsees G, Tamimi R, Kraft P. Genomewide association scans for secondary traits using case-control samples. Genet. Epidem. 2009;33:717–728. doi: 10.1002/gepi.20424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population based case-control studies. Statist. Med. 1994;13:153–162. doi: 10.1002/sim.4780130206. [DOI] [PubMed] [Google Scholar]
  22. Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–411. [Google Scholar]
  23. Spinka C, Carroll RJ, Chatterjee N. Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity. Genet. Epidem. 2005;29:108–127. doi: 10.1002/gepi.20085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Van der Vaart AW. Asymptotic Statistics. Cambridge: Cambridge University Press; 1998. [Google Scholar]
  25. Wang CY, Wang S, Carroll RJ. Estimation in choice-based sampling with measurement error and bootstrap analysis. J. Econmetr. 1997;77:65–86. [Google Scholar]
  26. Zhao LP, Li SS, Khalid N. A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies. Am. J. Hum. Genet. 2003;72:1231–1250. doi: 10.1086/375140. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES