Summary
Primary analysis of case–control studies focuses on the relationship between disease D and a set of covariates of interest (Y, X). A secondary application of the case–control study, which is often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated owing to the case–control sampling, where the regression of Y on X is different from what it is in the population. Previous work has assumed a parametric distribution for Y given X and derived semiparametric efficient estimation and inference without any distributional assumptions about X. We take up the issue of estimation of a regression function when Y given X follows a homoscedastic regression model, but otherwise the distribution of Y is unspecified. The semiparametric efficient approaches can be used to construct semiparametric efficient estimates, but they suffer from a lack of robustness to the assumed model for Y given X. We take an entirely different approach. We show how to estimate the regression parameters consistently even if the assumed model for Y given X is incorrect, and thus the estimates are model robust. For this we make the assumption that the disease rate is known or well estimated. The assumption can be dropped when the disease is rare, which is typically so for most case–control studies, and the estimation algorithm simplifies. Simulations and empirical examples are used to illustrate the approach.
Keywords: Biased samples, Homoscedastic regression, Secondary data, Secondary phenotypes, Semiparametric inference, Two-stage samples
1. Introduction
Case–control designs are popularly used for studying risk factors for rare diseases, such as cancers. Under this design, a fixed number of ‘cases’ and ‘controls’, i.e. subjects with and without the disease of interest, are sampled from an underlying base population. Data on various covariates on the subjects are then collected in a retrospective fashion so that they reflect history before the disease. The standard method for primary analysis of case–control data involves logistic regression modelling of the disease outcome as a function of the covariates of interest. It is well known that prospective logistic regression analysis for case–control data is efficient under a semiparametric framework that allows the ‘nuisance’ distribution of the underlying covariates to be unspecified (Prentice and Pyke, 1979).
Epidemiologic researchers popularly use controls from case–control studies to examine the interrelationship between certain covariates themselves. Such secondary analysis of case–control studies has received increasing attention in genetic epidemiologic studies, where it is often of interest to investigate the effect of genetic susceptibility, such as single-nucleotide polymorphism (SNP) genotypes, not only on the primary disease outcome, but also on various secondary factors, such as smoking habits, that may themselves be associated with the disease of interest. For such secondary analysis, use of only controls is generally considered a model robust approach since, when the disease is rare, the relationship between covariates in the controls should reflect that of the underlying population without any further model assumptions. It is, however, recognized that inclusion of cases in such analysis can increase efficiency, provided that appropriate adjustment can be made to account for non-random ascertainment in case–control sampling. Li et al. (2010), for example, reported that, if two binary covariates have no interaction with the risk of the disease on a logistic scale, then the association between the factors in the cases remains the same as that for the underlying population. Therefore in such a setting inclusion of cases can increase the efficiency of the secondary analysis.
In this paper, our goal is to develop an approach to secondary association analysis for a continuous covariate, say Y, in a case–control study setting so that both cases and controls can be used to increase efficiency and yet the resulting inference is model robust to distributional assumptions about the covariates. Suppose that data are originally collected from a case–control study of a relatively rare disease. Let D be disease status, with D = 1 denoting a case and D = 0 denoting a control. Suppose also that D is to be modelled by a vector of random covariates (Y, X), where Y is univariate and X is potentially multivariate, by using a standard logistic regression formulation. Consider here the homoscedastic regression model
(1) |
where αtrue is an intercept and μ(·) is a known function, and where ε has mean 0 and is independent of X, but its distribution is otherwise not specified.
To estimate (αtrue, βtrue), we cannot simply ignore the case–control sampling scheme and use the data as they are, because, if Y is a predictor of disease status D, the sampling is biased and in the case–control sample model (1) will not hold.
This paper is organized as follows. In Section 2, we describe recent work on case–control studies that allows efficient estimation if the distribution of Y given X is specified up to parameters. Although the solution is elegant, it suffers from the fact that the resulting estimate may be biased if the hypothesized distribution for Y given X is misspecified.
Section 3 takes an entirely different approach to the basic general problem and describes a simple method that is robust to misspecification of the distribution of Y given X. In Section 4 we describe extensions to cases that the disease rate in the population is known or well estimated from a disease registry or as part of an on-going cohort, and to the case of stratified or frequency-matched studies. Section 5 presents a series of simulation studies, whereas Section 6 presents analysis of an epidemiological data set. Concluding remarks are in Section 7. Technical details are given in Appendix A and Appendix B.
2. Efficient parametric estimation and robustness
2.1. Framework
In this section we outline recent work on efficient estimation for case–control studies when the distribution of Y given X is specified up to a finite dimensional parameter vector. We start with a logistic regression model underlying the case–control analysis, so that pr(D = 1|Y, X) = H{θ0 + m(Y, X, θ1)}, where H(·) is the logistic distribution function and m(·) is an arbitrary known function with unknown parameter vector θ1. For d = 0, 1, let πd = pr(D = d), the probability that D = d in the population, and suppose that there are n1 cases with D = 1 and n0 controls with D = 0. We write n = n0 + n1 and introduce the parameter κ = θ0 + log(n1/n0) − log(π1/π0). This reparameterization has the advantage that we can identify κ and θ1 from a logistic regression analysis of D on (Y, X), although we cannot identify θ0 (Prentice and Pyke, 1979; Chatterjee and Carroll, 2005) from such logistic regression alone.
In the parametric framework the conditional distribution of Y given X is modelled as fε{y − α − μ(x, β), ζ}, where ζ is a finite dimensional nuisance parameter. If in the population Y given X is normally distributed, then ζ = var(ε).
2.2. Population-based case–control studies and notation
Our explicit theoretical and asymptotic results are based on population-based case–control studies, i.e. studies in which random samples of (Y, X) are taken separately for D = 1 and D = 0. We shall refer to these simply as case–control studies. Some case–control studies use a form of stratification, which is sometimes called frequency matching, e.g. a population-based case–control study for each of a number of age ranges and the same number of cases and controls in each age group. With some notation and the inclusion of these strata in the logistic risk model and in the model for Y given X, our results are easily extended to such sampling; see Section 4.
We assume a logistic model for pr(D = 1|Y, X) as
(2) |
Our technical assumptions are assumptions 1–4 in Appendix B.1.
We also mention two important calculations. The density fX of X in the population can be written as
(3) |
with (π0, π1) defined in Section 2.1, and where fcont(x) and fcase(x) represent the density of X given D = 0 and D = 1 respectively. Since this is a case–control sampling scheme, all expectations are conditional on D1, …, Dn. Define R(β) = Y − μ(X, β) and Ri(β) = Yi − μ(Xi, β). For an arbitrary function G,
(4) |
the second and last steps following because (Y, X) are independent and identically distributed given D in the case–control sampling scheme.
2.3. Prior results and robustness
For the case–control studies that were described above, Jiang et al. (2006), Chen et al. (2008) and Lin and Zeng (2009) derived the efficient profile likelihood (in the sense that its score for β is an efficient score function), Lin and Zeng (2009) noting importantly that it can be used in our context. See also Monsees et al. (2009). Write Ω = (κ, θ1, θ0). The joint density of (D, Y, X) is
Let
The semiparametric efficient retrospective profile likelihood for β that makes no assumptions about the distribution of X when the distribution of Y given X is specified is
Taking logarithms, summing over the observed data and then maximizing in the parameters yields semiparametric efficient inference.
A difficulty arises, however, if the density fε(·) of ε is not specified properly. To see what happens, consider the score for β. Define Lpar(y, x, α, β, ζ) = ∂log[fε{y − α − μ(x, β), ζ}]/∂β. Then the score for β is
(5) |
Because ℒpar(·) is a legitimate semiparametric profile likelihood, when summed over the case–control data and evaluated at the true parameters, score (5) has mean 0. However, score (5), when evaluated at the true parameter values, only has mean 0 in general if the density fε(·) of ε is specified properly, i.e. the approach is not always model robust; see Section 5 for numerical evidence. This motivates our search for a robust estimation method, which is a topic that we take up in the next section.
3. Model robust estimation
3.1. Preliminaries
In this section we assume the same framework as in the previous section, with the exception that fε is now unknown. We pursue a sequential approach to derive an estimating equation for the parameters that determine the regression function.
- Estimate the true logistic regression parameters κ and θ1 by ordinary logistic regression of D on (Y, X). This can be done legitimately because it is known that ordinary logistic regression in a case–control study consistently estimates κ and θ1 (Prentice and Pyke, 1979; Chatterjee and Carroll, 2005). Denote the estimators by κ̂ and θ̂1.We also suppose that we have a consistent estimator of θ0. This estimator can, for example, be the solution of the equation
when the disease rate π1 in the population is known or well estimated, either from a disease registry or from an underlying cohort from which the cases and controls are sampled. Equation (6) leads to a consistent estimator of θ0, since for any function g(y, x) we can estimate ∫g(y, x) fYX(y, x) dydx unbiasedly by(6)
Call the resulting estimator θ̂0 and denote Ω̂ = (κ̂, θ̂1, θ̂0). - Use a score function for β that would be an appropriate score function if the (Y, X) data arose from random sampling. Define R(β) = Y − μ(X, β). Then the simplest such score function is that from ordinary least squares, which is obtained by differentiating {Y − α − μ(X, β)}2 with respect to β. This yields the score function
where the subscript means differentiation with respect to β.(7) Score (7) will not have mean 0 in the case–control sampling scheme, so we adjust it so that it has mean 0 in general.
For technical reasons that are described later, estimation of αtrue must be done via an auxiliary equation depending on the current values, which we generically call α̂(β, Ω), which replaces α in score (7); see below for the definition.
Solve the adjusted score equation to estimate βtrue and hence αtrue. Good starting values for β can be obtained by least squares regression among the controls.
Remark 1. The score function (7) is not the only one possible; for example, we could instead allow for robustness against outliers by replacing function (7) by the estimating equation of an M-estimator (Huber, 1981; Anderson, 2008).
3.2. Estimation algorithm
The development of our methodology is somewhat involved. Here we simply state our proposal, with its development given in Sections 3.3–3.5. As before, define R(β) = Y − μ(X, β).Remember that estimation of αtrue must be done by using an auxiliary equation; see equation (8) directly below. Define
For given (β, Ω), the estimator of αtrue is justified in Section 3.5 and given by
(8) |
where
Let μβ(x, β) = ∂μ(x, β)/∂β and let L{R(β), X, α, β} be as in equation (7). Then define
(9) |
Our algorithm then is as follows.
Estimate (κ, θ1)T by (κ̂, θ̂1)T, the logistic regression estimates of D on (Y, X). As described previously, this is known to produce consistent estimates of (κtrue, θ1,true)T. Estimate θ0 as explained in Section 3.1. This leads to an estimator Ω̂ of Ωtrue.
Solve 0 = Q̂n,est(β, Ω̂) in β to obtain the estimate β̂.
In the next few subsections, we describe how we obtained equation (9), and at the end we describe the asymptotic distribution theory.
3.3. Development of the score when fX and αtrue are known
3.3.1. Adjusting score (7)
We first describe how to proceed when the intercept αtrue, the density fX(·) of X in the population, and fε(t − αtrue), the density of Y − μ(X, βtrue) in the population, are all known; they are not and we shall show how to remove these restrictions in subsequent sections.
The approach is to start with the estimating function (7), which, when summed over the data, does not have mean 0 at the true parameters because of the case–control sampling scheme, i.e. , in general. Thus, we need to correct so that it does have mean 0 in the case–control sampling scheme, where expectations are computed as in equation (4). In the on-line supplemental material, we show how to follow the approach of Chen et al. (2009), section 2.3.3, to develop the adjusted estimating function
(10) |
This is not of much help, since none of fε(·), fX(·) or αtrue are known. In subsequent sections we show how to replace these terms by data-estimated quantities, and thus arrive at equation (9).
3.3.2. Replacing the unknown error density
The problem with expression (10) is that we do not know the form of fε(·), so score (10) cannot be implemented. Similarly to Chatterjee and Carroll (2005) and Spinka et al. (2005), we therefore replace fε(·) by a non-parametric maximum likelihood estimator. The idea is to take the observed Ri(β) = Yi − μ(Xi, β) as the support, and to maximize the log-likelihood with respect to γi = pr{R(β) = Ri(β)}, i = 1, …, n, subject to . By Chatterjee and Carroll (2005) and Spinka et al. (2005), the resulting estimator for pr{R(β) = Ri(β)} is
(11) |
The derivation of equation (11) is given in Appendix A.1. When we make this substitution in expression (10) and sum over the data, the score becomes
Because the denominator of this expression is π0/n0, by simple algebra it is readily seen that the normalized score function for estimating β can be defined as
(12) |
In Appendix A.2 we show that the expectation of Qn(αtrue, β, Ω) in the case–control sampling scheme is equal to 0 when evaluated at (αtrue, βtrue, Ωtrue), but not for arbitrary (β, Ω). This implies that equation (12) is indeed an unbiased estimating equation in the case–control sampling scheme.
3.4. Implementation when fX is unknown but αtrue is known
The density or mass function fX(·) is not known. We estimate the integrals in expression (12) unbiasedly by their sample average over all the observations, so our estimating equation is
(13) |
3.5. Implementation when the intercept αtrue is unknown
One might reasonably think that estimating the intercept is easy; for example, simply supplement the score with the ordinary least squares score for the intercept, so that . The problem with this is that the first component of the estimating equation (13) would then be identically 0 and thus will not produce an estimate of the intercept. The reason for this is that the solution (11) was calculated non-parametrically under the assumption that R(βtrue) and X are independent in the population. Since Y − αtrue − μ(X, βtrue) and Y − μ(X, βtrue) are both independent of X in the population, this means that equation (11) cannot lead to an estimate of the intercept. Hence, an alternative approach is required.
To overcome this problem, we estimate the intercept of R(β) by using equation (11), i.e., if fX(·) were known, then αtrue could be estimated by
(14) |
a quantity that is free of the π0 that shows up in equation (11). If we then replace the integral in the definition of pest(·) by its average , we obtain exactly expression (8). Making this substitution in equation (13), we obtain equation (9). This completes the derivation of our methodology.
3.6. Distribution theory
The asymptotic distribution of our estimator is given in the following result. We refer to Appendix B.1 for the definition of the functions and matrices that are mentioned below, and for the assumptions 1–4 there under which this result is valid. The proof of this theorem is given in Appendix B.2.
Theorem 1. Let (β, Ω) = Θ, and let Θtrue denote its true value. Assume that assumptions 1–4 in Appendix B.1 are valid. Then there is an invertible matrix ℳβ and a function Λ(Y, X, D, Θtrue) with the properties that E{Λ(Y, X, D, Θtrue)|D} = 0 and
Therefore, there is a matrix Σ, defined in Appendix B.1, such that
(15) |
Estimating the covariance matrix Σ in expression (15) can be accomplished by a plug-in method or by the bootstrap appropriate for case–control sampling (Wang et al., 1997; Buonaccorsi, 2010).
3.7. Inference via bootstrap resampling
In principle, estimating the covariance matrix Σ in expression (15) can be accomplished by a plug-in method, although the particular form of the function Q1(·) that is defined in Appendix B.1 makes computational speed slow. We have thus chosen to use bootstrap ideas to estimate Σ. Below we explain in detail how this can be done, but the basic idea is that we have random samples from two independent populations, i.e. the cases and the controls, and an estimator that is asymptotically normally distributed.
3.7.1. Bootstrap procedure
Let be drawn randomly with replacement from {(Yi, Xi) : Di = 0}, and similarly let be drawn randomly with replacement from {(Yi, Xi) : Di = 1}. This is the method of bootstrap sampling that was suggested by Wang et al. (1997) and Buonaccorsi (2010), page 225, and, since the data consist of samples from two independent populations, is the same as in Babu and Singh (1983); see also Lele (1991).
Let and , and define Ω̂*, α̂*(β, Ω) and in the same way as Ω̂, α̂(β, Ω) in equation (8) and Q̂n,est(β, Ω) in equation (9), but based on instead of (Yi, Xi, Di), i = 1, …, n.
The bootstrapped estimator β̂* of β is then defined as a solution of
with respect to β. See also Hall and Horowitz (1996), page 897, and Chen et al. (2003), where bootstrapping is used and justified in similar contexts.
3.7.2. Bootstrap consistency
To show the consistency of the above bootstrap procedure, we need to show that n1/2(β̂* − β̂) converges to the same normal limit as the original centred estimator n1/2(β̂ − βtrue). For this we use the same techniques as in the proof of theorem B in Chen et al. (2003), combined with the proof of theorem 1 in Appendix A. More precisely, it can be shown that, under certain regularity conditions, we have that
where op*(1) has the same meaning as op(1), except that the probability is computed under the bootstrap distribution conditional on the original data (Yi, Xi, Di), i = 1, …, n. From this together with the central limit theorem and theorem 1 the result follows.
4. Extensions
4.1. Rare disease approximations
The method that was defined in Section 3 assumes that π1 = pr(D = 1) is known. This is typically not the case, so many researchers adopt rare disease approximations (see below for references), where the word ‘rare’ has no precise definition but is certainly 1% or less. There are at least two ways to proceed in our context. The first is to use the literature, to choose a nominal π1 ≤ 1% and to apply the method in Section 3. In results that are not reported here, this works well in the simulation setting of Section 5. In the literature, most researchers use a different approximation, which is described next and implemented in Section 5. We have not investigated in any detail which approach is preferable.
Let ‘≐’ denote ‘approximately equal’. The estimation procedure simplifies if the disease can be assumed to be rare, i.e. if
or, equivalently, if pr(D = 0|Y, X) = [1 + exp{θ0 + m(Y, X, θ1)}]−1 ≐ 1. This approximation allows us to replace 𝒦 in the estimating function (12) by
(16) |
In addition, Ω = (κ, θ1, θ0) in 𝒦 is replaced by Ω* = (κ, θ1), which does not depend on θ0 any-more, and assumption 4 is no longer required since θ0 is no longer estimated. The proof in Appendix A.2, where we show that the estimating function (12) is unbiased, adapts to the rare disease case in a straightforward way, now using the approximation
Hence the modified estimating function based on 𝒦* is approximately unbiased in the rare disease case.
As in the general case, the rare disease version of the estimating function (12) depends on unknown quantities which must be estimated. The estimation algorithm for the rare disease model is as follows and is explained below. Set
As before, estimate Ω* = (κ, θ1) by the logistic regression estimates of D on (Y, X); then solve with respect to β to obtain β̂.
The formulae for α̂* and do not contain an average 𝒦̃*,which could be introduced analogously to the general case where both formulae involve 𝒦̃, and which depends on π1 = P(D = 1). This is explained as follows: both the estimating function (12) and the estimator pest, which is used to estimate αtrue, depend on the unknown density fX. As already explained in Section 2 at equation (3), under the rare disease approximation, fX can be approximated by fcont, i.e. we can use fX empirically using only the controls. This has the advantage that we do not need prior knowledge about the typically unknown disease rate π1. This is in contrast with the general model where we need to know π1 not only to be able to work with 𝒦̃, but also to obtain a consistent estimator of θ0.
Because case–control studies are almost inevitably conducted for rare outcomes, the rare disease approximation is natural in most applications. It is also widely used, a very non-exhaustive list of which includes Piegorsch et al. (1994), Epstein and Satten (2003), Lin and Zeng (2006), Modan et al. (2001), Zhao et al. (2003), Kwee et al. (2007), Lin and Zeng (2009) and Hu et al. (2010).
4.2. Case–control studies with frequency matching
In frequency-matched case–control studies, a few strata are formed based on covariates such as age, and then a population-based case–control study is performed within each stratum. A straightforward approach is to include these matching variables as part of X, to form the estimating function (9) for each stratum and to form a new estimating function as the possibly weighted sum of the estimating functions across the strata. The weights might for example be based on estimates of the size of each stratum in the population. The resulting estimates of (αtrue, βtrue) will be asymptotically normally distributed.
5. Simulations
We performed simulation studies both at and away from the Gaussian model. Our simulations indicate that our proposed estimator has small bias and nearly nominal coverage probability in the cases that we examined, whereas an implementation of the parametric approach (see Section 2.3) may suffer from bias and lower coverage probability (Tables 1 and 2). We also show that our method often achieves significant gains in efficiency when compared with the estimator that uses only the controls. The approach that uses all the data but ignores the case–control sampling design suffers from bias and low coverage; see below.
Table 1.
Results of the simulation study with n1 = 500 cases and n0 = 500 controls, and a disease rate of approximately 1%†
Results for normal model | Results for gamma model | |||||||
---|---|---|---|---|---|---|---|---|
Controls | SPMLE | Robust | All | Controls | SPMLE | Robust | All | |
θy = 0.00 | ||||||||
Mean | 0.992 | 0.991 | 1.001 | 0.992 | 1.002 | 1.005 | 1.003 | 1.003 |
sd | 0.148 | 0.107 | 0.119 | 0.105 | 0.156 | 0.111 | 0.120 | 0.111 |
Est. sd | 0.154 | 0.110 | 0.121 | 0.109 | 0.154 | 0.110 | 0.121 | 0.109 |
90% | 0.917 | 0.911 | 0.918 | 0.912 | 0.892 | 0.897 | 0.899 | 0.901 |
95% | 0.956 | 0.955 | 0.965 | 0.955 | 0.944 | 0.943 | 0.944 | 0.941 |
MSE Eff | 1.898 | 1.537 | 1.965 | 1.963 | 1.665 | 1.957 | ||
θy = 0.25 | ||||||||
Mean | 0.999 | 1.001 | 0.990 | 1.078 | 1.001 | 0.997 | 0.993 | 1.120 |
sd | 0.154 | 0.110 | 0.117 | 0.109 | 0.155 | 0.144 | 0.120 | 0.144 |
Est. sd | 0.154 | 0.111 | 0.119 | 0.110 | 0.153 | 0.149 | 0.123 | 0.148 |
90% | 0.911 | 0.905 | 0.908 | 0.818 | 0.900 | 0.924 | 0.901 | 0.797 |
95% | 0.955 | 0.954 | 0.958 | 0.889 | 0.945 | 0.961 | 0.947 | 0.881 |
MSE Eff | 1.951 | 1.720 | 1.303 | 1.148 | 1.643 | 0.680 | ||
θy = 0.50 | ||||||||
Mean | 0.995 | 0.994 | 0.989 | 1.177 | 0.986 | 0.848 | 1.024 | 1.297 |
sd | 0.154 | 0.114 | 0.117 | 0.114 | 0.144 | 0.205 | 0.147 | 0.208 |
Est. sd | 0.154 | 0.113 | 0.120 | 0.113 | 0.148 | 0.208 | 0.149 | 0.215 |
90% | 0.903 | 0.898 | 0.904 | 0.525 | 0.906 | 0.818 | 0.905 | 0.587 |
95% | 0.957 | 0.947 | 0.948 | 0.641 | 0.953 | 0.884 | 0.957 | 0.719 |
MSE Eff | 1.822 | 1.704 | 0.531 | 0.323 | 0.938 | 0.159 |
‘Normal’ means that ε ~ N(0, 1), and ‘gamma’ means that ε is a centred and scaled gamma random variable with shape parameter 0.4. The analyses performed are ‘controls’ (using only controls), the semiparametric efficient method that assumes normality (‘SPMLE’), our new estimator (‘robust’), and ‘all’, which is the method that uses all the data while ignoring the case–control study. Over 1000 simulations, we computed the mean estimated β (‘mean’), its standard deviation (‘sd’), the mean estimated standard deviation (‘Est. sd’), the coverage for a nominal 90% confidence interval (‘90%’), the coverage for a nominal 95% confidence interval (‘95%’) and the mean-squared error efficiency (‘MSE Eff’) compared with using only the controls.
Table 2.
Results of the simulation study described in Table 1, now with n1 = 150 cases and n0 = 150 controls†
Results for normal model | Results for gamma model | |||||||
---|---|---|---|---|---|---|---|---|
Controls | SPMLE | Robust | All | Controls | SPMLE | Robust | All | |
θy = 0.00 | ||||||||
Mean | 0.991 | 0.993 | 1.005 | 0.992 | 0.998 | 0.991 | 1.019 | 0.990 |
sd | 0.287 | 0.204 | 0.233 | 0.200 | 0.292 | 0.201 | 0.236 | 0.199 |
Est. sd | 0.282 | 0.202 | 0.230 | 0.200 | 0.281 | 0.201 | 0.230 | 0.199 |
90% | 0.891 | 0.908 | 0.910 | 0.905 | 0.892 | 0.900 | 0.916 | 0.902 |
95% | 0.942 | 0.951 | 0.965 | 0.952 | 0.948 | 0.950 | 0.959 | 0.950 |
MSE Eff | 1.973 | 1.509 | 2.043 | 2.103 | 1.526 | 2.151 | ||
θy = 0.25 | ||||||||
Mean | 1.008 | 1.016 | 0.983 | 1.092 | 1.007 | 0.994 | 0.974 | 1.118 |
sd | 0.301 | 0.204 | 0.220 | 0.202 | 0.280 | 0.268 | 0.223 | 0.267 |
Est. sd | 0.283 | 0.204 | 0.227 | 0.202 | 0.273 | 0.269 | 0.232 | 0.268 |
90% | 0.874 | 0.893 | 0.933 | 0.867 | 0.903 | 0.900 | 0.928 | 0.864 |
95% | 0.933 | 0.950 | 0.968 | 0.930 | 0.943 | 0.947 | 0.968 | 0.928 |
MSE Eff | 2.156 | 1.856 | 1.834 | 1.088 | 1.551 | 0.921 | ||
θy = 0.50 | ||||||||
Mean | 0.986 | 0.987 | 0.974 | 1.173 | 0.985 | 0.837 | 1.006 | 1.292 |
sd | 0.283 | 0.199 | 0.222 | 0.200 | 0.265 | 0.393 | 0.295 | 0.400 |
Est. sd | 0.282 | 0.206 | 0.235 | 0.207 | 0.266 | 0.381 | 0.311 | 0.393 |
90% | 0.903 | 0.918 | 0.936 | 0.798 | 0.900 | 0.864 | 0.938 | 0.808 |
95% | 0.948 | 0.958 | 0.973 | 0.871 | 0.943 | 0.923 | 0.969 | 0.888 |
MSE Eff | 2.003 | 1.597 | 1.143 | 0.388 | 0.806 | 0.287 |
The disease rate is approximately 1%.
We generated X from a uniform distribution on (0, 1). The logistic regression model is pr(D = 1|Y, X) = H(θ0 + θyY + θxX), with θ0 = −5.5, θy = 0.00, 0.25, 0.50 and θx = 1. The model for Y given X is a linear regression model, Y = αtrue + βtrueX + ε, with αtrue = 0 and βtrue = 1. We considered two distributions for ε: the standard normal distribution, for which the parametric approach attains the semiparametric efficiency bound, and, for comparison, a standardized gamma distribution with scale parameter 0.4. By equation (2), for θy = 0.00, 0.25, 0.50 the rates of disease are approximately 0.007, 0.008 and 0.010. In the first scenario the case–control study has n1 = 500 cases and n0 = 500 controls. In the second scenario we chose n0 = n1 = 150. We generated 1000 simulated data sets in each setting.
We contrasted four methods. The first uses ordinary linear regression based only on the controls. The second method uses the same approach but is expected to be significantly biased since it is based on the entire data set. The third method is the parametric (‘semiparametric efficient’) method that assumes normal errors, with standard errors obtained by inverting the Hessian of the log-likelihood. The fourth method is our proposed method, with standard errors estimated by using asymptotic formulae. The third and the fourth method were computed by making the rare disease approximation.
The case θy = 0.00 is interesting, because here Y is independent of D given X. Hence all methods should achieve nominal coverage probabilities for estimating βtrue, which is indeed seen in Table 1. Since, with θy = 0.00, all methods are asymptotically valid, the only possibility of seeing a bias is when θy is sufficiently ‘large’. For this reason, we experimented with the cases θy = 0.25 and θy = 0.50. Consider θy = 0.25 first. Here the approach that uses all the data yields a biased estimator of βtrue = 1, with low coverage probabilities. The ‘semiparametric efficient’ method that assumes normality still maintains its nominal coverage probabilities. As expected, since it is efficient if the errors are normal, it indeed outperforms the other approaches in this case. For example, for any two methods, say A and B, with estimates β̂A and β̂B, the mean-squared error efficiency of method A with respect to method B is E{(β̂B − βtrue)2}/E{(β̂A − βtrue)2}, and its estimated version is computed by replacing expectations by averages across the simulations. The semiparametric efficient method has 13% greater mean-squared error efficiency than our method in the normal case. However, in the gamma case, our method has 43% greater mean-squared error efficiency. It also outperforms the approach that uses only the controls, for both normal and gamma errors: in both cases the mean-squared error efficiency is roughly 70% larger.
Finally, in the case θy = 0.50 with normal regression errors, the semiparametric efficient method that assumes normality maintains its nominal coverage probabilities and has 7% greater mean-squared error efficiency than our method and 82% greater efficiency than using only controls. However, when the errors have a gamma distribution, it suffers from bias, increased variance and loss of coverage, with nominal 90% and 95% coverage actually being 81.8% and 88.4% respectively. Our method retains nominal coverage. The controls-only analysis and our method have roughly equal mean-squared error efficiency which is, in particular, much greater than the mean-squared error efficiency of the semiparametric efficient approach for regression models with normal errors.
6. Empirical example
In this section, we illustrate the methodology in a case–control study of prostate cancer, which was originally designed to investigate the risk of prostate-cancer-associated vitamin D biomarkers and genetic variations in vitamin D metabolism pathways (Ahn et al., 2009). The goal of the current analysis, which includes 749 prostate cancer cases and 781 controls, is to examine whether the genetic variations in the vitamin D receptor influence [25(OH)D], which is a serum level biomarker of vitamin D. In the notation of this paper, D is the prostate cancer case–control status and Y is the level of [25(OH)D]. We investigated three SNPs, rs2238136, rs2254210 and rs2239186, each of which represents an ordinal categorical variable coded as 0, 1 or 2 depending on how many copies of the variant allele a subject carries. In our analysis, X consists of three dummy variables for age groups, along with one of the genetic markers.
The results are given in Table 3. We see in Table 3 that none of the coefficients for the SNP are statistically significant. Thus, neither the traditional control-only nor the proposed method detected any association between the vitamin D receptor gene and [25(OH)D] level. These results are consistent with Chen et al. (2009) who noted that, given the downstream role of the vitamin D receptor gene in the vitamin D pathway, it is unlikely that vitamin D receptor polymorphisms could actually influence the level of [25(OH)D]. In spite of a lack of association, it is interesting to observe that the 95% confidence intervals by using our method are much shorter than by using those from the control data only. In terms of mean-squared error efficiency, here estimated as the square of the ratio of the lengths of the confidence intervals, the results for the three SNPs suggest gains in efficiency of 68%, 136% and 125% compared with using only the controls.
Table 3.
Results of the vitamin D receptor data example in Section 6†
X | Results for our method | Results for controls only | Efficiency | ||||
---|---|---|---|---|---|---|---|
Estimate | Lowe limitr |
Upper limit |
Estimate | Lowe limitr |
Upper limit |
||
SNP 1 | 0.015 | −0.165 | 0.195 | −0.029 | −0.262 | 0.204 | 1.68 |
SNP 2 | 0.023 | −0.047 | 0.093 | 0.039 | −0.069 | 0.146 | 2.36 |
SNP 3 | 0.015 | −0.062 | 0.092 | −0.045 | −0.161 | 0.070 | 2.25 |
Three analyses are displayed, one each when X is SNP 1, SNP 2 and SNP 3. Displayed are the parameter estimates of the slope for X (‘estimate’), and lower (‘lower’) and upper (‘upper’) 95% confidence intervals. Our method is contrasted with using linear regression among the controls only. Also displayed is the ‘efficiency’, which is defined as the square of the ratio of the lengths of the confidence intervals.
7. Discussion
If the disease probability pr(D = 1) is known, there are simpler methods for our particular setting that allow estimation of βtrue, based on weighting via equation (3). However, in the common case that pr(D = 1) is not known, the development in Section 3 leads to two natural rare disease approximations that use all the data and not just the data on the controls; see Section 4.1. It would be interesting to investigate which of these two approximate approaches is preferable in general.
Our simulation results are specific to rare diseases, by which we mean certainly that pr(D = 1) ≤ 1%. Biases will arise as the disease probability increases. In addition, since rare disease approximations do not lead to fully consistent estimation, coverage probability in large samples will suffer, since the bias is fixed whereas the variance decreases with sample size. Finally, the methods are likely to suffer in cases that the X-distribution has relatively rare values that are not within the centre of the support of X.
Acknowledgements
This paper represents part of the first author’s doctoral dissertation at Texas A&M University. Wei and Carroll’s research was supported by a grant from the National Cancer Institute (R37-CA057030). Carroll was also supported by award KUS-CI-016-04, made by King Abdullah University of Science and Technology. Chatterjee’s research was supported by a gene–environment initiative grant from the National Heart, Lung and Blood Institute (RO1-HL091172-01) and by the Intramural Research Program of the National Cancer Institute. Muller was supported by a National Science Foundation grant (DMS-0907014). Van Keilegom gratefully acknowledges financial support from Interuniversity Attraction Pole research network P6/03 of the Belgian Government (Belgian science policy), and from the European Research Council under the European Community’s seventh framework programme (FP7/2007-2013), European Research Council grant agreement 203650.
Appendix A
Some derivations
A.1. Derivation of the error density estimator (11)
The key idea of the approach is to introduce discrete probabilities γi = pr{R(β) = Ri(β)}, i = 1, …, n, which yields
and to work with the maximum likelihood estimates, i.e. with those γi that maximize the retrospective log-likelihood
Taking the derivative with respect to γk, k = 1, …, n, gives
Now set this equal to 0 to obtain
By definition of 𝒦, using that
this is the desired formula (11).
A.2. Unbiasedness of estimation function (12)
All calculations of expectations here will be based on the precise definition of expectations in a case–control sampling scheme; see equation (4). Let (βtrue, Ωtrue) be the true parameter, β an arbitrary value and τ(x, β, βtrue) = μ(x, βtrue) − μ(x, β). To derive the conditional density given the disease state we use the fact that we assume a logistic model, pr(D = 1|Y, X) = H{θ0 + m(Y, X, θ1)}, with H(x) the logistic distribution function, for which
Now write fYX(·) as the joint density function of (Y, X) in the population. Then, with θ0 and θ1 denoting the true parameters,
It then follows that the density of (Y, X) given D is
Recall that κ = θ0 + log(n1/n0) − log(π1/π0). Then equation (4) can now be computed as
The joint density of (Y, X) in the population is fYX(y, x) = fε{y − αtrue − μ(x, βtrue)} fX(x). Hence, fYX{r + μ(x, β), x} = fε{r − αtrue − τ(x, β, βtrue)} fX(x). Thus,
Now, since
we have that
(17) |
It follows from the convention in equation (4) and equation (17) that
If β = βtrue, since τ(x, βtrue, βtrue) = 0, it follows directly that the last term is 0, and therefore 0 = E{Qn(αtrue, βtrue, Ωtrue)|D1, …, Dn}. Hence Qn(αtrue, β, Ωtrue) = 0 is an unbiased estimating equation. If β ≠ βtrue, then in general we shall have 0 ≠ {Qn(αtrue, β, Ωtrue)|D1, …, Dn}.
Appendix B
Asymptotic theory
B.1. Notation and assumptions
In this section we introduce notation that is needed for our main theorem in Section 3.6, and we also state the formal assumptions under which this result will be valid.
Let (β, Ω) = Θ, and let Θtrue denote its true value. Recall equation (4), and define
Define
Write
and
Also define
where the function Ψ(Yi, Xi, Di, Ωtrue) is defined in assumption 4 below. Finally, let
Next, introduce the following assumptions, under which the main result in Section 3.6 is valid.
Assumption 1. The error ε is independent of X. The error distribution Fε is twice continuously differentiable, and the distribution FX of X is once continuously differentiable. The corresponding densities are denoted by fε and fX.
Assumption 2. There exists some 0 < c* < 1 such that n0/n → c*.
Assumption 3. The function μ(x, β) is three times continuously differentiable with respect to β, m(y, x, θ1) is twice continuously differentiable with respect to y and θ1, and Φ(y, x, d, Ω) is continuously differentiable with respect to Ω. Also, the matrices ℳβ and E{∂Φ(Y, X, D, Ω)/∂Ω|D = d}|Ω=Ωtrue are invertible.
Assumption 4. The estimator θ̂0 satisfies
for some function Ψ that satisfies E{Ψ(Y, X, D, Ωtrue)|D} = 0.
B.2. Proofs
We are now ready to give the proof of our main asymptotic result. Before giving a formal proof, let us first highlight the main steps of the proof. First, it follows from Appendix A.2 that Q̂n(α, β, Ω) is an unbiased estimating function. Plugging in an estimator of αtrue, we use a Taylor expansion of Q̂n,est(β̂, Ω̂) = 0 around the true β and Ω, which gives a regular asymptotically linear expansion of n1/2(β̂ − βtrue). Finally we apply the central limit theorem to obtain the required asymptotic normality result. Along the way, we must show an asymptotic expansion for ℋn(β, Θ), which is given in lemma 1. The notation in the statement of this lemma was introduced in the previous section.
Lemma 1. Assume that assumptions 1–3 are valid. Then, for each β and Θ,
where E[h2{R(β), X, D, Θ}|D] = 0.
Proof. Define
Since by assumption 2 we have that n1/n0 → c, 0 < c < ∞, it follows that Znum{R(β),Θ} = Op(1) and Zden{R(β), Θ} = Op(1), for each β and Θ. Hence, by a Taylor series expansion and assumption 3,
Thus,
By definition, E{ℬn(β, Θ)|D1, …, Dn} = 0. By the definition of W{Ri(β), Xj, Dj, Θ},
Without loss of generality, we can make the first n0 observations be the controls, and the last n − n0 observations be the cases. Then,
An easy calculation shows that
and similarly
Hence we have shown that
Except for the factor (n0/n)3/2, the first term above is a classical symmetric U-statistic of order 2 applied to independent and identically distributed observations, since by convention the first n0 observations are the controls. It then follows from standard U-statistic theory that (see, for example, Van der Vaart (1998))
This completes the proof.
B.2.1. Proof of theorem 1
Because of the unbiasedness of the estimating function (13) and the fact that expression (14) is consistent and asymptotically normally distributed for αtrue when evaluated at (βtrue, Ωtrue), the estimate is consistent for βtrue, and α(βtrue, Ωtrue) = αtrue. Set
We use the fact that 0 = Q̂n,est(β, Ω̂)|β=β̂. By a Taylor series expansion and assumption 3,
However, since α̂(βtrue, Ωtrue) is a consistent estimator for αtrue, it is clear that we have that
and
Hence it follows that
Because of its form, another Taylor series expansion and under assumption 3,
However, we can obtain by the same argument as in Appendix A.2 that c1(βtrue, Ωtrue) = 0. In addition, using the same tools as in lemma 1, n1/2{α̂(βtrue, Ωtrue) − α(βtrue, Ωtrue)} = Op(1). We have thus shown that
(18) |
Because (κ, θ1) is estimated by ordinary logistic regression, and assumption 4 gives a representation for θ̂0 − θ0,true, it follows from standard theory that
We thus have from equation (18) that
We can now apply lemma 1 to Q̂n(αtrue, βtrue, Ωtrue) with Gnum(r, x, d, Θ) = L{r, x, α(β, Ω), β} 𝒦̃(r, x, d, Θ) and Gden(r, x, d, Θ) = 𝒦̃(r, x, d, Θ). Invoking lemma 1, it follows that
We have shown in Appendix A.2 that the first term has mean 0. Remember from lemma 1 that E[h2{R(βtrue), X, D, Θtrue}|D] = 0. Moreover, the estimating equation for logistic regression is unbiased and assumption 4 ensures that E[Ψ(Y, X, D, Ωtrue)|D] = 0. Summarizing, we have shown that
as claimed.
Footnotes
Supporting information
Additional ‘supporting information’ may be found in the on-line version of this article:
‘Supplemental material for Robust estimation for homoscedastic regression in the secondary analysis of case-control data’.
Contributor Information
Jiawei Wei, Texas A&M University, College Station, USA.
Raymond J. Carroll, Texas A&M University, College Station, USA
Ursula U. Müller, Texas A&M University, College Station, USA
Ingrid Van Keilegom, Université catholique de Louvain, Louvain-la-Neuve, Belgium, and Tilburg University, The Netherlands.
Nilanjan Chatterjee, National Cancer Institute, Rockville, USA.
References
- Ahn J, Albanes D, Berndt SI, Peters U, Chatterjee N, Freedman ND, Abnet CC, Huang WY, Kibel AS, Crawford ED, Weinstein SJ, Chanock SJ, Schatzkin A, Hayes RB the Prostate, Lung, Colorectal and Ovarian Trial Project Team. Vitamin D-related genes, serum vitamin D concentrations and prostate cancer risk. Carcinogenesis. 2009;30:769–776. doi: 10.1093/carcin/bgp055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anderson R. Modern Methods for Robust Regression. New York: Sage; 2008. [Google Scholar]
- Babu GJ, Singh K. Inference on means using the bootstrap. Ann. Statist. 1983;11:999–1003. [Google Scholar]
- Buonaccorsi JP. Measurement Eror: Models, Methods and Applications. Boca Raton: Chapman and Hall; 2010. [Google Scholar]
- Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation in case-control studies of gene-environment interactions. Biometrika. 2005;92:399–418. [Google Scholar]
- Chen Y-H, Carroll RJ, Chatterjee N. Retrospective analysis of haplotype-based case-control studies under a flexible model for gene-environment association. Biostatistics. 2008;9:81–99. doi: 10.1093/biostatistics/kxm011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen Y-H, Chatterjee N, Carroll RJ. Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies. J. Am. Statist. Ass. 2009;104:220–233. doi: 10.1198/jasa.2009.0104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen X, Linton O, Van Keilegom I. Estimation of semiparametric models when the criterion function is not smooth. Econometrica. 2003;71:1591–1608. [Google Scholar]
- Epstein M, Satten GA. Inference on haplotype effects in case-control studies using unphased genotype data. Am. J. Hum. Genet. 2003;73:1316–1329. doi: 10.1086/380204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hall P, Horowitz J. Bootstrap critical values for tests based on generalized-method-of-moments estimators. Econometrica. 1996;6:891–916. [Google Scholar]
- Hu YJ, Lin DY, Zeng D. A general framework for studying genetic effects and gene–environment interactions with missing data. Biostatistics. 2010;11:583–598. doi: 10.1093/biostatistics/kxq015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huber PJ. Robust Statistics. New York: Wiley; 1981. [Google Scholar]
- Jiang Y, Scott AJ, Wild CJ. Secondary analysis of case-control data. Statist. Med. 2006;25:1323–1339. doi: 10.1002/sim.2283. [DOI] [PubMed] [Google Scholar]
- Kwee LC, Epstein MP, Manatunga AK, Duncan R, Allen AS, Satten GA. Simple methods for assessing haplotype-environment interactions in case-only and case-control studies. Genet. Epidem. 2007;31:75–90. doi: 10.1002/gepi.20192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lele S. Resampling using estimating equations. In: Godambe UP, editor. Estimating Functions. New York: Oxford University Press; 1991. pp. 295–304. [Google Scholar]
- Li H, Gail MH, Berndt S, Chatterjee N. Using cases to strengthen inference on the association between single nucleotide polymorphisms and a secondary phenotype in genome-wide association studies. Genet. Epidem. 2010;34:427–433. doi: 10.1002/gepi.20495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin DY, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies (with discussion) J. Am. Statist. Ass. 2006;101:89–118. [Google Scholar]
- Lin DY, Zeng D. Proper analysis of secondary phenotype data in case-control association studies. Genet. Epidem. 2009;33:256–265. doi: 10.1002/gepi.20377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Modan MD, Hartge P, Hirsh-Yechezkel G, Chetrit A, Lubin F, Beller U, Ben-Baruch G, Fishman A, Menczer J, Struewing JP, Tucker MA, Wacholder S for the National Israel Ovarian Cancer Study Group. Parity, oral contraceptives and the risk of ovarian cancer among carriers and noncarriers of a BRCA1 or BRCA2 mutation. New Engl. J. Med. 2001;345:235–240. doi: 10.1056/NEJM200107263450401. [DOI] [PubMed] [Google Scholar]
- Monsees G, Tamimi R, Kraft P. Genomewide association scans for secondary traits using case-control samples. Genet. Epidem. 2009;33:717–728. doi: 10.1002/gepi.20424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population based case-control studies. Statist. Med. 1994;13:153–162. doi: 10.1002/sim.4780130206. [DOI] [PubMed] [Google Scholar]
- Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–411. [Google Scholar]
- Spinka C, Carroll RJ, Chatterjee N. Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity. Genet. Epidem. 2005;29:108–127. doi: 10.1002/gepi.20085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van der Vaart AW. Asymptotic Statistics. Cambridge: Cambridge University Press; 1998. [Google Scholar]
- Wang CY, Wang S, Carroll RJ. Estimation in choice-based sampling with measurement error and bootstrap analysis. J. Econmetr. 1997;77:65–86. [Google Scholar]
- Zhao LP, Li SS, Khalid N. A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies. Am. J. Hum. Genet. 2003;72:1231–1250. doi: 10.1086/375140. [DOI] [PMC free article] [PubMed] [Google Scholar]