Robust estimation for homoscedastic regression in the secondary analysis of case–control data

Jiawei Wei; Raymond J Carroll; Ursula U Müller; Ingrid Van Keilegom; Nilanjan Chatterjee

doi:10.1111/j.1467-9868.2012.01052.x

. Author manuscript; available in PMC: 2014 Jan 1.

Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2012 Dec 4;75(1):185–206. doi: 10.1111/j.1467-9868.2012.01052.x

Robust estimation for homoscedastic regression in the secondary analysis of case–control data

Jiawei Wei ¹, Raymond J Carroll ², Ursula U Müller ³, Ingrid Van Keilegom ⁴, Nilanjan Chatterjee ⁵

PMCID: PMC3639015 NIHMSID: NIHMS449968 PMID: 23637568

Summary

Primary analysis of case–control studies focuses on the relationship between disease D and a set of covariates of interest (Y, X). A secondary application of the case–control study, which is often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated owing to the case–control sampling, where the regression of Y on X is different from what it is in the population. Previous work has assumed a parametric distribution for Y given X and derived semiparametric efficient estimation and inference without any distributional assumptions about X. We take up the issue of estimation of a regression function when Y given X follows a homoscedastic regression model, but otherwise the distribution of Y is unspecified. The semiparametric efficient approaches can be used to construct semiparametric efficient estimates, but they suffer from a lack of robustness to the assumed model for Y given X. We take an entirely different approach. We show how to estimate the regression parameters consistently even if the assumed model for Y given X is incorrect, and thus the estimates are model robust. For this we make the assumption that the disease rate is known or well estimated. The assumption can be dropped when the disease is rare, which is typically so for most case–control studies, and the estimation algorithm simplifies. Simulations and empirical examples are used to illustrate the approach.

Keywords: Biased samples, Homoscedastic regression, Secondary data, Secondary phenotypes, Semiparametric inference, Two-stage samples

1. Introduction

Case–control designs are popularly used for studying risk factors for rare diseases, such as cancers. Under this design, a fixed number of ‘cases’ and ‘controls’, i.e. subjects with and without the disease of interest, are sampled from an underlying base population. Data on various covariates on the subjects are then collected in a retrospective fashion so that they reflect history before the disease. The standard method for primary analysis of case–control data involves logistic regression modelling of the disease outcome as a function of the covariates of interest. It is well known that prospective logistic regression analysis for case–control data is efficient under a semiparametric framework that allows the ‘nuisance’ distribution of the underlying covariates to be unspecified (Prentice and Pyke, 1979).

Epidemiologic researchers popularly use controls from case–control studies to examine the interrelationship between certain covariates themselves. Such secondary analysis of case–control studies has received increasing attention in genetic epidemiologic studies, where it is often of interest to investigate the effect of genetic susceptibility, such as single-nucleotide polymorphism (SNP) genotypes, not only on the primary disease outcome, but also on various secondary factors, such as smoking habits, that may themselves be associated with the disease of interest. For such secondary analysis, use of only controls is generally considered a model robust approach since, when the disease is rare, the relationship between covariates in the controls should reflect that of the underlying population without any further model assumptions. It is, however, recognized that inclusion of cases in such analysis can increase efficiency, provided that appropriate adjustment can be made to account for non-random ascertainment in case–control sampling. Li et al. (2010), for example, reported that, if two binary covariates have no interaction with the risk of the disease on a logistic scale, then the association between the factors in the cases remains the same as that for the underlying population. Therefore in such a setting inclusion of cases can increase the efficiency of the secondary analysis.

In this paper, our goal is to develop an approach to secondary association analysis for a continuous covariate, say Y, in a case–control study setting so that both cases and controls can be used to increase efficiency and yet the resulting inference is model robust to distributional assumptions about the covariates. Suppose that data are originally collected from a case–control study of a relatively rare disease. Let D be disease status, with D = 1 denoting a case and D = 0 denoting a control. Suppose also that D is to be modelled by a vector of random covariates (Y, X), where Y is univariate and X is potentially multivariate, by using a standard logistic regression formulation. Consider here the homoscedastic regression model

Y = α_{true} + μ (X, β_{true}) + ε,

(1)

where α_true is an intercept and μ(·) is a known function, and where ε has mean 0 and is independent of X, but its distribution is otherwise not specified.

To estimate (α_true, β_true), we cannot simply ignore the case–control sampling scheme and use the data as they are, because, if Y is a predictor of disease status D, the sampling is biased and in the case–control sample model (1) will not hold.

This paper is organized as follows. In Section 2, we describe recent work on case–control studies that allows efficient estimation if the distribution of Y given X is specified up to parameters. Although the solution is elegant, it suffers from the fact that the resulting estimate may be biased if the hypothesized distribution for Y given X is misspecified.

Section 3 takes an entirely different approach to the basic general problem and describes a simple method that is robust to misspecification of the distribution of Y given X. In Section 4 we describe extensions to cases that the disease rate in the population is known or well estimated from a disease registry or as part of an on-going cohort, and to the case of stratified or frequency-matched studies. Section 5 presents a series of simulation studies, whereas Section 6 presents analysis of an epidemiological data set. Concluding remarks are in Section 7. Technical details are given in Appendix A and Appendix B.

2. Efficient parametric estimation and robustness

2.1. Framework

In this section we outline recent work on efficient estimation for case–control studies when the distribution of Y given X is specified up to a finite dimensional parameter vector. We start with a logistic regression model underlying the case–control analysis, so that pr(D = 1|Y, X) = H{θ₀ + m(Y, X, θ₁)}, where H(·) is the logistic distribution function and m(·) is an arbitrary known function with unknown parameter vector θ₁. For d = 0, 1, let π_d = pr(D = d), the probability that D = d in the population, and suppose that there are n₁ cases with D = 1 and n₀ controls with D = 0. We write n = n₀ + n₁ and introduce the parameter κ = θ₀ + log(n₁/n₀) − log(π₁/π₀). This reparameterization has the advantage that we can identify κ and θ₁ from a logistic regression analysis of D on (Y, X), although we cannot identify θ₀ (Prentice and Pyke, 1979; Chatterjee and Carroll, 2005) from such logistic regression alone.

In the parametric framework the conditional distribution of Y given X is modelled as f_ε{y − α − μ(x, β), ζ}, where ζ is a finite dimensional nuisance parameter. If in the population Y given X is normally distributed, then ζ = var(ε).

2.2. Population-based case–control studies and notation

Our explicit theoretical and asymptotic results are based on population-based case–control studies, i.e. studies in which random samples of (Y, X) are taken separately for D = 1 and D = 0. We shall refer to these simply as case–control studies. Some case–control studies use a form of stratification, which is sometimes called frequency matching, e.g. a population-based case–control study for each of a number of age ranges and the same number of cases and controls in each age group. With some notation and the inclusion of these strata in the logistic risk model and in the model for Y given X, our results are easily extended to such sampling; see Section 4.

We assume a logistic model for pr(D = 1|Y, X) as

pr (D = 1 | Y, X) = H {θ_{0} + m (Y, X, θ_{1})} = \frac{exp {θ_{0} + m (Y, X, θ_{1})}}{1 + exp {θ_{0} + m (Y, X, θ_{1})}} .

(2)

Our technical assumptions are assumptions 1–4 in Appendix B.1.

We also mention two important calculations. The density f_X of X in the population can be written as

f_{X} (x) = π_{1} f_{case} (x) + π_{0} f_{cont} (x),

(3)

with (π₀, π₁) defined in Section 2.1, and where f_cont(x) and f_case(x) represent the density of X given D = 0 and D = 1 respectively. Since this is a case–control sampling scheme, all expectations are conditional on D₁, …, D_n. Define R(β) = Y − μ(X, β) and R_i(β) = Y_i − μ(X_i, β). For an arbitrary function G,

E [n^{- 1} \sum_{i = 1}^{n} G {R_{i} (β), X_{i}, D_{i}}] = E (E [n^{- 1} \sum_{i = 1}^{n} G {R_{i} (β), X_{i}, D_{i}} | D_{1}, \dots, D_{n}]) = n^{- 1} \sum_{i = 1}^{n} E (E [G {R_{i} (β), X_{i}, D_{i}} | D_{i}]) = \sum_{d = 0}^{1} (n_{d} / n) E [G {R (β), X, d} | D = d],

(4)

the second and last steps following because (Y, X) are independent and identically distributed given D in the case–control sampling scheme.

2.3. Prior results and robustness

For the case–control studies that were described above, Jiang et al. (2006), Chen et al. (2008) and Lin and Zeng (2009) derived the efficient profile likelihood (in the sense that its score for β is an efficient score function), Lin and Zeng (2009) noting importantly that it can be used in our context. See also Monsees et al. (2009). Write Ω = (κ, θ₁, θ₀). The joint density of (D, Y, X) is

f_{X} (x) f_{ε} {y - α - μ (x, β), ζ} \frac{exp [d {θ_{0} + m (y, x, θ_{1})}]}{1 + exp {θ_{0} + m (y, x, θ_{1})}} .

Let

g (d, y, x, Ω, α, β, ζ) = f_{ε} {y - α - μ (x, β), ζ} exp [d {κ + m (y, x, θ_{1})}] {[1 + exp {θ_{0} + m (y, x, θ_{1})}]}^{- 1} .

The semiparametric efficient retrospective profile likelihood for β that makes no assumptions about the distribution of X when the distribution of Y given X is specified is

ℒ_{par} (D, Y, X, Ω, α, β, ζ) = \frac{g (D, Y, X, Ω, α, β, ζ)}{\sum_{d = 0}^{1} \int g (d, t, X, Ω, α, β, ζ) d t} .

Taking logarithms, summing over the observed data and then maximizing in the parameters yields semiparametric efficient inference.

A difficulty arises, however, if the density f_ε(·) of ε is not specified properly. To see what happens, consider the score for β. Define L_par(y, x, α, β, ζ) = ∂log[f_ε{y − α − μ(x, β), ζ}]/∂β. Then the score for β is

𝒦_{par} (D, Y, X, Ω, α, β, ζ) = \frac{\partial log {ℒ_{par} (D, Y, X, Ω, α, β, ζ)}}{\partial β} = L_{par} (Y, X, α, β, ζ) - \frac{\int \sum_{d = 0}^{1} L_{par} (t, X, α, β, ζ) g (d, t, X, Ω, α, β, ζ) d t}{\int \sum_{d = 0}^{1} g (d, t, X, Ω, α, β, ζ) d t} .

(5)

Because ℒ_par(·) is a legitimate semiparametric profile likelihood, when summed over the case–control data and evaluated at the true parameters, score (5) has mean 0. However, score (5), when evaluated at the true parameter values, only has mean 0 in general if the density f_ε(·) of ε is specified properly, i.e. the approach is not always model robust; see Section 5 for numerical evidence. This motivates our search for a robust estimation method, which is a topic that we take up in the next section.

3. Model robust estimation

3.1. Preliminaries

In this section we assume the same framework as in the previous section, with the exception that f_ε is now unknown. We pursue a sequential approach to derive an estimating equation for the parameters that determine the regression function.

Estimate the true logistic regression parameters κ and θ₁ by ordinary logistic regression of D on (Y, X). This can be done legitimately because it is known that ordinary logistic regression in a case–control study consistently estimates κ and θ₁ (Prentice and Pyke, 1979; Chatterjee and Carroll, 2005). Denote the estimators by κ̂ and θ̂₁.We also suppose that we have a consistent estimator of θ₀. This estimator can, for example, be the solution of the equation
$π_{1} = π_{1} n_{1}^{- 1} \sum_{i = 1}^{n} D_{i} H {θ_{0} + m (Y_{i}, X_{i}, {θ̂}_{1})} + π_{0} n_{0}^{- 1} \sum_{i = 1}^{n} (1 - D_{i}) H {θ_{0} + m (Y_{i}, X_{i}, {θ̂}_{1})},$ (6)
when the disease rate π₁ in the population is known or well estimated, either from a disease registry or from an underlying cohort from which the cases and controls are sampled. Equation (6) leads to a consistent estimator of θ₀, since for any function g(y, x) we can estimate ∫g(y, x) f_YX(y, x) dydx unbiasedly by
$\sum_{d = 0}^{1} \sum_{i = 1}^{n} (π_{d} / n_{d}) I (D_{i} = d) g (Y_{i}, X_{i}) .$
Call the resulting estimator θ̂₀ and denote Ω̂ = (κ̂, θ̂₁, θ̂₀).
Use a score function for β that would be an appropriate score function if the (Y, X) data arose from random sampling. Define R(β) = Y − μ(X, β). Then the simplest such score function is that from ordinary least squares, which is obtained by differentiating {Y − α − μ(X, β)}² with respect to β. This yields the score function
$L {R (β), X, α, β} = μ_{β} (X, β) {R (β) - α},$ (7)
where the subscript means differentiation with respect to β.
Score (7) will not have mean 0 in the case–control sampling scheme, so we adjust it so that it has mean 0 in general.
For technical reasons that are described later, estimation of α_true must be done via an auxiliary equation depending on the current values, which we generically call α̂(β, Ω), which replaces α in score (7); see below for the definition.
Solve the adjusted score equation to estimate β_true and hence α_true. Good starting values for β can be obtained by least squares regression among the controls.

Remark 1. The score function (7) is not the only one possible; for example, we could instead allow for robustness against outliers by replacing function (7) by the estimating equation of an M-estimator (Huber, 1981; Anderson, 2008).

3.2. Estimation algorithm

The development of our methodology is somewhat involved. Here we simply state our proposal, with its development given in Sections 3.3–3.5. As before, define R(β) = Y − μ(X, β).Remember that estimation of α_true must be done by using an auxiliary equation; see equation (8) directly below. Define

𝒦 (R_{i} (β), x, β, Ω) = \frac{1 + exp [κ + m {R_{i} (β) + μ (x, β), x, θ_{1}}]}{1 + exp [θ_{0} + m {R_{i} (β) + μ (x, β), x, θ_{1}}]} .

For given (β, Ω), the estimator of α_true is justified in Section 3.5 and given by

α̂ (β, Ω) = \frac{n^{- 1} \sum_{i = 1}^{n} R_{i} (β) {[\sum_{d = 0}^{1} (π_{d} / n_{d}) \sum_{j = 1}^{n} I (D_{j} = d) 𝒦 {R_{i} (β), X_{j}, β, Ω}]}^{- 1}}{n^{- 1} \sum_{i = 1}^{n} {[\sum_{d = 0}^{1} (π_{d} / n_{d}) \sum_{j = 1}^{n} I (D_{j} = d) 𝒦 {R_{i} (β), X_{j}, β, Ω}]}^{- 1}} = \frac{n^{- 1} \sum_{i = 1}^{n} R_{i} (β) {[n^{- 1} \sum_{j = 1}^{n} 𝒦̃ {R_{i} (β), X_{j}, β, Ω, D_{j}}]}^{- 1}}{n^{- 1} \sum_{i = 1}^{n} {[n^{- 1} \sum_{j = 1}^{n} 𝒦̃ {R_{i} (β), X_{j}, β, Ω, D_{j}}]}^{- 1}},

(8)

where

𝒦̃ {R_{i} (β), X_{j}, β, Ω, D_{j}} = \sum_{d = 0}^{1} (n π_{d} / n_{d}) I (D_{j} = d) 𝒦 {R_{i} (β), X_{j}, β, Ω} .

Let μ_β(x, β) = ∂μ(x, β)/∂β and let L{R(β), X, α, β} be as in equation (7). Then define

{Q̂}_{n, est} (β, Ω) = n^{- 1 / 2} \sum_{i = 1}^{n} [L {R_{i} (β), X_{i}, α̂ (β, Ω), β} - \frac{n^{- 1} \sum_{j = 1}^{n} L {R_{i} (β), X_{j}, α̂ (β, Ω), β} 𝒦̃ {R_{i} (β), X_{j}, β, Ω, D_{j}}}{n^{- 1} \sum_{j = 1}^{n} 𝒦̃ {R_{i} (β), X_{j}, β, Ω, D_{j}}}] .

(9)

Our algorithm then is as follows.

Estimate (κ, θ₁)^T by (κ̂, θ̂¹)^T, the logistic regression estimates of D on (Y, X). As described previously, this is known to produce consistent estimates of (κ_true, θ_1,true)^T. Estimate θ₀ as explained in Section 3.1. This leads to an estimator Ω̂ of Ω_true.
Solve 0 = Q̂_n,est(β, Ω̂) in β to obtain the estimate β̂.

In the next few subsections, we describe how we obtained equation (9), and at the end we describe the asymptotic distribution theory.

3.3. Development of the score when f_X and α_true are known

3.3.1. Adjusting score (7)

We first describe how to proceed when the intercept α_true, the density f_X(·) of X in the population, and f_ε(t − α_true), the density of Y − μ(X, β_true) in the population, are all known; they are not and we shall show how to remove these restrictions in subsequent sections.

The approach is to start with the estimating function (7), which, when summed over the data, does not have mean 0 at the true parameters because of the case–control sampling scheme, i.e. $E [Σ_{i = 1}^{n} L {R_{i} (β_{true}), X_{i}, α_{true}, β_{true}} | D_{i}] \neq 0$ , in general. Thus, we need to correct $n^{- 1} Σ_{i = 1}^{n} L {R_{i} (β), X_{i}, α, β}$ so that it does have mean 0 in the case–control sampling scheme, where expectations are computed as in equation (4). In the on-line supplemental material, we show how to follow the approach of Chen et al. (2009), section 2.3.3, to develop the adjusted estimating function

L {R (β), X, α_{true}, β} - \frac{\int L (t, x, α_{true}, β) 𝒦 (t, x, β, Ω) f_{ε} (t - α_{true}) f_{X} (x) d t d x}{\int 𝒦 (t, x, β, Ω) f_{ε} (t - α_{true}) f_{X} (x) d t d x} .

(10)

This is not of much help, since none of f_ε(·), f_X(·) or α_true are known. In subsequent sections we show how to replace these terms by data-estimated quantities, and thus arrive at equation (9).

3.3.2. Replacing the unknown error density

The problem with expression (10) is that we do not know the form of f_ε(·), so score (10) cannot be implemented. Similarly to Chatterjee and Carroll (2005) and Spinka et al. (2005), we therefore replace f_ε(·) by a non-parametric maximum likelihood estimator. The idea is to take the observed R_i(β) = Y_i − μ(X_i, β) as the support, and to maximize the log-likelihood with respect to γ_i = pr{R(β) = R_i(β)}, i = 1, …, n, subject to $Σ_{i = 1}^{n} γ_{i} = 1$ . By Chatterjee and Carroll (2005) and Spinka et al. (2005), the resulting estimator for pr{R(β) = R_i(β)} is

p_{est} {R_{i} (β), Ω} = \frac{π_{0}}{n_{0}} {[\int f_{X} (x) 𝒦 {R_{i} (β), x, β, Ω} d x]}^{- 1} .

(11)

The derivation of equation (11) is given in Appendix A.1. When we make this substitution in expression (10) and sum over the data, the score becomes

\sum_{i = 1}^{n} L {R_{i} (β), X_{i}, α_{true}, β} - \frac{\sum_{i = 1}^{n} \int L {R_{i} (β), x, α_{true}, β} 𝒦 {R_{i} (β), x, β, Ω} p_{est} {R_{i} (β), Ω} f_{X} (x) d x}{n^{- 1} \sum_{i = 1}^{n} \int 𝒦 {R_{i} (β), x, β, Ω} p_{est} {R_{i} (β), Ω} f_{X} (x) d x} .

Because the denominator of this expression is π₀/n₀, by simple algebra it is readily seen that the normalized score function for estimating β can be defined as

0 = Q_{n} (α_{true}, β, Ω) = n^{- 1 / 2} \sum_{i = 1}^{n} [L {R_{i} (β), X_{i}, α_{true}, β} - \frac{\int L {R_{i} (β), x, α_{true}, β} 𝒦 {R_{i} (β), x, β, Ω} f_{X} (x) d x}{\int 𝒦 {R_{i} (β), x, β, Ω} f_{X} (x) d x}] .

(12)

In Appendix A.2 we show that the expectation of Q_n(α_true, β, Ω) in the case–control sampling scheme is equal to 0 when evaluated at (α_true, β_true, Ω_true), but not for arbitrary (β, Ω). This implies that equation (12) is indeed an unbiased estimating equation in the case–control sampling scheme.

3.4. Implementation when f_X is unknown but α_true is known

The density or mass function f_X(·) is not known. We estimate the integrals in expression (12) unbiasedly by their sample average over all the observations, so our estimating equation is

0 = {Q̂}_{n} (α_{true}, β, Ω) = n^{- 1 / 2} \sum_{i = 1}^{n} [L {R_{i} (β), X_{i}, α_{true}, β} - \frac{n^{- 1} \sum_{j = 1}^{n} L {R_{i} (β), X_{j}, α_{true}, β} 𝒦̃ {R_{i} (β), X_{j}, β, Ω, D_{j}}}{n^{- 1} \sum_{j = 1}^{n} 𝒦̃ {R_{i} (β), X_{j}, β, Ω, D_{j}}}] .

(13)

3.5. Implementation when the intercept α_true is unknown

One might reasonably think that estimating the intercept is easy; for example, simply supplement the score with the ordinary least squares score for the intercept, so that $L {R (β), X, α, β} = {(1, μ_{β}^{T} (X, β))}^{T} {R (β) - α}$ . The problem with this is that the first component of the estimating equation (13) would then be identically 0 and thus will not produce an estimate of the intercept. The reason for this is that the solution (11) was calculated non-parametrically under the assumption that R(β_true) and X are independent in the population. Since Y − α_true − μ(X, β_true) and Y − μ(X, β_true) are both independent of X in the population, this means that equation (11) cannot lead to an estimate of the intercept. Hence, an alternative approach is required.

To overcome this problem, we estimate the intercept of R(β) by using equation (11), i.e., if f_X(·) were known, then α_true could be estimated by

α̃ (β, Ω) = \frac{n^{- 1} \sum_{i = 1}^{n} R_{i} (β) p_{est} {R_{i} (β), Ω}}{n^{- 1} \sum_{i = 1}^{n} p_{est} {R_{i} (β), Ω}} :

(14)

a quantity that is free of the π₀ that shows up in equation (11). If we then replace the integral in the definition of p_est(·) by its average $n^{- 1} Σ_{j = 1}^{n} 𝒦̃ {R_{i} (β), X_{j}, β, Ω, D_{j}}$ , we obtain exactly expression (8). Making this substitution in equation (13), we obtain equation (9). This completes the derivation of our methodology.

3.6. Distribution theory

The asymptotic distribution of our estimator is given in the following result. We refer to Appendix B.1 for the definition of the functions and matrices that are mentioned below, and for the assumptions 1–4 there under which this result is valid. The proof of this theorem is given in Appendix B.2.

Theorem 1. Let (β, Ω) = Θ, and let Θ_true denote its true value. Assume that assumptions 1–4 in Appendix B.1 are valid. Then there is an invertible matrix ℳ_β and a function Λ(Y, X, D, Θ_true) with the properties that E{Λ(Y, X, D, Θ_true)|D} = 0 and

n^{1 / 2} (β̂ - β_{true}) = - n^{- 1 / 2} ℳ_{β}^{- 1} \sum_{i = 1}^{n} Λ (Y_{i}, X_{i}, D_{i}, Θ_{true}) + o_{p} (1) .

Therefore, there is a matrix Σ, defined in Appendix B.1, such that

n^{1 / 2} (β̂ - β_{true}) \to N (0, Σ) .

(15)

Estimating the covariance matrix Σ in expression (15) can be accomplished by a plug-in method or by the bootstrap appropriate for case–control sampling (Wang et al., 1997; Buonaccorsi, 2010).

3.7. Inference via bootstrap resampling

In principle, estimating the covariance matrix Σ in expression (15) can be accomplished by a plug-in method, although the particular form of the function Q₁(·) that is defined in Appendix B.1 makes computational speed slow. We have thus chosen to use bootstrap ideas to estimate Σ. Below we explain in detail how this can be done, but the basic idea is that we have random samples from two independent populations, i.e. the cases and the controls, and an estimator that is asymptotically normally distributed.

3.7.1. Bootstrap procedure

Let $(Y_{1}^{*}, X_{1}^{*}), \dots, (Y_{n_{0}}^{*}, X_{n_{0}}^{*})$ be drawn randomly with replacement from {(Y_i, X_i) : D_i = 0}, and similarly let $(Y_{n_{0} + 1}^{*}, X_{n_{0} + 1}^{*}), \dots, (Y_{n}^{*}, X_{n}^{*})$ be drawn randomly with replacement from {(Y_i, X_i) : D_i = 1}. This is the method of bootstrap sampling that was suggested by Wang et al. (1997) and Buonaccorsi (2010), page 225, and, since the data consist of samples from two independent populations, is the same as in Babu and Singh (1983); see also Lele (1991).

Let $D_{i}^{*} = I (i > n_{0})$ and $R_{i}^{*} (β) = Y_{i}^{*} - μ (X_{i}^{*}, β)$ , and define Ω̂*, α̂*(β, Ω) and ${Q̂}_{n, est}^{*} (β, Ω)$ in the same way as Ω̂, α̂(β, Ω) in equation (8) and Q̂_n,est(β, Ω) in equation (9), but based on $(Y_{i}^{*}, X_{i}^{*}, D_{i}^{*})$ instead of (Y_i, X_i, D_i), i = 1, …, n.

The bootstrapped estimator β̂* of β is then defined as a solution of

0 = {Q̂}_{n, est}^{*} (β, {Ω̂}^{*}) - {Q̂}_{n, est} (β̂, Ω̂) = {Q̂}_{n, est}^{*} (β, {Ω̂}^{*})

with respect to β. See also Hall and Horowitz (1996), page 897, and Chen et al. (2003), where bootstrapping is used and justified in similar contexts.

3.7.2. Bootstrap consistency

To show the consistency of the above bootstrap procedure, we need to show that n^1/2(β̂* − β̂) converges to the same normal limit as the original centred estimator n^1/2(β̂ − β_true). For this we use the same techniques as in the proof of theorem B in Chen et al. (2003), combined with the proof of theorem 1 in Appendix A. More precisely, it can be shown that, under certain regularity conditions, we have that

n^{1 / 2} ({β̂}^{*} - β̂) = - ℳ_{β}^{- 1} n^{- 1 / 2} \sum_{i = 1}^{n} {Λ (Y_{i}^{*}, X_{i}^{*}, D_{i}^{*}, Θ_{true}) - Λ (Y_{i}, X_{i}, D_{i}, Θ_{true})} + o_{p^{*}} (1),

where o_p*(1) has the same meaning as o_p(1), except that the probability is computed under the bootstrap distribution conditional on the original data (Y_i, X_i, D_i), i = 1, …, n. From this together with the central limit theorem and theorem 1 the result follows.

4. Extensions

4.1. Rare disease approximations

The method that was defined in Section 3 assumes that π₁ = pr(D = 1) is known. This is typically not the case, so many researchers adopt rare disease approximations (see below for references), where the word ‘rare’ has no precise definition but is certainly 1% or less. There are at least two ways to proceed in our context. The first is to use the literature, to choose a nominal π₁ ≤ 1% and to apply the method in Section 3. In results that are not reported here, this works well in the simulation setting of Section 5. In the literature, most researchers use a different approximation, which is described next and implemented in Section 5. We have not investigated in any detail which approach is preferable.

Let ‘≐’ denote ‘approximately equal’. The estimation procedure simplifies if the disease can be assumed to be rare, i.e. if

pr (D = 1 | Y, X) = \frac{exp {θ_{0} + m (Y, X, θ_{1})}}{1 + exp {θ_{0} + m (Y, X, θ_{1})}} ≐ exp {θ_{0} + m (Y, X, θ_{1})},

or, equivalently, if pr(D = 0|Y, X) = [1 + exp{θ₀ + m(Y, X, θ₁)}]⁻¹ ≐ 1. This approximation allows us to replace 𝒦 in the estimating function (12) by

𝒦^{*} {R_{i} (β), x, β, Ω^{*}} = 1 + exp [κ + m {R_{i} (β) + μ (x, β), x, θ_{1}}] .

(16)

In addition, Ω = (κ, θ₁, θ₀) in 𝒦 is replaced by Ω* = (κ, θ₁), which does not depend on θ₀ any-more, and assumption 4 is no longer required since θ₀ is no longer estimated. The proof in Appendix A.2, where we show that the estimating function (12) is unbiased, adapts to the rare disease case in a straightforward way, now using the approximation

f_{Y X | D = d} (y, x) = \frac{exp [d {θ_{0} + m (y, x, θ_{1})}] f_{Y X} (y, x)}{[1 + exp {θ_{0} + m (y, x, θ_{1})}] π_{d}} ≐ \frac{exp [d {θ_{0} + m (y, x, θ_{1})}] f_{Y X} (y, x)}{π_{d}} .

Hence the modified estimating function based on 𝒦* is approximately unbiased in the rare disease case.

As in the general case, the rare disease version of the estimating function (12) depends on unknown quantities which must be estimated. The estimation algorithm for the rare disease model is as follows and is explained below. Set

{α̂}^{*} (β, Ω^{*}) = \frac{n^{- 1} \sum_{i = 1}^{n} R_{i} (β) {[n_{0}^{- 1} \sum_{j = 1}^{n} (1 - D_{j}) 𝒦^{*} {R_{i} (β), X_{j}, β, Ω^{*}}]}^{- 1}}{n^{- 1} \sum_{i = 1}^{n} {[n_{0}^{- 1} \sum_{j = 1}^{n} (1 - D_{j}) 𝒦^{*} {R_{i} (β), X_{j}, β, Ω^{*}}]}^{- 1}},

{Q̂}_{n, est}^{*} (β, Ω^{*}) = n^{- 1 / 2} \sum_{i = 1}^{n} [L {R_{i} (β), X_{i}, {α̂}^{*} (β, Ω^{*}), β} - \frac{n_{0}^{- 1} \sum_{j = 1}^{n} (1 - D_{j}) L {R_{i} (β), X_{j}, {α̂}^{*} (β, Ω^{*}), β} 𝒮^{*} {R_{i} (β), X_{j}, β, Ω^{*}}}{n_{0}^{- 1} \sum_{j = 1}^{n} (1 - D_{j}) 𝒦^{*} {R_{i} (β), X_{j}, β, Ω^{*}}}] .

As before, estimate Ω* = (κ, θ₁) by the logistic regression estimates of D on (Y, X); then solve ${Q̂}_{n, est}^{*} (β, {Ω̂}^{*}) = 0$ with respect to β to obtain β̂.

The formulae for α̂* and ${Q̂}_{n, est}^{*}$ do not contain an average 𝒦̃*,which could be introduced analogously to the general case where both formulae involve 𝒦̃, and which depends on π₁ = P(D = 1). This is explained as follows: both the estimating function (12) and the estimator p_est, which is used to estimate α_true, depend on the unknown density f_X. As already explained in Section 2 at equation (3), under the rare disease approximation, f_X can be approximated by f_cont, i.e. we can use f_X empirically using only the controls. This has the advantage that we do not need prior knowledge about the typically unknown disease rate π₁. This is in contrast with the general model where we need to know π₁ not only to be able to work with 𝒦̃, but also to obtain a consistent estimator of θ₀.

Because case–control studies are almost inevitably conducted for rare outcomes, the rare disease approximation is natural in most applications. It is also widely used, a very non-exhaustive list of which includes Piegorsch et al. (1994), Epstein and Satten (2003), Lin and Zeng (2006), Modan et al. (2001), Zhao et al. (2003), Kwee et al. (2007), Lin and Zeng (2009) and Hu et al. (2010).

4.2. Case–control studies with frequency matching

In frequency-matched case–control studies, a few strata are formed based on covariates such as age, and then a population-based case–control study is performed within each stratum. A straightforward approach is to include these matching variables as part of X, to form the estimating function (9) for each stratum and to form a new estimating function as the possibly weighted sum of the estimating functions across the strata. The weights might for example be based on estimates of the size of each stratum in the population. The resulting estimates of (α_true, β_true) will be asymptotically normally distributed.

5. Simulations

We performed simulation studies both at and away from the Gaussian model. Our simulations indicate that our proposed estimator has small bias and nearly nominal coverage probability in the cases that we examined, whereas an implementation of the parametric approach (see Section 2.3) may suffer from bias and lower coverage probability (Tables 1 and 2). We also show that our method often achieves significant gains in efficiency when compared with the estimator that uses only the controls. The approach that uses all the data but ignores the case–control sampling design suffers from bias and low coverage; see below.

Table 1.

Results of the simulation study with n₁ = 500 cases and n₀ = 500 controls, and a disease rate of approximately 1%^†

	Results for normal model				Results for gamma model

	Controls	SPMLE	Robust	All	Controls	SPMLE	Robust	All
θ_y = 0.00
Mean	0.992	0.991	1.001	0.992	1.002	1.005	1.003	1.003
sd	0.148	0.107	0.119	0.105	0.156	0.111	0.120	0.111
Est. sd	0.154	0.110	0.121	0.109	0.154	0.110	0.121	0.109
90%	0.917	0.911	0.918	0.912	0.892	0.897	0.899	0.901
95%	0.956	0.955	0.965	0.955	0.944	0.943	0.944	0.941
MSE Eff		1.898	1.537	1.965		1.963	1.665	1.957
θ_y = 0.25
Mean	0.999	1.001	0.990	1.078	1.001	0.997	0.993	1.120
sd	0.154	0.110	0.117	0.109	0.155	0.144	0.120	0.144
Est. sd	0.154	0.111	0.119	0.110	0.153	0.149	0.123	0.148
90%	0.911	0.905	0.908	0.818	0.900	0.924	0.901	0.797
95%	0.955	0.954	0.958	0.889	0.945	0.961	0.947	0.881
MSE Eff		1.951	1.720	1.303		1.148	1.643	0.680
θ_y = 0.50
Mean	0.995	0.994	0.989	1.177	0.986	0.848	1.024	1.297
sd	0.154	0.114	0.117	0.114	0.144	0.205	0.147	0.208
Est. sd	0.154	0.113	0.120	0.113	0.148	0.208	0.149	0.215
90%	0.903	0.898	0.904	0.525	0.906	0.818	0.905	0.587
95%	0.957	0.947	0.948	0.641	0.953	0.884	0.957	0.719
MSE Eff		1.822	1.704	0.531		0.323	0.938	0.159

Open in a new tab

^†

‘Normal’ means that ε ~ N(0, 1), and ‘gamma’ means that ε is a centred and scaled gamma random variable with shape parameter 0.4. The analyses performed are ‘controls’ (using only controls), the semiparametric efficient method that assumes normality (‘SPMLE’), our new estimator (‘robust’), and ‘all’, which is the method that uses all the data while ignoring the case–control study. Over 1000 simulations, we computed the mean estimated β (‘mean’), its standard deviation (‘sd’), the mean estimated standard deviation (‘Est. sd’), the coverage for a nominal 90% confidence interval (‘90%’), the coverage for a nominal 95% confidence interval (‘95%’) and the mean-squared error efficiency (‘MSE Eff’) compared with using only the controls.

Table 2.

Results of the simulation study described in Table 1, now with n₁ = 150 cases and n₀ = 150 controls^†

	Results for normal model				Results for gamma model

	Controls	SPMLE	Robust	All	Controls	SPMLE	Robust	All
θ_y = 0.00
Mean	0.991	0.993	1.005	0.992	0.998	0.991	1.019	0.990
sd	0.287	0.204	0.233	0.200	0.292	0.201	0.236	0.199
Est. sd	0.282	0.202	0.230	0.200	0.281	0.201	0.230	0.199
90%	0.891	0.908	0.910	0.905	0.892	0.900	0.916	0.902
95%	0.942	0.951	0.965	0.952	0.948	0.950	0.959	0.950
MSE Eff		1.973	1.509	2.043		2.103	1.526	2.151
θ_y = 0.25
Mean	1.008	1.016	0.983	1.092	1.007	0.994	0.974	1.118
sd	0.301	0.204	0.220	0.202	0.280	0.268	0.223	0.267
Est. sd	0.283	0.204	0.227	0.202	0.273	0.269	0.232	0.268
90%	0.874	0.893	0.933	0.867	0.903	0.900	0.928	0.864
95%	0.933	0.950	0.968	0.930	0.943	0.947	0.968	0.928
MSE Eff		2.156	1.856	1.834		1.088	1.551	0.921
θ_y = 0.50
Mean	0.986	0.987	0.974	1.173	0.985	0.837	1.006	1.292
sd	0.283	0.199	0.222	0.200	0.265	0.393	0.295	0.400
Est. sd	0.282	0.206	0.235	0.207	0.266	0.381	0.311	0.393
90%	0.903	0.918	0.936	0.798	0.900	0.864	0.938	0.808
95%	0.948	0.958	0.973	0.871	0.943	0.923	0.969	0.888
MSE Eff		2.003	1.597	1.143		0.388	0.806	0.287

Open in a new tab

^†

The disease rate is approximately 1%.

We generated X from a uniform distribution on (0, 1). The logistic regression model is pr(D = 1|Y, X) = H(θ₀ + θ_yY + θ_xX), with θ₀ = −5.5, θ_y = 0.00, 0.25, 0.50 and θ_x = 1. The model for Y given X is a linear regression model, Y = α_true + β_trueX + ε, with α_true = 0 and β_true = 1. We considered two distributions for ε: the standard normal distribution, for which the parametric approach attains the semiparametric efficiency bound, and, for comparison, a standardized gamma distribution with scale parameter 0.4. By equation (2), for θ_y = 0.00, 0.25, 0.50 the rates of disease are approximately 0.007, 0.008 and 0.010. In the first scenario the case–control study has n₁ = 500 cases and n₀ = 500 controls. In the second scenario we chose n₀ = n₁ = 150. We generated 1000 simulated data sets in each setting.

We contrasted four methods. The first uses ordinary linear regression based only on the controls. The second method uses the same approach but is expected to be significantly biased since it is based on the entire data set. The third method is the parametric (‘semiparametric efficient’) method that assumes normal errors, with standard errors obtained by inverting the Hessian of the log-likelihood. The fourth method is our proposed method, with standard errors estimated by using asymptotic formulae. The third and the fourth method were computed by making the rare disease approximation.

The case θ_y = 0.00 is interesting, because here Y is independent of D given X. Hence all methods should achieve nominal coverage probabilities for estimating β_true, which is indeed seen in Table 1. Since, with θ_y = 0.00, all methods are asymptotically valid, the only possibility of seeing a bias is when θ_y is sufficiently ‘large’. For this reason, we experimented with the cases θ_y = 0.25 and θ_y = 0.50. Consider θ_y = 0.25 first. Here the approach that uses all the data yields a biased estimator of β_true = 1, with low coverage probabilities. The ‘semiparametric efficient’ method that assumes normality still maintains its nominal coverage probabilities. As expected, since it is efficient if the errors are normal, it indeed outperforms the other approaches in this case. For example, for any two methods, say A and B, with estimates β̂_A and β̂_B, the mean-squared error efficiency of method A with respect to method B is E{(β̂_B − β_true)²}/E{(β̂_A − β_true)²}, and its estimated version is computed by replacing expectations by averages across the simulations. The semiparametric efficient method has 13% greater mean-squared error efficiency than our method in the normal case. However, in the gamma case, our method has 43% greater mean-squared error efficiency. It also outperforms the approach that uses only the controls, for both normal and gamma errors: in both cases the mean-squared error efficiency is roughly 70% larger.

Finally, in the case θ_y = 0.50 with normal regression errors, the semiparametric efficient method that assumes normality maintains its nominal coverage probabilities and has 7% greater mean-squared error efficiency than our method and 82% greater efficiency than using only controls. However, when the errors have a gamma distribution, it suffers from bias, increased variance and loss of coverage, with nominal 90% and 95% coverage actually being 81.8% and 88.4% respectively. Our method retains nominal coverage. The controls-only analysis and our method have roughly equal mean-squared error efficiency which is, in particular, much greater than the mean-squared error efficiency of the semiparametric efficient approach for regression models with normal errors.

6. Empirical example

In this section, we illustrate the methodology in a case–control study of prostate cancer, which was originally designed to investigate the risk of prostate-cancer-associated vitamin D biomarkers and genetic variations in vitamin D metabolism pathways (Ahn et al., 2009). The goal of the current analysis, which includes 749 prostate cancer cases and 781 controls, is to examine whether the genetic variations in the vitamin D receptor influence [25(OH)D], which is a serum level biomarker of vitamin D. In the notation of this paper, D is the prostate cancer case–control status and Y is the level of [25(OH)D]. We investigated three SNPs, rs2238136, rs2254210 and rs2239186, each of which represents an ordinal categorical variable coded as 0, 1 or 2 depending on how many copies of the variant allele a subject carries. In our analysis, X consists of three dummy variables for age groups, along with one of the genetic markers.

The results are given in Table 3. We see in Table 3 that none of the coefficients for the SNP are statistically significant. Thus, neither the traditional control-only nor the proposed method detected any association between the vitamin D receptor gene and [25(OH)D] level. These results are consistent with Chen et al. (2009) who noted that, given the downstream role of the vitamin D receptor gene in the vitamin D pathway, it is unlikely that vitamin D receptor polymorphisms could actually influence the level of [25(OH)D]. In spite of a lack of association, it is interesting to observe that the 95% confidence intervals by using our method are much shorter than by using those from the control data only. In terms of mean-squared error efficiency, here estimated as the square of the ratio of the lengths of the confidence intervals, the results for the three SNPs suggest gains in efficiency of 68%, 136% and 125% compared with using only the controls.

Table 3.

Results of the vitamin D receptor data example in Section 6^†

X	Results for our method			Results for controls only			Efficiency

	Estimate	Lowe limitr	Upper limit	Estimate	Lowe limitr	Upper limit
SNP 1	0.015	−0.165	0.195	−0.029	−0.262	0.204	1.68
SNP 2	0.023	−0.047	0.093	0.039	−0.069	0.146	2.36
SNP 3	0.015	−0.062	0.092	−0.045	−0.161	0.070	2.25

Open in a new tab

^†

Three analyses are displayed, one each when X is SNP 1, SNP 2 and SNP 3. Displayed are the parameter estimates of the slope for X (‘estimate’), and lower (‘lower’) and upper (‘upper’) 95% confidence intervals. Our method is contrasted with using linear regression among the controls only. Also displayed is the ‘efficiency’, which is defined as the square of the ratio of the lengths of the confidence intervals.

7. Discussion

If the disease probability pr(D = 1) is known, there are simpler methods for our particular setting that allow estimation of β_true, based on weighting via equation (3). However, in the common case that pr(D = 1) is not known, the development in Section 3 leads to two natural rare disease approximations that use all the data and not just the data on the controls; see Section 4.1. It would be interesting to investigate which of these two approximate approaches is preferable in general.

Our simulation results are specific to rare diseases, by which we mean certainly that pr(D = 1) ≤ 1%. Biases will arise as the disease probability increases. In addition, since rare disease approximations do not lead to fully consistent estimation, coverage probability in large samples will suffer, since the bias is fixed whereas the variance decreases with sample size. Finally, the methods are likely to suffer in cases that the X-distribution has relatively rare values that are not within the centre of the support of X.

Acknowledgements

This paper represents part of the first author’s doctoral dissertation at Texas A&M University. Wei and Carroll’s research was supported by a grant from the National Cancer Institute (R37-CA057030). Carroll was also supported by award KUS-CI-016-04, made by King Abdullah University of Science and Technology. Chatterjee’s research was supported by a gene–environment initiative grant from the National Heart, Lung and Blood Institute (RO1-HL091172-01) and by the Intramural Research Program of the National Cancer Institute. Muller was supported by a National Science Foundation grant (DMS-0907014). Van Keilegom gratefully acknowledges financial support from Interuniversity Attraction Pole research network P6/03 of the Belgian Government (Belgian science policy), and from the European Research Council under the European Community’s seventh framework programme (FP7/2007-2013), European Research Council grant agreement 203650.

Appendix A

Some derivations

A.1. Derivation of the error density estimator (11)

The key idea of the approach is to introduce discrete probabilities γ_i = pr{R(β) = R_i(β)}, i = 1, …, n, which yields

pr (D = d) = \sum_{i = 1}^{n} pr {D = d | R (β) = R_{i} (β)} γ_{i},

and to work with the maximum likelihood estimates, i.e. with those γ_i that maximize the retrospective log-likelihood

\sum_{i = 1}^{n} log [pr {R (β) = R_{i} (β) | D = D_{i}}] = \sum_{i = 1}^{n} log [\frac{pr {R (β) = R_{i} (β)} pr {D = D_{i} | R (β) = R_{i} (β)}}{pr (D = D_{i})}] = \sum_{i = 1}^{n} log [\sum_{k = 1}^{n} γ_{k} 1 {R_{i} (β) = R_{k} (β)}] + \sum_{i = 1}^{n} log [\frac{pr {D = D_{i} | R (β) = R_{i} (β)}}{\sum_{k = 1}^{n} pr {D = D_{i} | R (β) = R_{k} (β)} γ_{k}}] .

Taking the derivative with respect to γ_k, k = 1, …, n, gives

\frac{\sum_{i = 1}^{n} I {R_{i} (β) = R_{k} (β)}}{γ_{k}} - \sum_{i = 1}^{n} \frac{pr {D = D_{i} | R (β) = R_{k} (β)}}{\sum_{k = 1}^{n} pr {D = D_{i} | R (β) = R_{k} (β)} γ_{k}} = γ_{k}^{- 1} - \sum_{i = 1}^{n} \frac{pr {D = D_{i} | R (β) = R_{k} (β)}}{pr (D = D_{i})} = γ_{k}^{- 1} - \sum_{d = 0}^{1} pr {D = d | R (β) = R_{k} (β)} \frac{n_{d}}{π_{d}} .

Now set this equal to 0 to obtain

γ_{k} = {[\sum_{d = 0}^{1} pr {D = d | R (β) = R_{k} (β)} \frac{n_{d}}{π_{d}}]}^{- 1} = {[\int \sum_{d = 0}^{1} pr {D = d | R (β) = R_{k} (β), X = x} f_{X} (x) d x \frac{n_{d}}{π_{d}}]}^{- 1} .

By definition of 𝒦, using that

\frac{n_{0}}{π_{0}} + \frac{n_{1}}{π_{1}} exp {θ_{0} + m (y, x, θ_{1})} = \frac{n_{0}}{π_{0}} [1 + exp {κ + m (y, x, θ_{1})}],

this is the desired formula (11).

A.2. Unbiasedness of estimation function (12)

All calculations of expectations here will be based on the precise definition of expectations in a case–control sampling scheme; see equation (4). Let (β_true, Ω_true) be the true parameter, β an arbitrary value and τ(x, β, β_true) = μ(x, β_true) − μ(x, β). To derive the conditional density given the disease state we use the fact that we assume a logistic model, pr(D = 1|Y, X) = H{θ₀ + m(Y, X, θ₁)}, with H(x) the logistic distribution function, for which

H {θ_{0} + m (Y, X, θ_{1})} = [1 - H {θ_{0} + m (Y, X, θ_{1})}] exp {θ_{0} + m (Y, X, θ_{1})} .

Now write f_YX(·) as the joint density function of (Y, X) in the population. Then, with θ₀ and θ₁ denoting the true parameters,

π_{d} = pr (D = d) = \int H {θ_{0} + m (y, x, θ_{1})}^{d} {[1 - H {θ_{0} + m (y, x, θ_{1})}]}^{1 - d} f_{Y X} (y, x) d y d x = \int [1 - H {θ_{0} + m (y, x, θ_{1})}] exp [d {θ_{0} + m (y, x, θ_{1})}] f_{Y X} (y, x) d y d x .

It then follows that the density of (Y, X) given D is

f_{Y X | D = d} (y, x) = \frac{exp [d {θ_{0} + m (y, x, θ_{1})}] f_{Y X} (y, x)}{[1 + exp {θ_{0} + m (y, x, θ_{1})}] π_{d}} .

Recall that κ = θ₀ + log(n₁/n₀) − log(π₁/π₀). Then equation (4) can now be computed as

n^{- 1} \sum_{i = 1}^{n} E [G {R_{i} (β), X_{i}} | D_{i}] = \sum_{d = 0}^{1} \frac{n_{d}}{n π_{d}} \int G {y - μ (x, β), x} \frac{exp [d {θ_{0} + m (y, x, θ_{1})}]}{1 + exp {θ_{0} + m (y, x, θ_{1})}} f_{Y X} (y, x) d y d x = \frac{n_{0}}{n π_{0}} \int \sum_{d = 0}^{1} G {y - μ (x, β), x} \frac{n_{d} / n_{0}}{π_{d} / π_{0}} \frac{exp [d {θ_{0} + m (y, x, θ_{1})}]}{1 + exp {θ_{0} + m (y, x, θ_{1})}} f_{Y X} (y, x) d y d x = \frac{n_{0}}{n π_{0}} \int G (r, x) \frac{1 + exp [κ + m {r + μ (x, β), x, θ_{1}}]}{1 + exp [θ_{0} + m {r + μ (x, β), x, θ_{1}}]} f_{Y X} {r + μ (x, β), x} d r d x .

The joint density of (Y, X) in the population is f_YX(y, x) = f_ε{y − α_true − μ(x, β_true)} f_X(x). Hence, f_YX{r + μ(x, β), x} = f_ε{r − α_true − τ(x, β, β_true)} f_X(x). Thus,

n^{- 1} \sum_{i = 1}^{n} E [G {R_{i} (β), X_{i}} | D_{i}] = \frac{n_{0}}{n π_{0}} \int G (r, x) \frac{1 + exp [κ + m {r + μ (x, β_{true}) - τ (x, β, β_{true}), x, θ_{1}}]}{1 + exp [θ_{0} + m {r + μ (x, β_{true}) - τ (x, β, β_{true}), x, θ_{1}}]} \times f_{ε} {r - α_{true} - τ (x, β, β_{true})} f_{X} (x) d r d x = \frac{n_{0}}{n π_{0}} \int G {r + τ (x, β, β_{true}), x} \frac{1 + exp [κ + m {r + μ (x, β_{true}), x, θ_{1}}]}{1 + exp [θ_{0} + m {r + μ (x, β_{true}), x, θ_{1}}]} \times f_{ε} (r - α_{true}) f_{X} (x) d r d x .

Now, since

𝒦 (r, x, β_{true}, Ω_{true}) = (1 + exp [κ + m {r + μ (x, β_{true}), x, θ_{1}}]) {(1 + exp [θ_{0} + m {r + μ (x, β_{true}), x, θ_{1}}])}^{- 1},

we have that

n^{- 1} \sum_{i = 1}^{n} E [G {R_{i} (β), X_{i}} | D_{i}] = \frac{n_{0}}{n π_{0}} \int f_{ε} (r - α_{true}) f_{X} (x) 𝒦 (r, x, β_{true}, Ω_{true}) G {r + τ (x, β, β_{true}), x} d r d x .

(17)

It follows from the convention in equation (4) and equation (17) that

\frac{n π_{0}}{n_{0}} E {Q_{n} (α_{true}, β, Ω_{true})} = E {Q_{n} (α_{true}, β, Ω_{true}) | D_{1}, \dots, D_{n}} = n^{1 / 2} \int f_{ε} (r - α_{true}) f_{X} (x) 𝒦 (r, x, β_{true}, Ω_{true}) [L {r + τ (x, β, β_{true}), x, α (β, Ω_{true}), β} - \frac{\int L {r + τ (x, β, β_{true}), υ, α (β, Ω_{true}), β} 𝒦 {r + τ (x, β, β_{true}), υ, β, Ω_{true}} f_{X} (υ) d υ}{\int 𝒦 {r + τ (x, β, β_{true}), s, β, Ω_{true}} f_{X} (s) d s}] d x d r .

If β = β_true, since τ(x, β_true, β_true) = 0, it follows directly that the last term is 0, and therefore 0 = E{Q_n(α_true, β_true, Ω_true)|D₁, …, D_n}. Hence Q_n(α_true, β, Ω_true) = 0 is an unbiased estimating equation. If β ≠ β_true, then in general we shall have 0 ≠ {Q_n(α_true, β, Ω_true)|D₁, …, D_n}.

Appendix B

Asymptotic theory

B.1. Notation and assumptions

In this section we introduce notation that is needed for our main theorem in Section 3.6, and we also state the formal assumptions under which this result will be valid.

Let (β, Ω) = Θ, and let Θ_true denote its true value. Recall equation (4), and define

c_{*} = lim_{n \to \infty} (n_{0} / n),

α (β, Ω) = \frac{\sum_{d = 0}^{1} (n_{d} / n) E (R (β) {[\int f_{X} (x) 𝒦 {R (β), x, β, Ω} d x]}^{- 1} | D = d)}{\sum_{d = 0}^{1} (n_{d} / n) E ({[\int f_{X} (x) 𝒦 {R (β), x, β, Ω} d x]}^{- 1} | D = d)},

𝒯 {R (β), X, Θ, f_{X}} = L {R (β), X, α (β, Ω), β} - \frac{\int L {R (β), x, α (β, Ω), β} 𝒦 {R (β), x, Θ} f_{X} (x) d x}{\int 𝒦 {R (β), x, Θ} f_{X} (x) d x},

ℳ_{Ω} = \sum_{d = 0}^{1} c_{*}^{1 - d} {(1 - c_{*})}^{d} {E [\frac{\partial 𝒯 {R (β_{true}), X, Θ, f_{X}}}{\partial Ω^{T}} | D = d] |}_{Θ = Θ_{true}},

ℳ_{β} = \sum_{d = 0}^{1} c_{*}^{1 - d} {(1 - c_{*})}^{d} {E [\frac{\partial 𝒯 {R (β), X, Θ_{true}, f_{X}}}{\partial β^{T}} | D = d] |}_{β = β_{true}} .

Define

G_{num} (r, x, d, Θ) = L {r, x, α (β, Ω), β} 𝒦̃ (r, x, d, Θ),

G_{den} (r, x, d, Θ) = 𝒦̃ (r, x, d, Θ),

𝒜_{num} (r, Θ) = \sum_{d = 0}^{1} (n_{d} / n) E {G_{num} (r, X, D, Θ) | D = d},

𝒜_{den} (r, Θ) = \sum_{d = 0}^{1} (n_{d} / n) E {G_{den} (r, X, D, Θ) | D = d} .

Write

ℋ_{n} (β, Θ) = n^{- 1 / 2} \sum_{i = 1}^{n} [\frac{n^{- 1} \sum_{j = 1}^{n} G_{num} {R_{i} (β), X_{j}, D_{j}, Θ}}{n^{- 1} \sum_{j = 1}^{n} G_{den} {R_{i} (β), X_{j}, D_{j}, Θ}} - \frac{𝒜_{num} {R_{i} (β), Θ}}{𝒜_{den} {R_{i} (β), Θ}}]

and

W {R_{i} (β), X_{j}, D_{j}, Θ} = \frac{G_{num} {R_{i} (β), X_{j}, D_{j}, Θ} - 𝒜_{num} {R_{i} (β), Θ}}{𝒜_{den} {R_{i} (β), Θ}} - \frac{𝒜_{num} {R_{i} (β), Θ} [G_{den} {R_{i} (β), X_{j}, D_{j}, Θ} - 𝒜_{den} {R_{i} (β), Θ}]}{𝒜_{den}^{2} {R_{i} (β), Θ}} .

Also define

{Z̃}_{i} (β) = {R_{i} (β), X_{i}, D_{i}},

z̃ = (r, x, d),

Q_{1} {{Z̃}_{i} (β), {Z̃}_{j} (β), Θ} = W {R_{i} (β), X_{j}, D_{j}, Θ} + W {R_{j} (β), X_{i}, D_{i}, Θ},

Q_{2 j} (z̃, β, Θ) = E [W {R (β), x, d, Θ} | D = j],

h_{1 j} (z̃, β, Θ) = E [Q_{1} {z̃, Z̃ (β), Θ} | D = j] (j = 0, 1),

h_{2} {R_{i} (β), X_{i}, D_{i}, Θ} = \frac{n_{0}}{n} (1 - D_{i}) h_{10} {{Z̃}_{i} (β), β, Θ} + \frac{n_{1}}{n} D_{i} h_{11} {{Z̃}_{i} (β), β, Θ} + \frac{n_{0}}{n} D_{i} Q_{20} {{Z̃}_{i} (β), β, Θ} + \frac{n_{1}}{n} (1 - D_{i}) Q_{21} {{Z̃}_{i} (β), β, Θ},

m_{θ_{1}} (y, x, θ_{1}) = \frac{\partial m (y, x, θ_{1})}{\partial θ_{1}},

Φ (y, x, d, Ω) = {(1, m_{θ_{1}} (y, x, θ_{1}))}^{T} [d - H {κ + m (y, x, θ_{1})}],

𝒩_{Ω} = - \sum_{d = 0}^{1} c_{*}^{1 - d} {(1 - c_{*})}^{d} {[{E {\partial Φ (Y, X, D, Ω) / \partial Ω | D = d} |}_{Ω = Ω_{true}}]}^{- 1},

Λ (Y_{i}, X_{i}, D_{i}, Θ_{true}) = ℳ_{Ω} {(𝒩_{Ω} Φ (Y_{i}, X_{i}, D_{i}, Ω_{true}), Ψ (Y_{i}, X_{i}, D_{i}, Ω_{true}))}^{T} - h_{2} {R_{i} (β_{true}), X_{i}, D_{i}, Θ_{true}} + 𝒯 {R_{i} (β_{true}), X_{i}, Θ_{true}, f_{X}},

where the function Ψ(Y_i, X_i, D_i, Ω_true) is defined in assumption 4 below. Finally, let

Σ = \sum_{d = 0}^{1} c_{*}^{1 - d} {(1 - c_{*})}^{d} ℳ_{β}^{- 1} cov {Λ (Y, X, D, Θ_{true}) | D = d} {(ℳ_{β}^{- 1})}^{T} .

Next, introduce the following assumptions, under which the main result in Section 3.6 is valid.

Assumption 1. The error ε is independent of X. The error distribution F_ε is twice continuously differentiable, and the distribution F_X of X is once continuously differentiable. The corresponding densities are denoted by f_ε and f_X.

Assumption 2. There exists some 0 < c* < 1 such that n₀/n → c*.

Assumption 3. The function μ(x, β) is three times continuously differentiable with respect to β, m(y, x, θ₁) is twice continuously differentiable with respect to y and θ₁, and Φ(y, x, d, Ω) is continuously differentiable with respect to Ω. Also, the matrices ℳ_β and E{∂Φ(Y, X, D, Ω)/∂Ω|D = d}|_{Ω=Ω_true} are invertible.

Assumption 4. The estimator θ̂₀ satisfies

{θ̂}_{0} - θ_{0, true} = n^{- 1} Ψ (Y_{i}, X_{i}, D_{i}, Ω_{true}) + o_{p} (n^{- 1 / 2}),

for some function Ψ that satisfies E{Ψ(Y, X, D, Ω_true)|D} = 0.

B.2. Proofs

We are now ready to give the proof of our main asymptotic result. Before giving a formal proof, let us first highlight the main steps of the proof. First, it follows from Appendix A.2 that Q̂_n(α, β, Ω) is an unbiased estimating function. Plugging in an estimator of α_true, we use a Taylor expansion of Q̂_n,est(β̂, Ω̂) = 0 around the true β and Ω, which gives a regular asymptotically linear expansion of n^1/2(β̂ − β_true). Finally we apply the central limit theorem to obtain the required asymptotic normality result. Along the way, we must show an asymptotic expansion for ℋ_n(β, Θ), which is given in lemma 1. The notation in the statement of this lemma was introduced in the previous section.

Lemma 1. Assume that assumptions 1–3 are valid. Then, for each β and Θ,

ℋ_{n} (β, Θ) = n^{- 1 / 2} \sum_{i = 1}^{n} h_{2} {R_{i} (β), X_{i}, D_{i}, Θ} + o_{p} (1),

where E[h₂{R(β), X, D, Θ}|D] = 0.

Proof. Define

Z_{num} {R (β), Θ} = n^{- 1 / 2} \sum_{j = 1}^{n} [G_{num} {R (β), X_{j}, D_{j}, Θ} - 𝒜_{num} {R (β), Θ}],

Z_{den} {R (β), Θ} = n^{- 1 / 2} \sum_{j = 1}^{n} [G_{den} {R (β), X_{j}, D_{j}, Θ} - 𝒜_{den} {R (β), Θ}] .

Since by assumption 2 we have that n₁/n₀ → c, 0 < c < ∞, it follows that Z_num{R(β),Θ} = O_p(1) and Z_den{R(β), Θ} = O_p(1), for each β and Θ. Hence, by a Taylor series expansion and assumption 3,

\frac{n^{- 1} \sum_{j = 1}^{n} G_{num} {R (β), X_{j}, Θ, D_{j}}}{n^{- 1} \sum_{j = 1}^{n} G_{den} {R (β), X_{j}, Θ, D_{j}}} - \frac{𝒜_{num} {R (β), Θ}}{𝒜_{den} {R (β), Θ}} = \frac{𝒜_{num} {R (β), Θ} + n^{- 1 / 2} Z_{num} {R (β), Θ}}{𝒜_{den} {R (β), Θ} + n^{- 1 / 2} Z_{den} {R (β), Θ}} - \frac{𝒜_{num} {R (β), Θ}}{𝒜_{den} {R (β), Θ}} = \frac{n^{- 1 / 2} Z_{num} {R (β), Θ}}{𝒜_{den} {R (β), Θ}} - \frac{𝒜_{num} {R (β), Θ}}{𝒜_{den}^{2} {R (β), Θ}} n^{- 1 / 2} Z_{den} {R (β), Θ} + o_{p} (n^{- 1 / 2}) .

Thus,

ℋ_{n} (β, Θ) = n^{- 3 / 2} (\sum_{i = 1}^{n} \sum_{j = 1}^{n} \frac{G_{num} {R_{i} (β), X_{j}, D_{j}, Θ} - 𝒜_{num} {R_{i} (β), Θ}}{𝒜_{den} {R_{i} (β), Θ}} - \sum_{i = 1}^{n} \sum_{j = 1}^{n} \frac{𝒜_{num} {R_{i} (β), Θ}}{𝒜_{den}^{2} {R_{i} (β), Θ}} [G_{den} {R_{i} (β), X_{j}, D_{j}, Θ} - 𝒜_{den} {R_{i} (β), Θ}]) + o_{p} (1) = ℬ_{n} (β, Θ) + o_{p} (1) .

By definition, E{ℬ_n(β, Θ)|D₁, …, D_n} = 0. By the definition of W{R_i(β), X_j, D_j, Θ},

ℬ_{n} (β, Θ) = n^{- 3 / 2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} W {R_{i} (β), X_{j}, D_{j}, Θ} .

Without loss of generality, we can make the first n₀ observations be the controls, and the last n − n₀ observations be the cases. Then,

ℬ_{n} (β, Θ) = n^{- 3 / 2} \sum_{i = 1}^{n_{0}} \sum_{j = 1}^{n_{0}} W {R_{i} (β), X_{j}, D_{j}, Θ} + n^{- 3 / 2} \sum_{i = n_{0} + 1}^{n} \sum_{j = n_{0} + 1}^{n} W {R_{i} (β), X_{j}, D_{j}, Θ} + n^{- 3 / 2} \sum_{i = n_{0} + 1}^{n} \sum_{j = 1}^{n_{0}} W {R_{i} (β), X_{j}, D_{j}, Θ} + n^{- 3 / 2} \sum_{i = 1}^{n_{0}} \sum_{j = n_{0} + 1}^{n} W {R_{i} (β), X_{j}, D_{j}, Θ} = n^{- 3 / 2} \sum_{i = 1}^{n_{0}} \sum_{j = 1}^{i - 1} Q_{1} {{Z̃}_{i} (β), {Z̃}_{j} (β), Θ} + n^{- 3 / 2} \sum_{i = n_{0} + 1}^{n} \sum_{j = n_{0} + 1}^{i - 1} Q_{1} {{Z̃}_{i} (β), {Z̃}_{j} (β), Θ} + n^{- 3 / 2} \sum_{i = n_{0} + 1}^{n} \sum_{j = 1}^{n_{0}} W {R_{i} (β), X_{j}, D_{j}, Θ} + n^{- 3 / 2} \sum_{i = 1}^{n_{0}} \sum_{j = n_{0} + 1}^{n} W {R_{i} (β), X_{j}, D_{j}, Θ} + o_{p} (1) .

An easy calculation shows that

var [n^{- 3 / 2} \sum_{i = n_{0} + 1}^{n} \sum_{j = 1}^{n_{0}} W {R_{i} (β), X_{j}, D_{j}, Θ} - n_{1} n^{- 3 / 2} \sum_{j = 1}^{n_{0}} Q_{21} {{Z̃}_{j} (β), β, Θ}] \to 0,

and similarly

var [n^{- 3 / 2} \sum_{i = 1}^{n_{0}} \sum_{j = n_{0} + 1}^{n} W {R_{i} (β), X_{j}, D_{j}, Θ} - n_{0} n^{- 3 / 2} \sum_{j = n_{0} + 1}^{n} Q_{20} {{Z̃}_{j} (β), β, Θ}] \to 0,

Hence we have shown that

ℬ_{n} (β, Θ) = {(\frac{n_{0}}{n})}^{3 / 2} n_{0}^{- 3 / 2} \sum_{i = 1}^{n_{0}} \sum_{j = 1}^{i - 1} Q_{1} {{Z̃}_{i} (β), {Z̃}_{j} (β), Θ} + {(\frac{n_{1}}{n})}^{3 / 2} n_{1}^{- 3 / 2} \sum_{i = n_{0} + 1}^{n} \sum_{j = n_{0} + 1}^{i - 1} Q_{1} {{Z̃}_{i} (β), {Z̃}_{j} (β), Θ} + n_{1} n^{- 3 / 2} \sum_{i = 1}^{n_{0}} Q_{21} {{Z̃}_{i} (β), β, Θ} + n_{0} n^{- 3 / 2} \sum_{i = n_{0} + 1}^{n} Q_{20} {{Z̃}_{i} (β), β, Θ} + o_{p} (1) .

Except for the factor (n₀/n)^3/2, the first term above is a classical symmetric U-statistic of order 2 applied to independent and identically distributed observations, since by convention the first n₀ observations are the controls. It then follows from standard U-statistic theory that (see, for example, Van der Vaart (1998))

ℬ_{n} (β, Θ) = {(\frac{n_{0}}{n})}^{3 / 2} n_{0}^{- 1 / 2} \sum_{i = 1}^{n_{0}} h_{10} {{Z̃}_{i} (β), β, Θ} + {(\frac{n_{1}}{n})}^{3 / 2} n_{1}^{- 1 / 2} \sum_{i = n_{0} + 1}^{n} h_{11} {{Z̃}_{i} (β), β, Θ} + n_{1} n^{- 3 / 2} \sum_{i = 1}^{n_{0}} Q_{21} {{Z̃}_{i} (β), β, Θ} + n_{0} n^{- 3 / 2} \sum_{i = n_{0} + 1}^{n} Q_{20} {{Z̃}_{i} (β), β, Θ} + o_{p} (1) = n^{- 1 / 2} \sum_{i = 1}^{n} h_{2} {R_{i} (β), X_{i}, D_{i}, Θ} + o_{p} (1) .

This completes the proof.

B.2.1. Proof of theorem 1

Because of the unbiasedness of the estimating function (13) and the fact that expression (14) is consistent and asymptotically normally distributed for α_true when evaluated at (β_true, Ω_true), the estimate is consistent for β_true, and α(β_true, Ω_true) = α_true. Set

𝒥 (R (β), X, β, Ω) = μ_{β} (X, β) - \frac{\int μ_{β} (x, β) 𝒦 {R (β), x, β, Ω} f_{X} (x) d x}{\int 𝒦 {R (β), x, β, Ω} f_{X} (x) d x},

c_{1 n} (β, Ω) = n^{- 1} \sum_{i = 1}^{n} 𝒥 {R_{i} (β), X_{i}, β, Ω},

c_{1} (β, Ω) = \sum_{d = 0}^{1} \frac{n_{d}}{n} E [𝒥 {R (β), X, β, Ω} | D = d] .

We use the fact that 0 = Q̂_n,est(β, Ω̂)|_β=β̂. By a Taylor series expansion and assumption 3,

0 = {Q̂}_{n, est} (β_{true}, Ω_{true}) + \frac{\partial}{\partial β^{T}} {n^{- 1 / 2} {Q̂}_{n, est} (β_{true}, Ω_{true})} n^{1 / 2} (β̂ - β_{true}) + \frac{\partial}{\partial Ω^{T}} {n^{- 1 / 2} {Q̂}_{n, est} (β_{true}, Ω_{true})} n^{1 / 2} (Ω̂ - Ω_{true}) + o_{p} (1) .

However, since α̂(β_true, Ω_true) is a consistent estimator for α_true, it is clear that we have that

n^{- 1 / 2} {\partial {Q̂}_{n, est} (β, Ω_{true}) / \partial β^{T}}_{β = β_{true}} = ℳ_{β} + o_{p} (1)

and

n^{- 1 / 2} {\partial {Q̂}_{n, est} (β_{true}, Ω) / \partial Ω^{T}}_{Ω = Ω_{true}} = ℳ_{Ω} + o_{p} (1) .

Hence it follows that

0 = {Q̂}_{n, est} (β_{true}, Ω_{true}) + ℳ_{β} n^{- 1 / 2} (β̂ - β_{true}) + ℳ_{Ω} n^{- 1 / 2} (Ω̂ - Ω_{true}) + o_{p} (1) .

Because of its form, another Taylor series expansion and under assumption 3,

{Q̂}_{n, est} (β_{true}, Ω_{true}) = {Q̂}_{n} (α_{true}, β_{true}, Ω_{true}) + c_{1} (β_{true}, Ω_{true}) n^{1 / 2} {α̂ (β_{true}, Ω_{true}) - α (β_{true}, Ω_{true})} + o_{p} (1) .

However, we can obtain by the same argument as in Appendix A.2 that c₁(β_true, Ω_true) = 0. In addition, using the same tools as in lemma 1, n^1/2{α̂(β_true, Ω_true) − α(β_true, Ω_true)} = O_p(1). We have thus shown that

n^{1 / 2} (β̂ - β_{true}) = - ℳ_{β}^{- 1} {{Q̂}_{n} (α_{true}, β_{true}, Ω_{true}) + ℳ_{Ω} n^{1 / 2} (Ω̂ - Ω_{true})} + o_{p} (1) .

(18)

Because (κ, θ₁) is estimated by ordinary logistic regression, and assumption 4 gives a representation for θ̂₀ − θ_0,true, it follows from standard theory that

n^{1 / 2} (Ω̂ - Ω_{true}) = n^{- 1 / 2} \sum_{i = 1}^{n} {(𝒩_{Ω} Φ (Y_{i}, X_{i}, D_{i}, Ω_{true}), Ψ (Y_{i}, X_{i}, D_{i}, Ω_{true}))}^{T} + o_{p} (1) .

We thus have from equation (18) that

n^{1 / 2} (β̂ - β_{true}) = ℳ_{β}^{- 1} {{Q̂}_{n} (α_{true}, β_{true}, Ω_{true}) + ℳ_{Ω} n^{- 1 / 2} \sum_{i = 1}^{n} {(𝒩_{Ω} Φ (Y_{i}, X_{i}, D_{i}, Ω_{true}), Ψ (Y_{i}, X_{i}, D_{i}, Ω_{true}))}^{T}} + o_{p} (1) .

We can now apply lemma 1 to Q̂_n(α_true, β_true, Ω_true) with G_num(r, x, d, Θ) = L{r, x, α(β, Ω), β} 𝒦̃(r, x, d, Θ) and G_den(r, x, d, Θ) = 𝒦̃(r, x, d, Θ). Invoking lemma 1, it follows that

{Q̂}_{n} (α_{true}, β_{true}, Ω_{true}) = n^{- 1 / 2} \sum_{i = 1}^{n} 𝒯 {R_{i} (β_{true}), X_{i}, Θ_{true}, f_{X}} - n^{- 1 / 2} \sum_{i = 1}^{n} h_{2} {R_{i} (β_{true}), X_{i}, D_{i}, Θ_{true}} + o_{p} (1) .

We have shown in Appendix A.2 that the first term has mean 0. Remember from lemma 1 that E[h₂{R(β_true), X, D, Θ_true}|D] = 0. Moreover, the estimating equation for logistic regression is unbiased and assumption 4 ensures that E[Ψ(Y, X, D, Ω_true)|D] = 0. Summarizing, we have shown that

n^{1 / 2} (β̂ - β_{true}) = ℳ_{β}^{- 1} n^{- 1 / 2} \sum_{i = 1}^{n} Λ (Y_{i}, X_{i}, D_{i}, Θ_{true}) + o_{p} (1), Λ (Y_{i}, X_{i}, D_{i}, Θ_{true}) = ℳ_{Ω} {(𝒩_{Ω} Φ (Y_{i}, X_{i}, D_{i}, Ω_{true}), Ψ (Y_{i}, X_{i}, D_{i}, Ω_{true}))}^{T} - h_{2} {R_{i} (β_{true}), X_{i}, D_{i}, Θ_{true}} + 𝒯 {R_{i} (β_{true}), X_{i}, Θ_{true}, f_{X}},

0 = E {Λ (Y, X, D, Θ_{true}) | D},

as claimed.

Footnotes

Supporting information

Additional ‘supporting information’ may be found in the on-line version of this article:

‘Supplemental material for Robust estimation for homoscedastic regression in the secondary analysis of case-control data’.

Contributor Information

Jiawei Wei, Texas A&M University, College Station, USA.

Raymond J. Carroll, Texas A&M University, College Station, USA

Ursula U. Müller, Texas A&M University, College Station, USA

Ingrid Van Keilegom, Université catholique de Louvain, Louvain-la-Neuve, Belgium, and Tilburg University, The Netherlands.

Nilanjan Chatterjee, National Cancer Institute, Rockville, USA.

References

Ahn J, Albanes D, Berndt SI, Peters U, Chatterjee N, Freedman ND, Abnet CC, Huang WY, Kibel AS, Crawford ED, Weinstein SJ, Chanock SJ, Schatzkin A, Hayes RB the Prostate, Lung, Colorectal and Ovarian Trial Project Team. Vitamin D-related genes, serum vitamin D concentrations and prostate cancer risk. Carcinogenesis. 2009;30:769–776. doi: 10.1093/carcin/bgp055. [DOI] [PMC free article] [PubMed] [Google Scholar]
Anderson R. Modern Methods for Robust Regression. New York: Sage; 2008. [Google Scholar]
Babu GJ, Singh K. Inference on means using the bootstrap. Ann. Statist. 1983;11:999–1003. [Google Scholar]
Buonaccorsi JP. Measurement Eror: Models, Methods and Applications. Boca Raton: Chapman and Hall; 2010. [Google Scholar]
Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation in case-control studies of gene-environment interactions. Biometrika. 2005;92:399–418. [Google Scholar]
Chen Y-H, Carroll RJ, Chatterjee N. Retrospective analysis of haplotype-based case-control studies under a flexible model for gene-environment association. Biostatistics. 2008;9:81–99. doi: 10.1093/biostatistics/kxm011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen Y-H, Chatterjee N, Carroll RJ. Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies. J. Am. Statist. Ass. 2009;104:220–233. doi: 10.1198/jasa.2009.0104. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen X, Linton O, Van Keilegom I. Estimation of semiparametric models when the criterion function is not smooth. Econometrica. 2003;71:1591–1608. [Google Scholar]
Epstein M, Satten GA. Inference on haplotype effects in case-control studies using unphased genotype data. Am. J. Hum. Genet. 2003;73:1316–1329. doi: 10.1086/380204. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hall P, Horowitz J. Bootstrap critical values for tests based on generalized-method-of-moments estimators. Econometrica. 1996;6:891–916. [Google Scholar]
Hu YJ, Lin DY, Zeng D. A general framework for studying genetic effects and gene–environment interactions with missing data. Biostatistics. 2010;11:583–598. doi: 10.1093/biostatistics/kxq015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huber PJ. Robust Statistics. New York: Wiley; 1981. [Google Scholar]
Jiang Y, Scott AJ, Wild CJ. Secondary analysis of case-control data. Statist. Med. 2006;25:1323–1339. doi: 10.1002/sim.2283. [DOI] [PubMed] [Google Scholar]
Kwee LC, Epstein MP, Manatunga AK, Duncan R, Allen AS, Satten GA. Simple methods for assessing haplotype-environment interactions in case-only and case-control studies. Genet. Epidem. 2007;31:75–90. doi: 10.1002/gepi.20192. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lele S. Resampling using estimating equations. In: Godambe UP, editor. Estimating Functions. New York: Oxford University Press; 1991. pp. 295–304. [Google Scholar]
Li H, Gail MH, Berndt S, Chatterjee N. Using cases to strengthen inference on the association between single nucleotide polymorphisms and a secondary phenotype in genome-wide association studies. Genet. Epidem. 2010;34:427–433. doi: 10.1002/gepi.20495. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin DY, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies (with discussion) J. Am. Statist. Ass. 2006;101:89–118. [Google Scholar]
Lin DY, Zeng D. Proper analysis of secondary phenotype data in case-control association studies. Genet. Epidem. 2009;33:256–265. doi: 10.1002/gepi.20377. [DOI] [PMC free article] [PubMed] [Google Scholar]
Modan MD, Hartge P, Hirsh-Yechezkel G, Chetrit A, Lubin F, Beller U, Ben-Baruch G, Fishman A, Menczer J, Struewing JP, Tucker MA, Wacholder S for the National Israel Ovarian Cancer Study Group. Parity, oral contraceptives and the risk of ovarian cancer among carriers and noncarriers of a BRCA1 or BRCA2 mutation. New Engl. J. Med. 2001;345:235–240. doi: 10.1056/NEJM200107263450401. [DOI] [PubMed] [Google Scholar]
Monsees G, Tamimi R, Kraft P. Genomewide association scans for secondary traits using case-control samples. Genet. Epidem. 2009;33:717–728. doi: 10.1002/gepi.20424. [DOI] [PMC free article] [PubMed] [Google Scholar]
Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population based case-control studies. Statist. Med. 1994;13:153–162. doi: 10.1002/sim.4780130206. [DOI] [PubMed] [Google Scholar]
Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–411. [Google Scholar]
Spinka C, Carroll RJ, Chatterjee N. Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity. Genet. Epidem. 2005;29:108–127. doi: 10.1002/gepi.20085. [DOI] [PMC free article] [PubMed] [Google Scholar]
Van der Vaart AW. Asymptotic Statistics. Cambridge: Cambridge University Press; 1998. [Google Scholar]
Wang CY, Wang S, Carroll RJ. Estimation in choice-based sampling with measurement error and bootstrap analysis. J. Econmetr. 1997;77:65–86. [Google Scholar]
Zhao LP, Li SS, Khalid N. A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies. Am. J. Hum. Genet. 2003;72:1231–1250. doi: 10.1086/375140. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Ahn J, Albanes D, Berndt SI, Peters U, Chatterjee N, Freedman ND, Abnet CC, Huang WY, Kibel AS, Crawford ED, Weinstein SJ, Chanock SJ, Schatzkin A, Hayes RB the Prostate, Lung, Colorectal and Ovarian Trial Project Team. Vitamin D-related genes, serum vitamin D concentrations and prostate cancer risk. Carcinogenesis. 2009;30:769–776. doi: 10.1093/carcin/bgp055. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Anderson R. Modern Methods for Robust Regression. New York: Sage; 2008. [Google Scholar]

[R3] Babu GJ, Singh K. Inference on means using the bootstrap. Ann. Statist. 1983;11:999–1003. [Google Scholar]

[R4] Buonaccorsi JP. Measurement Eror: Models, Methods and Applications. Boca Raton: Chapman and Hall; 2010. [Google Scholar]

[R5] Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation in case-control studies of gene-environment interactions. Biometrika. 2005;92:399–418. [Google Scholar]

[R6] Chen Y-H, Carroll RJ, Chatterjee N. Retrospective analysis of haplotype-based case-control studies under a flexible model for gene-environment association. Biostatistics. 2008;9:81–99. doi: 10.1093/biostatistics/kxm011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Chen Y-H, Chatterjee N, Carroll RJ. Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies. J. Am. Statist. Ass. 2009;104:220–233. doi: 10.1198/jasa.2009.0104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Chen X, Linton O, Van Keilegom I. Estimation of semiparametric models when the criterion function is not smooth. Econometrica. 2003;71:1591–1608. [Google Scholar]

[R9] Epstein M, Satten GA. Inference on haplotype effects in case-control studies using unphased genotype data. Am. J. Hum. Genet. 2003;73:1316–1329. doi: 10.1086/380204. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Hall P, Horowitz J. Bootstrap critical values for tests based on generalized-method-of-moments estimators. Econometrica. 1996;6:891–916. [Google Scholar]

[R11] Hu YJ, Lin DY, Zeng D. A general framework for studying genetic effects and gene–environment interactions with missing data. Biostatistics. 2010;11:583–598. doi: 10.1093/biostatistics/kxq015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Huber PJ. Robust Statistics. New York: Wiley; 1981. [Google Scholar]

[R13] Jiang Y, Scott AJ, Wild CJ. Secondary analysis of case-control data. Statist. Med. 2006;25:1323–1339. doi: 10.1002/sim.2283. [DOI] [PubMed] [Google Scholar]

[R14] Kwee LC, Epstein MP, Manatunga AK, Duncan R, Allen AS, Satten GA. Simple methods for assessing haplotype-environment interactions in case-only and case-control studies. Genet. Epidem. 2007;31:75–90. doi: 10.1002/gepi.20192. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Lele S. Resampling using estimating equations. In: Godambe UP, editor. Estimating Functions. New York: Oxford University Press; 1991. pp. 295–304. [Google Scholar]

[R16] Li H, Gail MH, Berndt S, Chatterjee N. Using cases to strengthen inference on the association between single nucleotide polymorphisms and a secondary phenotype in genome-wide association studies. Genet. Epidem. 2010;34:427–433. doi: 10.1002/gepi.20495. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Lin DY, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies (with discussion) J. Am. Statist. Ass. 2006;101:89–118. [Google Scholar]

[R18] Lin DY, Zeng D. Proper analysis of secondary phenotype data in case-control association studies. Genet. Epidem. 2009;33:256–265. doi: 10.1002/gepi.20377. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Modan MD, Hartge P, Hirsh-Yechezkel G, Chetrit A, Lubin F, Beller U, Ben-Baruch G, Fishman A, Menczer J, Struewing JP, Tucker MA, Wacholder S for the National Israel Ovarian Cancer Study Group. Parity, oral contraceptives and the risk of ovarian cancer among carriers and noncarriers of a BRCA1 or BRCA2 mutation. New Engl. J. Med. 2001;345:235–240. doi: 10.1056/NEJM200107263450401. [DOI] [PubMed] [Google Scholar]

[R20] Monsees G, Tamimi R, Kraft P. Genomewide association scans for secondary traits using case-control samples. Genet. Epidem. 2009;33:717–728. doi: 10.1002/gepi.20424. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population based case-control studies. Statist. Med. 1994;13:153–162. doi: 10.1002/sim.4780130206. [DOI] [PubMed] [Google Scholar]

[R22] Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–411. [Google Scholar]

[R23] Spinka C, Carroll RJ, Chatterjee N. Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity. Genet. Epidem. 2005;29:108–127. doi: 10.1002/gepi.20085. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Van der Vaart AW. Asymptotic Statistics. Cambridge: Cambridge University Press; 1998. [Google Scholar]

[R25] Wang CY, Wang S, Carroll RJ. Estimation in choice-based sampling with measurement error and bootstrap analysis. J. Econmetr. 1997;77:65–86. [Google Scholar]

[R26] Zhao LP, Li SS, Khalid N. A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies. Am. J. Hum. Genet. 2003;72:1231–1250. doi: 10.1086/375140. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Robust estimation for homoscedastic regression in the secondary analysis of case–control data

Jiawei Wei

Raymond J Carroll

Ursula U Müller

Ingrid Van Keilegom

Nilanjan Chatterjee

Summary

1. Introduction

2. Efficient parametric estimation and robustness

2.1. Framework

2.2. Population-based case–control studies and notation

2.3. Prior results and robustness

3. Model robust estimation

3.1. Preliminaries

3.2. Estimation algorithm

3.3. Development of the score when fX and αtrue are known

3.3.1. Adjusting score (7)

3.3.2. Replacing the unknown error density

3.4. Implementation when fX is unknown but αtrue is known

3.5. Implementation when the intercept αtrue is unknown

3.6. Distribution theory

3.7. Inference via bootstrap resampling

3.7.1. Bootstrap procedure

3.7.2. Bootstrap consistency

4. Extensions

4.1. Rare disease approximations

4.2. Case–control studies with frequency matching

5. Simulations

Table 1.

Table 2.

6. Empirical example

Table 3.

7. Discussion

Acknowledgements

Appendix A

Some derivations

A.1. Derivation of the error density estimator (11)

A.2. Unbiasedness of estimation function (12)

Appendix B

Asymptotic theory

B.1. Notation and assumptions

B.2. Proofs

B.2.1. Proof of theorem 1

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.3. Development of the score when f_X and α_true are known

3.4. Implementation when f_X is unknown but α_true is known

3.5. Implementation when the intercept α_true is unknown