A semiparametric efficient estimator in case-control studies for gene–environment independent models

Liang Liang; Yanyuan Ma; Raymond J Carroll

doi:10.1016/j.jmva.2019.01.006

. Author manuscript; available in PMC: 2019 Nov 1.

Published in final edited form as: J Multivar Anal. 2019 Feb 8;173:38–50. doi: 10.1016/j.jmva.2019.01.006

A semiparametric efficient estimator in case-control studies for gene–environment independent models

Liang Liang ^a,^*, Yanyuan Ma ^b, Raymond J Carroll ^c,^d

PMCID: PMC6824552 NIHMSID: NIHMS1014120 PMID: 31680705

Abstract

Case-controls studies are popular epidemiological designs for detecting gene–environment interactions in the etiology of complex diseases, where the genetic susceptibility and environmental exposures may often be reasonably assumed independent in the source population. Various papers have presented analytical methods exploiting gene–environment independence to achieve better efficiency, all of which require either a rare disease assumption or a distributional assumption on the genetic variables. We relax both assumptions. We construct a semiparametric estimator in case-control studies exploiting gene–environment independence, while the distributions of genetic susceptibility and environmental exposures are both unspecified and the disease rate is assumed unknown and is not required to be close to zero. The resulting estimator is semiparametric efficient and its superiority over prospective logistic regression, the usual analysis in case-control studies, is demonstrated in various numerical illustrations.

Keywords: Biased samples, Case-control study, Gene–environment independence, Gene–environment interaction, Semiparametric estimation

1. Introduction

The etiology of most complex diseases, such as cancers and cardiovascular diseases, is the joint effect of genetic susceptibility and environmental or non-genetic exposures, as well as their interactions. Even subtle differences in genetic factors between people, when exposed to the same environmental factors, can lead to dramatically different responses. In other words, people with certain genes may have a low risk of developing a disease whereas others may be more vulnerable when exposed to an identical environmental agent. One common example is that sunlight exposure results in higher risk of developing skin cancer among fair-skinned individuals than people with dark skin [17,22]. Studying gene–environment interactions is thus of great importance to understand disease mechanisms and develop new treatments and prevention strategies.

The case-control study design is commonly used to investigate the intricate interplay of genetic susceptibility and environment effects. It is cost-efficient and convenient to implement compared to a cohort study, especially when dealing with relatively rare diseases [6]. Instead of taking a random sample from the underlying source population, the case-control design randomly draws a fixed number of cases (diseased subjects) and a comparable number of controls (non-diseased subjects) from the respective case and control subpopulations. Genetic and environmental factors are measured and recorded for these sampled subjects. The standard approach for the analysis of such a case-control study is prospective logistic regression, which ignores the underlying retrospective nature of the case-control design. Cornfield [10] showed the equivalence of prospective and retrospective odds ratios, which validates the prospective approach. Prentice and Pyke [24] further showed that prospective logistic regression analysis gives an efficient estimator, in the sense that it yields the maximum likelihood estimates of the odds ratio parameters under a semiparametric model that allows an arbitrary covariate distribution.

Despite this, prospective logistic regression treatment in a case control study can still require a large sample size to obtain adequate statistical power for detecting gene–environment interactions or testing other hypotheses of interest. As a consequence, epidemiological researchers often exploit the potential efficiency gain from further assuming certain parametric or semiparametric structures for the covariate distribution. For example, in practice, a common assumption is that genetic susceptibility and environmental exposure are independent in the underlying source population [23], possibly given strata. Under such a model, prospective logistic regression analysis is still valid but may not be efficient because it ignores gene–environment independence.

A growing number of articles have been published in the last two decades, proposing analytical methods that exploit gene–environment independence assumption [5,14,15,20,21,23]. Piegorsch et al. [23] showed that under gene–environment independence and a rare disease assumption, the multiplicative interaction odds-ratio parameter can be estimated by cases alone and the resulting estimator is more precise than the estimator from traditional prospective logistic regression analysis using both cases and controls. However, the misuse of a rare disease assumption in analyzing diseases with moderate prevalence or diseases with small marginal probability in the source population but high risk for certain combination of genetic and environmental exposures can lead to considerable bias in the estimation. Noting this fact, Chatterjee and Carroll [5] developed a semiparametric maximum likelihood estimator employing the gene–environment independence assumption but not requiring any rare-disease assumption. Their approach leaves the distribution of the environmental exposures totally unspecified but restricts genetic susceptibility to have a discrete distribution that takes values in a finite and fixed set. Ma [20] proposed a semiparametric efficient estimator in the same setting as Chatterjee and Carroll [5] except the distribution of genetic susceptibility is allowed to be either discrete or continuous with a finite-dimensional parameter. The key ingredient of this approach is to construct a hypothetical population with infinite population size and a disease to non-disease ratio of n₁/n₀, where n₁ and n₀ are the numbers of cases and controls in the case-control sample. Section 2 of Ma [20] showed that the case-control sample can be viewed as a size n = n₀+n₁ random sample of independent and identically distributed observations from this hypothetical population, and hence classical semiparametric analysis is applicable. The validity and usefulness of such a hypothetical population was established in Ma [20]. Instead of assuming independence of gene and environment, there is a literature based on parametric modeling of the relationship between them [8,9,18,19]: we make no such parametric assumptions.

In this paper, we consider a more general setting which keeps the gene–environment independence assumption, while further allowing an unknown disease rate and completely nonparametric distributions for both the genetic susceptibility and the environmental exposure. Under such a model setting, we adopt the hypothetical population framework of Ma [20] and derive the semiparametric efficient estimator by employing a semiparametric approach, which links the efficient estimator with the efficient score function. Throughout our work, the underlying source population is referred to as the true population to emphasize the difference between the underlying source population and the hypothetical population. The inherent connection between the two populations allows us to transport parameter estimation and inference results derived in the hypothetical population directly to those in the true population, see Theorem 1. Although general semiparametric theory applies in the hypothetical population framework, computing the efficient estimator in this context is technically challenging because the efficient score does not have an explicit form and must be solved from an integral equation. We adopt a simple numerical approach to solve the integral equation by discretizing the distribution of the genetic susceptibility when it is continuous. The resulting estimator, when properly implemented, is asymptotically linear with optimal efficiency.

The rest of the paper is organized as follows. The specific model and the hypothetical population framework are presented in Section 2, with the corresponding identifiability conditions provided in Appendix A.1. In Section 3, we formulate the problem by using a conventional semiparametric approach. The analytic expression of our semiparametric efficient estimator as well as its detailed implementation are discussed in this section. Section 4 illustrates the asymptotic properties of the resulting estimator. Several simulation studies are conducted in Section 5 to demonstrate the numerical performance of our semiparametric efficient estimator compared with prospective logistic regression. A real data analysis is provided in Section 6, followed with a brief discussion in Section 7. Technical details and proofs are given in an Appendix and in the Online Supplement.

2. Model and framework

2.1. Background

It is useful to describe how the methods, referenced in Section 1, for exploiting a genetic-environmental relationship in an underlying source population have evolved from the earlier work, a relatively simple case in Chatterjee and Carroll [5], which includes the following key ingredients:

An underlying logistic regression for disease D as a function of genetic variables G and environmental exposures X.
A parametric distribution assumption for G in the source population when G and X are independent.
Writing out the retrospective likelihood of the observed case-control data.
A profile likelihood argument that estimates the distribution of X in the source population using a Lagrange multiplier argument that places probability mass at each observed value of X. This leads to a pseudolikelihood that involves the distribution of G but not the distribution of X.
The main technical difficulty is carrying out the algebra of the Lagrange multiplier argument and getting an explicit pseudolikelihood, where by explicit we mean that the resulting formula requires no numerical solutions to nonlinear equations.

In our case, however, we are not making the assumption of a parametric distribution for G in the source population. A profile likelihood method to remove the distribution of G and get a new, explicit, profile likelihood based on a Lagrange multiple argument does not appear to be possible, or at least it seems to be very difficult, because of the form of the pseudolikelihood.

To overcome these difficulties, there have been two main alternatives, and they are both based on the idea of relating the case-control study to some version of a prospective random sampling framework to derive a methodology, and to then show that this methodology is valid in the case-control study. Recall that n₀ is the number of controls in the sample and n₁ the number of cases. Define $π_{d} = \Pr (D = d)$ .

I. In Section 2.3.3 of [9], Chen et al. treat the case-control study as if it were a random sample from the source population but with data missing at random. They propose a prospective sampling scenario where each subject from the source population is observed with probability $1 / {1 + (n_{1 - d} π_{d}) / (n_{d} π_{1 - d})}$ , where d = 1 for cases and d = 0 for controls, respectively. They show that performing a missing data analysis for the distribution of (D, G) given X and the probability that the subject is observed yields the same pseudolikelihood as other papers have computed, but without having to do the Lagrange argument, and in a much easier way.
II. Ma [20] takes an entirely different approach, also without having to do the Lagrange argument. This approach, which she calls a hypothetical population approach, differs from that of Chen et al. [9] in that she aims to create a likelihood that (a) is equivalent to that of the case-control sample; and (b) is that of a simple random sample of size $n = n_{0} + n_{1}$ from a hypothetical population. Because it is a random sample, rather than a sample with missing data, when we use it this allows us to rely on the classic machinery of semiparametric methods as exemplified by Bickel et al. [4] and Tsiatis [27].

2.2. Basic calculations and likelihood

Assume that the prospective risk given the covariates (G, X) follows a logistic model, viz.

\Pr (D = d | G = g, X = x) = f_{D | G, X}^{true} (d, g, x) = H (d, g, x, θ) = \frac{\exp [d {α + m (g, x, β)}]}{1 + \exp {α + m (g, x, β)}},

(1)

where $θ = {(β^{⊤}, α)}^{⊤}$ and m is a function known up to the parameter β. Here and throughout the text, the superscript “true” is used to emphasize that those quantities are related to the true source population. In addition, in the true population, G and X are assumed to be independent so that the joint probability density/mass function of G, X can be written as $f_{G, X}^{true} (g, x) = f_{G}^{true} (g) f_{X}^{true} (x) = η_{1} (g) η_{2} (x)$ . Here, for notational simplicity, we write ${f_{G}^{true} (g), f_{X}^{true} (x)}$ as ${η_{1} (g), η_{2} (x)}$ . The problem stated above is identifiable in the case-control study under mild conditions, which are given in Appendix A.1, along with the proof of identifiability.

The hypothetical population study joint density/mass function of (D, G, X) is

f_{D, G, X} (d, g, x, θ, η_{1}, η_{2}) = (n_{d} / n) f_{G, X | D} (d, g, x) = (n_{d} / n) f_{G, X | D}^{true} (d, g, x) = \frac{n_{d}}{n} \frac{f_{G}^{true} (g) f_{X}^{true} (x) f_{D | G, X}^{true} (d, g, x, θ)}{\int f_{G}^{true} (g) f_{X}^{true} (x) f_{D | G, X}^{true} (d, g, x, θ) d μ (x) d μ (g)} = \frac{n_{d} η_{1} (g) η_{2} (x) H (d, g, x, θ)}{n \int η_{1} (g) η_{2} (x) H (d, g, x, θ) d μ (x) d μ (g)} = \frac{n_{d}}{n π_{d}} η_{1} (g) η_{2} (x) H (d, g, x, θ),

(2)

where

π_{d} = \int η_{1} (g) η_{2} (x) H (d, g, x, θ) d μ (x) d μ (g) .

(3)

We consider η = {η₁, η₂} as the infinite-dimensional nuisance parameter. The approach of Ma [20] views this as a semiparametric problem, to be solved using techniques explained in Bickel et al. [4] and Tsiatis [27]. Here, the concept of hypothetical population and the corresponding distorted likelihood is used as a vehicle to allow us to transport the semiparametric tools for direct application. It enables us to construct consistent estimators without having to concern about the non-random sample issue in case-control study. Because the non-random sampling issue is already taken into account when we formulate the distorted likelihood, the resulting estimator is indeed automatically consistent under the original case-control sampling framework, that is, if the case-control sample size grows to infinity while retaining the relative sample proportion of n₁/n₀, the estimator will converge to the true parameter value. We formally write out this result in Theorem 1.

Theorem 1. Assume (d₁, g₁, x₁), …, (d_n, g_n, x_n) is a case-control sample with n₁ cases, n₀ controls, and with disease model (1) and independence of X and G. Assume $({\tilde{d}}_{1}, {\tilde{g}}_{1}, {\tilde{x}}_{1}), \dots, ({\tilde{d}}_{n}, {\tilde{g}}_{n}, {\tilde{x}}_{n})$ is a random sample of independent and identically distributed observations with size n from model (2). Then, if $\hat{θ} {({\tilde{d}}_{1}, {\tilde{g}}_{1}, {\tilde{x}}_{1}), \dots, ({\tilde{d}}_{n}, {\tilde{g}}_{n}, {\tilde{x}}_{n})}$ is a $\sqrt{n}$ -consistent regular asymptotically linear estimator of θ and satisfies $E [\hat{θ} {({\tilde{d}}_{1}, {\tilde{g}}_{1}, {\tilde{x}}_{1}), \dots, ({\tilde{d}}_{n}, {\tilde{g}}_{n}, {\tilde{x}}_{n})} | D] - θ = o_{p} (n^{- 1 / 2})$ , then so is $\hat{θ} {(d_{1}, g_{1}, x_{1}), \dots, (d_{n}, g_{n}, x_{n})}$ .

Theorem 1 essentially says that if we can develop a $\sqrt{n}$ -consistent estimator based on a random sample from model (2), then we can simply apply this estimation procedure to the case-control sample and we will still get a $\sqrt{n}$ -consistent estimator. The proof of Theorem 1 is the entire content of Section 2 of Ma [20]. We take advantage of this property to generate an estimation procedure, which we will then show consistently estimates the parameters when using the case-control data. In particular, the procedure is not dependent on the hypothetical population study formalism.

3. Analytic derivations: Efficient score and algorithm

The outline of the semiparametric approach is to first construct a Hilbert space $H$ , consisting of all measurable functions with mean zero and finite variance. We next decompose $H$ into nuisance tangent space $Λ$ and its orthogonal complement $Λ^{⊥}$ . The efficient estimator can then be obtained by solving

\sum_{i = 1}^{n} S_{eff} (D_{i}, G_{i}, X_{i}; θ) = 0,

where S_eff is the projection of the score function S_θ onto $Λ^{⊥}$ , and thus S_eff is called efficient score function.

Careful calculation shows that the score function under the hypothetical population (2) takes the form $S_{θ} (d, g, x) = S (d, g, x) - E (S | d)$ where $S = {d - H (1, g, x, θ)} {m_{β}^{'} {(g, x, β)}^{⊤}, 1}^{⊤}$ and $m_{β}^{'} (g, x, θ) \equiv \partial m (g, x, θ) / \partial β$ . Let p denote the dimension of θ. The final form of the spaces Λ and Λ is listed below with the detailed derivation provided in Appendix A.2. Specifically,

Λ = {a_{1} (G) + a_{2} (X) - E {a_{1} (G) + a_{2} (X) | D} for all a_{1} (G), a_{2} (X)},

Λ^{⊤} = {f (D, G, X) : E (f | G) = E {E (f | D) | G}, E (f | X) = E {E (f | D) | X}, E (f) = 0} .

Define $S_{x} (x) = E (S_{θ} | x) = E (S | x) - E {E (S | D) | x}$ and $S_{g} (g) = E (S_{θ} | g) = E (S | g) - E {E (S | D) | g}$ . Projecting the score function onto Λ^⊥ shows that

S_{eff} (d, g, x) = S (d, g, x) - a (g) - b (x) - E {S (d, G, X) | d} + E {a (G) + b (X) | d},

where

E {a (G) | x} + b (x) - E {E (a + b | D) | x} = S_{x} (x),

(4)

a (g) + E {b (X) | g} - E {E (a + b | D) | g} = S_{g} (g) .

(5)

It is easy to check that $E {S_{eff} (d, G_{i}, X_{i}) | d} = 0$ .

In order to obtain the efficient score function, we need to solve a and b from the integral equations (4) and (5). The existence of the solution is automatically guaranteed by the identifiability of the problem, whereas the uniqueness is not. However, it is shown in Appendix A.3 that a and b are unique up to constant shifts. Thus, (4) and (5) have a unique solution under the constraints $E (a) = E (b) = 0$ . It is further proved in Appendix A.4 that, under the mean zero constraint, (4) and (5) have an equivalent expression, which is given by Eqs. (A.1)–(A.3), in the Appendix. Such an equivalent expression allows us to separate a and b by introducing an intermediate variable $u_{0} = E (a + b | D = 0)$ . However, there is no explicit expression for a and b. We still need to solve the integral equation (A.1). In Appendix A.5, we propose an approximation to its solution in the spirit of Tsiatis and Ma [28], by discretizing X if X is continuous.

The detailed algorithm for constructing the efficient score function and computing the efficient estimator for θ is given in Algorithm 1, where the disease rate is estimated during the procedure. Usually, the disease prevalence is not identifiable from a case-control sample [24]. However, the additional assumption we make on the relationship between G and X in the source population, i.e., gene–environment independence, leads to the technical identifiability [5,20].

Algorithm 1

Estimate $f_{X | D = d}$ the conditional density/mass function of X given disease status D = d, by nonparametric kernel density estimation among the data with D_i = d for d ϵ {0, 1}.
${\hat{f}}_{X | D = d} (x) = \frac{1}{n_{d} h} \sum_{i, D_{i} = d} K {(X_{i} - x) / h},$
for continuous X, and
${\hat{f}}_{X | D = d} (x) = \frac{1}{n_{d}} \sum_{i, D_{i} = d} 1 (X_{i} = x),$
for discrete X, where K is a univariate kernel function.
Estimate $f_{G | D = d}$ the conditional density/mass function of G given disease status D = d, by nonparametric kernel density estimation among the data with D_i = d for d ∈ {0, 1}. similarly as for X. Denote the result by ${\hat{f}}_{G | D}$ .
Define ${\hat{η}}_{1} (g, π_{0}) = π_{0} {\hat{f}}_{G | D = 0} (g) + (1 - π_{0}) {\hat{f}}_{G | D = 1} (g)$ , ${\hat{η}}_{2} (x, π_{0}) = π_{0} {\hat{f}}_{X | D = 0} (x) + (1 - π_{0}) {\hat{f}}_{X | D = 1} (x)$ , what we call a weighted nonparametric density/mass function estimate, being weighted by the (estimated) population probabilities.
When (π₀, π₁) is unknown, estimate them by solving the integral equation
$π_{0} = \int H (0, g, x) {\hat{η}}_{1} (g, π_{0}) {\hat{η}}_{2} (x, π_{0}) d μ (g) d μ (x),$
and setting ${\hat{π}}_{1} = 1 - {\hat{π}}_{0}, {\hat{η}}_{1} (g) = {\hat{η}}_{1} (g, {\hat{π}}_{0}), {\hat{η}}_{2} (x) = {\hat{η}}_{2} (x, {\hat{π}}_{0}) .$
Follow the method described in Appendix A.5 to obtain the solution of the integral equations (4) and (5), with result $\hat{a}, \hat{b}$ , and approximate $E (\hat{a} + \hat{b} | D)$ using nonparametric density estimates ${\hat{f}}_{X | D}$ and ${\hat{f}}_{G | D}$ with result $\hat{E} (\hat{a} + \hat{b} | D)$ .
From ${\hat{S}}_{eff} (D_{i}, G_{i}, X_{i}, θ) = {\hat{S}}_{θ} (D_{i}, G_{i}, X_{i}) - \hat{a} (G_{i}) - \hat{b} (X_{i}) + \hat{E} {\hat{a} (G_{i}) + \hat{b} (X_{i}) | D_{i}}$ , and estimate θ by solving the estimating equation

\sum_{i = 1}^{n} {\hat{S}}_{eff} (D_{i}, G_{i}, X_{i}, θ) = 0 .

(6)

It is critical that we estimate $E {\hat{a} (G_{i}) + \hat{b} (X_{i}) | D_{i}}$ and $E (S | D_{i})$ involved in Steps 5 and 6 using ${\hat{f}}_{X | D}$ and ${\hat{f}}_{G | D}$ described in Steps 1 and 2 of the above algorithm, instead of simply taking a sample version of the expectations. This ensures that all the conditional expectations are computed using the same kind of approximation and the gene–environment independence assumption is fully employed.

4. Distribution theory

It is not surprising that the semiparametric estimator described in Algorithm 1 is asymptotically normal with a parametric convergence rate and optimal efficiency as it is formed by estimating all conditional expectations in the efficient score nonparametrically. The asymptotic properties of our estimator are described in Theorem 2 under regularity conditions C1–C2 listed below. The proof is provided in the Online Supplement.

C1 The univariate kernel function K has support (–1, 1) and satisfies $\int K (u) u d u = 0$ , $\int K (u) u^{2} d u < \infty$ . The bandwidth h satisfies $n h^{2} \to \infty$ and $n h^{8} \to 0$ .

C2 Any discrete covariate has finitely many levels. Any continuous covariate has compact support and its density function is twice continuously differentiable.

Theorem 2. Under the regularity conditions C1 and C2, the estimator $\hat{θ}$ obtained from solving the estimating Eq. (6) is asymptotically normal with optimal efficiency, i.e., $\sqrt{n} (\hat{θ} - θ) \to N [0, v a r {(S_{eff})}^{- 1}]$ , and is semiparametric efficient.

5. Simulation study

We performed simulations to understand the finite sample performance of the semiparametric efficient estimator described in Section 3 and demonstrate its superiority to prospective logistic regression method under the gene–environment independent model. Two scenarios are considered: (a) $\Pr (D = 1) = 0.045$ and (b) $\Pr (D = 1) = 0.10$ , corresponding to cases with a relatively rare disease rate and a common disease rate, respectively. In each scenario, we generated X from the standard normal distribution $N (0, 1)$ or the Gamma distribution with mean 20 and variance 20, $G (20, 1)$ , while the distribution of G is one of the following: (i) Bernoulli with success probability $0.6 B (0.6)$ , where for example G = 1 or G = 0 corresponds to the presence or absence of a genetic mutation, and (ii) $N (0, 1)$ , which can be used to model gene expression levels or continuous traits, such as height and skin color, that are controlled by several genes. Given G and X, we generated disease status D from the logistic regression model

\Pr (D = 1 | G, X) = 1 / [1 + \exp {- (α + β_{1} G + β_{2} X + β_{3} G X)}],

where $β = {(β_{1}, β_{2}, β_{3})}^{⊤} = (0.76, 0.36, - 0.63)$ for both settings with normal X, and $β = {(β_{1}, β_{2}, β_{3})}^{⊤} = (3.577, 0.080, - 0.141)$ for both settings with Gamma X. We varied the intercept β₀ in different simulations to get the desired disease rate. Specifically speaking, in the case of $X = N (0, 1)$ , we set α = −3.61 and −3.465 for binary G and normal G respectively to achieve a disease rate of 4.5%, and we set α = −2.74 and −2.538 for binary G and normal G respectively to achieve a disease rate of 10%. In the case of $X = G (20, 1)$ , we set α = −5.220 and −5.086 for binary G and normal G respectively to achieve a disease rate of 4.5%, and we set α = −4.352 and −4.158 for binary G and normal G respectively to achieve a disease rate of 10%. For each setting, we simulated 1000 data sets, each with n₁ = 1000 cases and n₀ = 1000 controls. The details of simulating the case-control data are provided in the Online Supplement. In the computation of the weighted nonparametric density/mass function estimates defined in Algorithm 1, we used the asymptotically justified bandwidth $h = c n^{- 1 / 5}$ , where $c \in [0.4, 1.2]$ , and the results were insensitive to the choice of c.

The results are summarized in Tables 1–4. For 4.5% disease prevalence and normally distributed X (Table 1), it is clear that prospective logistic regression and our semiparametric efficient estimator are both consistent, while the semiparametric estimator has smaller variance. Specifically, the semiparametric efficient estimator has a mean squared error efficiency gain as large as 57% (the interaction term between G and X) for binary G, and 46% (the interaction term between G and X) for normal G. For 4.5% disease prevalence and Gamma X (Table 3), when G follows a Bernoulli distribution, our semiparametric efficient estimator has a mean squared error efficiency gain between 31% (the main effect of X) and 56% (the interaction term between G and X); when G is normal, the corresponding efficiency gain of the interaction term is 44%.

Table 1.

Simulation results from 1000 simulated case-control samples taken from a population with a disease rate of approximately 4.5%, and independent genetic and environmental variables, under the logistic model with gene–environment interaction. The results for G ∼ $B$ (0.6) and X ∼ N(0, 1) is displayed on the left whereas the results for G ~ $N$ N(0, 1) and X ~ N(0, 1) is on the right. Each replicate contains N₁ = 1000 cases and N₀ = 1000 controls, and is analyzed through two approaches, (1) “Logistic” is ordinary logistic regression, and (2) “Semi” is our semiparametric efficient estimator. Here, we list the sample mean (“mean”), the sample standard error (“se”), the mean estimated standard error (“est se”) and the coverage for the nominal 95% confidence intervals (“95%”) for both methods. In addition, we computed the mean squared error efficiency of the “Semi” method compared to the “Logistic” approach.

		Binary G, Normal X			Normal G, Normal X
	β	0.76	0.36	−0.63	0.76	0.36	−0.63
Logistic	Mean	0.761	0.363	−0.635	0.762	0.363	−0.634
	se	0.101	0.088	0.103	0.055	0.053	0.056
	est se	0.101	0.084	0.101	0.056	0.054	0.055
	95%	0.952	0.939	0.942	0.950	0.954	0.942
Semi	Mean	0.761	0.360	−0.630	0.761	0.362	−0.627
	se	0.101	0.077	0.082	0.054	0.051	0.046
	est se	0.100	0.073	0.079	0.053	0.051	0.041
	95%	0.953	0.939	0.941	0.949	0.953	0.921
	MSE Eff	1.003	1.325	1.566	1.068	1.112	1.457

Open in a new tab

Table 4.

Simulation results from 1000 simulated case-control samples taken from a population with a disease rate of approximately 10%, and independent genetic and environmental variables, under the logistic model with gene–environment interaction. The results for G ∼ $B$ (0.6) and X ∼ $G$ (20, 1) is displayed on the left whereas the results for G ~ $N$ N(0, 1) and X ~ $G$ (20, 1) is on the right. Each replicate contains N₁ = 1000 cases and N₀ = 1000 controls, and is analyzed through two approaches, (1) “Logistic” is ordinary logistic regression, and (2) “Semi” is our semiparametric efficient estimator. Here, we list the sample mean (“mean”), the sample standard error (“se”), the mean estimated standard error (“est se”) and the coverage for the nominal 95% confidence intervals (“95%”) for both methods. In addition, we computed the mean squared error efficiency of the “Semi” method compared to the “Logistic” approach.

		Binary G, Gamma X			Normal G, Gamma X
	β	3.577	0.080	−0.141	3.577	0.080	−0.141
Logistic	Mean	3.589	0.081	−0.141	3.600	0.081	−0.142
	se	0.459	0.018	0.022	0.274	0.012	0.013
	est se	0.460	0.018	0.022	0.269	0.012	0.012
	95%	0.949	0.950	0.947	0.950	0.934	0.944
Semi	Mean	3.565	0.080	−0.140	3.590	0.081	−0.142
	se	0.394	0.016	0.019	0.268	0.012	0.012
	est se	0.381	0.016	0.018	0.247	0.011	0.011
	95%	0.945	0.953	0.938	0.934	0.937	0.930
	MSE Eff	1.360	1.240	1.406	1.048	1.031	1.061

Open in a new tab

Table 3.

Simulation results from 1000 simulated case-control samples taken from a population with a disease rate of approximately 4.5%, and independent genetic and environmental variables, under the logistic model with gene–environment interaction. The results for G ∼ $B$ (0.6) and X ∼ $G$ (20, 1) is displayed on the left whereas the results for G ~ $N$ N(0, 1) and X ~ $G$ (20, 1) is on the right. Each replicate contains N₁ = 1000 cases and N₀ = 1000 controls, and is analyzed through two approaches, (1) “Logistic” is ordinary logistic regression, and (2) “Semi” is our semiparametric efficient estimator. Here, we list the sample mean (“mean”), the sample standard error (“se”), the mean estimated standard error (“est se”) and the coverage for the nominal 95% confidence intervals (“95%”) for both methods. In addition, we computed the mean squared error efficiency of the “Semi” method compared to the “Logistic” approach.

		Binary G, Gamma X			Normal G, Gamma X
	β	3.577	0.080	−0.141	3.577	0.080	−0.141
Logistic	Mean	3.599	0.081	−0.142	3.592	0.080	−0.141
	se	0.456	0.018	0.022	0.269	0.012	0.012
	est se	0.462	0.018	0.022	0.259	0.012	0.012
	95%	0.957	0.953	0.949	0.937	0.950	0.942
Semi	Mean	3.586	0.080	−0.141	3.569	0.080	−0.140
	se	0.375	0.016	0.018	0.230	0.011	0.010
	est se	0.369	0.016	0.017	0.202	0.011	0.009
	95%	0.950	0.949	0.942	0.914	0.940	0.919
	MSE Eff	1.484	1.305	1.559	1.372	1.059	1.437

Open in a new tab

The results for the 10% disease rate case (Tables 2 and 4) are similar. Both approaches are asymptotically valid, with our approach being superior to prospective logistic regression in the sense that our semiparametric efficient estimator has smaller mean squared error.

Table 2.

Simulation results from 1000 simulated case-control samples taken from a population with a disease rate of approximately 10%, and independent genetic and environmental variables, under the logistic model with gene–environment interaction. The results for G ∼ $B$ (0.6) and X ∼ N(0, 1) is displayed on the left whereas the results for G ~ $N$ N(0, 1) and X ~ N(0, 1) is on the right. Each replicate contains N₁ = 1000 cases and N₀ = 1000 controls, and is analyzed through two approaches, (1) “Logistic” is ordinary logistic regression, and (2) “Semi” is our semiparametric efficient estimator. Here, we list the sample mean (“mean”), the sample standard error (“se”), the mean estimated standard error (“est se”) and the coverage for the nominal 95% confidence intervals (“95%”) for both methods. In addition, we computed the mean squared error efficiency of the “Semi” method compared to the “Logistic” approach.

		Binary G, Normal X			Normal G, Normal X
	β	0.76	0.36	−0.63	0.76	0.36	−0.63
Logistic	Mean	0.762	0.363	−0.638	0.762	0.363	−0.633
	Se	0.102	0.084	0.100	0.056	0.051	0.057
	est se	0.100	0.083	0.100	0.056	0.053	0.057
	95%	0.943	0.952	0.955	0.957	0.960	0.952
Semi	Mean	0.762	0.359	−0.628	0.761	0.363	−0.629
	se	0.102	0.077	0.087	0.055	0.050	0.053
	est se	0.100	0.074	0.081	0.055	0.052	0.050
	95%	0.944	0.932	0.936	0.953	0.960	0.934
	MSE Eff	1.004	1.180	1.325	1.032	1.065	1.145

Open in a new tab

6. Example

Prostate cancer is a heterogeneous disease resulting from the complex interplay of genetic susceptibility and environmental exposures. It is the second leading cause of cancer death among men in the USA [1]. Prostate cells (both primary and cancer cells) were demonstrated to have 1α-OHase activity, whereas 1α-OHase is the enzyme responsible for converting [25(OH)D], the major circulating form of vitamin D that reflects both dietary and sunlight exposures, into 1,25-dihydroxy-vitamin D [1,25(OH)2D], the most active form of this vitamin that can induce cell-cycle regulation, apoptosis and differentiation in prostate cancer cells via the vitamin D receptor (VDR). Thus, (a) [25(OH)D] is hypothesized to have an anticancer effect, and (b) an important question is whether its relationship with the risk of developing prostate cancer is modified by genetic polymorphisms in the VDR gene.

In this section, we implemented our methodology in a case-control study of prostate cancer, using the same data set analyzed but in a different context by Chen et al. [9], see that reference for details about the study. Specifically, our analysis is based on a polygenic risk score, a single risk factor incorporating information from susceptibility SNPs, whereas Chen et al. [9] focused on haplotypes. The data consist of n₁ = 690 cases and n₀ = 717 controls randomly selected from the screening arm of a large population-based cohort study, the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO) at the National Cancer Institute. The PLCO cohort study recruited a total of 76,685 men aged 55–74 at 10 screening centers between November 1993 and July 2001, then randomly assigned 38,340 of them to the screening arm and the rest to the non-screening arm. In a 10-year follow-up period, in the study population, the cumulative incidence rate for prostate cancer in the screening arm was 108.4 per 10,000 person-years [3]. Apart from case-control status, [25(OH)D] level (nmol/L) and genotype data on 19 single-nucleotide polymorphisms (SNPS) are available for each subject involved in the case-control study. According to Chen et al. [9], these polymorphisms, our G, are unlikely to affect the [25(OH)D] level, our X, as the VDR gene plays a “downstream” role in the vitamin-D pathway. In other words, the gene–environment independence assumption is likely to be valid in this application. Detailed information about the design can be found in Andriole et al. [3],Hayes et al. [16],Prorok et al. [25].

One difficulty in investigating the genetic modification of the VDR gene to [25(OH)D] on the risk of prostate cancer is that the VDR gene contains multiple underlying susceptibility SNPs, where each individual SNP may only confer a small component of overall risk. In fact, running a logistic regression of case-control status on each of the 19 SNPs shows only three SNPs have p-values ≤ 0.10. Recently, it has been recognized that the polygenic risk score has the potential of improving risk prediction for some common diseases [2,7,11–13,26]. Therefore, we created a polygenic risk score for the prostate cancer data by weighting those 19 SNPs, where the weights are the effect sizes of separate logistic regressions applied to each SNP.

The results of prospective logistic regression and our semiparametric approach based on 1000 bootstrap samples are given in Table 5. The two sets of estimates are fairly consistent as expected. However, our semiparametric efficient estimator has smaller standard errors than does the prospective logistic regression, in accordance with theory and our simulations. This leads to a substantial difference in inference for the interaction between the polygenic risk score and the [25(OH)D] level. Specifically, both prospective logistic regression and our semiparametric efficient method show that the main effects of both the polygenic risk score and the [25(OH)D] level are statistically significant and positive. That is, if ignoring the interaction, men with higher polygenic risk scores or/and higher [25(OH)D] levels tend to have higher risk of developing prostate cancer.

Table 5.

Analysis of the case-control study on prostate cancer, containing n₁ = 690 cases and n₀ = 717 controls. Two approaches were implemented, (1) “Logistic” is ordinary logistic regression, and (2) “Semi” is our semiparametric efficient estimator. Displayed are the estimates, bootstrap standard error (“se, bootstrap”), mean estimated asymptotic standard error (“est se, asymptotic”), bootstrap p-value (“ p-value, bootstrap”), and asymptotic p-value (“ p-value, asymptotic”) of the coefficients for the standardized polygenic risk score (G), [25(OH)D] level (X), and the interaction between them (GX).

		β_G	β_X	β_GX
Logistic	Estimates	0.169	0.123	−0.101
	se, bootstrap	0.056	0.056	0.054
	est se, asymptotic	0.055	0.055	0.055
	p-value, bootstrap	0.002	0.028	0.064
	p-value, asymptotic	0.002	0.024	0.066
Semi	Estimates	0.168	0.124	−0.110
	se, bootstrap	0.056	0.056	0.049
	est se, asymptotic	0.055	0.054	0.042
	p-value, bootstrap	0.003	0.027	0.026
	p-value, asymptotic	0.002	0.021	0.009

Open in a new tab

Importantly, the estimates of the interaction parameter from the prospective logistic regression is not significant at the 5% level. However, our approach shows significant evidence of interaction, i.e., the effects of [25(OH)D] level on prostate cancer risk differ depending on the polygenic risk score.

In addition, our approach provides an estimated disease rate in the population of 10.6%, whereas the disease rate in the PLCO cohort study is 10.8% per person-year. This validation of our methodology suggests an additional use to which it can be applied.

7. Discussion

We have developed a semiparametric efficient estimator in case-control studies for the gene–environment independent model, where the distributions of genetic susceptibility and environmental exposure are allowed to be arbitrary and the disease rate is assumed completely unknown. We showed that in spite of these weak assumptions, the problem is identifiable in most cases. The proposed estimator is derived under the so-called hypothetical population framework, which enables us to view the case-control sample as a random sample from a hypothetical distribution and thus facilitates the application of a conventional semiparametric approach. Such an estimator is semiparametric efficient and its superiority over the prospective logistic regression was demonstrated in various simulations. The general methodology of our approach can be extended to parametric models other than the logistic model, such as the probit model, and it can be used to consider assumptions other than gene–environment independence, such as Hardy–Weinberg equilibrium, as long as the resulting model is identifiable.

The method hinges on the assumption of gene–environment independence. When they are in fact dependent, blindly applying this method will not lead to a consistent estimator. It is possible to further apply the empirical Bayes shrinkage method of [9] to improve robustness to the model assumptions. This method effectively uses our method when the assumption holds, and effectively uses logistic regression when the model assumption fails.

To handle the nuisance parameters in the estimation procedure, nonparametric density/mass function estimation is used. When the dimensions of genetic susceptibility or environmental exposures increase, such nonparametric estimation suffers from the curse of dimensionality. In such cases, dimension reduction techniques might be needed to maintain model flexibility as well as ensure computation feasibility. This will be pursued in future work.

Supplementary Material

Supp

NIHMS1014120-supplement-Supp.pdf^{(173.5KB, pdf)}

Acknowledgments

Ma’s research was partially supported by the National Science Foundation, USA (DMS-1608540). Carroll and Liang’s research was supported by a grant from the National Cancer Institute, USA (U01-CA057030). We thank Nilanjan Chatterjee and Alex Asher for many helpful comments.

Appendix. Sketch of technical arguments

A.1. Identifiability

A1 There exists c_x so that when $x \to c_{x}$ , $m (g, x, β) \to \infty$ or $m (g, x, β) \to - \infty$ for any g.

A2 There exists g₁ and x₁, x₂ such that $m (g_{1}, x_{1}, β) \neq m (g_{1}, x_{2}, β)$ .

A3 There exists c_g so that when $g \to c_{g}$ , $m (g, x, β) \to \infty$ or $m (g, x, β) \to - \infty$ for any x.

A4 There exists x₁ and g₁, g₂ such that $m (g_{1}, x_{1}, β) \neq m (g_{2}, x_{1}, β)$ .

Proposition 1. The problem stated in (2) is identifiable

(i) If condition A1 holds, and at least one of the conditions A3 and A4 holds;
(ii) or if at least one of the conditions A1 and A2 holds, and condition A3 holds.

Remark 1. In practice, a widely used model is the one including main effects and two-way interaction, i.e., $α + β_{1} g + β_{2} x + β_{3} x g$ . It can be easily verified that if g and x both have the support on ℝ then this model satisfies conditions A1 and A3 described above and hence is identifiable.

Remark 2. Proposition 1 applies in the case where at most one of G and X is discrete. In the case where both G and X are discrete with levels ℓ_G and ℓ_X respectively, identifiability requires $ℓ_{G} ℓ_{X} \geq 2 ℓ_{G} + 2 ℓ_{X} - 2$ as a necessary condition. Additional conditions may be needed. Although for a specific model with known ℓ_G and ℓ_X, it can be easy to derive the sufficient conditions for identifiability, such result is difficult to describe in general.

Proof of Proposition 1. From [24], β is identifiable. Thus, we aim at establishing the identifiability of η₁, η₂ and α. We first prove the result under A1 and A3. Assume there are α, η₁, η₂ and $α^{*}, η_{1}^{*}, η_{2}^{*}$ so that

\frac{n_{d}}{n π_{d}} η_{1} (g) η_{2} (x) H (d, g, x, β, α) = \frac{n_{d}}{n π_{d}^{*}} η_{1}^{*} (g) η_{2}^{*} (x) H (d, g, x, β, α^{*}) .

This yields

\frac{1}{π_{1}} η_{1} (g) η_{2} (x) H (1, g, x, β, α) = \frac{1}{π_{1}^{*}} η_{1}^{*} (g) η_{2}^{*} (x) H (1, g, x, β, α^{*}),

\frac{1}{π_{0}} η_{1} (g) η_{2} (x) H (0, g, x, β, α) = \frac{1}{π_{0}^{*}} η_{1}^{*} (g) η_{2}^{*} (x) H (0, g, x, β, α^{*}) .

Taking the ratio of the above two and solving, we obtain $\exp (α^{*}) = \exp (α) π_{0} π_{1}^{*} / (π_{1} π_{0}^{*})$ . This leads to

\frac{η_{2}^{*} (x)}{η_{2} (x)} \frac{η_{1}^{*} (g)}{η_{1} (g)} = \frac{π_{0}^{*} / π_{0} + \exp {α + m (g, x, β)} π_{1}^{*} / π_{1}}{1 + \exp {α + m (g, x, β)}} .

Under condition A1, letting $x \to c_{x}$ , we obtain $η_{1}^{*} (g) = η_{1} (g)$ . Similarly, under condition A3, letting $g \to c_{g}$ , we obtain $η_{2}^{*} (x) = η_{2} (x)$ . This in turn leads to $π_{0}^{*} = π_{0}, π_{1}^{*} = π_{1}$ . Finally, these results lead to $α^{*} = α$ .

We now prove the result under A1 and A4. Under condition A1 alone, the same derivation as before leads to

\frac{η_{2}^{*} (x)}{η_{2} (x)} = \frac{π_{0}^{*} / π_{0} + \exp {α + m (g, x, β)} π_{1}^{*} / π_{1}}{1 + \exp {α + m (g, x, β)}} .

Thus A4 further implies

\frac{π_{0}^{*} / π_{0} + \exp {α + m (g_{1}, x_{1}, β)} π_{1}^{*} / π_{1}}{1 + \exp {α + m (g_{1}, x_{1}, β)}} = \frac{π_{0}^{*} / π_{0} + \exp {α + m (g_{2}, x_{1}, β)} π_{1}^{*} / π_{1}}{1 + \exp {α + m (g_{2}, x_{1}, β)}},

or equivalently, $(π_{0}^{*} / π_{0} - π_{1}^{*} / π_{1}) [\exp {α + m (g_{1}, x_{1}, β)} - \exp {α + m (g_{2}, x_{1}, β)}] = 0$ . Hence, $π_{d}^{*} = π_{d}$ for d = 0, 1. As a result, α^∗ = α and $α^{*} = α and η_{2}^{*} (x) = η_{2} (x)$ .

The result under A2 and A3 is symmetric to the one under A1 and A4 hence is omitted. □

The requirements in A1 and A3 are appropriate in the case where G and X are both continuous. The requirements in A1 and A4 are suitable in the case where G is discrete and X is continuous. The requirements in A2 and A3 are suitable in the case where X is discrete and G is continuous.

A.2. Nuisance tangent space Λ and its orthogonal complement $Λ^{⊥}$

The nuisance tangent space Λ is computed in two steps. First, replacing the nuisance parameter $η = (η_{1}, η_{2})$ with a finite-dimensional parameter, say $γ = {(γ_{1}^{⊤}, γ_{2}^{⊤})}^{⊤}$ , and taking the derivative of ln $f_{D, G, X} (d, g, x; β, γ)$ with respect to γ to get $S_{γ} = {(S_{γ 1}^{⊤}, S_{γ 2}^{⊤})}^{⊤}$ . Second, finding the mean squared closure that contains all such Sγ, which is Λ.

For any finite-dimensional parameter $γ = {(γ_{1}^{⊤}, γ_{2}^{⊤})}^{⊤}$ , we have $S_{γ} = {(S_{γ 1}^{⊤}, S_{γ 2}^{⊤})}^{⊤}$ , where

S_{γ 1} = η_{1} {(g, γ_{1})}^{- 1} \partial η_{1} (g, γ_{1}) / \partial γ_{1} - π_{d}^{- 1} \int \partial η_{1} (g, γ_{1}) / \partial γ_{1} η_{2} (x) H (d, g, x, θ) d μ (x) d μ (g) = η_{1} {(g, γ_{1})}^{- 1} \partial η_{1} (g, γ_{1}) / \partial γ_{1} - E {η_{1} {(g, γ_{1})}^{- 1} \partial η_{1} (G, γ_{1}) / \partial γ_{1} | D},

S_{γ 2} = η_{2} {(x, γ_{2})}^{- 1} \partial η_{2} (x, γ_{2}) / \partial γ_{2} - π_{d}^{- 1} \int η_{1} (g) \partial η_{2} (x, γ_{2}) / \partial γ_{2} H (d, g, x, θ) d μ (x) d μ (g) = η_{2} {(x, γ_{2})}^{- 1} \partial η_{2} (x, γ_{2}) / \partial γ_{2} - E {η_{2} {(x, γ_{2})}^{- 1} \partial η_{2} (X, γ_{2}) / \partial γ_{2} | D} .

It is easy to show the nuisance tangent spaces associated with η₁ and η₂ are respectively

Λ_{1} = {a (g) - π_{d}^{- 1} \int a (g) η_{1} (g) η_{2} (x) H (d, g, x, θ) d μ (x) d μ (g) : E^{true} {a (G)} = 0} = {a (g) - E {a (G) | d} for all a (g)},

Λ_{2} = {a (x) - π_{d}^{- 1} \int a (x) η_{1} (g) η_{2} (x) H (d, g, x, θ) d μ (x) d μ (g) : E^{true} {a (X)} = 0} = {a (x) - E {a (X) | d} for all a (x)} .

Then

Λ = Λ_{1} + Λ_{2} = {a_{1} (g) + a_{2} (x) - E {a_{1} (G) + a_{2} (X) | d} for all a_{1} (g), a_{2} (x)} .

Define $Λ_{1}^{⊥, conj} = {f (d, g, x) : E (f) = 0, E (f | G) = E {E (f | D) | G}}$ . Now consider $f ⊥ Λ_{1}$ . Then for any $a (g) - E {a (G) | d} \in Λ_{1}$ ,

0 = E (f^{⊤} [a (G) - E {a (G) | D}]) = E [f^{⊤} a (G) - f^{⊤} E {a (G) | D}] = E [f^{⊤} a (G) - E (f^{⊤} | D) E {a (G) | D}] = E {f^{⊤} a (G) - E (f^{⊤} | D) a (G)} = E [E {f^{⊤} - E (f^{⊤} | D) | G} a (G)] .

Hence, $E {f - E (f | D) | G} = 0$ almost surely. Besides, $Λ_{1}^{⊥}$ need to be a subspace of the Hilbert space $H$ , hence $E (f) = 0$ . Thus, we have shown $Λ_{1}^{⊥} \subset Λ_{1}^{⊥, conj}$ . Furthermore, for any $f \in Λ_{1}^{⊥, conj}$ ,

E [f^{⊤} a (G) - f^{⊤} E {a (G) | D}] = E {f^{⊤} a (G) - E (f^{⊤} | D) a (G)} = E [E {f^{⊤} - E (f^{⊤} | D) | G} a (G)] = 0,

hence $Λ_{1}^{⊥, conj} \subset Λ_{1}^{⊥}$ . Thus, we have obtained $Λ_{1}^{⊥} = Λ_{1}^{⊥, conj}$ . Similarly, we can prove

Λ_{2}^{⊥} = {f (d, g, x) : E (f) = 0, E (f | X) = E {E (f | D) | X}}

Hence,

Λ^{⊥} = {f (d, g, x) : E (f | G) = E {E (f | D) | G}, E (f | X) = E {E (f | D) | X}, E (f) = 0} .

A.3. Uniqueness of a and b up to constants

To prove that a and b defined in Eqs. (4) and (5) are unique up to constant shifts, we consider the following. If there exist a₁, a₂, b₁, b₂ such that

S_{eff} (d, g, x) = S (d, g, x) - a_{1} (g) - b_{1} (x) - E {S (d, G, X) | d} + E {a_{1} (G) + b_{1} (X) | d} = S (d, g, x) - a_{2} (g) - b_{2} (x) - E {S (d, G, X) | d} + E {a_{2} (G) + b_{2} (X) | d},

then

a_{2} (g) - a_{1} (g) = b_{1} (x) - b_{2} (x) - E {a_{1} (G) + b_{1} (X) | d} + E {a_{2} (G) + b_{2} (X) | d} .

The left-hand side is a function of g while the right-hand side is a function of x and d. Hence $a_{1} (g) - a_{2} (g)$ is a constant. Similarly, $b_{1} (x) - b_{2} (x)$ is also a constant. □

A.4. Equivalent expression of Eqs. (4) and (5) and the proof under the condition $E (a) = E (b) = 0$

We claim under the mean zero constraint $E (a) = E (b) = 0$ , (4) and (5) are equivalent to (A.1)–(A.3), below, namely

S_{g} (g) - E {S_{x} (X) | g} = a (g) + u_{0} c_{g} (g) - E {E (a | X) | g} - u_{0} E {c_{x} (X) | g},

(A.1)

S_{x} (x) = E (a | x) + b (x) + u_{0} c_{x} (x),

(A.2)

u_{0} = E (a + b | D = 0),

(A.3)

where $c_{x} (x) = E [{n_{0} - n I (D = 0)} / n_{1} | x]$ , $c_{g} (g) = E [{n_{0} - n I (D = 0)} / n_{1} | g]$ .

Proof. Suppose a and b are the solution of Eqs. (4) and (5). Let $E (a + b | D = 0) = u_{0}$ , $E (a + b | D = 1) = u_{1}$ . Then (A.3) automatically holds. It is easy to verify that $u_{0} n_{0} + u_{1} n_{1} = n E (a + b) = 0$ . Hence (4) and (5) become

E (a | x) + b (x) + u_{0} {(n_{0} / n_{1}) f_{D | X} (1, x) - f_{D | X} (0, x)} = S_{x} (x),

a (g) + E (b | g) + u_{0} {(n_{0} / n_{1}) f_{D | G} (1, g) - f_{D | G} (0, g)} = S_{g} (g) .

Further write

c_{x} (x) = (n_{0} / n_{1}) f_{D | X} (1, x) - f_{D | X} (0, x) = {n_{0} - n f_{D | X} (0, x)} / n_{1} = E [{n_{0} - n I (D = 0)} / n_{1} | x] = E [{n_{0} / n - I (D = 0)} / (n_{1} / n) | x],

c_{g} (g) = (n_{0} / n_{1}) f_{D | G} (1, g) - f_{D | G} (0, g) = {n_{0} - n f_{D | G} (0, g)} / n_{1} = E [{n_{0} - n I (D = 0)} / n_{1} | g] = E [{n_{0} / n - I (D = 0)} / (n_{1} / n) | g] .

Then

E (a | x) + b (x) + u_{0} c_{x} (x) = S_{x} (x),

(A.4)

a (g) + E (b | g) + u_{0} c_{g} (g) = S_{g} (g) .

(A.5)

Note that (A.4) above is exactly (A.2) defined in Section 3. Taking conditional expectation of (A.4) given G = g, we obtain

E {E (a | X) | g} + E (b | g) + u_{0} E {c_{x} (X) | g} = E {S_{x} (X) | g} .

Subtracting the above from (A.5), we obtain (A.1), namely

a (g) + u_{0} c_{g} (g) - E {E (a | X) | g} - u_{0} E {c_{x} (X) | g} = S_{g} (g) - E {S_{x} (X) | g} .

From the above derivation, it is clear that any mean zero functions a(g), b(x) that solve (4) and (5) also satisfy (A.1)–(A.3). We now prove the other way around, that is any mean zero functions a(g), b(x) that satisfy (A.1)–(A.3) also satisfy (4) and (5).

Taking the expectation of (A.2) conditionally on G = g and adding the resulting equation to (A.1), we obtain exactly (A.5).

Hence Eqs. (A.1) and (A.2) lead to Eqs. (A.2) and (A.5).

For preparation, note also that $c_{g} (g) = (n_{0} / n_{1}) f_{D | G} (1, g) - f_{D | G} (0, g)$ . Hence under (A.3) and the condition $n_{1} E (a + b | D = 1) + n_{0} E (a + b | D = 0) = n E (a + b) = 0$ , we can further write

u_{0} c_{g} (g) = E (a + b | D = 0) {(n_{0} / n_{1}) f_{D | G} (1, g) - f_{D | G} (0, g)} = E (a + b | D = 0) (n_{0} / n_{1}) f_{D | G} (1, g) - E (a + b | D = 0) f_{D | G} (0, g) = - E (a + b | D = 1) f_{D | G} (1, g) - E (a + b | D = 0) f_{D | G} (0, g) = - E {E (a + b | D | g} .

Similarly, $u_{0} c_{x} (x) = - E {E (a + b | D) | x}$ . From (A.2), we obtain

S_{x} (x) = E (a | x) + b (x) + u_{0} c_{x} (x) = E (a | x) + b (x) - E {E (a + b | D) | x},

which is exactly (4). Similarly, from (A.5), we obtain (5). □

Eq. (A.1) allows us to solve for a(g) as a function of u₀ and other known quantities, say $a (g) = F_{a} (g, u_{0}) - E {F_{a} (G, u_{0})}$ , where F_a is a function that solves (A.1) which does not need to have mean 0. Then we can solve b from (A.2) as a function of u₀ to obtain

b (x) = S_{x} (x) - u_{0} c_{x} (x) - E {F_{a} (G, u_{0}) | x} + E {F_{a} (G, u_{0})} .

Now

u_{0} = E {a (G) + b (X) | D = 0} = E [F_{a} (G, u_{0}) + S_{x} (X) - u_{0} c_{x} (X) - E {F_{a} (G, u_{0}) | X} | D = 0],

which allows us to solve for u₀. Having obtained u₀, we can then solve for all other quantities easily. Unfortunately, the integral equation (A.1) does not have an explicit solution. We propose an approximation to its solution in the spirit of Tsiatis and Ma [28], which is provided in Appendix A.5, by discretizing X if X is continuous.

The efficient score S_eff, especially the procedure of solving for a and b, contains several expectations conditional on D, G, or X. To get estimations of these conditional expectations, we need density estimators of the nuisance parameter η = (η₁, η₂).

If the disease rate π₁ or the non-disease rate π₀ = 1 − π₁ is known, then η can be approximated by

{\hat{η}}_{1} = π_{0} {\hat{f}}_{G | D = 0} + (1 - π_{0}) {\hat{f}}_{G | D = 1}, {\hat{η}}_{2} = π_{0} {\hat{f}}_{X | D = 0} + (1 - π_{0}) {\hat{f}}_{X | D = 1},

where ${\hat{f}}_{G | D = d}$ and ${\hat{f}}_{X | D = d}$ are the nonparametric estimators of the conditional density/mass function $f_{G | D = d}$ and $f_{X | D = d}$ respectively for d ∈ {0, 1}. Of course, in practice, π₀ is typically unknown. However, we can get an estimate of π_{0 through} (3).

A.5. Solving the integral equation (A.1)

Define $Z = S - E (S | D) - u_{0} {n_{0} - n 1 (D = 0)} / n_{1}$ . An equivalent expression of (A.1) is

a (G) - E [E {a (G) | X} | G] = E (Z | G) - E {E (Z | X) | G} .

(A.6)

For fixed u₀, all the quantities in Z are known or have explicit form except E(S | D). With the weighted kernel density ${\hat{η}}_{1}, {\hat{η}}_{2}$ , estimated non-disease rate ${\hat{π}}_{0}$ and disease rate ${\hat{π}}_{1}$ , we can estimate it by

\hat{E} (S | D = d) = {\hat{π}}_{d}^{- 1} \int S (d, g, x) {\hat{η}}_{1} (g) {\hat{η}}_{2} (x) d μ (g) d μ (x) .

A.5.1. Discrete G with finite number of levels

Assume G is discrete with mass at m_g points g₁, …, g_m_g. We computed each term in (A.6) under the weighted nonparametric densities ${\hat{η}}_{1}, {\hat{η}}_{2}$

\hat{E} {a (G) | x} = \frac{\sum_{j = 1}^{m_{g}} a (g_{j}) k (g_{j}, x) {\hat{η}}_{1} (g_{j})}{\sum_{j = 1}^{m_{g}} k (g_{j}, x) {\hat{η}}_{1} (g_{j})}, \hat{E} [\hat{E} {a (G) | X} g_{k}] = \int {\frac{\sum_{j = 1}^{m_{g}} a (g_{j}) k (g_{j}, x) {\hat{η}}_{1} (g_{j})}{\sum_{j = 1}^{m_{g}} k (g_{j}, x) {\hat{η}}_{1} (g_{j})}} \frac{k (g_{k}, x) {\hat{η}}_{2} (x)}{\int k (g_{k}, x) {\hat{η}}_{2} (x) d μ (x)} d μ (x) .

Similarly, we have

\hat{E} {Z (D, G, X) | X} = \frac{\sum_{j = 1}^{m_{g}} \sum_{d = 0}^{1} n_{d} / (n π_{d}) Z (d, g_{j}, x) H (d, g_{j}, x) {\hat{η}}_{1} (g_{j})}{\sum_{j = 1}^{m_{g}} k (g_{j}, x) {\hat{η}}_{1} (g_{j})}, \hat{E} [\hat{E} {Z (D, G, X) | X} | g_{k}] = \int {\frac{\sum_{j = 1}^{m_{g}} \sum_{d = 0}^{1} n_{d} / (n π_{d}) Z (d, g_{j}, x) H (d, g_{j}, x) {\hat{η}}_{1} (g_{j})}{\sum_{j = 1}^{m_{g}} k (g_{j}, x) {\hat{η}}_{1} (g_{j})}} \times \frac{k (g_{k}, x) {\hat{η}}_{2} (x)}{\int k (g_{k}, x) {\hat{η}}_{2} d μ (x)} d μ (x),

(A.7)

and

\hat{E} {Z (D, G, X) | g_{k}} = \sum_{d = 0}^{1} \int Z (d, g_{k}, x) \frac{n_{d} / (n π_{d}) H (d, g_{k}, x) {\hat{η}}_{2} (x)}{\int k (g_{k}, x) {\hat{η}}_{2} (x) d μ (x)} d μ (x) .

(A.8)

Consequently, the integral equation (A.6) reduces to the linear equations $(I - B) A^{⊤} = C^{⊤}$ , where A is the (p+1) × m_g matrix a(g₁), …, a(g_mg), corresponding to the solution of the integral equation, I is an m_g m_g identity matrix, B is an m_g m_g matrix whose (i, j)th element is given by

B_{i j} = \int {\frac{k (g_{j}, x) {\hat{η}}_{1} (g_{j})}{\sum_{j = 1}^{m_{g}} k (g_{j}, x) {\hat{η}}_{1} (g_{j})}} \frac{k (g_{i}, x) {\hat{η}}_{2} (x)}{\int k (g_{i}, x) {\hat{η}}_{2} (x) d μ (x)} d μ (x),

and C is a (p + 1) × m_g matrix whose kth column is $\hat{E} {Z (D, G, X) | g_{k}} - \hat{E} [\hat{E} {Z (D, G, X) | X} | g_{k}]$ defined in (A.7) and (A.8). After obtaining a, we set

b (x) = \hat{E} (Z - a | x) = \frac{\sum_{j = 1}^{m_{g}} \sum_{d = 0}^{1} n_{d} / (n π_{d}) Z (d, g_{j}, x) H (d, g_{j}, x) {\hat{η}}_{1} (g_{j})}{\sum_{j = 1}^{m_{g}} k (g_{j}, x) {\hat{η}}_{1} (g_{j})} - \frac{\sum_{j = 1}^{m_{g}} a (g_{j}) k (g_{j}, x) {\hat{η}}_{1} (g_{j})}{\sum_{j = 1}^{m_{g}} k (g_{j}, x) {\hat{η}}_{1} (g_{j})} .

Then we compute $u_{0} = \hat{E} (a + b | D = 0)$ , where

\hat{E} (a | D = 0) = \frac{\sum_{j = 1}^{m_{g}} a (g_{j}) {\hat{η}}_{1} (g_{j}) \int H (0, g_{j}, x) {\hat{η}}_{2} (x) d μ (x)}{\int \sum_{j = 1}^{m_{g}} H (0, g_{j}, x) {\hat{η}}_{1} (g_{j}) {\hat{η}}_{2} (x) d μ (x)}, \hat{E} (b | D = 0) = \int b (x) \frac{\sum_{j = 1}^{m_{g}} H (0, g_{j}, x) {\hat{η}}_{1} (g_{j}) {\hat{η}}_{2} (x)}{\int \sum_{j = 1}^{m_{g}} H (0, g_{j}, x) {\hat{η}}_{1} (g_{j}) {\hat{η}}_{2} (x) d μ (x)} d μ (x) .

A.5.1. Continuous G or discrete G with infinite number of levels

When G is a continuous variable, we discretize it at a finite number of equally distributed points, say, g₁ ≤ ··· ≤ gm_g with $g_{i + 1} - g_{i} \equiv Δ_{g}$ for all i ∈ {1, …, m_g − 1}, such that

\sum_{i = 1}^{m_{g}} f_{G | D} (g_{i}) Δ_{g} \approx 1 .

Similarly, when G is discrete with infinite number of levels, we simply choose a sufficient number of points from its support to get an overall probability close to 1. Then the sequential procedures are exactly the same as that described in the case where G is discrete with finite number of levels.

Footnotes

Appendix B. Supplementary data

Supplementary material related to this article can be found online at https://doi.org/10.1016/j.jmva.2019.01.006.

Contributor Information

Yanyuan Ma, Email: yzm63@psu.edu.

Raymond J. Carroll, Email: carroll@stat.tamu.edu.

References

[1].American Cancer Society, Cancer Facts & Figures 2015, American Cancer Society, Atlanta, GA, 2015. [Google Scholar]
[2].Aly M, Wiklund F, Xu J, Isaacs WB, Eklund M, D’Amato M, Adolfsson J, Grönberg H, Polygenic risk score improves prostate cancer risk prediction: Results from the Stockholm-1 cohort study, Eur. Urol 60 (2011) 21–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Andriole GL, Crawford ED, Grubb RL, Buys SS, Chia D, Church TR, Fouad MN, Isaacs C, Kvale PA, Reding DJ, Weissfeld JL, Yokochi LA, O’Brien B, Ragard LR, Clapp JD, Rathmell JM, Riley TL, Hsing AW, Izmirlian G, Pinsky PF, Kramer BS, Miller AB, Gohagan JK, Prorok PC, PLCO Project Team, Prostate cancer screening in the randomized prostate, lung, colorectal, and ovarian cancer screening trial: Mortality results after 13 years of follow-up, J. Natl. Cancer Inst 104 (2012) 125–132. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Bickel PJ, Klaassen CA, Ritov Y, Wellner JA, Efficient and Adaptive Estimation for Semiparametric Models, The Johns Hopkins University Press, Baltimore, MD, 1993. [Google Scholar]
[5].Chatterjee N, Carroll RJ, Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies, Biometrika 92 (2005) 399–418. [Google Scholar]
[6].Chatterjee N, Chen Y-H, Luo S, Carroll RJ, Analysis of case-control association studies: SNPs, imputation and haplotypes, Statist. Sci 24 (2009) 489–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Chatterjee N, Shi J, García-Closas M, Developing and evaluating polygenic risk prediction models for stratified disease prevention, Nature Rev. Genet 17 (2016) 392–406. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Chen YH, Chatterjee N, Carroll RJ, Retrospective analysis of haplotype-based case-control studies under a flexible model for gene-environment association, Biostatistics 9 (2008) 81–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Chen YH, Chatterjee N, Carroll RJ, Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies, J. Amer. Statist. Assoc 104 (2009) 220–233. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Cornfield J, A statistical problem arising from retrospective studies, in: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, vol. 4, pp. 135–148. [Google Scholar]
[11].Dudbridge F, Power and predictive accuracy of polygenic risk scores, PLoS Genet 9 (2013) e1003348. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Evans DM, Visscher PM, Wray NR, Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk, Hum. Mol. Gen 18 (2009) 3525–3531. [DOI] [PubMed] [Google Scholar]
[13].Fuchsberger C, Flannick J, Teslovich TM, Mahajan A, Agarwala V, Gaulton KJ, Ma C, Fontanillas P, Moutsianas L, McCarthy DJ, et al. , The genetic architecture of type 2 diabetes, Nature 536 (2016) 41–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Gauderman WJ, Zhang P, Morrison JL, Lewinger JP, Finding novel genes by testing G × E interactions in a genome-wide association study, Genetic Epidemiol 37 (2013) 603–613. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Han SS, Rosenberg PS, Ghosh A, Landi MT, Caporaso NE, Chatterjee N, An exposure-weighted score test for genetic associations integrating environmental risk factors, Biometrics 71 (2015) 596–605. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Hayes RB, Reding D, Kopp W, Subar AF, Bhat N, Rothman N, Caporaso N, Ziegler RG, Johnson CC, Weissfeld JL, Hoover RN, Hartge P, Palace C, Gohagan JK, et al. , Etiologic and early marker studies in the Prostate, Lung, Colorectal and Ovarian, PLCO cancer screening trial, Controlled Clin. Trials 21 (2000) 349S–355S. [DOI] [PubMed] [Google Scholar]
[17].Hunter DJ, Gene–environment interactions in human diseases, Nature Rev. Genet 6 (2005) 287–298. [DOI] [PubMed] [Google Scholar]
[18].Jiang Y, Scott AJ, Wild CJ, Secondary analysis of case-control data, Stat. Med 25 (2006) 1323–1339. [DOI] [PubMed] [Google Scholar]
[19].Lin D, Zeng D, Proper analysis of secondary phenotype data in case-control association studies, Genet. Epidemiol 33 (2009) 256–265. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Ma Y, A semiparametric efficient estimator in case-control studies, Bernoulli 16 (2010) 585–603. [Google Scholar]
[21].Murcray CE, Lewinger JP, Gauderman WJ, Gene-environment interaction in genome-wide association studies, Am. J. Epidemiol 169 (2009) 219–226. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Ottman R, Gene-environment interaction: Definitions and study designs, Prev. Med 25 (1996) 764. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Piegorsch WW, Weinberg CR, Taylor JA, Non-hierarchical logistic models and case-only designs for assessing susceptibility in population based case-control studies, Stat. Med 13 (1994) 153–162. [DOI] [PubMed] [Google Scholar]
[24].Prentice RL, Pyke R, Logistic disease incidence models and case-control studies, Biometrika 66 (1979) 403–411. [Google Scholar]
[25].Prorok PC, Andriole GL, Bresalier RS, Buys SS, Chia D, Crawford ED, Fogel R, Gelmann EP, Gilbert F, Hasson MA, Hayes RB, Johnson CC, Mandel JS, Oberman A, O’Brien B, Oken MM, Rafla S, Reding D, Rutt W, Weissfeld JL, Yokochi L, Gohagan JK, et al. , Design of the Prostate, Lung, Colorectal and Ovarian, PLCO cancer screening trial, Controlled Clin. Trials 21 (2000) 273S–309S. [DOI] [PubMed] [Google Scholar]
[26].Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, Sklar P, Ruderfer DM, McQuillin A, Morris DW, et al. , Common polygenic variation contributes to risk of schizophrenia and bipolar disorder, Nature 460 (2009) 748–752. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Tsiatis AA, Semiparametric Theory and Missing Data, Springer, New York, 2007. [Google Scholar]
[28].Tsiatis AA, Ma Y, Locally efficient semiparametric estimators for functional measurement error models, Biometrika 91 (2004) 835–848. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp

NIHMS1014120-supplement-Supp.pdf^{(173.5KB, pdf)}

[R1] [1].American Cancer Society, Cancer Facts & Figures 2015, American Cancer Society, Atlanta, GA, 2015. [Google Scholar]

[R2] [2].Aly M, Wiklund F, Xu J, Isaacs WB, Eklund M, D’Amato M, Adolfsson J, Grönberg H, Polygenic risk score improves prostate cancer risk prediction: Results from the Stockholm-1 cohort study, Eur. Urol 60 (2011) 21–28. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Andriole GL, Crawford ED, Grubb RL, Buys SS, Chia D, Church TR, Fouad MN, Isaacs C, Kvale PA, Reding DJ, Weissfeld JL, Yokochi LA, O’Brien B, Ragard LR, Clapp JD, Rathmell JM, Riley TL, Hsing AW, Izmirlian G, Pinsky PF, Kramer BS, Miller AB, Gohagan JK, Prorok PC, PLCO Project Team, Prostate cancer screening in the randomized prostate, lung, colorectal, and ovarian cancer screening trial: Mortality results after 13 years of follow-up, J. Natl. Cancer Inst 104 (2012) 125–132. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Bickel PJ, Klaassen CA, Ritov Y, Wellner JA, Efficient and Adaptive Estimation for Semiparametric Models, The Johns Hopkins University Press, Baltimore, MD, 1993. [Google Scholar]

[R5] [5].Chatterjee N, Carroll RJ, Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies, Biometrika 92 (2005) 399–418. [Google Scholar]

[R6] [6].Chatterjee N, Chen Y-H, Luo S, Carroll RJ, Analysis of case-control association studies: SNPs, imputation and haplotypes, Statist. Sci 24 (2009) 489–502. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Chatterjee N, Shi J, García-Closas M, Developing and evaluating polygenic risk prediction models for stratified disease prevention, Nature Rev. Genet 17 (2016) 392–406. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Chen YH, Chatterjee N, Carroll RJ, Retrospective analysis of haplotype-based case-control studies under a flexible model for gene-environment association, Biostatistics 9 (2008) 81–99. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Chen YH, Chatterjee N, Carroll RJ, Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies, J. Amer. Statist. Assoc 104 (2009) 220–233. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Cornfield J, A statistical problem arising from retrospective studies, in: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, vol. 4, pp. 135–148. [Google Scholar]

[R11] [11].Dudbridge F, Power and predictive accuracy of polygenic risk scores, PLoS Genet 9 (2013) e1003348. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Evans DM, Visscher PM, Wray NR, Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk, Hum. Mol. Gen 18 (2009) 3525–3531. [DOI] [PubMed] [Google Scholar]

[R13] [13].Fuchsberger C, Flannick J, Teslovich TM, Mahajan A, Agarwala V, Gaulton KJ, Ma C, Fontanillas P, Moutsianas L, McCarthy DJ, et al. , The genetic architecture of type 2 diabetes, Nature 536 (2016) 41–47. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Gauderman WJ, Zhang P, Morrison JL, Lewinger JP, Finding novel genes by testing G × E interactions in a genome-wide association study, Genetic Epidemiol 37 (2013) 603–613. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Han SS, Rosenberg PS, Ghosh A, Landi MT, Caporaso NE, Chatterjee N, An exposure-weighted score test for genetic associations integrating environmental risk factors, Biometrics 71 (2015) 596–605. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Hayes RB, Reding D, Kopp W, Subar AF, Bhat N, Rothman N, Caporaso N, Ziegler RG, Johnson CC, Weissfeld JL, Hoover RN, Hartge P, Palace C, Gohagan JK, et al. , Etiologic and early marker studies in the Prostate, Lung, Colorectal and Ovarian, PLCO cancer screening trial, Controlled Clin. Trials 21 (2000) 349S–355S. [DOI] [PubMed] [Google Scholar]

[R17] [17].Hunter DJ, Gene–environment interactions in human diseases, Nature Rev. Genet 6 (2005) 287–298. [DOI] [PubMed] [Google Scholar]

[R18] [18].Jiang Y, Scott AJ, Wild CJ, Secondary analysis of case-control data, Stat. Med 25 (2006) 1323–1339. [DOI] [PubMed] [Google Scholar]

[R19] [19].Lin D, Zeng D, Proper analysis of secondary phenotype data in case-control association studies, Genet. Epidemiol 33 (2009) 256–265. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Ma Y, A semiparametric efficient estimator in case-control studies, Bernoulli 16 (2010) 585–603. [Google Scholar]

[R21] [21].Murcray CE, Lewinger JP, Gauderman WJ, Gene-environment interaction in genome-wide association studies, Am. J. Epidemiol 169 (2009) 219–226. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Ottman R, Gene-environment interaction: Definitions and study designs, Prev. Med 25 (1996) 764. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Piegorsch WW, Weinberg CR, Taylor JA, Non-hierarchical logistic models and case-only designs for assessing susceptibility in population based case-control studies, Stat. Med 13 (1994) 153–162. [DOI] [PubMed] [Google Scholar]

[R24] [24].Prentice RL, Pyke R, Logistic disease incidence models and case-control studies, Biometrika 66 (1979) 403–411. [Google Scholar]

[R25] [25].Prorok PC, Andriole GL, Bresalier RS, Buys SS, Chia D, Crawford ED, Fogel R, Gelmann EP, Gilbert F, Hasson MA, Hayes RB, Johnson CC, Mandel JS, Oberman A, O’Brien B, Oken MM, Rafla S, Reding D, Rutt W, Weissfeld JL, Yokochi L, Gohagan JK, et al. , Design of the Prostate, Lung, Colorectal and Ovarian, PLCO cancer screening trial, Controlled Clin. Trials 21 (2000) 273S–309S. [DOI] [PubMed] [Google Scholar]

[R26] [26].Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, Sklar P, Ruderfer DM, McQuillin A, Morris DW, et al. , Common polygenic variation contributes to risk of schizophrenia and bipolar disorder, Nature 460 (2009) 748–752. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Tsiatis AA, Semiparametric Theory and Missing Data, Springer, New York, 2007. [Google Scholar]

[R28] [28].Tsiatis AA, Ma Y, Locally efficient semiparametric estimators for functional measurement error models, Biometrika 91 (2004) 835–848. [Google Scholar]

PERMALINK

A semiparametric efficient estimator in case-control studies for gene–environment independent models

Liang Liang

Yanyuan Ma

Raymond J Carroll

Abstract

1. Introduction

2. Model and framework

2.1. Background

2.2. Basic calculations and likelihood

3. Analytic derivations: Efficient score and algorithm

Algorithm 1

4. Distribution theory

5. Simulation study

Table 1.

Table 4.

Table 3.

Table 2.

6. Example

Table 5.

7. Discussion

Supplementary Material

Acknowledgments

Appendix. Sketch of technical arguments

A.1. Identifiability

A.2. Nuisance tangent space Λ and its orthogonal complement Λ⊥

A.3. Uniqueness of a and b up to constants

A.4. Equivalent expression of Eqs. (4) and (5) and the proof under the condition E(a)=E(b)=0

A.5. Solving the integral equation (A.1)

A.5.1. Discrete G with finite number of levels

A.5.1. Continuous G or discrete G with infinite number of levels

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

A.2. Nuisance tangent space Λ and its orthogonal complement $Λ^{⊥}$

A.4. Equivalent expression of Eqs. (4) and (5) and the proof under the condition $E (a) = E (b) = 0$