Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 Oct 15.
Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2009 Jan 1;71(1):75–96. doi: 10.1111/j.1467-9868.2008.00671.x

Testing in semiparametric models with interaction, with applications to gene-environment interactions

Arnab Maity 1, Raymond J Carroll 2, Enno Mammen 3, Nilanjan Chatterjee 4
PMCID: PMC2762226  NIHMSID: NIHMS113982  PMID: 19838317

Summary

Motivated from the problem of testing for genetic effects on complex traits in the presence of gene-environment interaction, we develop score tests in general semiparametric regression problems that involves Tukey style 1 degree-of-freedom form of interaction between parametrically and non-parametrically modelled covariates. We find that the score test in this type of model, as recently developed by Chatterjee and co-workers in the fully parametric setting, is biased and requires undersmoothing to be valid in the presence of non-parametric components. Moreover, in the presence of repeated outcomes, the asymptotic distribution of the score test depends on the estimation of functions which are defined as solutions of integral equations, making implementation difficult and computationally taxing. We develop profiled score statistics which are unbiased and asymptotically efficient and can be performed by using standard bandwidth selection methods. In addition, to overcome the difficulty of solving functional equations, we give easy interpretations of the target functions, which in turn allow us to develop estimation procedures that can be easily implemented by using standard computational methods. We present simulation studies to evaluate type I error and power of the method proposed compared with a naive test that does not consider interaction. Finally, we illustrate our methodology by analysing data from a case-control study of colorectal adenoma that was designed to investigate the association between colorectal adenoma and the candidate gene NAT2 in relation to smoking history.

Keywords: Additive models, Diplotypes, Function estimation, Non-parametric regression, Omnibus hypothesis testing, Partially linear model, Repeated measures, Score test, Semiparametric models, Smooth backfitting, Tukey’s 1 degree-of-freedom model

1. Introduction

Modern genetic association studies often focus on discovery of susceptibility loci, i.e. identification of genetic variants that are associated with the trait under study. The risks of multifactorial traits, such as cancer, however, are determined by complex interactions between genetic and environmental exposures and the chance for discovery of the underlying susceptibility genes can be substantially reduced if the possibility of heterogeneity in genetic effects due to interactions is ignored. Thus, in recent years, there has been increasing attention in omnibus testing of genetic main effects and gene-environment or gene-gene interactions for detection of susceptibility genes for complex traits. Clearly, tests of association incorporating interactions require larger degrees of freedom than those which are based only on main effects. When the number of extra degrees of freedom required is relatively small, recent studies have shown that the omnibus tests can be a robust and powerful approach for detecting genetic association irrespectively of whether certain specific forms of interactions are present or not (Chatterjee et al., 2006; Kraft et al., 2007). However, if the required number of degrees of freedom is large, then the omnibus tests can have poor power. Thus parsimonious modelling of gene-gene and gene-environment interactions should be considered for construction of powerful omnibus tests.

Chatterjee et al. (2006) proposed the use of a Tukey style 1 degree-of-freedom model for interaction for testing the genetic association of a disease with a set of genetic variants, such as tagging single nucleotide polymorphisms (SNPs) in a candidage gene, that may potentially interact with another set of genetic variants or/and with one or more environmental exposures. SNPs represent a natural genetic variability at high density in the human genome. A genetic locus corresponding to an SNP has two possible alleles (states), namely the normal and the variant. The SNP-genotype data for a subject can have three possible values and are often coded numerically as the number of variant alleles that the subject carries on the pair of homologous chromosomes that are inherited from his or her parents.

In this paper, we shall consider extending the work of Chatterjee et al. (2006) focusing on the problem of gene-environment interaction. Thus, for example, if D denotes the binary indicator of a disease outcome, X denotes a ‘design matrix’ that is associated with a set of genetic variants G, Z denotes the desig matrix that is associated with an environmental exposure of interest and S denotes a set of additional cofactors, such as age and sex, then the risk of the disease can be modelled by using Tukey’s form of gene-environment interaction as

pr(D=1X,S,Z,γ)=H(XTβ0+STη0+ZTθ0+γXTβ0ZTθ0), (1)

where H(·) is the logistic distribution function. Unlike in the standard logistic regression model where potentially a separate interaction parameter is allowed between each pair of design elements of the genetic and environmental factors, in model (1), a single parameter (γ) is used to capture interactions. Moreover, in model (1), the omnibus null hypothesis of interest can be simply stated as β0=0 under which both genetic main effects and gene-environment interactions disappear from the model. A complication, however, is that, under β0=0, the parameter γ also disappears from the model and hence is not identifiable from the data. Nevertheless, Chatterjee et al. (2006) noticed that, for each fixed value of γ, model (1) can be used to construct a valid score test for β0=0. They proposed to use maxima of such score statistics over a range of the parameter γ as the final test statistics for testing β0=0. They observed that the score test has particular computational advantages, because under the null hypothesis model (1) reduces to a standard logistic regression model involving only main effects of Z and S.

In this paper, we extend the work by Chatterjee et al. (2006) in two novel ways. First, we consider modelling complex effects of continuous environmental exposures by using non-parametric regression models. The problem is particularly motivated by the fact that modern molecular epidemiologic studies often involve measurement of environmental exposures through continuous biomarkers, the relationships of which with the disease can be highly complex and non-linear. Thus for exmaple in the logistic context, we might consider the model

pr(D=1X,S,Z,γ)=H{XTβ0+STη0+θ0(Z)+γXTβ0(Z)+γXTβ0θ0(Z)}, (2)

where θ0(·) is an unknown function. Second, we consider general semiparametric models with possible repeated measures (Lin and Carroll, 2006), where the effects are given through terms roughly of the form on the right-hand side of model (2). In particular, we assume that, for each subject or cluster i, there are j=1,. . . , J observations (Yij, Xij, Sij, Zij). We write i=(Yi1, . . . ,, YiJ) and work with a criterion function

L{Y~,ν1,,νJ,ζ0},withνj=XjTβ0{1+γθ0(Zj)}+SjTη0, (3)

where a criterion function could mean either an actual likelihood function, a composite likelihood function, i.e. one that is a likelihood function for a reduced set of data, or a working independence likelihood function. In particular, criterion functions have scores in the parameters (β0, η0, ζ0, θ0) that have mean 0 given appropriate subcomponents of (Xj,SjZj)j=1J. The case of no repeated measures as in model (1) occurs when J=1.

Our interest is in testing for the hypothesis of the form H0 : β0=0. As in Chatterjee et al. (2006), it is natural to use a score testing approach to this problem to avoid numerical difficulty that is associated with parameter estimation under general models of the form (1) and (2). In particular, we note that estimation of γ in these models can be numerically unstable because of lack of identifiability of this parameter under β0=0. This also means that γ cannot be consistently estimated at contiguous alternatives. In practice, even in fully parametric models, this lack of identifiability means that estimating γ is numerically instable, leading to non-convergence if its range is not restricted.

Following Chatterjee et al. (2006), we propose to perform score-type tests for eadh value of γ and then to maximize these tests over an interval of γ-values, and to use numerical devices to create levels of significance. It is possible to create the score statistic directly, and to apply the asymptotic expansions that were developed by Lin and Carroll (2006) to analyse these statistics. However, two problems arise.

  1. The first problem is that the direct score statistic requires undersmoothing for the non-parametric estimation of θ0(·) in expression (3). By modifying the directly calculated score statistic in a suitable manner using a profile argument, we shall show how to create test statistics that lose no local power yet allow regular smoothing, such as cross-validation.

  2. The second problem to overcome is that, in the repeated measures case that J>1, the distribution of the profile score statistic depends on random variables that are formed as solutions to integral equations. Rather than go about this problem directly and solve the integral equation, which would be extremely difficult, we show that the crucial terms can be estimated by using nothing more than the Gaussian repeated measures algorithm of Wang (2003); see also Lin et al. (2004) for a non-iterative solution and Huggins (2006) for another simple computational device.

Thus, we shall develop a test statistic that is straightforward to compute and does not require undersmoothing, and the method also allows a simple implementation when the score test is maximized over a range for γ.

Our methodology is easiest to understand in the non-repeated measures case that J=1, and we take this up in Section 2. The repeated measures case is described in Section 3. Section 4 gives the results of a simulation study. Here we find that our maximized tests lose little power when there is no interaction and can gain great power advantages over a main effects test when there are interactions. Section 5 illustrates an application of the proposed method for omnibus testing of the effects genetic variants in the NAT2 gene and their interactions with the number of years since stopping smoking on the risk of colorectal adenoma by using a case-control study that was conducted with the prostate, lung, colorectal and ovaraian cancer screening trial (Hayes et al., 2000).

We close this section with a few remarks about identifiability. The models that we study are examples of a problem where γ is a nuisance parameter and, under the null hypothesis (5) that β0=0, the nuisance parameter is unidentified. Model (1) is of course reminiscent of Tukey’s 1 degree-of-freedom test for interaction (Tukey, 1949). However, unlike in that context, in our problem the parameter γ is a nuisance parameter and is not of primary interest. The method of Chatterjee et al. (2006) is more closely akin to the basic suggestion in Davies (1987), namely to fix the nuisance parameter, to compute an appropriate test statistic and then to maximize that test statistic over a range of values for the nuisance parameter. Thus, one way to think about our testing procedure is as the appropriate, efficient (both computationally and in terms of power) way of implementing the basic approach of Davies (1987) in our context, while taking care to eliminate the concerns of undersmoothing and solution of integral equations that arise from a less targeted approach.

2. Testing without repeated measures

2.1. Data and notation

The data consist of a response Y, parametrically modelled covariates S and X, the latter possibly interacting with a non-parametrically modelled covariate Z. We consider a general log-likelihood or criterion function

L[Y,STη0+θ0(Z)+XTβ0{1+γθ0(Z)},ζ0], (4)

where β0 and η0 are the main effects, θ0(·) is an unknown function, γ is the interaction effect and ζ0 are nuisance parameters. In this section, we are interested in testing the parametric hypothesis

H0:β0=0. (5)

As described in Section 1, Chatterjee et al. (2006) addressed a similar problem for a fully parameteric model where Z is also modelled parametrically. They used a score-based testing procedure to test H0. We generalize their idea for the general semiparametric model that is given in expression (4). We describe below the major steps to derive the test statistic for testing hypothesis (5).

In what follows, we use a simple subscripting convention for derivatives of the log-likelihood. Thus, with (•), we set

Lθ()=(ν)L{Y,STη+ν+XTβ(1+γν),ζ}ν=θ(Z),Lθθ()=(2ν2)λ{Y,STζ+ν+XTβ(1+γν),ζ}ν=θ(Z),Lζ()=(ζ)L{Y,STη+ν+XTβ(1+γν),ζ}ν=θ(Z),Lθζ()=(ζ)Lθ{Y,STη+ν+XTβ(1+γν),ζ}ν=θ(Z),

etc. Thus, in an abuse of notation we do not indicate in the notation that these partial derivatives do not depend on the parameters and covariates only via STη(Z)+XTβ{1+γ θ(Z)}.

2.2. Estimation of parameters under the null hypothesis

Here we show how to estimate the parameters and the function at the null hypothesis.

The strength of score tests is that we fit the model under the null hypothesis. Under the null hypothesis, the log-likelihood or criterion function for the model is written as L{Y,STη0+θ0(Z),ζ0}, a standard form that is easy to handle. The log-likelihood under the alternative is much more difficult to deal with numerically because of the interaction.

By definition of a log-likelihood or criterion function, at the null hypothesis,

0=E[Lθ{Y,STη0+θ0(Z),ζ0}X,S,Z]. (6)

The first step of the process is to estimate the function θ0(·) for any fixed value of δ=δ*=(η*ζ*). We shall use kernel methods because of their convenient theory, bu this step can be modified in practice by using any smoother. The resulting estimate is denoted as θ^(· ,δ*). Let K(·) be a smooth symmetric density function with bounded support, let h be a bandwidth and let Kh (z)=h-1 K(z/h). Define ϕk=∫ Zk K(z) dz and Gh (z)=(1, z/h)T. We follow Lin and Carroll (2006) to estimate the parameters under hypothesis H0: for any fixed value of δ=δ*, estimate θ0(z) by solving the local likelihood equation

0=n1i=1nKh(Ziz)Gh(Ziz)Lθ{Yi,SiTη+α0+α1(Ziz),ζ},

for α^0 and set θ^(z, δ*).

The second step in the process is now smoothing method independent. To estimate δ0=(η0,ζ0) maximize in δ the function

n1i=1nL{Yi,SiTη+θ^(Zi,δ),ζ},

the so-called profile method, which solves

0=n1i=1n{Si+θ^η(Zi,δ)}Lθ{Yi,SiTη+θ^(Zi,δ),ζ},0=n1i=1n[Lζ{Yi,SiTη+θ^(Zi,δ),ζ}+θ^ζ(Zi,δ)Lθ{Yi,SiTη+θ^(Zi,δ),ζ}],

where θ^η(Zi, δ) and θ^ζ(Zi, δ) are the derivatives of θ^(Zi, δ) with respect to η or ζ respectively. Call the resulting estimate δ^.

2.3. The score function and asymptotic theory

2.3.1. Derivation

One approach to developing a score statistic is to fix the function θ(·), to derive the score statistic and then to plug in estimates of nuisance parameters and the function θ(·). This does not work well because the function estimate itself needs profiling, and indeed this approach requires undersmoothing for its validity.

In contrast, our test statistic is a particular implementation of the profiled log-likelihood or criterion function, which is derived as follows. In general, the log-likelihood function for an observation is L{Y,STη+XTβ+θ(Z)+γXTβθ(Z),ζ}. Recall that δ(η, ζ). For given (β, δ), let θ(Z, β, δ) be the profile function that solves

E[Lθ{Y,STη+XTβ+θ(Z,β,δ)+γXTβθ(Z,β,δ),δ}Z]=0. (7)

Define pro=X{1+γ θ(Z,0,δ)}+θβ(Z,0,β), where θβ(Z,β,β)=(/∂ β)θ(Z, β, δ). The profiled log-likelihood is L{Y,STη+XTβ+θ(Z,β,δ)+γXTβθ(Z,β,δ),ζ}. Differentiating it with respect to β and evaluating at the null hypothesis β=0, the profiled (efficient) score is easily seen to be X~proLθ{Y,STη+θ(Z,0,δ),ζ}.

In addition, differentiating equation (7) with respect to β=0 and evaluating it at β=0 and δ=δ0 shows that pro={1+γθ0(Z)} , where

X~=XE[XLθθ{Y,STη0+θ0(Z),ζ0}Z]E[Lθθ{Y,STη0+θ0(Z),ζ0}Z].

We thus propose the following profiled score statistic for β0:

Tn,pro(γ)=n12i=1n{1+γθ^(Zi,δ^)}X~i,estLθ{Yi,SiTη^+θ^(Zi,δ^),ζ^}, (8)

where i, est is an estimated version of i, with the terms to be estimated in obtained by separate non-parametric regressions in the numerator and denominator. The normalization by n-1/2 is conveient for the asymptotic theory.

2.3.2. Theoretical results

Let δ0=(η0T,ζ0T)T and make the definitions

θδ(z0,δ0)=E[Lθδ{Y,STη0+θ0(Z),ζ0}Z=z0]E[Lθθ{Y,STη0+θ0(Z),ζ0}Z=z0],=Lδ{Y,STη0+θ0(Z),ζ0}+θδ(Z,δ0)Lθ{Y,STη0+θ0(Z),ζ0},M=E(T),N=E(X{1+γθ0(Z)}[Lθδ{Y,STη0+θ0(Z),ζ0}+Lθθ{Y,STη0+θ0(Z),ζ0}θδ(Z,δ0)]T),Ψ(γ)={1+γθ0(Z)}X~Lθ{Y,STη0+θ0(Z),ζ0}NM1.

The main result of this section justifying our methodology is stated below. Technically, a precise argument requires little more than that the linear expansions for the parametric and non-parametric parts that are given in Lin and Carroll (2006) hold to order op(n-1/2), the latter uniformly.

Result 1

Suppose that we are testing for H0:β0=0. Assume that h α n with 1/3 ≤ α ≤ 1/5. Then, for any fixed γ, the score function for β0 can be written as

Tn,pro(γ)=n12i=1nΨi(γ)+op(1).

In addition, assume that, for any γ1 and γ2, ν(γ1,γ2)=E{Ψ(γ1)ΨT(γ2)} is finite. Then, under the hypothesis that β=0 Tn,pro(γ) as a function of γ ∈ [L, R] converges weakly to a Gaussian process W(γ) with mean 0 and covariance function ν(γ1,γ2).

Remark 1

There are two methods that can be used to estimate the covariance matrix of the estimated score.

  1. First, suppose as in logistic regression that there are no nuisance parameters ζ0, and that L() is a log-likelihood function and not a general criterion function. Then we can write Ψi(γ)=Ψi(γ)Lθ{Yi,SiTη0+θ0(Zi)} with Ψi(γ)={1+γθ0(Zi)}X~iNM1S~i, where
    S~=SE[SLθθ{Y,STη0+θ0(Z)}Z]E[Lθθ{Y,STη0+θ0(Z)}Z].
    Let Ψ^i(γ) be the estimated version of Ψi(γ) This estimated version requires the definition of i, i and additional non-parametric regressions, which are easily accomplished via kernel or spline methods. Further, let Iθ,null{SiTη0+θ0(Zi)} be the conditional information matrix for θ under the null model. Then we estimate the covariance matrix of Tn(γ)
    Iβ0,n(γ)=n1i=1nIθ,null{SiTη^+θ^(Zi,η^)}Ψ^i(γ)Ψ^i(γ)T.
  2. In general, Iβ0,n(γ) can be estimated as the sample covariance matrix of the terms Ψ̂i(γ), the estimated version of Ψi(γ). In likelihood problems, simplifications arise becuase we can compute the covariance matrix of Ψ (·) given (X, Z, S) by using Fisher information calculations.

Remark 2

The validity and unbiasedness of the profiled score statistic primarily depend on the use of . In simpler models, such as the Gaussian model, =X - E(X|Z) is simply the residual of a non-parametric Gaussian regression of each component of X on Z. In general, can be thought of as the residual of a weighted non-parametric Gaussian regression of each component of X on Z, where the error variance for weighting is taken to be 1Lθθ(). This interpretation enables us to construct estimates of with considerable ease in many cases, especially in the presence of repeated measurements; see Section 3 for details.

2.4. The test statistic and its implementation

Here we define our test statistic and show how to implement it in practice to compute critical values.

The score test statistic, for a fixed value of γ, is then given by Tn,pro(γ)TIβ0,n1(γ)Tn,pro(γ). We compute the final test statistic as

Tn=maxLγR{Tn,proT(γ)Iβ0,n1(γ)Tn,pro(γ)},

Where L and R are prespecified lower and upper bounds of γ. Our approach is also related to adaptive tests that have been developed for non-parametric alternatives of functions with unknown smoothness; see for example Horowitz and Spokoiny (2001).

To implement the test, we need to simulate the null distribution of Tn and to obtain the desired critical values. Oure method avoids the need to determine critical values for the maximum of a function of a Gaussian process. Using result 1 we can generate realizations from the limiting distribution of the score statistic as

T0(γ)=n12i=1nΨ^i(γ)Zi,

where Ψ̂(γ) is Ψ(γ) evaluated at δ^ and θ^(z, δ ^) and Z1, . . ., Zn, are standard normal random variates which are drawn independently of the data. The null distribution of Tn is then simulated by generating T0=maxLγR{T0(γ)TIβ0,n1(γ)T0(γ)} repeatedly. This methods is the semiparametric version of a method that was discussed by Lin and Zou (2004) and Chatterjee et al. (2006).

3. General interaction model with repeated measures

3.1. Data and notation

In this section we generalize the ideas that were presented earlier to the case when repeated measures are present in the data. Repeated measures models can arise from various fields of research, e.g. matched case-control studies, finance and epidemiology. The key feature of these models is that the non-parametric function is evaluated for each of the repeated measurements. Lin and Carroll (2006) developed kernel-based estimation procedures and investigated asymptotic properties of the estimators in general semiparametric regression problems. We shall use their results and methodology in our context.

In this section we set out the notation to be used.

For simplicity only, we suppose that there are J repeated measurements for each individual. Only obvious notational changes are requires for the more general case. Specifically, we consider a log-likelihood or criterion function

L{Y~,ν1(β0,θ0,η0),,νJ(β0,θ0,η0),ζ0},

Where νj(β0,θ0,η0)=XjTβ0{1+γθ0(Zj)}+θ0(Zj)+SjTη0 γ is the common interaction parameter for each of the repeated measurements and ζ0 is the collection of all the nuisance parameters. Then, with a slight abuse of notation in the first formula below,

E[L{Y~,ν1(β0,θ0,η0),,νJ(β0,θ0,η0),ζ0}θ0(Zk)(Xj,Zj,Sj)j=1J]=0,E[L{Y~,ν1(β0,θ0,η0),,νJ(β0,θ0,η0),ζ0}(β,η,ζ)(Xj,Zj,Sj)j=1J]=0;

see Lin and Carroll (2006) for more discussion. In Section 3.6, we describe methods for the partially linear model when working independence among the errors is used, and hence weaker conditioning assumptions are required.

Letting •={Ỹ, ν1 (β,θ,η),. . . ,, νJ(β, θ, ν),. . . ,, νJ(β,θ, ζ}, we define terms Ljθ(), Ljkθ(), Lζ() and Ljθζ() in the same way as described in Section 2.1. Thus, for example,

Ljθ()=νjL[Y~,S1Tη+θ(Z1)+X1Tβ{1+γθ(Z1)},,SjTη+νj+XjTβ(1+γνj),,SJTη+θ(ZJ)+XJTβ{1+γθ(ZJ)},ζ]νj=θ(Zj),Ljkθ()=2νjνkL[Y~,S1Tη+θ(Z1)+X1Tβ{1+γθ(Z1)},,SjTη+νj+XjTβ(1+γνj),,SkTη+νk+XkTβ(1+γνk),,SJTη+θ(ZJ)+XJTβ{1+γθ(ZJ)},ζ]νj=θ(Zk).

3.2. Estimation under the null model

In this section, we display the method for estimation of parameters and the function θ(·), at the null hypothesis.

Under the null hypothesis, the criterion function is given by

L{Y~,θ0(Z1)+S1Tη0,,θ0(ZJ)+SJTη0,ζ0}.

Let δ=(η, ζ). We estimate θ0· and δ0 under the null model by using methodology that was proposed in Lin and Carroll (2006): for any fixed δ=δ*=(η*, ζ*), estimate θ0(z) by solving for (α0,α1)

0=i=1nj=1JKh(Zijz)G(Zijz)Ljθ{Y~i,θ^(Zi1,δ)+Si1Tη,,α0+α1(Zijz)h+SijTη,,θ^(ZiJ,δ)+SiJTη,ζ},

and setting θ^(z,δ*) Next, estimate δ by maximizing

i=1nL{Y~i,θ^(Zi1,δ)+Si1T,,θ^(ZiJ,δ)+SiJTη,ζ}

with respect to δ. This can be accomplished by implementing a profiling algorithm as in Lin and Carroll (2006)

3.3. The score function and asymptotic theory

3.3.1. Derivation of the profile score

As we have seen in Section 2.3, our test statistic will be based on the score function of a profiled log-likelihood. In this section, we derive the profiled log-likelihood and the score function, but here the repeated measures aspect makes the calculations less transparent and indeed leads to real issues of implementation. Let fj(z) be the marginal density of Zj. Again, for any (β, δ), we define θ(z, β, δ) by the repeated measures version of equation (7), namely the solution to the equation

0=j=1Jfj(z)E[Ljθ{Y~,X1Tβ{1+γθ(Z1,β,δ)}+θ(Z1,β,δ)+S1Tη,,XJTβ{1+γθ(ZJ,β,δ)}+θ(ZJ,β,δ)+SJTη,ζ}Zj=z]. (9)

Defining ωj(β,θ,δ)=XjTβ{1+γθ(Zj,β,δ)}+θ(Zj,β,δ)+SjTη, the profiled log-likelihood function is L{Y~,ω1(β,θ,δ),,ωJ(β,θ,δ),ζ}. Let Ljθβ{Y~,ω1(β,θ,δ),,ωJ(β,θ,δ),ζ} and Ljkθ{Y~,ω1(β,θ,δ),,ωJ(β,θ,δ),ζ} be the derivatives of Ljθ{Y~,ω1(β,θ,δ),,ωJ(β,θ,δ),ζ} with respect to β and θ(Zk, β, δ) respectively. Differentiating and setting β=0, the profiled score becomes

j=1J[{1+γθ(Zj,0,δ)}Xj+θβ(Zj,0,δ,γ)]Ljθ{Y~,ω1(0,θ,δ),,ωJ(0,θ,δ),ζ},

where, by differentiating equation (9) with respect to β and solving θβ(z, β, δ, γ) is the solution of the functional integral equation

0=j=1Jfj(z)E[Ljθβ{Y~,ω1(β,θ,δ),,ωJ(β,θ,δ),ζ}+k=1JLjkθ{Y~,ω1(β,θ,δ),,ωJ(β,θ,δ),ζ0}×θβ(Zk,β,δ,γ)Zj=z]. (10)

Then, for any fixed value of γ, the profiled score function for β0 evaluated at β0=0, β0=β^ and θ(z)=θ^(z,δ^) is given by

Tnpro(γ)=n12i=1nj=1J[{1+γθ^(Zij,δ^)}Xij+θ^β(Zij,0,δ^,γ)]Ljθ{Y~,θ~(Zi1,δ^)+Si1Tη^,,θ^(Zij,δ^)+SiJTη^,ζ^}

3.3.2. Asymptotic theory

Denote (•)={Ỹ, ω1 (β0, θ0, δ0),. . . , ωJ(β0, θ0, δ0) and denote (•i) to be (•) evaluated at the ith observation. Do all calculations at the null model β0=0. Define θδ(z,δ0) such that

0=j=1Jfj(z)E{Ljθδ()+k=1Jθδ(Zk,δ0)Ljkθ()Zj=z}.

Further define

M1=cov{Lδ()+j=1JLjθ()θδ(Zj,δ0)},M2E[j=1J{1+γθ0(Zj)}Xj{Ljθδ()+k=1Jθδ(Zk,δ0)Ljkθ()}T],
Ψi(γ)=j=1J[Xij{1+γθ0(Zij)}+θβ(Zij,0,δ0,γ)]Ljθ(i)M2M11{Lθ(i)+j=1JLjθ(i)θδ(Zij,δ0)}.

Then we have the following result.

Result 2

Suppose that we are interested in testing H0 : β0=0. Assume that h α n where 1/3 ≤ α ≤ 1/5. Then, for any fixed γ, the score function for β0 can be written as

Tn,pro(γ)=n12i=1nΨi(γ)+op(n12).

In addition, assume that, for any γ1 and γ2 V(γ1,γ2)=E{Ψ(γ2)T} is finite. Then, under the hypothesis that β0=0, Tn,pro(γ) as a function of γ ∈ [L, R] converges weakly to a Gaussian process W(γ) with mean 0 and covariance function V(γ1, γ2).

Using result 2, we construct the test statistic and the critical values in the obvious analogy with Sections 2.3 and 2.4. To implement this in practice though, we must solve the integral equations for θβ(·_ and θδ(·), which is very difficult to do. In the next section, we show how to estimate these quantities without directly solving the integral equations.

3.4. Computation of θβ(·) and θδ(·)

The main difficulty in performing the score test is that, for each γ, we must compute θ^β (z,0,δ0, γ) and θ^β (z, 0, δ0), the former of which is the solution of integral equation (10), making implementation difficult. In this section we show that θβ(z, 0, δ0, γ) can be viewed as a regression function and hence can be computed via a non-parametric Gaussian repeated measures regression, which is easily computed and for which the exact solution is known; see Huggins (2006) and Lin et al. (2004). The result can be stated as follows: details are in Appendix A.

Result 3

Define Qij =-Xij{1+γθ0(Zij)}. Let Vi be the J × J matrix with elements νijk. Then θβ (z, 0, δ0, γ) is identified as the formal solution of the Gaussian repeated measures problem that was solved by Wang (2003) and Huggins (2006) with ‘responses’ being the components of Qij and the inverse of the covariance matrix being Vi

The algorithm for estimating θβ(·) now is quite simple. Define Q̂ij=-{1+γθ^(Zij^). Then we construct each component of θ^β(z,0,β^,γ) by performing a non-parametric repeated measures regression under the null model with β=0, with the response being the appropriate component of Q^ij and the inverse of the covariance matrix being i=(ν^ijk), where ν^ijk=Ljkθ{Y~i,θ^(Zi1,δ^)+SilTη^,,θ^(ZiJ,δ^)+SiJTη^,ζ^} and θ^(z,δ^) is computed under the null model with β0=0.

We can estimate θδ(·) in a similar manner. We do this componentwise. Let Ljθδ,l() denote the lth component of θ^β(z,0,δ^,γ), and similarly for θδ,l(·). Define (Rli1,,RiJl)T=Vi1{Li1θδ,l(),,LiJθ,l()}T. Then θδ,l(·) can be thought of as the Gaussian repeated measures regression of Rijl on Zij pretending that the inverse of the covariance matrix for the ith cluster is Vi. In practice, we construct θ^δ,l(·) by using R^ijl and V^i1.

3.5. Special case: partially linear repeated measurement model

In this section we consider the partially linear Gaussian model as an example to demonstrate our methodology. Specifically, we consider the model

Yij=XijTβ0{1+γθ0(Zij)}+θ0(Zij)+SijTη0+εij,

where ε~i=(εi1,,εiJ) has an N(0, Σ) distribution. We want to test for H0: β0=0. The asymptotic theory is not affected by estimation of Σ, so here we assume that it is known.

Let Σ=(σjk)j,k=1,. . . ,,J and Σ-1=V=(vjk). Then the log-likelihood function is given by

L=12q=1Jl=1Jνq1(Yqμq)(Ylμl),

where μj=XjTβ0{1+γθ0(Zj)}+θ0(Zj)+SjTη0. Now we observe that, when β0=0,

Ljθ()=l=1Jνjl(Ylμl),Ljθβ()=γXjl=1Jνjl(Ylμl)l=1JνjlXl{1+γθ0(Zl)},Ljkθ()=νjk.

For β0=0, θβ (z, 0, η0, γ) solves

0=j=1Jfj(z)E(k=1Jνjk[Xk{1+γθ0(Zk)}+θβ(Zk),0,η0,γ]Zj=z). (11)

Hence the profiled score function is given by

Tn,pro(γ)=n12i=1nj=1Jk=1Jνjk[{1+γθ^(Zij,η^)}Xij+θ^β(Zij,0,η^,γ)]×{Yikθ^(Zik,η^)SikTη^}.

Now we can construct the score test by using result 2.

Remark 3

Referring to Section 3.4, we observe that estimation of θβ(·) becomes much simpler in this case. Using the fact that Ljkθ()=νjk, we can construct θ^β(·) by performing a non-parametric componentwise Gaussian repeated measures regression of Q̂k=-{1+γ θ^(Zk,ν^) on Zk pretending that the error covariance matrix is Σ, where β0=0 is computed under the null model with β0=0. Similarly, we can estimate θη(·) by performing a non-parametric Gaussian repeated measures regression of -Sij on Zij by using Σ as the error covariance matrix.

3.6. Testing under working independence

In practice, often working independence is used to simplify the computations in the presence of repeated measures. In this set-up, we pretend that there is no correlation among the data. In our context, this leads to the assumption that σjk=0 for jk, and we work with the criterion function

LWI=12j=1Jσjj1(Yjμj)2,

where μj=XjTβ0{1+γθ0(Zj)}+θ0(Zj)+SjTη0. The use of this criterion function simplifies the calculations to a great extent. For any generic random variable W, define W~j=WjmZW(Zj) with

mZW(z)=j=1Jσjj1fj(z)E(WjZj=z)j=1Jσjj1fj(z).

Under the hypothesis that H0: β0=0, we then observe that now θβ(·) and θη(·) have closed form expressions:

θβ(z,0,η0,γ)={1+γθ0(z)}mZX(z),θη(z,η0)=mZS(z).

The profiled score statistic is given by

Tn,proWI(γ)=n12i=1nj=1Jσjj1{1+γθ^(Zij,n^)}X~ij,est{Yijθ^(Zij,η^)SijTη^},

where X~ij,est=Xijm^ZX(Zij). We can compute m^ZX(z) by running a componentwise Gaussian repeated measures regression on Xij and Zij by using the working independence set-up.

Further define

M1=cov[j=1Jσjj1S~j{Yjθ0(Zj)SjTη0}],M2=E[j=1Jσjj1{1+γθ0(Zj)}XjS~jT].

Result 2 then translates to the following result.

Result 4

Assume that h α n where 1/3 ≤ α ≤ 1/5. Then, under the assumption of working independence,

Tn,proWI(γ)=n12i=1nj=1Jσjj1[{1+γθ0(Zij)}X~ij+M2M11S~ij]{Yijθ0(Zij)SijTη0}+op(1).

Define Ψij(γ)={1+γθ0(Zij)}X~ij+M2M11S~ij and let Ψ^ij(γ) (γ) be the sample version. Under the null hypothesis, we estimate the covariance matrix of Tn,proWI by

Iβ0,nWI=n1i=1nj=1Jσjj1Ψ^ij(γ)Ψ^ij(γ)T.

The score statistic, maximized over γ, is then given by

Tn=maxγ[L,R]{Tn,proWI(γ)T(Iβ0,nWI)1Tn,proWI(γ)}.

Using lemma 4 in Appendix A, we can now implement the score test by using the technique that was described in Section 2.4. We start by generating

T0WI(γ)=n12i=1nj=1Jσjj1Ψ^ij(γ)Zij,

where Zi=(Zi1,,ZiJ)T, i=1, . . . ,, n, are independent random vectors that are generated from an N(0,Σ̂) distribution. We can form Σ̂ as the sample covariance matrix of the residuals {Yijθ^(Zij,η^)SijTη^}. The null distribution of Tn is then simulated by repeatedly generating

T0=maxγ[L,R]{T0WI(γ)T(Iβ0,nWI)1T0WI(γ)}.

Remark 4

We reiterate that one needs to estimate m^ZX(Zij) and m^ZS(Zij) to implement the score test. These quantities can be easily estimated by performing componentwise Gaussian repeated measures regressions of Xij and Sij on Zij by using the working independence set-up.

4. Simulations

4.1. Testing without repeated measures

For the simulation for the test for β0=0, we used the following conventions. We used 31 values of γ in the range [-3, 3]. The variable Z=Uniform[-2, 2], whereas the function θ0 (z)=sin (2z) is distinctly non-linear. In keeping with our data example, the sample size was n=1400.

We generated X in three ways:

  1. as a bivariate standard normal random variable;

  2. X=(X1, X2) where X1=Bernoulli(0.6) and X2=N(0, 1);

  3. as two dummy variables. Thus, we first generated a standard normal random variable r, and X1=I(r < -0.4) and X2=I(r>0.4).

We set β0=c(1, 1)T, where we set c=0.0, 0.01, . . . ,, 0.15 for power calculations. The true value of γ was varied: γtrue=0,1,2. We ran simulations both with and without additional covariates S: in the former case, we set S to be generated from a univariate N(0, 1) distribution and used η0=1.

For each scenario, we ran 1000 simulated data sets. To estimate the level of significance, we applied the method in Section 2.4 with 1500 replications. The Epanechnikov kernel was used to carry out the computation. We used different bandwidths of the form h=κ std(Z)n-1/5 with various values of κ ranging from 0.5 to 2. The results are very similar in each of those cases and hence we report the results for κ=1 only. The results are displayed in Figs 1-3. There three main conclusions are clear.

Fig. 1.

Fig. 1

Results of the simulation for testing whether β=0 as described in Section 4.1 by using kernel-based calculations (here X is a bivariate standard normal random variable; Inline graphic, our method; - - -, naive test, value of c and the vertical axis plots the corresponding power): (a) γtrue=0, without S; (b) γtrue=1, without S; (c) γtrue=2, without S; (d) γtrue=0, with S; (e) γtrue=1, with S; (f) γtrue=2, with S

Fig. 3.

Fig. 3

Results of the simulation for testing whether β=0 as described in Section 4.1 by using kernel-based calculations (here X=(X1, X2) is two dummy variables; thus, we first generated a standard normal random variable r, and X1=I(r < -0.4) and X2=I(r < 0.4);Inline graphic, our method;-------, naive test, which assumes γ=0; the true value that was used was β=c(1, 1)T; the horizontal axis plots the value of c and the vertical axis plots the corresponding power): (a) γtrue=0, without S; (b) γtrue=1, without S; (c) γtrue=2, without S; (d) γtrue=0, with S; (e) γtrue=1, with S; (f) γtrue=2, with S

  1. The test level of our method is near nominal, being 0.051 without S and 0.057 with S in the model.

  2. For the main effects model with γtrue=0, our maximized score-type test loses only modest power compared with the efficient (in this case) main effects score test.

  3. When there are interactions, our methods greatly dominate the main effects score test as γtrue increases.

For comparison, we repeated the simulation by using penalized B-spline regression, using a second-order B-spline with 10 basis functions and with a second-order difference penalty. The smoothing parameter was chosen by generalized cross-validation. The results were very similar to those obtained for kenrnel methods. The near equivalence of kernel and spline methods here is no surprise, since there is evidence in Gaussian cases that smoothing splines are equivalent to kernel methods (Silverman, 1984; Lin et al., 2004). Recently, Li and Ruppert (2008) showed that penalized B-spline regression is also asymptotically equivalent to kernel regression methods in the Gaussian case.

4.2. Testing with repeated measures

We use the following set-up for our simulations for testing β0=0. We generate samples from the partially linear gaussian repeated measures model: for i=1, . . . , n and j=1, . . . , J,

Yij=XijTβ0+θ0(Zij)(1+γXijTβ0)+εij,

with n=200 and J=3, where we take the true value of the parameter to be β0=c(1, -1)T and set c=0.001, . . . , 0.06 for power calculation. We set τ0(z)=sin (2z) to be the true function. We generated X from the standard bivariate normal distribution and Z from the Uniform[-2, 2] distribution. The error vectors (ε1, . . . , εJ)T are generated from a multivariate normal distribution with covariance matrix σ=I + 0.6 (11T - I).

We use 11 values of γ in [0, 2] to compute the test statistic. The true values of γ that are used to generate the data are taken to be γtrue=0, 1, 2. As in the previous simulation, we use the Epanechnikov kernel with bandwidth h=κ std (Z)n-1/5 where the value of κ ranged from 0.5 to 2. In this case also, we observe that the results are very similar for each of the choices of bandwidth and hence we report the results for κ=1. We generate 1000 data sets for each case and for each data set we apply our method by using 1000 replications. The results are given in Fig. 4. The level of our test is 0.051, which is very close to the nominal level of 0.05. It is evident that, although our test loses very little power when γtrue=0, it achieves great power gain in the presence of interaction as seen in cases where γtrue=1, 2.

Fig. 4.

Fig. 4

Results of the simulation for testing whether β0=0, as described in Section 4.2 by using kernel-based calculations (Inline graphic, our method; - - -, naive test which assumes γ=0; the true value that was used was β=c(1, -1)T; the horizontal axis plots the value of c and the veritcal axis plots the corresponding power): (a) γtrue=0; (b) γtrue=1; (c) γtrue=2

We redid the simulation by using B-splines with 10 basis functions where the penalty parameter is estimated at the null model by using generalized cross-validation. The results are nearly identical of Fig. 4, as we would expect in the Gaussian case.

5. Data analysis

Chatterjee et al. (2006) illustrated application of their methodology by using a case-control study for investigation of association between colorectal adenoma, a precursor of colorectal cancer, and NAT2, a candidate gene that is known to play an important role in detoxification of certain aromatic carcinogens in cigarette smoke. The study involved about 700 cases and 700 controls who were genotyped for six known functional polymorphisms related to NAT2 acetylation activity. The genotype data were used to construct diploytpe information, i.e. the pair of haplotypes that the subjects carried along their pair of homologous chromosomes. The frequency distribution of these diplotypes and associated acetylation phenotypes are shown in Table 4 of Chatterjee et al. (2006). In principle, the diplotypes are not observed directly and we can only assign diplotypes on the basis of the unphased genotype data. However, in many instances such as this example, when we have tightly linked SNPs, the phase ambiguity is often minimal, i.e. we can assign a very large proportion (greater than 95%) of the subjects a specific diplotype with a very high probability (greater than 0.95). In such cases, it is esier just to remove those few people for whom the diplotypes are more uncertain and to assume that for the rest of the people the diplotypes are known. In our data set. we removed a small number of people whose haplotypes were quite uncertain.

Chatterjee et al. (2006) considered an omnibus test that can account for interaction of NAT2 history with smoling history, defined as ever, former or never smokers. We consider a similar application involving NAT2 diplotypes but model the effect of CIG_STOP (years since stopping smoking) in a continuous fashion with non-parametric regression among smokers. Because of a few high leverage values, we censored CIG_STOP at 45. In our analysis, the cofactor S included gender and three indicator dummy variables for age level: between 60 and 65 years, between 65 and 70 years and more than 70 years. For modelling the effect of NAT2 diplotypes, we considered a series of 14 different analyses where in the kth analysis we compare the risk that is associated with the k (k=1, . . . ,, 14) most common diplotypes in reference to the rest, with the associated design matrix Xk being defined by k corresponding dummy variables. To account for non-smokers in this analysis, we defined δ to be the indicator of smoking (ever versus never) and considered the following model:

pr(D=1X,S,Z)=H{(1δ)β0+STβ1+XTβ2+δθ(Z)+γδXTβ2θ(Z)}. (12)

Modifying our methods to handle this slightly more complex model is straightforward: details are available from the authors.

Table 1 compares results of the proposed method for testing β2=0 on the basis of model (12) with those for a test for only the corresponding main effects of the diplotypes, ignoring NAT2-smoking interaction, i.e. assuming γ=0. We observe that, in each analysis, stronger evidence of association is seen in our new test. For example, when the 12 most common diplotypes were used, our method had a level of significance of 0.036 versus a level of significance of 0.214 for the main-effect-based test. Interestingly, when all 14 common diplotypes are used, the level of significance of the test proposed was 0.066, which is quite close to that for the test that was used by Chatterjee et al. (2006), also using all the 14 diplotypes, but accounting for interaction with the categorical smoking history variable defined as never, former or current smoker.

Table 1.

Levels of significance (p-values) of the test for genetic effects in a regression model in which Z is years since stopping smoking

Diplotype Results for our method
Results for γ=0
Test p-value Test p-value
1 11.4 0.001 3.3 0.066
2 13.9 0.003 5.7 0.055
3 16.6 0.002 9.8 0.016
4 16.7 0.007 9.8 0.041
5 19.5 0.007 11.3 0.045
6 19.7 0.017 11.4 0.087
7 20.0 0.021 12.3 0.098
8 21.3 0.025 13.1 0.111
9 24.1 0.015 14.2 0.116
10 25.2 0.016 15.3 0.120
11 25.2 0.027 15.4 0.180
12 25.6 0.036 15.4 0.214
13 25.9 0.055 15.8 0.262
14 26.7 0.066 16.6 0.279

Age category and gender were modelled additively and parametrically. The analysis is done for the most common diplotype, the most common two diplotypes, and so on. The non-parametric regression was done by using penalized order 2 B-splines with 10 segments, with penalization done via generalized cross-validation.

6. Discussion

We have developed methodology for an efficient score test for genetic effect in general semiparametric models that can account for gene-environment interaction with non-parametrically specified environmental effects. The procedure proposed allows for repeated measurements.

We proposed a profiled score statistic which can be performed by using standard bandwidth selection procedures. We also found that these profiled score tests are efficient.

The main difficulty of performing the score test is that one must estimate a function which itself is a solution of an integral equation that is difficult to solve. In the case of repeatedly measured data, the solution generally does not a closed form expression and hence some sort of numerical procedure is required for estimation. In this paper, we overcome this problem by developing an easily implementable estimation procedure which does not involve solving integral equations and can be performed easily via standard software. The key idea lies in the fact that the target functions, based on their estimating equations, can be interpreted as Gaussian repeated measures regressions.

Simulations that were presented in the paper show that the score tests proposed maintain the desired type I error level, indicating that the asymptotic approximations work well for studies such as ours. Moreover, both simulation studies and the data example indicate that the score test proposed taking account of the interaction can achieve higher statistical power than naive tests which ignore interaction altogether. Future research areas of interest include extension of the score test to account for the interaction of the genetic factors with several different, but biologically related, environmental factors, such as different biomarkers for a nutrient, simultaneously. In principle, the score test can be extended by using generalized additive models to account for the effect of several different continuous exposures. Further theoretical development, however, is needed to establish the asymptotic theory for such procedures.

Fig. 2.

Fig. 2

Results of the simulation for testing whether β=0 as described in Section 4.1 by using kernel-based calculations (here X=(X1, X2) where X1=Bernoulli (0.6) and X2=N(0, 1);Inline graphic, our method;-------, naive test, which assumes γ=0; the true value that was used was β=c(1, 1)T; the horizontal axis plots the value of c and the vertical axis plots the corresponding power): (a) γtrue=0, without S; (b) γtrue=1, without S; (c) γtrue=2, without S; (d) γtrue=0, with S; (e) γtrue=1, with S; (f) γtrue=2, with S

Acknowledgements

Maity and Carroll’s research was supported by grants from the National Cancer Institute (CA-57030 and CA104620). Mammen’s research was supported by the Deutsche Forschungs-gemeinschaft project MA 1026/7-3. Chatterjee’s research was supported by a ‘Gene-environment initiative’ grant from the National Heart, Lung and Blood Institute and by the intramural research programme of the National Cancer Institute.

Appendix A: Argument for result 1

For simplicity of notation, here we consider only the case that there are no nuisance parameters ζ0. The more general case is a simple extension.

To prove the results, we rely on several technical conditions that for brevity we do not state here explicitly. These conditions are well known and standard in smoothing theory. Refer to Claeskens and Van Keilegom (2003), Claeskens and Carroll (2007) and Lin and Carroll (2006) among many others for the details of these assumptions. As stated just before result 1, we require that the linear expansions for the parametric and non-parametric parts that were given in Lin and Carroll (2006) hold to order op (n-1/2); the latter uniformly.

A.1. Expansion of Tn(γ)

Let θ(j) (·) be the jth derivative of θ(·) with respect to z0. Let fz(z0) be the density function of Z. Make the definitions

Ω(z0)=E[Lθθ{Y,STη0+θ0(Z)}Z=z0],θη(z0,η0)=E[SLθθ{Y,STη0+θ0(Z)}Z=z0]Ω(z0).

Note that Si+θη(Zi,η0)=S~i, and recall that

M=cov[{S+θη(Z,η0)}Lθ{Y,STη0+θ(Z)}].

Then using Lin and Carroll (2006) we have that, uniformly in z0,

θ^(Z0,η0)θ0(Z0,η0)=n1i=1nKh(Ziz0)Lθ{YiSiTη0+θ0(Zi)}fZ(z0)Ω(z0)+(ϕ2h22)θ0(2)(z0)+Op{h4+log(n)(n)nh}, (13)
η^η0=M1n1i=1nS~iLθ{YiSiTη0+θ0(Zi)}+op(n12). (14)

The score statistic for β is, via Taylor series,

Tn,pro(γ)=n12i=1n{1+γθ0(Zi)}X~iL{Yi,SiTη0+θ0(Zi)}+n12i=1nS1i(γ){θ^(Ziθ0(Zi))}+n12i=1nS2i(γ)(η^η0)+op(1)=A1n+A2n+A3n+op(1)

where

S1i(γ)=X~i[γLθ{Yi,SiTη0+θ0(Zi)}+{1+γθ0(Zi)}Lθθ{Yi,SiTη0+θ0(Zi)}]S2i(γ)=γθη(Zi)X~iLθ{Yi,SiTη0+θ0(Zi)}+{1+γθ0(Zi)}X~iS~iTLθθ{Yi,SiTη0+θ0(Zi)}.

By definition of X̃, it is easy to see that, to order op(1),

A2n=n12i=1nLθ{Yi,SiTη0+θ0(Zi)}E{S1i(γ)Zi}Ω(Zi)=0,

where we have used equations (6) and (13). Also, using equation (14) and the definition of N we obtain

A3n=NM1n12i=1nS~iLθ{YiSTη0+θ0(Zi)}+op(1).

The result now follows by collecting all ther terms. It is readily seen that the expansion is uniform in γ∈[L, R].

A.2. Weak convergence

Weak convergence is trivial. Examining the form of the test statistic Tn,pro(γ) in equation (8), we see that it is linear in γ and can be written as Un + γVn, where (Un, Vn) are jointly asymptotically normally distributed.

Appendix B: Argument for result 2

Define

Ω(z)=j=1Jfj(z)E{Ljjθ()Zj=z}

and

A(B,Z1,Z2)=j=1Jkj=1Jfj(z1)E{Ljkθ()B(Zk,z2)Ω(Zk)Zj=z1},Q(Z1,Z2)=j=1Jkj=1Jfjk(z1,z2)E{Ljkθ()Zj=z1,Zk=z2}Ω(Z2),

where fj(z) is the density of Zj and fjk(z1, z2) is the bivariate density of (Zj, Zk), which are assumed to have bounded support and are positive on the support. Let G(z1,z2) be the solution to

G(z1,z2)=Q(z1,z2)A(G,z1,z2).

Using the results of Lin and Carroll (2006) we obtain that, uniformly in z,

θ^(z,η0)θ0(z)=(ϕh22)b(z)n1i=1nj=1nKh(Zijz)Lijθ()Ω(z)+n1i=1nj=1nLijθ()G(z,Zij)Ω(z)+op{h4+log(n)nh}, (15)
η^η0=M11n1i=1nj=1n{Sij+θn(Zij,η0)}Lij()+op(n12). (16)

Define

Tk,n(γ)=j=1J[Xj{1+γθ0(Zj)}+θβ(Zj,0,η0,γ)]Ljkθ(),Tη,n(γ)=j=1Jk=1J[Xj{1+γθ0(Zj)}+θβ(Zj,0,η0,γ)](Sk+θη(Zk,η0))TLjkθ().

It is easily shown that

Tn,pro(γ)=n12i=1nj=1J[Xij{1+γθ0(Zij)}+θβ(Zij,0,η0,γ)]Lijθ()+n12i=1nTiη,n(γ)(η^η0)+n12i=1nj=1JTik,n(γ){θ^(Zij,η0)θ0(Zik)}+n12i=1nj=1JLijθ(){θ^β(Zij,0,η^,γ)θβ(Zij,0,η0,γ)}+op(1).

Using equation (16) and the fact that E{Tη,n(γ)}=M2, it is easy to see that

n12i=1nTiη,nT(γ)(η^η0)=M2M11n12i=1nj=1J{Sij+θη(Zij,η0)}Lijθ()+op(1).

Next, using equation (15), we now derive that, up to terms of op (1),

n12i=1nk=1JTik,n(γ){θ^(Zik,η0)θ0(Zik)}=n12i=1nk=1JTik,n(γ){n1r=1nj=1JKh(ZrjZik)Lrjθ()Ω(Zik)}+n12i=1nj=1nTik,n(γ){n1r=1nj=1JLrjθ()G(Zik,Zrj)Ω(Zik)}=n12r=1nj=1JLrjθ(){C1(Zrj)+C2(Zrj)},

where we define

C1(z,γ)=k=1Jfk(z)E{Tik,n(γ)Zk=z}Ω(z),C2(z,γ)=E[k=1JE{Tik,n(γ)Zk}G(Zk,z)Ω(Zk)].

We now note that

k=1Jfk(z)E{Tik,n(γ)Zk=z}=0

by deifnition of θβ(·) with β0=0 and hence C1 (z, γ)=C2 (z, γ)=0.

Finally, we recognize that θ^β(·) is the repeated measures regression of Qij on Zij and hence yields an asymptotic expansion similar to equation (15). Together with the fact that E{Ljθ()X,S,Z}=0, it is now straightforward to show that the fourth term in the expansion of Tn,pro(γ)=op(1) completing the proof.

Appendix C: Argument for result 3

Under the null hypothesis, θβ(z, 0, δ0, γ) solves

0=j=1Jfj(z)E(k=1J[Xk{1+γθ0(zk)}+θβ(Zk,0,δ0,γ)]Ljkθ()Zj=z). (17)

Recall that Kh (z)=h-1 K(z/h) and Gh (z)=(1,z/h)T. Consider the problem of solving, for |{m(z), m(1) (z)|},

0=n1i=1nj=1JKh(Zijz)G(Zijz)[kj=1Jνijk{Qikm(Zik)}+νijjQijνijjG(Zijz)T(m(z),m(1)(z))T]

where νijk=Lijkθ() Define

Fn(z)=n1i=1nj=1JνijjKh(zijz)G(zijz)T.

The solution then satisfies

Fn(z)(m(z),m(1)(z))T=n1i=1nj=1nKh(Zijz)[kj=1Jνijk{Qikm(Zik)}+νijjQij].

Note that

Fn(z)=j=1JE(νijZj=z)(fj(z)00ϕ2)+op(1),

Where ϕ2=∫ z2k(z) dz. Hence, taking the limit of both sides we obtain that m(z) satisfies

j=1JE(νjjZj=z)fj(z)m(z)=j=1Jfj(z)kj=1JE[νjk{Qkm(Zk)}Zj=z]+j=1Jfj(z)E(νjjQjZj=z),

which is identical to equation (17) with m(z)=θβ(z,0,δ0,γ). This completes the argument.

Contributor Information

Arnab Maity, Texas A&M University, College Station, USA.

Raymond J. Carroll, Texas A&M University, College Station, USA

Enno Mammen, University of Mannheim, Germany.

Nilanjan Chatterjee, National Cancer Institute, Rockville, USA.

References

  1. Chatterjee N, Kalaylioglu Z, Moslehi R, Peters U, Wacholder S. Powerful multi-locus tests for genetic assocation in the presence of gene-gene and gene-environment interactions. Am. J. Hum. Genet. 2006;79:1002–1016. doi: 10.1086/509704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Claeskens G, Carroll RJ. An asymptotic theory for model selection inference in general semiparametric problems. Biometrika. 2007;94:249–265. [Google Scholar]
  3. Claeskens G, Van Keilegom I. Bootstrap confidence bands for regression curves and their derivatives. Ann. Statist. 2003;31:1852–1884. [Google Scholar]
  4. Davies RB. Hypothesis testing when a nuisance parameter is present only under the null hypothesis. Biometrika. 1987;74:33–43. [Google Scholar]
  5. Hayes RB, Reding D, Kopp W, Subar AF, Bhat N, Rothman N, Caporaso N, Ziegler RG, Johnson CC, Weissfeld JL, Hoover RN, Hartge P, Palace C, Gohagan JK. Etiologic and early marker studies in the prostate, lung, colorectal and ovarian (plco) cancer screening trial. Contr. Clin. Trials. 2000;21(suppl 6):349S–355S. doi: 10.1016/s0197-2456(00)00101-x. [DOI] [PubMed] [Google Scholar]
  6. Horowitz JT, Spokoiny VG. An adaptive, rate-optimal test of a parametric model against a non-parametric alternative. Econometrica. 2001;69:599–631. [Google Scholar]
  7. Huggins R. Understanding nonparametric estimation for clustered data. Biometrika. 2006;93:486–489. [Google Scholar]
  8. Kraft P, Yen Y-C, Stram DO, Morrison J, Gauderman WJ. Exploiting gene-environment interaction to detect genetic associations. Hum. Hered. 2007;63:111–119. doi: 10.1159/000099183. [DOI] [PubMed] [Google Scholar]
  9. Li Y, Ruppert D. On the asymptotics of penalized splines. Biometrika. 2008;95:415–437. [Google Scholar]
  10. Lin DY, Zou F. Assessing genomewide statistical significance in linkage studies. Genet. Epidem. 2004;27:202–214. doi: 10.1002/gepi.20017. [DOI] [PubMed] [Google Scholar]
  11. Lin X, Carroll RJ. Semiparametric estimation in general repeated measures problems. J. R. Statist. Soc. B. 2006;68:69–88. [Google Scholar]
  12. Lin X, Wang N, Welsh A, Carroll RJ. Equivalent kernels of smoothing splines in nonparametric regression for clustered data. Biometrika. 2004;91:177–193. [Google Scholar]
  13. Silverman B. Spline smoothing: the equivalent variable kernel method. Ann. Statist. 1984;12:898–916. [Google Scholar]
  14. Tukey JW. One degree of freedom for non-additivity. Biometrics. 1949;5:232–242. [Google Scholar]
  15. Wang N. Marginal nonparametric kernel regression accounting for within-subject correlation. Biometrika. 2003;90:43–52. [Google Scholar]

RESOURCES