Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Jan 29.
Published in final edited form as: Stat Sin. 2018 Jan;28:63–92. doi: 10.5705/ss.202016.0063

Two-Sample Tests for High-Dimensional Linear Regression with an Application to Detecting Interactions

Yin Xia 1, Tianxi Cai 2, T Tony Cai 3
PMCID: PMC5788049  NIHMSID: NIHMS874424  PMID: 29386856

Abstract

Motivated by applications in genomics, we consider in this paper global and multiple testing for the comparisons of two high-dimensional linear regression models. A procedure for testing the equality of the two regression vectors globally is proposed and shown to be particularly powerful against sparse alternatives. We then introduce a multiple testing procedure for identifying unequal coordinates while controlling the false discovery rate and false discovery proportion. Theoretical justifications are provided to guarantee the validity of the proposed tests and optimality results are established under sparsity assumptions on the regression coefficients. The proposed testing procedures are easy to implement. Numerical properties of the procedures are investigated through simulation and data analysis. The results show that the proposed tests maintain the desired error rates under the null and have good power under the alternative at moderate sample sizes. The procedures are applied to the Framingham Offspring study to investigate the interactions between smoking and cardiovascular related genetic mutations important for an inflammation marker.

Keywords: False discovery proportion, false discovery rate, high-dimensional linear regression, hypothesis testing, multiple comparisons, sparsity, two-sample tests

1 Introduction

As we enter a new era of data science, called by some the “information century”, research in several novel genomics and epigenomics fields are well underway. Large-scale genomewide scans, such as genome-wide association studies, have become widely available tools for identifying common genetic variants that contribute to complex diseases and treatment responses (McCarthy et al. (2008); Venter et al. (2001)). However, there is growing evidence that genetic variants alone explain only a small proportion of variations in complex disease phenotypes. Most complex diseases are a result of interplay between genes and environment (Hunter (2005)). It is thus of substantial interest to rigorously study the effects of environment and its interaction with genetic predispositions on disease phenotypes.

When the environmental factor is a binary variable such as smoking status or gender, such interaction problems can be addressed through the two-sample high-dimensional regression framework. Specifically, interaction detection can be formulated based on comparing two high-dimensional regression models

Yd=μd+Xdβd+εd,  for d=1,2, (1)

and identifing the nonzero components of β1β2, where βd = (β1,d, …, βp,d)T ∈ ℝp, μd = (μ1,d, …, μnd,d)T, Xd=(X1,·,dT,,Xnd,·,dT)T, Yd = (Y1,d, …, Ynd,d)T, and εd = (ε1,d, …, εnd,d)T, with {εk,d} being independent and identically distributed (i.i.d) random variables with mean zero and variance σεd2 and independent of Xk,·,d, k = 1, …, nd. Two-sample interaction detection problems arise in many other biomedical settings. For example, when the two samples represent diseased and non-diseased group and Y represents a diagnostic test, and the non-zero components of β1β2 represent the covariates that affect the diagnostic accuracy of Y (Pepe (2003)). When the two samples represent two treatment groups, the proposed testing procedures have important applications in personalized medicine. The non-zero components of β1β2 correspond to markers useful for individualized treatment selection since the rule that optimize the treatment selection for an individual patient with genomic markers X can be formed based on (β1β2)TX (Matsouaka et al. (2014)). However, the high dimensionality of the genomic data presents substantial statistical challenges in efficiently identifying gene-environment interactions and markers useful for personalized treatment selection.

There is a paucity of literature focusing on multiple testing of the regression coefficients in the high-dimensional two-sample setting while controlling the false discovery rate (FDR) and false discovery proportion (FDP). For example, Zhang and Zhang (2014), Van de Geer et al. (2014), and Javanmard and Montanari (2013, 2014) considered confidence intervals and tests for a given coordinate of a high-dimensional linear regression vector. Procedures that are based on the “de-biased” Lasso estimators were proposed. The focus was solely on inference for a given coordinate and simultaneous testing of all coordinates was not considered. Recently, Liu and Luo (2014) investigated the one-sample version of the multiple testing problem, testing simultaneously

H0,i:βi,1=0 versus H1,i:βi,10,i=1,,p,

with the control of FDR. They constructed the test statistics based on bias-corrected sample covariances of the residuals and inverse regression, as explained in detail in Section 2.2. The one-sample setting is simpler than the two-sample multiple testing problem considered in the present paper. For example, their proposed test statistics have desirable theoretical properties due to the facts that (i) they are asymptotically normally distributed under H0,i:βi,1=0, and (ii) the correlation between two test statistics is equal to the partial correlation between two covariates, which is fully determined by the precision matrix. However, those properties no longer hold when we extend the hypothesis testing problem to two samples as described in (3).

In this paper, we are interested in developing efficient procedures for testing β1β2. The first goal is to develop a global test for

H0:β1=β2  versus  H1:β1β2 (2)

that is powerful against sparse alternatives. We then develop a procedure for simultaneously testing the hypotheses

H0,i:βi,1=βi,2  versus  H1,i:βi,1βi,2,i=1,,p, (3)

with FDR and FDP control. The test statistics are constructed using the covariances between the residuals of the fitted regression models and the inverse regression models. Although the techniques build on the inverse regression method developed in Liu and Luo (2014) for the one-sample case, the two-sample case poses significant additional difficulties in both methodology development and technical analyses. We point out here two such major challenges and more detailed discussion is given in Section 2.3.

  1. The construction of test statistics is much more involved than the one-sample case. This is mainly due to the fact that the difference of regression coefficients can no longer be reduced to the difference of residual covariances as in the one-sample setting. Furthermore, corrections of the test statistics are essential in the two-sample case to establish the asymptotic normality.

  2. The technical analyses of the two-sample case are much more challenging. This is because the one-sample case can be easily reduced to a weakly correlated testing problem provided that the precision matrix of the covariates is sparse or nearly sparse, while the two-sample case cannot as the correlation structure is much more complicated.

The properties of the proposed testing procedures are investigated theoretically as well as numerically through simulation and data analysis. Theoretical justifications are provided to ensure the validity of the proposed tests and optimality results are established under sparsity assumptions on the regression coefficients. A simulation study is carried out to demonstrate that the proposed tests maintain the desired error rates under the null and have good power under the alternative at moderate sample sizes. The simulation results also show that the new multiple testing procedure outperforms the well known Benjamini-Yekutieli procedure (Benjamini and Yekutieli (2001)). In addition, the proposed testing procedures are illustrated by an application to the Framingham Offspring Study (Kannel et al., 1979) to study how smoking and its interaction with a genetic predisposition affect an inflammation marker which plays an important role in the risk of developing cardiovascular disease.

The rest of the paper is organized as follows. In Section 2, we introduce the construction of the new test statistics and discuss the technical differences and theoretical challenges of the two-sample testing problems. Section 3 develops a maximum-type statistic Mn and the corresponding test for the global hypothesis H0 : β1 = β2 through the inverse regression framework. We establish in this section the asymptotic null distribution of Mn and show the optimality results under sparse alternatives. Large-scale multiple testing with FDR and FDP control is presented in Section 4. Section 5 investigates the numerical performance of the proposed procedures by simulations. In Section 6, we apply the proposed procedures to the Framingham Offspring Study. The proofs of the main results are given in Section 8.

2 Methodology

2.1 Notation and Definitions

We first introduce the notation and definitions that will be used throughout the paper. For a vector βd = (β1,d, …, βp,d)T ∈ ℝp, define the ℓq norm by |βd|q=(i=1p|βi,d|q)1/q for 1 ≤ q ≤ ∞. For subscripts, we use the convention that i stands for the ith entry of a vector and (i, j) for the entry in the ith row and jth column of a matrix, k represents the kth sample and d is the group indicator. Let Xd=(X1,·,dT,,Xnd,·,dT)T be the nd × p data matrix, and Yd = (Y1,d, …, Ynd,d)T be the nd × 1 data matrix, for d = 1, 2. Throughout, suppose that we have i.i.d random samples {Yk,d, Xk,·,d, 1 ≤ knd} with Xk,·,d = (Xk,1,d, …, Xk,p,d) being a random vector with covariance matrix Σd for d = 1, 2. Define d1=Ωd=(ωi,j,d).

For any vector μd ∈ ℝp, let μi,d denote the (p − 1)-dimensional vector formed by removing the ith entry from μd. For a symmetric matrix Ad, let λmax(Ad) and λmin(Ad) denote the largest and smallest eigenvalues of Ad, respectively. For any n × p matrix Ad, Ai,−j,d denotes the ith row of Ad with its jth entry removed and Ai,j,d denotes the jth column of Ad with its ith entry removed. Ai,−j,d denotes the (n − 1) × (p − 1) submatrix of Ad with its ith row and jth column removed. Let A·, −j,d denote the n × (p − 1) submatrix of Ad with the jth column removed, Ai,·,d denote the ith row of Ad, A·,j,d denote the jth column of Ad and Ā·,j,d=1/ni=1nAi,j,d. Let Ā·,j,d=1/ni=1nAi,j,d,Ā·,j,d=(Ā·,j,d,,Ā·,j,d)n×1T, and Ā(·,j,d)=(Ā·,j,dT,,Ā·,j,dT)n×(p1)T. Let Ād=1/ni=1nAi,·,d. For a matrix Ω = (ωi,j)p×p, the matrix 1-norm is the maximum absolute column sum, ΩL1=max1jpi=1p|ωi,j|, the matrix elementwise infinity norm is defined to be ‖Ω = max1≤i,jpi,j| and the elementwise ℓ1 norm is Ω1=i=1pj=1p|ωi,j|. For a set ℋ, let |ℋ| be the cardinality of ℋ. For two sequences of real numbers {an} and {bn}, write an = O(bn) if there exists a constant C such that |an| ≤ C|bn| holds for all n, write an = o(bn) if limn→∞ an/bn = 0, and write anbn if there are positive constants c and C such that can/bnC for all n.

2.2 Test Statistics

To form the test statistics, we consider the inverse regression models obtained by regressing Xk,i,d on (Yk,d, Xk,−i,d), as introduced in Liu and Luo (2014)

Xk,i,1=αi,1+(Yk,1,Xk,i,1)γi,1+ηk,i,1,(k=1,,n1)
Xk,i,2=αi,2+(Yk,2,Xk,i,2)γi,2+ηk,i,2,(k=1,,n2)

where for d = 1, 2, ηk,i,d has mean zero and variance σηi,d2 and is uncorrelated with (Yk,d, Xk,−i,d), and γi,d = (γi,1,d, …, γi,p,d)T satisfies

γi,d=σηi,d2(βi,d/σεd2,βi,dβi,dT/σεd2+Ωi,i,d)T, (4)

where σηi,d2=(βi,d2/σεd2+ωi,i,d)1, as provided in Liu and Luo (2014).

Remark 1

Equation (4) can be obtained directly as follows. Denote the covariance matrix of Z = (Xk,i,d, Yk,d, Xk,−i,d) by Σ = Cov(Z). Section 2.5 of Anderson (2003) shows that γi,d can be obtained by γi,d=22121, where Σ22 = Cov(Z1) with Z1 = (Yk,d, Xk,−i,d) and Σ21 = Cov(Z1, Xk,i,d) is the covariance between Z1 and Xk,i,d. Then (4) follows from the regression model Yd = μd + Xdβd + εd and the fact that Xd and εd are uncorrelated with each other.

Because ri,d = Covk,d, ηk,i,d) can be expressed as γi,1,dCov(εk,d,Yk,d)=γi,1,dσεd2=σηi,d2βi,d, the null hypotheses in global testing problem (2) and entry-wise testing problem (3) would be, respectively, equivalent to

H0:max1ip|ri,1/σηi,12ri,2/σηi,22|=0, (5)

and

H0,i:ri,1/σηi,12=ri,2/σηi,22,i=1,,p, (6)

and we base the tests on the estimates of { ri,d/σηi,d2, i = 1, …, p; d = 1, 2}.

Define the residuals

ε^k,d=Yk,dȲd(Xk,·,dX¯d)β^d
η^k,i,d=Xk,i,dX¯i,d(Yk,dȲd,(Xk,i,dX¯·,i,d))γ^i,d,

where β̂d = (β̂1,d, …, β̂p,d) and γ̂i,d = (γ̂i,1,d, …, γ̂i,p,d) are the respective estimators of βd and γi,d satisfy

max{|β^dβd|1,max1ip|γ^i,dγi,d|1}=OP(an1),
max{|β^dβd|2,max1ip|γ^i,dγi,d|2}=OP(an2), (7)

for some an1 and an2 such that

max{an1an2,an22}=o{(n log p)1/2}, and an1=o(1/log p). (8)

Estimators β̂d and γ̂i,d that satisfy (7) and (8) can be obtained easily via standard methods such as the lasso and Danzig selector, see, for example, Xia et al. (2015) and Liu and Luo (2014).

Based on the residuals ε̂k,d and η̂k,i,d, a natural estimator of ri,d is the sample covariance between the residuals,

ri,d=nd1k=1ndε^k,dη^k,i,d.

Because i,d tends to be biased, we define a bias corrected estimator for ri,d as

r^i,d=ri,d+σ^εd2γ^i,1,d+σ^ηi,d2β^i,d, (9)

where σ^εd2=nd1k=1ndε^k,d2 and σ^ηi,d2=nd1k=1ndη^k,i,d2 are the sample variances satisfying

max{|σ^εd2σεd2|,max1ip|σ^ηi,d2σηi,d2|}=OP{(log p/nd)1/2},

which can be obtained by Lemma 2 in Xia et al. (2015) under conditions (7) and (8). By Lemma 2, the bias of i,d is then of order max{βi,d(log p/nd)1/2, (nd log p)−1/2}.

Remark 2

The most straightforward way to estimate ri,d is to use the sample covariance between the error terms, nd1k=1ndεk,dηk,i,d. However, the error terms are unknown, and we can use the the sample covariance between the residuals i,d instead. The bias of i,d exceeds the desired rate (nd log p)−1/2, and thus we calculate the difference of i,d and nd1k=1ndεk,dηk,i,d, which up to order (nd log p)−1/2, is equal to σ^εd2γ^i,1,d+σ^ηi,d2β^i,d. Hence, we define r^i,d=ri,d+σ^εd2γ^i,1,d+σ^ηi,d2β^i,d as in (9).

For i = 1, …, p and d = 1, 2, a natural estimator of ri,d/σηi,d2 can then be defined by

Ti,d=r^i,d/σ^ηi,d2. (10)

Subsequently, we may test the hypotheses (2) and (3) using the estimators 𝒯 = {Ti,1Ti,2 : i = 1, …, p}. However, since Ti,1Ti,2 in 𝒯 are heteroscedastic with possibly a wide range of variability, we instead consider a standardized version of Ti,1Ti,2. Specifically, let

Ui,d=nd1k=1nd{εk,dηk,i,dE(εk,dηk,i,d)} and Ũi,d=(βi,d+Ui,d)/σηi,d2.

It can be shown in Lemma 2 that, uniformly in i = 1, …, p,

|Ti,dŨi,d|=OP{βi,d(log p/nd)1/2}+oP{(nd log p)1/2}.

Noting that θi,d=Var(Ũi,d)=Var(εk,dηk,i,d/σηi,d2)/nd=(σεd2/σηi,d2+βi,d2)/nd, we estimate θi,d by

θ^i,d=(σ^εd2/σ^ηi,d2+β^i,d2)/nd.

and define the standardized statistics

Wi=Ti,1Ti,2(θ^i,1+θ^i,2)1/2,i=1,,p. (11)

We base the tests for (2) and (3) on {Wi, i = 1, …, p}, which will be studied in detail in Sections 3 and 4.

2.3 Discussion

We discuss here the substantial differences between the two-sample and one-sample cases and the necessity for significant adjustments and corrections in the two-sample setting.

The proposed tests are based on estimators of ri,1/σηi,12ri,2/σηi,22. Here we estimate ri,d = Covk,d, ηk,i,d) through constructing a bias-corrected sample covariance between the residuals, i,d, as defined in (9). That is, we need to get an estimate of the difference between the naive estimate i,d and an unbiased estimate of ri,d, which is nd1k=1ndεk,dηk,i,d.

Liu and Luo (2014) considered the one-sample case of the multiple testing problem (3) so ri/σηi2=0 is equivalent to ri = 0 under the null hypothesis, and ri is easier to estimate. The procedure in Liu and Luo (2014) is based on the estimation of ri instead of ri/σηi2. In the two-sample case, ri,1/σηi,12=ri,2/σηi,22 is not equivalent to ri,1 = ri,2. Thus, it is necessary to construct testing procedures based directly on estimators of ri,1/σηi,12ri,2/σηi,22.

Furthermore, in the one-sample case, the asymptotic normality of Ti can be established because βi,1 = 0 under the null, which is shown in Lemma 2. Thus the theoretical properties of the individual test statistics are easier to obtain. In the two-sample case, βi,1 and βi,2 are not necessary equal to 0 under the null, and corrections are thus essential in order to show Wi is close to a normal random variable; the technical details are much more complicated.

More importantly, in the one-sample case, under the null hypothesis βi,1 = 0, and thus Corrkηk,i, εkηk,j) = ωi,j / (ωi,iωj,j), which is fully determined by the precision matrix of the covariates and thus simplifies the calculations. In the two-sample version, βi,1 = βi,2 under the null hypothesis and they are not necessary equal to zero. The calculation of Corrk,dηk,i,d, εk,dηk,j,d), which determines the correlation between Wi and Wj, is much more involved, and it can be shown in the proof of Theorem 4 that

ξi,j,d=Corr(εk,dηk,i,d,εk,dηk,j,d)=(ωi,j,dσεd2+2βi,dβj,d){(ωi,i,dσεd2+2βi,d2)(ωj,j,dσεd2+2βj,d2)}1/2. (12)

The technical analysis for establishing the theoretical results in Sections 3 and 4 is thus much more challenging.

3 Global Test

In this section, we wish to test the global hypothesis

H0:β1=β2 versus H1:β1β2.

We propose a procedure based on the standardized statistics {Wi, i = 1, …, p}

Mn=max1ipWi2=max1ip(Ti,1Ti,2)2θ^i,1+θ^i,2. (13)

It is shown in Section 3.1 that, under certain regularity conditions, Mn − 2 log p + log log p converges to a Gumbel distribution under the null, and the asymptotic α-level test can thus be defined as

Ψα=I(Mnqα+2 log plog log p), (14)

where qα is the 1 − α quantile of the Gumbel distribution with the cumulative distribution function exp(−π−1/2et/2),

qα=log(π)2 log log(1α)1.

We reject the null hypothesis H0 whenever Ψα = 1.

3.1 Asymptotic Null Distribution

We first introduce some regularity conditions, under which, Mn − 2 log p+log log p converges weakly to a Gumbel random variable with distribution function exp(−π−1/2et/2).

  • (C1)

    log p = o(n1/5), n1n2, and for some constants C0,C1,C2 > 0, C01λmin(Ωd)λmax(Ωd)C0,C11σεd2C1, and |βd|C2 for d = 1, 2. There exists some τ > 0 such that |Aτ| = O(pr) with r < 1/4, where Aτ = {i : |βi,d| ≥ (log p)−2−τ, 1 ≤ ip, for d = 1 or 2}.

  • (C2)

    Let Dd be the diagonal of Ωd and let (ξi,j,d)=Rd=Dd1/2ΩdDd1/2, for d = 1, 2. max1≤i<jpi,j,d| ≤ ξd < 1 for some constant 0 < ξd < 1.

  • (C3)

    There exists some constant K > 0 such that maxVar(aTXk,·,dT)=1E exp(K(aTXk,·,dT)2) and E exp(Kεk,d2) are finite.

Condition (C1) on the eigenvalues is commonly used in the high-dimensional setting and implies that most of the variables are not highly correlated with each other. Condition (C2) is also mild. For example, if max1≤ijpi,j,d| = 1, then Ωd is singular. (C3) is s sub-Gaussian tail condition, and it can be weakened to a polynomial tail condition if p < nc for some constant c > 0.

Theorem 1

Suppose (C1), (C2), (C3), (7), and (8) hold. Then under H0, for any t ∈ ℝ,

P(Mn2 log p+log log pt)exp{π1/2 exp(t/2)}, as n1,n2,p, (15)

where Mn is defined in (13). Under H0, the convergence in (15) is uniform for all {Yk,d, Xk,·,d : k = 1, 2, …, nd} satisfying (C1), (C2), (C3), (7), and (8).

Remark 3

The analysis can be extended to test H0 : βG,1 = βG,2 versus H1 : βG,1βG,2 for a given index set G. We can construct the test statistic as MG,n=maxiGWi2, and obtain a similar Gumbel limiting null distribution by replacing p with |G|, as n1, n2, |G| → ∞. The condition (C1) will be slightly different, with Aτ being replaced by AG = {i : |βi,d| ≥ (log p)−2−τ, iG, for d = 1 or 2}.

Remark 4

Condition (C1) is slightly stronger than the conditions in Liu and Luo (2014) as we need |Aτ| = O(pr) with r < 1/4. This is due to the major difference between the one-sample and two-sample cases that the global null H0 : β = 0 is a simple null in the one-sample case and the null H0 : β1 = β2 is composite in the two-sample case. In the one-sample case, Ti is a nearly unbiased estimate of βi because βi = 0 under the global null. However, in the two-sample case, as stated in Lemma 2, additional correction terms involving βi,d are needed in order to make Ti,d nearly unbiased because βi,1 and βi,2 are not necessary equal to 0 under the null. Thus, slightly stronger conditions on Aτ are needed.

3.2 Asymptotic Power

We now analyze the asymptotic power of the test Ψα given in (14). The test is shown to be particularly powerful against a large class of sparse alternatives and the power is minimax rate optimal. We first define a class of regression coefficients:

𝒰(c)={(β1,β2):max1ip|βi,1βi,2|(θi,1+θi,2)1/2c(log p)1/2}. (16)

We show that the null hypothesis H0 can be rejected by the test Ψα with overwhelming probability, if (β1,β2)𝒰(22).

Theorem 2

Let the test Ψα be given in (14). Suppose (C1), (C3), (7) and (8) hold. Then

inf(β1,β2)𝒰(22)P(Ψα=1)1,n,p.

Theorem 2 shows that the null parameter set in which β1 = β2 is asymptotically distinguishable from 𝒰(22) by the test Ψα.

We further show that the lower bound in (16) is rate optimal. Let 𝒯α be the set of all α-level tests, P(Tα = 1) ≤ α under H0 for all Tα ∈ 𝒯α. If c in (16) is sufficiently small, then any α level test is unable to reject the null hypothesis correctly uniformly over (β1, β2) ∈ 𝒰(c) with probability tending to one.

Theorem 3

Suppose that log p = o(n). Let α, β > 0 and α + β < 1. Then there exists a constant c0 > 0 such that for all sufficiently large n and p,

inf(β1,β2)𝒰(c0)supTα𝒯αP(Tα=1)1β.

Theorem 3 shows that the order (log p)1/2 in the lower bound of max1≤ip{|βi,1 − βi,2|/(θi,1 + θi,2)1/2} in (16) cannot be further improved.

4 Multiple Testing with False Discovery Rate Control

4.1 Multiple Testing Procedure

If the global null hypothesis is rejected, it is then of interest to identify the subset of variables in X that interact with the group indicator. This can be achieved by simultaneously testing on the entries of β1β2 with FDR and FDP control,

H0,i:βi,1=βi,2 versus H1,i:βi,1βi,2,1ip. (17)

The standardized differences of Ti,1Ti,2 are defined by the test statistics Wi = (Ti,1Ti,2)/(θ̂i,1 + θ̂i,2)1/2 as in (11). Let t be the threshold such that H0,i is rejected if |Wi| ≥ t. Let ℋ0 = {i : βi,1 = βi,2, 1 ≤ ip} be the set of true nulls. Let R0(t) = Σi∈ℋ0 I(|Wi| ≥ t) and R(t) = Σ1≤ip I(|Wi| ≥ t), respectively, denote the total number of false positives and the total number of rejections. The FDP and FDR are defined as

FDP(t)=R0(t)R(t)1,  FDR(t)=E{FDP(t)}.

Ideally, we select the threshold level as

t0=inf {0t(2 log p)1/2: FDP(t)α}.

However, ℋ0 is unknown, and we estimate Σi∈ℋ0 I{|Wi| ≥ t} by 2p{1 − Φ(t)} due to the sparsity of β1β2, where Φ(t) is the standard normal cumulative distribution function. This leads to the following multiple testing procedure.

  1. Calculate the test statistics Wi = (Ti,1Ti,2)/(θ̂i,1 + θ̂i,2)1/2 as in (11).

  2. For a given 0 ≤ α ≤ 1, calculate
    t^=inf {0t(2 log p)1/2:2p{1Φ(t)}R(t)1α}.

    If does not exists, set = (2 log p)1/2.

  3. For 1 ≤ ip, reject H0,i if and only if |Wi| ≥ .

4.2 Theoretical Properties

We now investigate the theoretical properties of this multiple testing procedure. For any 1 ≤ ip, define

Γi(γ)={j:1jp,|ξi,j,d|(log p)2γ,d=1,2},

where ξi,j,d is defined in Condition (C2). Under regularity conditions, this procedure controls the FDP and FDR at the pre-specified level α, asymptotically.

Theorem 4

Let

𝒮ρ={i:1ip,|βi,1βi,2|(θi,1+θi,2)1/2(log p)1/2+ρ}.

Suppose for some ρ > 0 and some δ > 0, |𝒮ρ| ≥ [1/(π1/2α) + δ](log p)1/2. Suppose that |Aτ ∩ ℋ0| = o(pν) for any ν > 0, where Aτ is given in Condition (C1). Assume that p0 = |ℋ0| ≥ cp for some c > 0, and (7) and (8) hold. If there exists some γ > 0 such that max1≤ipi(γ)| = o(pν) for any ν > 0, then under (C1) – (C3) with pcnr for some c > 0 and r > 0, we have

lim(n,p)FDR(t^)αp0/p=1,
FDP(t^)αp0/p1

in probability, as (n, p) → ∞.

The condition on |𝒮ρ| is mild, because among p hypotheses in total, it only requires a few number of entries with the standardized difference exceeding (log p)1/2+ρ/n1/2 for some constant ρ > 0. The technical condition |Aτ ∩ ℋ0| = o(pν) for any ν > 0 is to ensure that most of the regression residuals are not highly correlated with each other under the null hypotheses H0,i : βi,1 = βi,2.

5 Simulation Study

We consider the numerical performance, including the sizes and powers of both the global and the multiple testing procedures, through simulation studies. We investigated the performance of both procedures under two sets of simulations. For the first, we generated the data by considering two constructions of regression coefficients under three matrix models, with covariates being a combination of continuous and discrete random variables. For the second set, we studied the numerical performance of the proposed multiple testing procedure in a setting that is similar to the data application described in Section 6. We compared the proposed multiple testing procedure with Benjamini-Yekutieli (B-Y) procedure, as considered in Benjamini and Yekutieli (2001), and show that the B-Y procedure is much more conservative and has lower power in all cases.

5.1 Implementation Details

The proposed testing procedures required the estimation of the regression coefficients βd and γi,d, for i = 1, …, p and d = 1, 2. One may use the Lasso to estimate these parameters, as follows.

βd=DX1/2arg minu{12nd|(XdX¯d)DX1/2u(YdȲd)|22+λn|u|1}, (18)

and

γi,d=Di,d1/2arg minυ{12nd|((Yd,X·,i,d)(Ȳd,X¯(·,i,d)))Di,d1/2υ(X·,i,dX¯·,i,d)|22+λi,n|υ|1}, (19)

where DX = diag(Σ̂), Di,d = diag(σ̂Yd, Σ̂i,−i), λn=κσ^Yd log p/nd and λi,n=κσ^i,i log p/nd, in which σ̂Yd is the sample variance of Yd and Σ̂ = (σ̂i,j) is the sample covariance matrix of Xd. In the global testing of H0 : β1 = β2, we chose the tuning parameter κ = 2.

For multiple testing of H0,i : βi,1 = βi,2, we selected the tuning parameters λn and λi,n in (18) and (19) adaptively by the data with the principle of making Σi∈ℋ0 I{|Wi| ≥ t} and 2{1 − Φ(t)}|ℋ0| as close as possible. That is, a good choice of the tuning parameters should minimize the error

c1(iI(|Wi(b)|Φ1(1α/2))αp1)2dα,

where c > 0 and Wi(b) is the statistic of the corresponding tuning parameter. Step 2 below is a discretization of the above integral. The algorithm is summarized as follows.

  1. Let λn=b/20σ^Yd log p/nd and λi,n=b/20σ^i,i log p/nd for b = 1, …, 40. For each b, calculate β^d(b) and γ^i,d(b), i = 1, …, p, d = 1, 2. Based on the estimation of regression coefficients, construct the corresponding statistics Wi(b) for each b.

  2. Choose as the minimizer of
    b^=arg mins=110(1ipI{|Wi(b)|Φ1(1s[1Φ{(log p)1/2}]/10)}2ps[1Φ{(log p)1/2}]/101)2.

The tuning parameters λn and λi,n are then chosen to be

λn=b^/20σ^Yd log p/nd and λi,n=b^/20σ^i,i log p/nd (20)

5.2 Simulation Under Different Matrix Models

We first generated the design matrices Xk,d, for k = 1, …, nd and d = 1, 2, with some of the covariates being continuous and the others being discrete. For simplicity, we generated Xk,d from the same distribution for d = 1, 2. As a first step, for three different matrix models, we obtained i.i.d samples Xk,d ~ N(0,Σ(m)), for k = 1, …, nd, with m = 1, 2 and 3. We then replaced l covariates of Xk,d by one of three discrete values 0, 1 or 2, with probability 1/3 each, where l is a random integer between ⌊p/2⌋ and p. We first introduce the matrix models Σ(m) used in the simulations. Let D = (Di,j) be a diagonal matrix with Di,i = Unif(1, 3) for i = 1, …, p. The following models were used to generate the design matrices.

  • Model 1: Ω(1)=(ωi,j(1)), where ωi,i(1)=1,ωi,i+1(1)=ωi+1,i(1)=0.6,ωi,i+2(1)=ωi+2,i(1)=0.3 and ωi,j(1)=0 otherwise. Ω(1) = D1/2Ω*(1)D1/2.

  • Model 2: Ω(2)=(ωi,j(2)), where ωi,j(2)=ωj,i(2)=0.5 for i = 10(k − 1) + 1 and 10(k − 1) + 2 ≤ j ≤ 10(k − 1) + 10, 1 ≤ kp/10. ωi,j(2)=0 otherwise. Ω(2) = D1/2(Ω*(2) + δI)/(1 + δ)D1/2 with δ = |λmin(Ω*(2))| + 0.05.

  • Model 3: Ω(3)=(ωi,j(3)), where ωi,i(3)=1,ωi,j(3)=0.8×Bernoulli(1,0.05) for i < j and ωj,i(3)=ωi,j(3). Ω(3) = D1/2(Ω*(3) + δI)/(1 + δ)D1/2 with δ = |λmin(Ω*(3))| + 0.05.

Global Test

For the global testing of H0 : β1 = β2, the sample sizes were taken to be n = n1 = n2 = 100, while the dimension p varied over the values 100, 200, 400, and 1000. Under the global null hypothesis, we have β1 = β2 = β, and two scenarios of generating β were considered. For case 1, 10 nonzero locations {k1, …, k10} of β were randomly generated with magnitudes βki,1=2i0.5n10.15, i = 1, …, 10. For case 2, s nonzero locations for β were randomly selected, with s = 5, 8, 10, and 15 for p =100, 200, 400 and 1000, respectively. The nonzero locations had magnitudes with any values between −10 and 10. The error terms εk,d were generated as normal random variables with mean 0 and variances having any values between 0.5 and 2.5. The nominal significance level for all the tests was set at α1 = 0.05.

Table 1 shows that the sizes of the global test Ψα1 are close to the nominal level for both cases under all matrix models. This reflects the fact that the null distribution of the test statistics Mn is well approximated by its limiting null distribution, as shown in Theorem 1. The empirical sizes are slightly below the nominal level in some cases for lower dimensions, as similarly observed in Xia et al. (2015), due to correlation among the variables. It is also shown in Table 1 that the proposed test is powerful in all settings, though β1 and β2 only differ in five or fewer locations with magnitudes of the order log p/n.

Table 1.

Empirical sizes and powers (%) for global testing with α1 = 0.05, n1 = n2 = 100, and 1000 replications.

p Case 1 Case 2

Model 1 Model 2 Model 3 Model 1 Model 2 Model 3
Size

100 4.1 3.2 2.9 4.4 2.9 2.8
400 4.8 3.8 3.7 4.0 4.1 3.5
1000 6.1 4.4 5.4 5.9 4.6 6.4

Power

100 71.9 64.3 67.4 95.1 97.1 96.6
400 88.3 86.2 83.5 82.3 77.0 82.1
1000 95.1 92.6 97.9 47.3 42.0 48.1

To evaluate the power of the global test, we selected five locations, {k1, …, k5}, among the nonzero locations of β1, with magnitudes βkj,2 = βkj,1 + uj, j = 1, …, m, where uj has magnitude randomly and uniformly from the set [−2β(log p/n)1/2, −β(2 log p/n)1/2] ∪ [β(2 log p/n)1/2, 2β(log p/n)1/2], with β = max1≤ipi,1|. The actual sizes and powers in percentage for each case under three matrix models, reported in Table 1, are estimated from 1000 replications. For each replication, the nonzero locations and magnitudes of the regression coefficients could vary.

Multiple Testing

For simultaneous testing of {H0,i : βi,1 − βi,2 = 0, for 1 ≤ ip} with FDR control, we first generated β1 according to the above two cases. For case 1, ten nonzero locations {k1,,k10} for β2 were randomly generated and the locations could vary for these two vectors. The magnitudes were generated with values βki,2=4i0.5n20.15, i = 1, …, 10. For case 2, s nonzero locations for β2 were randomly selected, again with s = 5, 8, 10, and 15 for p =100, 200, 400 and 1000, respectively, also with magnitudes having any values between −10 and 10.

In Table 2, we present the empirical FDR and true discovery rate (power) of the proposed procedure (NEW) and the B-Y procedure at the FDR level of α2 = 0.1, based on 100 replications, where the power is summarized based on

1100l=1100i1I(|Wi,l|t^)|1|,

where Wi,l denotes standardized difference for the lth replication and ℋ1 denotes the nonzero locations of β1β2. The results suggest that across all configurations, the FDRs are well controlled under the nominal level α by both FDR control procedures. However, the B-Y procedure is extremely conservative in all scenarios. For the new FDR procedure, the empirical FDRs are also conservative, due to the correlations among the regression residuals under the nulls ℋ0,i, and also due to the fact that we use |ℋ| to estimate |ℋ0| because the latter is usually unknown. Furthermore, the total number of true signals is small in all cases due to the sparsity of the regression coefficients; for example, when the total number of true signals is ten, the FDP for each replication tends to be either 0 or some number close to 0.1, which will also cause the conservatism of the FDR estimation. In case 2, we can see that the empirical FDR gets closer to the nominal level as dimension increases, because the number of true signals increases when p grows. In summary, the new procedure has empirical FDR much closer to the nominal level than B-Y procedure in all cases. Table 2 also reflects that the FDR control procedure introduced in Section 4 is more powerful than the B-Y procedure for different scenarios.

Table 2.

Empirical FDRs and powers (%) for the new FDR procedure and B-Y procedure with α2 = 0.1, n1 = n2 = 100, and 100 replications.

p Method Case 1 Case 2

Model 1 Model 2 Model 3 Model 1 Model 2 Model 3
Size

100 NEW 5.9 5.8 6.8 3.8 4.5 3.6
B-Y 0.3 1.0 0.7 0.1 0.3 0.7

400 NEW 6.7 7.4 6.8 6.2 5.5 5.5
B-Y 0.4 0.6 0.4 0.2 0.7 0.5

1000 NEW 6.2 6.0 6.1 9.4 9.4 9.8
B-Y 0.6 1.0 0.4 1.5 1.6 1.4

Power

100 NEW 95.3 94.7 94.7 93.3 92.1 90.4
B-Y 91.5 88.1 88.5 88.6 90.3 88.3

400 NEW 92.7 88.2 90.8 84.3 82.9 83.6
B-Y 86.1 82.2 84.3 81.5 78.7 81.3

1000 NEW 84.7 82.7 85.1 71.7 70.4 71.9
B-Y 77.7 75.0 77.6 66.2 64.5 66.1

5.3 Simulation by Mimicking Data

We now consider a simulation setting mimicking the data considered in Section 6, where we have p = 119, n1 = 46 and n2 = 417. We investigated both cases of the construction of the regression coefficients as considered in Section 5.2, with ten nonzero locations, under all three matrix models, with covariates as a combination of continuous and discrete random variables. The nominal level was set at α3 = 0.1, and the empirical FDR’s and powers for both FDR procedures, as reported in Table 3, were evaluated based on 100 replications. As in Section 5.2, the empirical FDRs are well controlled under the data setting by the new FDR procedure, while the B-Y procedure is again very conservative. For case 1, the empirical FDR’s of the new procedure are slightly larger than the nominal level, due to the fact that n1 is much smaller than n2 in this setting, and thus β1 and β2 have magnitudes much closer to each other based on their construction. The performance of the new method for case 2 is less conservative than in Section 5.2 due to the fact we have ten nonzero locations for the regression coefficients when the dimension is 119 in the data setting. Table 3 also indicates that the new procedure is more powerful than the B-Y procedure under the data setting in all scenarios.

Table 3.

Empirical FDRs and powers (%) for the new FDR procedure and B-Y procedure under the data setting with α3 = 0.1, p = 119, n1 = 46, n2 = 417, and 100 replications.

p Method Case 1 Case 2

Model 1 Model 2 Model 3 Model 1 Model 2 Model 3
Size

119 NEW 9.4 11.2 11.0 8.7 8.9 8.8
B-Y 2.2 3.0 2.9 1.7 1.4 1.6

Power

119 NEW 83.6 81.7 83.9 79.6 78.2 80.3
B-Y 76.2 72.1 74.8 73.7 72.6 74.6

6 Data Analysis

We illustrate our proposed methods using the Framingham Offspring Study (Kannel et al. (1979)) of coronary artery disease (CAD). Over the past three decades, various risk prediction models for CAD have been developed (Wilson et al. (1998); Ridker et al. (2007)). Unlike those for many other diseases, the risk models such as the Framingham Risk Score have been incorporated into clinical practice guidelines (Lloyd-Jones et al. (2004); D’Agostino Sr et al. (2008)). However, these models, largely based on traditional clinical risk factors, have recognized limitations in their clinical utilities. It is thus important to explore avenues beyond the routine clinical measures to improve prediction. One potential approach is to fully understand the roles of intermediate phenotypes, such as the C- reactive protein (CRP) and genomic markers. In recent years, many genome-wide association studies (GWAS) have been conducted to identify CAD-related single-nucleotide polymorphism (SNP) mutations. The newly identified SNPs, while significantly associated with CAD risk or the intermediate phenotypes of CAD, explain very little of the genetic risk for the trait (Humphries et al. (2008); Paynter et al. (2009)). This coincides with the growing awareness that the failure to identify genetic scores that significantly improve risk prediction for complex traits may be in part due to failure to account for the interplay of genes and environment. It is thus of substantial interests to study environment and its interaction with a genetic predisposition in causing human diseases.

Here, we use data from Framingham Offspring Study to examine how the interaction between smoking and genetic risk factors affect the inflammation marker CRP, since the inflammation system plays a vital role in the atherosclerotic process (Ross (1999)). We focus on the 463 female participants with complete information on CRP, 116 SNP’s previously reported as associated with CAD intermediate phenotypes, two leading principal components that adjust for population stratification, as well as age and smoking status at exam seven. Smoking is known to roughly double life-time risk of CAD and is thought to increase cardiovascular risk via a few different mechanisms. We examine the interaction between smoking and the genetic markers, as well as other risk factors based on the proposed method. We fit linear regression models for smokers and for non-smokers and the variables with significantly different coefficients between smokers and non-smokers are deemed as having an interactive effect.

The effects of top eight SNPs including rs11585329, rs17583120, rs17132534, rs11214606, rs17529477, rs10891552, rs4293, and rs4351, on CRP are considered as significantly modified by smoking. Interestingly, the smoking and rs11585329 interaction has been reported as important contributor to the risk of colorectal cancer whereas inflammation is a hallmark of cancer (Liu et al. (2013)). SNP rs17132534 belongs to the UCP2 gene whose main function is the control of mitochondria-derived reactive oxygen species. A variant in the UCP2 has been previously shown to interact with smoking to influence plasma markers of oxidative stress and hence likely to be associated with prospective CHD risk (Stephens et al. (2008)). SNPs rs10891552, rs17529477, and rs11214606 all belong to the DRD2 gene, which is linked to addictive behaviors, including alcoholism and smoking. Smoking was found to modify the effects of polymorphism in DRD2 gene on gastric cancer risk (Ikeda et al. (2008)). SNPs rs4293 and rs 4351 belong to the ACE gene, linked with hypertension and CAD among other disorders. Interactions between smoking and polymorphism in the ACE gene have been reported for blood pressure and coronary atherosclerosis (Hibi et al. (1997); Sayed-Tabatabaei et al. (2004); Schut et al. (2004)).

7 Extension to Non-Binary Environmental Variable

Motivated by applications in genomics, we have proposed hypothesis testing procedures for detecting the interactions between environment and genomic markers when the environmental variable is binary, such as smoking status, as illustrated in Section 6. Our testing approach can be extended to detect the interactions when the environmental variable is discrete and finite, but non-binary. Specifically, suppose the environmental variable takes K possible values. Interaction detection can then be formulated based on comparing K high-dimensional regression models

Yd=μd+Xdβd+εd,  for d=1,,K.

One wishes to develop a global test for

H0:β1=β2=βK versus H1:βlβk for some 1l<kK, (21)

as well as develop a procedure for simultaneously testing the hypotheses

H0,i:βi,1=βi,2==βi,K versus H1,i:βi,lβi,k for some 1l<kK,i=1,,p, (22)

with FDR and FDP control.

The test statistics for each model can be formulated similarly as in Section 2.2. For d = 1, …, K, we let

Ti,d=r^i,d/σ^ηi,d2

and estimate θi,d by

θ^i,d=(σ^εd2/σ^ηi,d2+β^i,d2)/nd.

Then the pairwise standardized statistics can be defined by

Wi(l,k)=Ti,lTi,k(θ^i,l+θ^i,k)1/2,1l<kK,i=1,,p.

Then if K is finite, we construct the sum of square type test statistic by

Si=1l<kK(Wi(l,k))2.

As in Cai and Xia (2014), it can be shown that the limiting null distribution of Si is a mixture chi-square distribution. Based on this fact, we can further develop global and multiple testing procedures. When the environmental variable is binary, the test statistics Si reduce to (11) in Section 2.2. On the other hand, if the environmental variable is continuous, the testing problem is significantly different, and out of the scope of the current paper. We leave it to future research.

8 Proofs

We prove the main results in this section. We begin by collecting technical lemmas that will be used in the proof of the main theorems.

8.1 Technical Lemmas

The first lemma is the classical Bonferroni inequality.

Lemma 1 (Bonferroni inequality)

Let B=t=1pBt. For any k < [p/2], we have

t=12k(1)t1FtP(B)t=12k1(1)t1Ft,

where Ft = Σ1≤i1<⋯<itp P(Bi1 ∩ ⋯ ∩ Bit).

For d = 1, 2, let Ui,d=nd1k=1nd{εk,dηk,i,dE(εk,dηk,i,d)} and Ũi,d=βi,d+Ui,d/σηi,d2. The following lemma is essentially proved in Liu and Luo (2014).

Lemma 2

Suppose that Conditions (C1), (C3), (7) and (8) hold. Then

Ti,d=Ũi,d+(σεd2/σεd2+σηi,d2/σηi,d22)βi,d+oP{(ndlog p)1/2},

where σεd2=nd1k=1nd(εk,dε¯k,d)2 and σηi,d2=nd1k=1nd(ηk,i,dη¯k,i,d)2 with ε¯k,d=nd1k=1ndεk,d and η¯k,i,d=nd1k=1ndηk,i,d. Consequently, uniformly in i = 1, …, p,

|Ti,dŨi,d|=OP{βi,d(log p/nd)1/2}+oP{(ndlog p)1/2}.

Lemma 3

Let Xk ~ N(μ1, Σ1) for k = 1, …, n1 and Yk ~ N(μ2, Σ2) for k = 1, …, n2.

Define

1=(σi,j,1)p×p=1n1k=1n1(Xμ1)(Xμ1),2=(σi,j,2)p×p=1n2k=1n2(Yμ2)(Yμ2).

Then, for some constant C > 0, σ̃i,j,1 − σ̃i,j,2 satisfies the large deviation bound

P[max(i,j)𝒮(σi,j,1σi,j,2σi,j,1+σi,j,2)2Var{(Xk,iμ1,i)(Xk,jμ1,j)}/n1+Var{(Yk,iμ2,i)(Yk,jμ2,j)}/n2x2]C|𝒮|{1Φ(x)}+O(p1)

uniformly for 0 ≤ x ≤ (8 log p)1/2 and any subset 𝑆 ⊆ {(i, j) : 1 ≤ ijp}.

The complete proof of this lemma can be found in the supplementary material of Xia et al. (2015).

8.2 Proof of Theorem 1

To prove Theorem 1, we first show that the terms in Aτ are negligible. Then we focus on the terms in ℋ\Aτ, where ℋ = {1, …, p}, and show that P(maxi\AτWi22 log p+log log pt)exp(π1/2 exp(t/2)), where Wi is defined in (11).

Define

Vi=Ui,1/σηi,12Ui,2/σηi,22(θi,1+θi,2)1/2,

where θi,d=Var(Ũi,d)=Var(εk,dηk,i,d/σηi,d2)/nd=(σεd2/σηi,d2+βi,d2)/nd, for d = 1, 2. By Lemma 2 in Xia et al. (2015), under conditions (7) and (8), we have

|σ^εd2σεd2|=OP(log pnd), and maxi|σ^ηi,d2σηi,d2|=OP(log pnd). (23)

Thus we have

maxi|θ^i,dθi,d|=oP(1/(nd log p)). (24)

By Lemma 2, we have

Wi=Vi+bi+oP{(log p)1/2},

where bi={(σε12/σε12+σηi,12/σηi,12)βi,1(σε22/σε22+σηi,22/σηi,22)βi,2}/(θ^i,1+θ^i,2)1/2. Note that for i ∈ ℋ\Aτ, βi,d = o{(log p)−1}. Thus we have maxi∈ℋ\Aτ |WiVi| = oP{(log p)−1/2}. For iAτ,

bi|σε12βi,1/σε12σε22βi,2/σε22{Var(εk,12)βi,12/(σε14n1)+Var(εk,22)βi,22/(σε24n2)}1/2|+|σηi,12βi,1/σηi,12σηi,22βi,2/σηi,22{Var(ηk,i,12)βi,12/(σηi,14n1)+Var(ηk,i,22)βi,22/(σηi,24n2)}1/2|+oP{(log p)1/2}.

Due to the fact that the indices of the random variables only show up in the second term here, by Lemma 3 and the condition that |Aτ| = O(pr) with r < 1/4, we have

P(maxiAτWi22 log plog log p+t)|Aτ|{P(Vi22r log p)+P(bi22r log p)}+o(1)=o(1),

where bi=|σηi,12βi,1/σηi,12σηi,22βi,2/σηi,22{Var(ηk,i,12)βi,12/(σηi,14n1)+Var(ηk,i,22)βi,22/(σηi,24n2)}1/2|. Thus, it suffices to show that

P(maxi\AτVi22 log p+log log pt)exp(π1/2 exp(t/2)).

Let q = |ℋ\Aτ| and let n2/n1K1 with K1 ≥ 1. Define Zk,i=(n2/n1){εk,1ηk,i,1E(εk,1ηk,i,1)}/σηi,12 for 1 ≤ kn1 and Zk,i={εk,2ηk,i,2E(εk,2ηk,i,2)}/σηi,22 for n1 + 1 ≤ kn2. Thus we have

Vi=k=1n1+n2Zk,i(n22θk,1/n1+n2θk,2)1/2.

Without loss of generality, we assume σεd2=σηi,d2=1. Define

V^i=k=1n1+n2k,i(n22θk,1/n1+n2θk,2)1/2,

where k,i = Zk,iI(|Zk,i| ≤ τn) − E{Zk,iI(|Zk,i| ≤ τn)}, and τn = (4K1/K) log(p + n). Note that maxi\AτVi2=max1iqVi2, and that

max1iqn1/2k=1n1+n2E[|Zk,i|I{|Zk,i|(4K1/K) log(p+n)}]Cn1/2max1kn1+n2max1iqE[|Zk,i|I{|Zk,i|(4K1/K) log(p+n)}]Cn1/2(p+n)2max1kn1+n2max1iqE[|Zk,i| exp{(K/2)|Zk,i|}]Cn1/2(p+n)2.

Hence, P{max1≤iq |Vii| ≥ (log p)−1} ≤ P(max1≤iq max1≤kn1+n2 |Zk,i| ≥ τn) = O(p−1). By the fact that |max1iqVi2max1iqV^i2|2 max1iq|V^i|max1iq|ViV^i|+max1iq|ViV^i|2, it suffices to prove that for any t ∈ ℝ, as n, p → ∞,

P(max1iqV^i22 log p+log log pt)exp(π1/2 exp(t/2)). (25)

By Lemma 1, for any integer l with 0 < l < q/2,

d=12l(1)d11i1<<idqP(j=1dFij)P(max1iqV^i2yp)d=12l1(1)d11i1<<idqP(j=1dFij), (26)

where yp = 2 log p − log log p + t and Fij=(V^ij2yp). Let k,i = k,i/(n2θi,1/n1 + θi,2)1/2 for i = 1, …, q and Wk = (k,i1, …, k,id), for 1 ≤ kn1 + n2. Define |a|min = min1≤id |ai| for any vector aRd. Then we have

P(j=1dFij)=P(|n212k=1n1+n2Wk|minyp12).

Then it follows from Theorem 1 in Zaïtsev (1987) that

P(|n21/2k=1n1+n2Wk|minyp1/2)P{|Nd|minyp1/2εn(log p)1/2}+c1d52 exp {n1/2εnc2d3τn(log p)1/2}, (27)

where c1 > 0 and c2 > 0 are constants, εn → 0 which will be specified later, and Nd = (Nm1, …, Nmd) is a normal random vector with E(Nd) = 0 and Cov(Nd) = n1/n2 Cov(W1) + Cov(Wn1+1). Here d is a fixed integer that does not depend on n, p. Because log p = o(n1/5), we can let εn → 0 sufficiently slowly that, for any large M > 0,

c1d5/2 exp {n1/2εnc2d3τn(log p)1/2}=O(pM). (28)

Combining (26), (27), and (28) we have

P(max1iqV^i2yp)d=12l1(1)d11i1<<idqP{|Nd|minyp1/2εn(log p)1/2}+o(1). (29)

Similarly, using Theorem 1 in Zaïtsev (1987) again, we can get

P(max1iqV^i2yp)d=12l(1)d11i1<<idqP{|Nd|minyp1/2+εn(log p)1/2}o(1). (30)

The following lemma is shown in the supplementary material of Cai et al. (2013) with qp and yp = 2 log p − log log p + t.

Lemma 4

For any fixed integer d ≥ 1 and real number t ∈ ℝ,

1i1<<idqP{|Nd|minyp1/2±εn(log p)1/2}=1d!{(π)1/2 exp(t/2)}d{1+o(1)}. (31)

It then follows from Lemma 4, (29), and (30) that

lim supn,pP(max1iqV^i2yp)d=12l(1)d11d!{(π)1/2 exp(t/2)}d
lim infn,pP(max1iqV^i2yp)d=12l1(1)d11d!{(π)1/2 exp(t/2)}d

for any positive integer l. By letting l → ∞, we obtain (25) and Theorem 1 is proved.

8.3 Proof of Theorem 2

Let Mn1=max1ijp{Ti,1Ti,2(βi,1βi,2)}2/(θ^i,1+θ^i,2). It follows from the proof of Theorem 1 that P(Mn12 log p21 log log p)1, as n, p → ∞. By (23), (24), and the inequalities

max1ip(βi,1βi,2)2(θ^i,1+θ^i,2)2Mn1+2Mn,
max1ip|βi,1βi,2|(θi,1+θi,2)1/222(log p)1/2,

we have P(Mnqα + 2 log p − log log p) → 1 as n, p → ∞.

8.4 Proof of Theorem 3

To prove the lower bound, we first construct a worst case scenario to test between β1 and β2. We apply the arguments in Baraud (2002) to prove the result.

Without loss of generality, we assume σεd2=1, σi,i,d = 1, σi,j,d = 0, ij for d = 1, 2, and n1 = n2. Let be a random entry uniformly drawn from ℋ = {1, …, p}. We construct a class of β1, 𝒩 = {β(), ∈ ℋ}, such that, β,1 = ρ and βi,1 = 0 for i, with ρ = c(log p/n)1/2, where c < 1/2 is a constant. Let β2 = 0 and β1 be uniformly distributed on 𝒩. Let μρ be the distribution on β1β2. Note that μρ is a probability measure on {δ𝒮1:|δ|22=ρ2}, where 𝑆1 is a class of p-dimensional vectors with one nonzero entry. Then the likelihood ratio between samples {Yk,1, Xk,·,1} and {Yk,2, Xk,·,2} can be calculated as

Lμρ=Em^[k=1n1|(m^)|1/2exp {12ZkT(Ω(m^)I)Zk}],

where Σ() = Ω()−1 is the covariance matrix of {Yk,1, Xk,·,1} and {Z1, …, Zn} are i.i.d samples generated from N(0, I). Because Var(Yk,1)=σm^,m^,1βm^,12+1, Var(Yk,2) = 1 and Cov(Yk,d, Xk,i,d) = βi,dσi,i,d. It can be easily calculated that |Σ()| = 1 and Ω(m^)=(ωi,j(m^)) with ω1,1(m^)=1,ω1,m^+1(m^)=ωm^+1,1(m^)=ρ,ωm^+1,m^+1(m^)=1+ρ2, and ωi,j(m^)=0 otherwise. Hence

Lμρ2=p2m,mE[k=1nexp {12ZkT(Ω(m)+Ω(m)2I)Zk}].

With Ω(m) + Ω(m′) − 2I = (ai,j), it is easy to see that, when mm′, ai,i = ρ2 and a1,i = −ρ for i = m + 1 or m′ + 1, aj,i = ai,j and ai,j = 0 otherwise; when m = m′, ai,i = 2ρ2 and a1,i = −2ρ for i = m + 1, aj,i = ai,j and ai,j = 0 otherwise. Thus we have

E(Lμρ2)=[E(exp{ρ(x1x2+x2x3)ρ2(x22+x32)/2}]n+p1[E(exp{2ρx1x2ρ2x22}]n,

where x1, x2, x3 are independent standard normal random variables. Because E(exp{ρ(x1x2 + x2x3)) = 1 + ρ2, E(exp{ρ2x22/2}=(1+ρ2)1/2 and E(exp{2ρx1x2) = 1 + 2ρ2, we have

E(Lμρ2)=1+p2c1+o(1)=1+o(1).

Theorem 3 is thus proved by Baraud (2002).

8.5 Proof of Theorem 4

We first show that , as defined in Section 4.1, is attained in the interval [0, (2 log p)1/2]. We then show that Aτ is negligible and we focus on the set ℋ\Aτ. We then show the FDP result by dividing the null set into small subsets and controlling the variance of R0(t) for each subset, and the FDR result will thus also be proved.

Under the condition of Theorem 4, we have

1ipI{|Wi|(2 log p)1/2}{1/(π1/2α)+δ}(log p)1/2,

with probability going to one. Hence, with probability tending to one, we have

2p1ipI{|Wi|(2 log p)1/2}2p{1/(π1/2α)+δ}1(log p)1/2.

Let tp = (2 log p − 2 log log p)1/2. Because 1Φ(tp)1/{(2π)1/2tp}exp(tp2/2), we have P(1 ≤ tp) → 1 according to the definition of in Section 4.1. For 0 ≤ tp,

2p{1Φ(t^)}max{1ipI{|Wi|t^},1}=α.

Thus, to prove Theorem 4, it suffices to prove

|i0I{|Wi|t}p0G(t)pG(t)|0,

in probability, uniformly for 0 ≤ ttp, where G(t) = 2(1 − Φ(t)) and p0 = |ℋ0|. We will show that it suffices to show

|i0\AτI{|Vi|t}p0G(t)pG(t)|0, (32)

in probability. We now consider two cases.

  1. If t = {2 log p + o(log p)}1/2, the proof of Theorem 1 yields that P(maxiAτWi2t2)=o(1). Thus, it suffices to prove
    |i0\AτI{|Wi|t}p0G(t)pG(t)|0,
    in probability. We show in Theorem 1 that maxi∈ℋ0\Aτ |WiVi| = oP{(log p)−1/2}. Thus it suffices to show (32).
  2. If t ≤ (C log p)1/2 for some C < 2, we have
    |iAτ0I{|Wi|t}pG(t)||Aτ0|O(p1C/2)0
    in probability. Thus, it is again enough to show (32).

Let 0 ≤ t0 < t1 < ⋯ < tb = tp such that tιtι−1 = υp for 1 ≤ ι ≤ b − 1 and tbtb−1 ≤ υp, where υp=1/log p(log4p). Thus we have b ~ tpp. For any t such that tι−1ttι, by the fact that G(t + o((log p)−1/2))/G(t) = 1 + o(1) uniformly in 0 ≤ tc(log p)1/2 for any constant c, we have

i0\AτI(|Vi|tι)p0G(tι)G(tι)G(tι1)i0\AτI(|Vi|t)p0G(t)i0\AτI(|Vi|tι1)p0G(tι1)G(tι1)G(tι).

Thus it suffices to prove

max0ιb|i0\Aτ{I(|Vi|tι)G(tι)}pG(tι)|0,

in probability. Define ℋ̃0 = ℋ0 \ Aτ. Note that

P[max0ιb|i0{I(|Vi|tι)G(tι)}p0G(tι)|ε]ι=1bP[|i0{I(|Vi|tι)G(tι)}p0G(tι)|ε]1υp0tpP{|i0I(|Vi|t)p0G(t)1|ε}dt+ι=b1bP[|i0{I(|Vi|tι)G(tι)}p0G(tι)|ε].

Thus, it suffices to show, for any ε > 0,

0tpP{|i0{I(|Vi|t)P(I(|Vi|t)}p0G(t)|ε}dt=o(υp). (33)

Note that

E|i0{I(|Vi|t)P(I(|Vi|t)}p0G(t)|2=i,j0{P(|Vi|t,|Vj|t)P(|Vi|t)P(|Vj|t)}p02G2(t).

We divides the indices i, j ∈ ℋ̃0 into the subsets: ℋ̃01 = {i, j ∈ ℋ̃0, i = j}, ℋ̃02 = {i, j ∈ ℋ̃0, i ∈ Γj(γ), or j ∈ Γi(γ)} and ℋ̃03 = ℋ̃0 \ (ℋ̃01 ∪ ℋ̃02). Then we have

i,j01{P(|Vi|t,|Vj|t)P(|Vi|t)P(|Vj|t)}p02G2(t)Cp0G(t). (34)

We now show the equation (12). Note that Cov(εk,dηk,i,d,εk,dηk,j,d)=E(εk,d2ηk,i,dηk,j,d)E(εk,dηk,i,d)E(εk,dηk,j,d). Because Cov(εk,d,ηk,i,d)=σηi,d2βi,d, we have E(εk,dηk,i,d)E(εk,dηk,j,d)=σηi,d2σηj,d2βi,dβj,d. Note that

E(εk,d2ηk,i,dηk,j,d)=E{εk,d2(ηk,i,d+εk,dγi,1,d)(ηk,j,d+εk,dγj,1,d)}E{εk,d2(ηk,i,d+εk,dγi,1,d)εk,dγj,1,d}E(εk,d3γi,1,dηk,j,d).

By definition, we have εk,d independent with ηk,i,d + εk,dγi,1,d. Thus, we have

E(εk,d2ηk,i,dηk,j,d)=σεd2E{(ηk,i,d+εk,dγi,1,d)(ηk,j,d+εk,dγj,1,d)}E(εk,d3γi,1,dηk,j,d).

Note that

E(εk,d3γi,1,dηk,j,d)=E{εk,d3γi,1,d(ηk,j,d+εk,dγj,1,d)}E(εk,d4γi,1,dγj,1,d)=3γi,1,dγj,1,dσεd4,

and that

E{(ηk,i,d+εk,dγi,1,d)(ηk,j,d+εk,dγj,1,d)}=Cov(ηk,i,d,ηk,j,d)+γi,1,dCov(εk,d,ηk,j,d)+γj,1,dCov(εk,d,ηk,i,d)+γi,1,dγj,1,dσεd2.

We have Cov(εk,dηk,i,d,εk,dηk,j,d)=(ωi,jσεd2+2βi,dβj,d)σηi,d2σηj,d2. Thus

ξi,j,d=Corr(εk,dηk,i,d,εk,dηk,j,d)=(ωi,j,dσεd2+2βi,dβj,d){(ωi,i,dσεd2+2βi,d2)(ωj,j,dσεd2+2βj,d2)}1/2.

Note that, for i ∈ ℋ̃0, we have βi,d = O((log p)−2−τ) and so | Corr(Vi, Vj)| ≤ ξ < 1, where ξ = max{ξ1, ξ2} + ε with ξd defined in (C2) and ε < 1 − max{ξ1, ξ2}, for i, j ∈ ℋ̃02. Hence

i,j02{P(|Vi|t,|Vj|t)P(|Vi|t)P(|Vj|t)}p02G2(t)Cp1+νt2 exp{t2/(1+ξ)}p2G(t)Cp1ν{G(t)}2ξ/(1+ξ). (35)

It remains to consider the subset ℋ̃03, in which Vi and Vj are weakly correlated. It is easy to check that maxi,j∈ ℋ̃03 P(|Vi| ≥ t, |Vj | ≥ t) = (1 + O{(log p)−1−γ})G2(t). Hence,

i,j03{P(|Vi|t,|Vj|t)P(|Vi|t)P(|Vj|t)}p02G2(t)=O{(log p)1γ}. (36)

Equation (33) and the FDP result then follow by combining (34), (35), and (36), and the FDR result is also proved.

Acknowledgments

The research of Yin Xia was supported in part by “The Recruitment Program of Global Experts” Youth Project from China, the startup fund from Fudan University and NSF Grant DMS-1612906.

The research of Tianxi Cai was supported in part by NIH Grants R01 GM079330, P50 MH106933, and U54 HG007963.

The research of Tony Cai was supported in part by NSF Grants DMS-1208982 and DMS-1403708, and NIH Grant R01 CA127334.

References

  1. Anderson TW. An Introduction To Multivariate Statistical Analysis. 3. Wiley-Intersceince; New York: 2003. [Google Scholar]
  2. Baraud Y. Non-asymptotic minimax rates of testing in signal detection. Bernoulli. 2002;8(5):577–606. [Google Scholar]
  3. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Annals of statistics. 2001:1165–1188. [Google Scholar]
  4. Cai T, Liu W, Xia Y. Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. J. Am. Statist. Assoc. 2013;108(501):265–277. [Google Scholar]
  5. Cai TT, Xia Y. High-dimensional sparse manova. Journal of Multivariate Analysis. 2014;131:174–196. [Google Scholar]
  6. D’Agostino R, Sr, Vasan R, Pencina M, Wolf P, Cobain M, Massaro J, Kannel W. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation. 2008;117(6):743. doi: 10.1161/CIRCULATIONAHA.107.699579. [DOI] [PubMed] [Google Scholar]
  7. Hibi K, Ishigami T, Kimura K, Nakao M, Iwamoto T, Tamura K, Nemoto T, Shimizu T, Mochida Y, Ochiai H, et al. Angiotensin-converting enzyme gene polymorphism adds risk for the severity of coronary atherosclerosis in smokers. Hypertension. 1997;30(3):574–579. doi: 10.1161/01.hyp.30.3.574. [DOI] [PubMed] [Google Scholar]
  8. Humphries S, Yiannakouris N, Talmud P. Cardiovascular disease risk prediction using genetic information (gene scores): is it really informative? Current Opinion in Lipidology. 2008;19(2):128. doi: 10.1097/MOL.0b013e3282f5283e. [DOI] [PubMed] [Google Scholar]
  9. Hunter DJ. Gene–environment interactions in human diseases. Nature Reviews Genetics. 2005;6(4):287–298. doi: 10.1038/nrg1578. [DOI] [PubMed] [Google Scholar]
  10. Ikeda S, Sasazuki S, Natsukawa S, Shaura K, Koizumi Y, Kasuga Y, Ohnami S, Sakamoto H, Yoshida T, Iwasaki M, et al. Screening of 214 single nucleotide polymorphisms in 44 candidate cancer susceptibility genes: a case–control study on gastric and colorectal cancers in the japanese population. The American journal of gastroenterology. 2008;103(6):1476–1487. doi: 10.1111/j.1572-0241.2008.01810.x. [DOI] [PubMed] [Google Scholar]
  11. Javanmard A, Montanari A. Hypothesis testing in high-dimensional regression under the gaussian random design model: Asymptotic theory 2013 [Google Scholar]
  12. Javanmard A, Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research. 2014;15(1):2869–2909. [Google Scholar]
  13. Kannel W, Feinleib M, McNamara P, Garrison R, Castelli W. An investigation of coronary heart disease in families The Framingham Offspring Study. American Journal of Epidemiology. 1979;110(3):281–290. doi: 10.1093/oxfordjournals.aje.a112813. [DOI] [PubMed] [Google Scholar]
  14. Liu L, Zhong R, Wei S, Xiang H, Chen J, Xie D, Yin J, Zou L, Sun J, Chen W, et al. The leptin gene family and colorectal cancer: interaction with smoking behavior and family history of cancer. PloS one. 2013;8(4):e60777. doi: 10.1371/journal.pone.0060777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Liu W, Luo S. Hypothesis testing for high-dimensional regression models. Technical report 2014 [Google Scholar]
  16. Lloyd-Jones D, Wilson P, Larson M, Beiser A, Leip E, D’Agostino R, Levy D. Framingham risk score and prediction of lifetime risk for coronary heart disease* 1. The American Journal of Cardiology. 2004;94(1):20–24. doi: 10.1016/j.amjcard.2004.03.023. [DOI] [PubMed] [Google Scholar]
  17. Matsouaka RA, Li J, Cai T. Evaluating marker-guided treatment selection strategies. Biometrics. 2014;70(3):489–499. doi: 10.1111/biom.12179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Reviews Genetics. 2008;9(5):356–369. doi: 10.1038/nrg2344. [DOI] [PubMed] [Google Scholar]
  19. Paynter N, Chasman D, Buring J, Shiffman D, Cook N, Ridker P. Cardiovascular disease risk prediction with and without knowledge of genetic variation at chromosome 9p21. 3. Annals of internal medicine. 2009;150(2):65. doi: 10.7326/0003-4819-150-2-200901200-00003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Pepe MS. The statistical evaluation of medical tests for classification and prediction. Oxford University Press; 2003. [Google Scholar]
  21. Ridker P, Buring J, Rifai N, Cook N. Development and validation of improved algorithms for the assessment of global cardiovascular risk in women: the Reynolds Risk Score. Journal of American Medical Association. 2007;297(6):611. doi: 10.1001/jama.297.6.611. [DOI] [PubMed] [Google Scholar]
  22. Ross R. Atherosclerosis is an inflammatory disease. American Heart Journal. 1999;138(5):S419–S420. doi: 10.1016/s0002-8703(99)70266-8. [DOI] [PubMed] [Google Scholar]
  23. Sayed-Tabatabaei F, Schut A, Hofman A, Bertoli-Avella A, Vergeer J, Witteman J, van Duijn C. A study of gene–environment interaction on the gene for angiotensin converting enzyme: a combined functional and population based approach. Journal of medical genetics. 2004;41(2):99–103. doi: 10.1136/jmg.2003.013441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Schut AF, Sayed-Tabatabaei FA, Witteman JC, Avella AM, Vergeer JM, Pols HA, Hofman A, Deinum J, van Duijn CM. Smoking-dependent effects of the angiotensin-converting enzyme gene insertion/deletion polymorphism on blood pressure. Journal of hypertension. 2004;22(2):313–319. doi: 10.1097/00004872-200402000-00015. [DOI] [PubMed] [Google Scholar]
  25. Stephens JW, Bain SC, Humphries SE. Gene–environment interaction and oxidative stress in cardiovascular disease. Atherosclerosis. 2008;200(2):229–238. doi: 10.1016/j.atherosclerosis.2008.04.003. [DOI] [PubMed] [Google Scholar]
  26. Van de Geer S, Bühlmann P, Ritov Y, Dezeure R, et al. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics. 2014;42(3):1166–1202. [Google Scholar]
  27. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al. The sequence of the human genome. Science. 2001;291(5507):1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]
  28. Wilson P, D’Agostino R, Levy D, Belanger A, Silbershatz H, Kannel W. Prediction of coronary heart disease using risk factor categories. Circulation. 1998;97(18):1837. doi: 10.1161/01.cir.97.18.1837. [DOI] [PubMed] [Google Scholar]
  29. Xia Y, Cai T, Cai TT. Testing differential network with applications to detecting gene by gene interactions. Biometrika. 2015;102:247–266. doi: 10.1093/biomet/asu074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Zaïtsev AY. On the gaussian approximation of convolutions under multidimensional analogues of sn bernstein’s inequality conditions. Probab. Theory Rel. 1987;74(4):535–566. [Google Scholar]
  31. Zhang C-H, Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B. 2014;76(1):217–242. [Google Scholar]

RESOURCES