Two-Sample Tests for High-Dimensional Linear Regression with an Application to Detecting Interactions

Yin Xia; Tianxi Cai; T Tony Cai

doi:10.5705/ss.202016.0063

. Author manuscript; available in PMC: 2018 Jan 29.

Published in final edited form as: Stat Sin. 2018 Jan;28:63–92. doi: 10.5705/ss.202016.0063

Two-Sample Tests for High-Dimensional Linear Regression with an Application to Detecting Interactions

Yin Xia ¹, Tianxi Cai ², T Tony Cai ³

PMCID: PMC5788049 NIHMSID: NIHMS874424 PMID: 29386856

Abstract

Motivated by applications in genomics, we consider in this paper global and multiple testing for the comparisons of two high-dimensional linear regression models. A procedure for testing the equality of the two regression vectors globally is proposed and shown to be particularly powerful against sparse alternatives. We then introduce a multiple testing procedure for identifying unequal coordinates while controlling the false discovery rate and false discovery proportion. Theoretical justifications are provided to guarantee the validity of the proposed tests and optimality results are established under sparsity assumptions on the regression coefficients. The proposed testing procedures are easy to implement. Numerical properties of the procedures are investigated through simulation and data analysis. The results show that the proposed tests maintain the desired error rates under the null and have good power under the alternative at moderate sample sizes. The procedures are applied to the Framingham Offspring study to investigate the interactions between smoking and cardiovascular related genetic mutations important for an inflammation marker.

Keywords: False discovery proportion, false discovery rate, high-dimensional linear regression, hypothesis testing, multiple comparisons, sparsity, two-sample tests

1 Introduction

As we enter a new era of data science, called by some the “information century”, research in several novel genomics and epigenomics fields are well underway. Large-scale genomewide scans, such as genome-wide association studies, have become widely available tools for identifying common genetic variants that contribute to complex diseases and treatment responses (McCarthy et al. (2008); Venter et al. (2001)). However, there is growing evidence that genetic variants alone explain only a small proportion of variations in complex disease phenotypes. Most complex diseases are a result of interplay between genes and environment (Hunter (2005)). It is thus of substantial interest to rigorously study the effects of environment and its interaction with genetic predispositions on disease phenotypes.

When the environmental factor is a binary variable such as smoking status or gender, such interaction problems can be addressed through the two-sample high-dimensional regression framework. Specifically, interaction detection can be formulated based on comparing two high-dimensional regression models

Y_{d} = μ_{d} + X_{d} β_{d} + ε_{d}, for d = 1, 2,

(1)

and identifing the nonzero components of β₁ − β₂, where β_d = (β_1,d, …, β_p,d)^T ∈ ℝ^p, μ_d = (μ_1,d, …, μ_{n_d,d})^T, $X_{d} = {(X_{1, \cdot, d}^{T}, \dots, X_{n_{d}, \cdot, d}^{T})}^{T}$ , Y_d = (Y_1,d, …, Y_{n_d,d})^T, and ε_d = (ε_1,d, …, ε_{n_d,d})^T, with {ε_k,d} being independent and identically distributed (i.i.d) random variables with mean zero and variance $σ_{ε_{d}}^{2}$ and independent of X_k,·,d, k = 1, …, n_d. Two-sample interaction detection problems arise in many other biomedical settings. For example, when the two samples represent diseased and non-diseased group and Y represents a diagnostic test, and the non-zero components of β₁ − β₂ represent the covariates that affect the diagnostic accuracy of Y (Pepe (2003)). When the two samples represent two treatment groups, the proposed testing procedures have important applications in personalized medicine. The non-zero components of β₁ − β₂ correspond to markers useful for individualized treatment selection since the rule that optimize the treatment selection for an individual patient with genomic markers X can be formed based on (β₁ − β₂)^TX (Matsouaka et al. (2014)). However, the high dimensionality of the genomic data presents substantial statistical challenges in efficiently identifying gene-environment interactions and markers useful for personalized treatment selection.

There is a paucity of literature focusing on multiple testing of the regression coefficients in the high-dimensional two-sample setting while controlling the false discovery rate (FDR) and false discovery proportion (FDP). For example, Zhang and Zhang (2014), Van de Geer et al. (2014), and Javanmard and Montanari (2013, 2014) considered confidence intervals and tests for a given coordinate of a high-dimensional linear regression vector. Procedures that are based on the “de-biased” Lasso estimators were proposed. The focus was solely on inference for a given coordinate and simultaneous testing of all coordinates was not considered. Recently, Liu and Luo (2014) investigated the one-sample version of the multiple testing problem, testing simultaneously

H_{0, i}^{'} : β_{i, 1} = 0 versus H_{1, i}^{'} : β_{i, 1} \neq 0, i = 1, \dots, p,

with the control of FDR. They constructed the test statistics based on bias-corrected sample covariances of the residuals and inverse regression, as explained in detail in Section 2.2. The one-sample setting is simpler than the two-sample multiple testing problem considered in the present paper. For example, their proposed test statistics have desirable theoretical properties due to the facts that (i) they are asymptotically normally distributed under $H_{0, i}^{'} : β_{i, 1} = 0$ , and (ii) the correlation between two test statistics is equal to the partial correlation between two covariates, which is fully determined by the precision matrix. However, those properties no longer hold when we extend the hypothesis testing problem to two samples as described in (3).

In this paper, we are interested in developing efficient procedures for testing β₁ − β₂. The first goal is to develop a global test for

H_{0} : β_{1} = β_{2} versus H_{1} : β_{1} \neq β_{2}

(2)

that is powerful against sparse alternatives. We then develop a procedure for simultaneously testing the hypotheses

H_{0, i} : β_{i, 1} = β_{i, 2} versus H_{1, i} : β_{i, 1} \neq β_{i, 2}, i = 1, \dots, p,

(3)

with FDR and FDP control. The test statistics are constructed using the covariances between the residuals of the fitted regression models and the inverse regression models. Although the techniques build on the inverse regression method developed in Liu and Luo (2014) for the one-sample case, the two-sample case poses significant additional difficulties in both methodology development and technical analyses. We point out here two such major challenges and more detailed discussion is given in Section 2.3.

The construction of test statistics is much more involved than the one-sample case. This is mainly due to the fact that the difference of regression coefficients can no longer be reduced to the difference of residual covariances as in the one-sample setting. Furthermore, corrections of the test statistics are essential in the two-sample case to establish the asymptotic normality.
The technical analyses of the two-sample case are much more challenging. This is because the one-sample case can be easily reduced to a weakly correlated testing problem provided that the precision matrix of the covariates is sparse or nearly sparse, while the two-sample case cannot as the correlation structure is much more complicated.

The properties of the proposed testing procedures are investigated theoretically as well as numerically through simulation and data analysis. Theoretical justifications are provided to ensure the validity of the proposed tests and optimality results are established under sparsity assumptions on the regression coefficients. A simulation study is carried out to demonstrate that the proposed tests maintain the desired error rates under the null and have good power under the alternative at moderate sample sizes. The simulation results also show that the new multiple testing procedure outperforms the well known Benjamini-Yekutieli procedure (Benjamini and Yekutieli (2001)). In addition, the proposed testing procedures are illustrated by an application to the Framingham Offspring Study (Kannel et al., 1979) to study how smoking and its interaction with a genetic predisposition affect an inflammation marker which plays an important role in the risk of developing cardiovascular disease.

The rest of the paper is organized as follows. In Section 2, we introduce the construction of the new test statistics and discuss the technical differences and theoretical challenges of the two-sample testing problems. Section 3 develops a maximum-type statistic M_n and the corresponding test for the global hypothesis H₀ : β₁ = β₂ through the inverse regression framework. We establish in this section the asymptotic null distribution of M_n and show the optimality results under sparse alternatives. Large-scale multiple testing with FDR and FDP control is presented in Section 4. Section 5 investigates the numerical performance of the proposed procedures by simulations. In Section 6, we apply the proposed procedures to the Framingham Offspring Study. The proofs of the main results are given in Section 8.

2 Methodology

2.1 Notation and Definitions

We first introduce the notation and definitions that will be used throughout the paper. For a vector β_d = (β_1,d, …, β_p,d)^T ∈ ℝ^p, define the ℓ_q norm by ${| β_{d} |}_{q} = {(\sum_{i = 1}^{p} {| β_{i, d} |}^{q})}^{1 / q}$ for 1 ≤ q ≤ ∞. For subscripts, we use the convention that i stands for the i^th entry of a vector and (i, j) for the entry in the i^th row and j^th column of a matrix, k represents the k^th sample and d is the group indicator. Let $X_{d} = {(X_{1, \cdot, d}^{T}, \dots, X_{n_{d}, \cdot, d}^{T})}^{T}$ be the n_d × p data matrix, and Y_d = (Y_1,d, …, Y_{n_d,d})^T be the n_d × 1 data matrix, for d = 1, 2. Throughout, suppose that we have i.i.d random samples {Y_k,d, X_k,·,d, 1 ≤ k ≤ n_d} with X_k,·,d = (X_k,1,d, …, X_k,p,d) being a random vector with covariance matrix Σ_d for d = 1, 2. Define $\sum_{d}^{- 1} = Ω_{d} = (ω_{i, j, d})$ .

For any vector μ_d ∈ ℝ^p, let μ_−i,d denote the (p − 1)-dimensional vector formed by removing the i^th entry from μ_d. For a symmetric matrix A_d, let λ_max(A_d) and λ_min(A_d) denote the largest and smallest eigenvalues of A_d, respectively. For any n × p matrix A_d, A_i,−j,d denotes the i^th row of A_d with its j^th entry removed and A_−i,j,d denotes the j^th column of A_d with its i^th entry removed. A_−i,−j,d denotes the (n − 1) × (p − 1) submatrix of A_d with its i^th row and j^th column removed. Let A_{·, −j,d} denote the n × (p − 1) submatrix of A_d with the j^th column removed, A_i,·,d denote the i^th row of A_d, A_·,j,d denote the j^th column of A_d and $Ā_{\cdot, j, d} = 1 / n \sum_{i = 1}^{n} A_{i, j, d}$ . Let $Ā_{\cdot, - j, d} = 1 / n \sum_{i = 1}^{n} A_{i, - j, d}, Ā_{\cdot, j, d} = {(Ā_{\cdot, j, d}, \dots, Ā_{\cdot, j, d})}_{n \times 1}^{T}$ , and $Ā_{(\cdot, - j, d)} = {(Ā_{\cdot, - j, d}^{T}, \dots, Ā_{\cdot, - j, d}^{T})}_{n \times (p - 1)}^{T}$ . Let $Ā_{d} = 1 / n \sum_{i = 1}^{n} A_{i, \cdot, d}$ . For a matrix Ω = (ω_i,j)_p×p, the matrix 1-norm is the maximum absolute column sum, ${‖ Ω ‖}_{L_{1}} = {max}_{1 \leq j \leq p} \sum_{i = 1}^{p} | ω_{i, j} |$ , the matrix elementwise infinity norm is defined to be ‖Ω‖_∞ = max_1≤i,j≤p|ω_i,j| and the elementwise ℓ₁ norm is ${‖ Ω ‖}_{1} = \sum_{i = 1}^{p} \sum_{j = 1}^{p} | ω_{i, j} |$ . For a set ℋ, let |ℋ| be the cardinality of ℋ. For two sequences of real numbers {a_n} and {b_n}, write a_n = O(b_n) if there exists a constant C such that |a_n| ≤ C|b_n| holds for all n, write a_n = o(b_n) if lim_n→∞ a_n/b_n = 0, and write a_n ≍ b_n if there are positive constants c and C such that c ≤ a_n/b_n ≤ C for all n.

2.2 Test Statistics

To form the test statistics, we consider the inverse regression models obtained by regressing X_k,i,d on (Y_k,d, X_k,−i,d), as introduced in Liu and Luo (2014)

X_{k, i, 1} = α_{i, 1} + (Y_{k, 1}, X_{k, - i, 1}) γ_{i, 1} + η_{k, i, 1}, (k = 1, \dots, n_{1})

X_{k, i, 2} = α_{i, 2} + (Y_{k, 2}, X_{k, - i, 2}) γ_{i, 2} + η_{k, i, 2}, (k = 1, \dots, n_{2})

where for d = 1, 2, η_k,i,d has mean zero and variance $σ_{η_{i, d}}^{2}$ and is uncorrelated with (Y_k,d, X_k,−i,d), and γ_i,d = (γ_i,1,d, …, γ_i,p,d)^T satisfies

γ_{i, d} = - σ_{η_{i, d}}^{2} {(- β_{i, d} / σ_{ε_{d}}^{2}, β_{i, d} β_{- i, d}^{T} / σ_{ε_{d}}^{2} + Ω_{i, - i, d})}^{T},

(4)

where $σ_{η_{i, d}}^{2} = {(β_{i, d}^{2} / σ_{ε_{d}}^{2} + ω_{i, i, d})}^{- 1}$ , as provided in Liu and Luo (2014).

Remark 1

Equation (4) can be obtained directly as follows. Denote the covariance matrix of Z = (X_k,i,d, Y_k,d, X_k,−i,d) by Σ = Cov(Z). Section 2.5 of Anderson (2003) shows that γ_i,d can be obtained by $γ_{i, d} = \sum_{22}^{- 1} \sum_{21}$ , where Σ₂₂ = Cov(Z₁) with Z₁ = (Y_k,d, X_k,−i,d) and Σ₂₁ = Cov(Z₁, X_k,i,d) is the covariance between Z₁ and X_k,i,d. Then (4) follows from the regression model Y_d = μ_d + X_dβ_d + ε_d and the fact that X_d and ε_d are uncorrelated with each other.

Because r_i,d = Cov(ε_k,d, η_k,i,d) can be expressed as $- γ_{i, 1, d} Cov (ε_{k, d}, Y_{k, d}) = - γ_{i, 1, d} σ_{ε_{d}}^{2} = - σ_{η_{i, d}}^{2} β_{i, d}$ , the null hypotheses in global testing problem (2) and entry-wise testing problem (3) would be, respectively, equivalent to

H_{0} : max_{1 \leq i \leq p} | r_{i, 1} / σ_{η_{i, 1}}^{2} - r_{i, 2} / σ_{η_{i, 2}}^{2} | = 0,

(5)

and

H_{0, i} : r_{i, 1} / σ_{η_{i, 1}}^{2} = r_{i, 2} / σ_{η_{i, 2}}^{2}, i = 1, \dots, p,

(6)

and we base the tests on the estimates of { $r_{i, d} / σ_{η_{i, d}}^{2}$ , i = 1, …, p; d = 1, 2}.

Define the residuals

{\hat{ε}}_{k, d} = Y_{k, d} - Ȳ_{d} - (X_{k, \cdot, d} - {\bar{X}}_{d}) {\hat{β}}_{d}

{\hat{η}}_{k, i, d} = X_{k, i, d} - {\bar{X}}_{i, d} - (Y_{k, d} - Ȳ_{d}, (X_{k, - i, d} - {\bar{X}}_{\cdot, - i, d})) {\hat{γ}}_{i, d},

where β̂_d = (β̂_1,d, …, β̂_p,d) and γ̂_i,d = (γ̂_i,1,d, …, γ̂_i,p,d) are the respective estimators of β_d and γ_i,d satisfy

max {{| {\hat{β}}_{d} - β_{d} |}_{1}, max_{1 \leq i \leq p} {| {\hat{γ}}_{i, d} - γ_{i, d} |}_{1}} = O_{P} (a_{n 1}),

max {{| {\hat{β}}_{d} - β_{d} |}_{2}, max_{1 \leq i \leq p} {| {\hat{γ}}_{i, d} - γ_{i, d} |}_{2}} = O_{P} (a_{n 2}),

(7)

for some a_n1 and a_n2 such that

max {a_{n 1} a_{n 2}, a_{n 2}^{2}} = o {{(n log p)}^{- 1 / 2}}, and a_{n 1} = o (1 / log p) .

(8)

Estimators β̂_d and γ̂_i,d that satisfy (7) and (8) can be obtained easily via standard methods such as the lasso and Danzig selector, see, for example, Xia et al. (2015) and Liu and Luo (2014).

Based on the residuals ε̂_k,d and η̂_k,i,d, a natural estimator of r_i,d is the sample covariance between the residuals,

{\tilde{r}}_{i, d} = n_{d}^{- 1} \sum_{k = 1}^{n_{d}} {\hat{ε}}_{k, d} {\hat{η}}_{k, i, d} .

Because r̃_i,d tends to be biased, we define a bias corrected estimator for r_i,d as

{\hat{r}}_{i, d} = {\tilde{r}}_{i, d} + {\hat{σ}}_{ε_{d}}^{2} {\hat{γ}}_{i, 1, d} + {\hat{σ}}_{η_{i, d}}^{2} {\hat{β}}_{i, d},

(9)

where ${\hat{σ}}_{ε_{d}}^{2} = n_{d}^{- 1} \sum_{k = 1}^{n_{d}} {\hat{ε}}_{k, d}^{2}$ and ${\hat{σ}}_{η_{i, d}}^{2} = n_{d}^{- 1} \sum_{k = 1}^{n_{d}} {\hat{η}}_{k, i, d}^{2}$ are the sample variances satisfying

max {| {\hat{σ}}_{ε_{d}}^{2} - σ_{ε_{d}}^{2} |, max_{1 \leq i \leq p} | {\hat{σ}}_{η_{i, d}}^{2} - σ_{η_{i, d}}^{2} |} = O_{P} {{(log p / n_{d})}^{1 / 2}},

which can be obtained by Lemma 2 in Xia et al. (2015) under conditions (7) and (8). By Lemma 2, the bias of r̂_i,d is then of order max{β_i,d(log p/n_d)^1/2, (n_d log p)^−1/2}.

Remark 2

The most straightforward way to estimate r_i,d is to use the sample covariance between the error terms, $n_{d}^{- 1} \sum_{k = 1}^{n_{d}} ε_{k, d} η_{k, i, d}$ . However, the error terms are unknown, and we can use the the sample covariance between the residuals r̃_i,d instead. The bias of r̃_i,d exceeds the desired rate (n_d log p)^−1/2, and thus we calculate the difference of r̃_i,d and $n_{d}^{- 1} \sum_{k = 1}^{n_{d}} ε_{k, d} η_{k, i, d}$ , which up to order (n_d log p)^−1/2, is equal to ${\hat{σ}}_{ε_{d}}^{2} {\hat{γ}}_{i, 1, d} + {\hat{σ}}_{η_{i, d}}^{2} {\hat{β}}_{i, d}$ . Hence, we define ${\hat{r}}_{i, d} = {\tilde{r}}_{i, d} + {\hat{σ}}_{ε_{d}}^{2} {\hat{γ}}_{i, 1, d} + {\hat{σ}}_{η_{i, d}}^{2} {\hat{β}}_{i, d}$ as in (9).

For i = 1, …, p and d = 1, 2, a natural estimator of $r_{i, d} / σ_{η_{i, d}}^{2}$ can then be defined by

T_{i, d} = {\hat{r}}_{i, d} / {\hat{σ}}_{η_{i, d}}^{2} .

(10)

Subsequently, we may test the hypotheses (2) and (3) using the estimators 𝒯 = {T_i,1 − T_i,2 : i = 1, …, p}. However, since T_i,1 − T_i,2 in 𝒯 are heteroscedastic with possibly a wide range of variability, we instead consider a standardized version of T_i,1 − T_i,2. Specifically, let

U_{i, d} = n_{d}^{- 1} \sum_{k = 1}^{n_{d}} {ε_{k, d} η_{k, i, d} - E (ε_{k, d} η_{k, i, d})} and Ũ_{i, d} = (β_{i, d} + U_{i, d}) / σ_{η_{i, d}}^{2} .

It can be shown in Lemma 2 that, uniformly in i = 1, …, p,

| T_{i, d} - Ũ_{i, d} | = O_{P} {β_{i, d} {(log p / n_{d})}^{1 / 2}} + o_{P} {{(n_{d} log p)}^{- 1 / 2}} .

Noting that $θ_{i, d} = Var (Ũ_{i, d}) = Var (ε_{k, d} η_{k, i, d} / σ_{η_{i, d}}^{2}) / n_{d} = (σ_{ε_{d}}^{2} / σ_{η_{i, d}}^{2} + β_{i, d}^{2}) / n_{d}$ , we estimate θ_i,d by

{\hat{θ}}_{i, d} = ({\hat{σ}}_{ε_{d}}^{2} / {\hat{σ}}_{η_{i, d}}^{2} + {\hat{β}}_{i, d}^{2}) / n_{d} .

and define the standardized statistics

W_{i} = \frac{T_{i, 1} - T_{i, 2}}{{({\hat{θ}}_{i, 1} + {\hat{θ}}_{i, 2})}^{1 / 2}}, i = 1, \dots, p .

(11)

We base the tests for (2) and (3) on {W_i, i = 1, …, p}, which will be studied in detail in Sections 3 and 4.

2.3 Discussion

We discuss here the substantial differences between the two-sample and one-sample cases and the necessity for significant adjustments and corrections in the two-sample setting.

The proposed tests are based on estimators of $r_{i, 1} / σ_{η_{i, 1}}^{2} - r_{i, 2} / σ_{η_{i, 2}}^{2}$ . Here we estimate r_i,d = Cov(ε_k,d, η_k,i,d) through constructing a bias-corrected sample covariance between the residuals, r̂_i,d, as defined in (9). That is, we need to get an estimate of the difference between the naive estimate r̂_i,d and an unbiased estimate of r_i,d, which is $n_{d}^{- 1} \sum_{k = 1}^{n_{d}} ε_{k, d} η_{k, i, d}$ .

Liu and Luo (2014) considered the one-sample case of the multiple testing problem (3) so $r_{i} / σ_{η_{i}}^{2} = 0$ is equivalent to r_i = 0 under the null hypothesis, and r_i is easier to estimate. The procedure in Liu and Luo (2014) is based on the estimation of r_i instead of $r_{i} / σ_{η_{i}}^{2}$ . In the two-sample case, $r_{i, 1} / σ_{η_{i, 1}}^{2} = r_{i, 2} / σ_{η_{i, 2}}^{2}$ is not equivalent to r_i,1 = r_i,2. Thus, it is necessary to construct testing procedures based directly on estimators of $r_{i, 1} / σ_{η_{i, 1}}^{2} - r_{i, 2} / σ_{η_{i, 2}}^{2}$ .

Furthermore, in the one-sample case, the asymptotic normality of T_i can be established because β_i,1 = 0 under the null, which is shown in Lemma 2. Thus the theoretical properties of the individual test statistics are easier to obtain. In the two-sample case, β_i,1 and β_i,2 are not necessary equal to 0 under the null, and corrections are thus essential in order to show W_i is close to a normal random variable; the technical details are much more complicated.

More importantly, in the one-sample case, under the null hypothesis β_i,1 = 0, and thus Corr(ε_kη_k,i, ε_kη_k,j) = ω_i,j / (ω_i,iω_j,j), which is fully determined by the precision matrix of the covariates and thus simplifies the calculations. In the two-sample version, β_i,1 = β_i,2 under the null hypothesis and they are not necessary equal to zero. The calculation of Corr(ε_k,dη_k,i,d, ε_k,dη_k,j,d), which determines the correlation between W_i and W_j, is much more involved, and it can be shown in the proof of Theorem 4 that

{\tilde{ξ}}_{i, j, d} = Corr (ε_{k, d} η_{k, i, d}, ε_{k, d} η_{k, j, d}) = \frac{(ω_{i, j, d} σ_{ε_{d}}^{2} + 2 β_{i, d} β_{j, d})}{{(ω_{i, i, d} σ_{ε_{d}}^{2} + 2 β_{i, d}^{2}) (ω_{j, j, d} σ_{ε_{d}}^{2} + 2 β_{j, d}^{2})}^{1 / 2}} .

(12)

The technical analysis for establishing the theoretical results in Sections 3 and 4 is thus much more challenging.

3 Global Test

In this section, we wish to test the global hypothesis

H_{0} : β_{1} = β_{2} versus H_{1} : β_{1} \neq β_{2} .

We propose a procedure based on the standardized statistics {W_i, i = 1, …, p}

M_{n} = max_{1 \leq i \leq p} W_{i}^{2} = max_{1 \leq i \leq p} \frac{{(T_{i, 1} - T_{i, 2})}^{2}}{{\hat{θ}}_{i, 1} + {\hat{θ}}_{i, 2}} .

(13)

It is shown in Section 3.1 that, under certain regularity conditions, M_n − 2 log p + log log p converges to a Gumbel distribution under the null, and the asymptotic α-level test can thus be defined as

Ψ_{α} = I (M_{n} \geq q_{α} + 2 log p - log log p),

(14)

where q_α is the 1 − α quantile of the Gumbel distribution with the cumulative distribution function exp(−π^−1/2e^−t/2),

q_{α} = - log (π) - 2 log log {(1 - α)}^{- 1} .

We reject the null hypothesis H₀ whenever Ψ_α = 1.

3.1 Asymptotic Null Distribution

We first introduce some regularity conditions, under which, M_n − 2 log p+log log p converges weakly to a Gumbel random variable with distribution function exp(−π^−1/2e^−t/2).

(C1)
log p = o(n^1/5), n₁ ≍ n₂, and for some constants C₀,C₁,C₂ > 0, $C_{0}^{- 1} \leq λ_{min} (Ω_{d}) \leq λ_{max} (Ω_{d}) \leq C_{0}, C_{1}^{- 1} \leq σ_{ε_{d}}^{2} \leq C_{1}$ , and |β_d|_∞ ≤ C₂ for d = 1, 2. There exists some τ > 0 such that |A_τ| = O(p^r) with r < 1/4, where A_τ = {i : |β_i,d| ≥ (log p)^−2−τ, 1 ≤ i ≤ p, for d = 1 or 2}.
(C2)
Let D_d be the diagonal of Ω_d and let $(ξ_{i, j, d}) = R_{d} = D_{d}^{- 1 / 2} Ω_{d} D_{d}^{- 1 / 2}$ , for d = 1, 2. max_1≤i<j≤p |ξ_i,j,d| ≤ ξ_d < 1 for some constant 0 < ξ_d < 1.
(C3)
There exists some constant K > 0 such that ${max}_{Var (a^{T} X_{k, \cdot, d}^{T}) = 1} E exp (K {(a^{T} X_{k, \cdot, d}^{T})}^{2})$ and $E exp (K ε_{k, d}^{2})$ are finite.

Condition (C1) on the eigenvalues is commonly used in the high-dimensional setting and implies that most of the variables are not highly correlated with each other. Condition (C2) is also mild. For example, if max_{1≤i≤j≤p} |ξ_i,j,d| = 1, then Ω_d is singular. (C3) is s sub-Gaussian tail condition, and it can be weakened to a polynomial tail condition if p < n^c for some constant c > 0.

Theorem 1

Suppose (C1), (C2), (C3), (7), and (8) hold. Then under H₀, for any t ∈ ℝ,

P (M_{n} - 2 log p + log log p \leq t) \to exp {- π^{- 1 / 2} exp (- t / 2)}, as n_{1}, n_{2}, p \to \infty,

(15)

where M_n is defined in (13). Under H₀, the convergence in (15) is uniform for all {Y_k,d, X_k,·,d : k = 1, 2, …, n_d} satisfying (C1), (C2), (C3), (7), and (8).

Remark 3

The analysis can be extended to test H₀ : β_G,1 = β_G,2 versus H₁ : β_G,1 ≠ β_G,2 for a given index set G. We can construct the test statistic as $M_{G, n} = {max}_{i \in G} W_{i}^{2}$ , and obtain a similar Gumbel limiting null distribution by replacing p with |G|, as n₁, n₂, |G| → ∞. The condition (C1) will be slightly different, with A_τ being replaced by A_G,τ = {i : |β_i,d| ≥ (log p)^−2−τ, i ∈ G, for d = 1 or 2}.

Remark 4

Condition (C1) is slightly stronger than the conditions in Liu and Luo (2014) as we need |A_τ| = O(p^r) with r < 1/4. This is due to the major difference between the one-sample and two-sample cases that the global null H₀ : β = 0 is a simple null in the one-sample case and the null H₀ : β₁ = β₂ is composite in the two-sample case. In the one-sample case, T_i is a nearly unbiased estimate of β_i because β_i = 0 under the global null. However, in the two-sample case, as stated in Lemma 2, additional correction terms involving β_i,d are needed in order to make T_i,d nearly unbiased because β_i,1 and β_i,2 are not necessary equal to 0 under the null. Thus, slightly stronger conditions on A_τ are needed.

3.2 Asymptotic Power

We now analyze the asymptotic power of the test Ψ_α given in (14). The test is shown to be particularly powerful against a large class of sparse alternatives and the power is minimax rate optimal. We first define a class of regression coefficients:

𝒰 (c) = {(β_{1}, β_{2}) : max_{1 \leq i \leq p} \frac{| β_{i, 1} - β_{i, 2} |}{{(θ_{i, 1} + θ_{i, 2})}^{1 / 2}} \geq c {(log p)}^{1 / 2}} .

(16)

We show that the null hypothesis H₀ can be rejected by the test Ψ_α with overwhelming probability, if $(β_{1}, β_{2}) \in 𝒰 (2 \sqrt{2})$ .

Theorem 2

Let the test Ψ_α be given in (14). Suppose (C1), (C3), (7) and (8) hold. Then

inf_{(β_{1}, β_{2}) \in 𝒰 (2 \sqrt{2})} P (Ψ_{α} = 1) \to 1, n, p \to \infty .

Theorem 2 shows that the null parameter set in which β₁ = β₂ is asymptotically distinguishable from $𝒰 (2 \sqrt{2})$ by the test Ψ_α.

We further show that the lower bound in (16) is rate optimal. Let 𝒯_α be the set of all α-level tests, P(T_α = 1) ≤ α under H₀ for all T_α ∈ 𝒯_α. If c in (16) is sufficiently small, then any α level test is unable to reject the null hypothesis correctly uniformly over (β₁, β₂) ∈ 𝒰(c) with probability tending to one.

Theorem 3

Suppose that log p = o(n). Let α, β > 0 and α + β < 1. Then there exists a constant c₀ > 0 such that for all sufficiently large n and p,

inf_{(β_{1}, β_{2}) \in 𝒰 (c_{0})} sup_{T_{α} \in 𝒯_{α}} P (T_{α} = 1) \leq 1 - β .

Theorem 3 shows that the order (log p)^1/2 in the lower bound of max_1≤i≤p{|β_i,1 − β_i,2|/(θ_i,1 + θ_i,2)^1/2} in (16) cannot be further improved.

4 Multiple Testing with False Discovery Rate Control

4.1 Multiple Testing Procedure

If the global null hypothesis is rejected, it is then of interest to identify the subset of variables in X that interact with the group indicator. This can be achieved by simultaneously testing on the entries of β₁ − β₂ with FDR and FDP control,

H_{0, i} : β_{i, 1} = β_{i, 2} versus H_{1, i} : β_{i, 1} \neq β_{i, 2}, 1 \leq i \leq p .

(17)

The standardized differences of T_i,1−T_i,2 are defined by the test statistics W_i = (T_i,1 − T_i,2)/(θ̂_i,1 + θ̂_i,2)^1/2 as in (11). Let t be the threshold such that H_0,i is rejected if |W_i| ≥ t. Let ℋ₀ = {i : β_i,1 = β_i,2, 1 ≤ i ≤ p} be the set of true nulls. Let R₀(t) = Σ_i∈ℋ₀ I(|W_i| ≥ t) and R(t) = Σ_1≤i≤p I(|W_i| ≥ t), respectively, denote the total number of false positives and the total number of rejections. The FDP and FDR are defined as

FDP (t) = \frac{R_{0} (t)}{R (t) ⋁ 1}, FDR (t) = E {FDP (t)} .

Ideally, we select the threshold level as

t_{0} = inf {0 \leq t \leq {(2 log p)}^{1 / 2} : FDP (t) \leq α} .

However, ℋ₀ is unknown, and we estimate Σ_i∈ℋ₀ I{|W_i| ≥ t} by 2p{1 − Φ(t)} due to the sparsity of β₁ − β₂, where Φ(t) is the standard normal cumulative distribution function. This leads to the following multiple testing procedure.

Calculate the test statistics W_i = (T_i,1 − T_i,2)/(θ̂_i,1 + θ̂_i,2)^1/2 as in (11).
For a given 0 ≤ α ≤ 1, calculate
$\hat{t} = inf {0 \leq t \leq {(2 log p)}^{1 / 2} : \frac{2 p {1 - Φ (t)}}{R (t) ⋁ 1} \leq α} .$

If t̂ does not exists, set t̂ = (2 log p)^1/2.
For 1 ≤ i ≤ p, reject H_0,i if and only if |W_i| ≥ t̂.

4.2 Theoretical Properties

We now investigate the theoretical properties of this multiple testing procedure. For any 1 ≤ i ≤ p, define

Γ_{i} (γ) = {j : 1 \leq j \leq p, | ξ_{i, j, d} | \geq {(log p)}^{- 2 - γ}, d = 1, 2},

where ξ_i,j,d is defined in Condition (C2). Under regularity conditions, this procedure controls the FDP and FDR at the pre-specified level α, asymptotically.

Theorem 4

Let

𝒮_{ρ} = {i : 1 \leq i \leq p, \frac{| β_{i, 1} - β_{i, 2} |}{{(θ_{i, 1} + θ_{i, 2})}^{1 / 2}} \geq {(log p)}^{1 / 2 + ρ}} .

Suppose for some ρ > 0 and some δ > 0, |𝒮_ρ| ≥ [1/(π^1/2α) + δ](log p)^1/2. Suppose that |A_τ ∩ ℋ₀| = o(p^ν) for any ν > 0, where A_τ is given in Condition (C1). Assume that p₀ = |ℋ₀| ≥ cp for some c > 0, and (7) and (8) hold. If there exists some γ > 0 such that max_1≤i≤p |Γ_i(γ)| = o(p^ν) for any ν > 0, then under (C1) – (C3) with p ≤ cn^r for some c > 0 and r > 0, we have

lim_{(n, p) \to \infty} \frac{FDR (\hat{t})}{α p_{0} / p} = 1,

\frac{FDP (\hat{t})}{α p_{0} / p} \to 1

in probability, as (n, p) → ∞.

The condition on |𝒮_ρ| is mild, because among p hypotheses in total, it only requires a few number of entries with the standardized difference exceeding (log p)^1/2+ρ/n^1/2 for some constant ρ > 0. The technical condition |A_τ ∩ ℋ₀| = o(p^ν) for any ν > 0 is to ensure that most of the regression residuals are not highly correlated with each other under the null hypotheses H_0,i : β_i,1 = β_i,2.

5 Simulation Study

We consider the numerical performance, including the sizes and powers of both the global and the multiple testing procedures, through simulation studies. We investigated the performance of both procedures under two sets of simulations. For the first, we generated the data by considering two constructions of regression coefficients under three matrix models, with covariates being a combination of continuous and discrete random variables. For the second set, we studied the numerical performance of the proposed multiple testing procedure in a setting that is similar to the data application described in Section 6. We compared the proposed multiple testing procedure with Benjamini-Yekutieli (B-Y) procedure, as considered in Benjamini and Yekutieli (2001), and show that the B-Y procedure is much more conservative and has lower power in all cases.

5.1 Implementation Details

The proposed testing procedures required the estimation of the regression coefficients β_d and γ_i,d, for i = 1, …, p and d = 1, 2. One may use the Lasso to estimate these parameters, as follows.

β_{d} = D_{X}^{- 1 / 2} arg min_{u} {\frac{1}{2 n_{d}} {| (X_{d} - {\bar{X}}_{d}) D_{X}^{- 1 / 2} u - (Y_{d} - Ȳ_{d}) |}_{2}^{2} + λ_{n} {| u |}_{1}},

(18)

and

γ_{i, d} = D_{i, d}^{- 1 / 2} arg min_{υ} {\frac{1}{2 n_{d}} {| ((Y_{d}, X_{\cdot, - i, d}) - (Ȳ_{d}, {\bar{X}}_{(\cdot, - i, d)})) D_{i, d}^{- 1 / 2} υ - (X_{\cdot, i, d} - {\bar{X}}_{\cdot, i, d}) |}_{2}^{2} + λ_{i, n} {| υ |}_{1}},

(19)

where D_X = diag(Σ̂), D_i,d = diag(σ̂_{Y_d}, Σ̂_−i,−i), $λ_{n} = κ \sqrt{{\hat{σ}}_{Y_{d}} log p / n_{d}}$ and $λ_{i, n} = κ \sqrt{{\hat{σ}}_{i, i} log p / n_{d}}$ , in which σ̂_{Y_d} is the sample variance of Y_d and Σ̂ = (σ̂_i,j) is the sample covariance matrix of X_d. In the global testing of H₀ : β₁ = β₂, we chose the tuning parameter κ = 2.

For multiple testing of H_0,i : β_i,1 = β_i,2, we selected the tuning parameters λ_n and λ_i,n in (18) and (19) adaptively by the data with the principle of making Σ_i∈ℋ₀ I{|W_i| ≥ t} and 2{1 − Φ(t)}|ℋ₀| as close as possible. That is, a good choice of the tuning parameters should minimize the error

\int_{c}^{1} {(\frac{\sum_{i} I (| W_{i}^{(b)} | \geq Φ^{- 1} (1 - α / 2))}{α p} - 1)}^{2} d α,

where c > 0 and $W_{i}^{(b)}$ is the statistic of the corresponding tuning parameter. Step 2 below is a discretization of the above integral. The algorithm is summarized as follows.

Let $λ_{n} = b / 20 \sqrt{{\hat{σ}}_{Y_{d}} log p / n_{d}}$ and $λ_{i, n} = b / 20 \sqrt{{\hat{σ}}_{i, i} log p / n_{d}}$ for b = 1, …, 40. For each b, calculate ${\hat{β}}_{d}^{(b)}$ and ${\hat{γ}}_{i, d}^{(b)}$ , i = 1, …, p, d = 1, 2. Based on the estimation of regression coefficients, construct the corresponding statistics $W_{i}^{(b)}$ for each b.
Choose b̂ as the minimizer of
$\hat{b} = arg min \sum_{s = 1}^{10} {(\frac{\sum_{1 \leq i \leq p} I {| W_{i}^{(b)} | \geq Φ^{- 1} (1 - s [1 - Φ {{(log p)}^{1 / 2}}] / 10)}}{2 ps [1 - Φ {{(log p)}^{1 / 2}}] / 10} - 1)}^{2} .$

The tuning parameters λ_n and λ_i,n are then chosen to be

λ_{n} = \hat{b} / 20 \sqrt{{\hat{σ}}_{Y_{d}} log p / n_{d}} and λ_{i, n} = \hat{b} / 20 \sqrt{{\hat{σ}}_{i, i} log p / n_{d}}

(20)

5.2 Simulation Under Different Matrix Models

We first generated the design matrices X_k,·,d, for k = 1, …, n_d and d = 1, 2, with some of the covariates being continuous and the others being discrete. For simplicity, we generated X_k,·,d from the same distribution for d = 1, 2. As a first step, for three different matrix models, we obtained i.i.d samples X_k,·,d ~ N(0,Σ^(m)), for k = 1, …, n_d, with m = 1, 2 and 3. We then replaced l covariates of X_k,·,d by one of three discrete values 0, 1 or 2, with probability 1/3 each, where l is a random integer between ⌊p/2⌋ and p. We first introduce the matrix models Σ^(m) used in the simulations. Let D = (D_i,j) be a diagonal matrix with D_i,i = Unif(1, 3) for i = 1, …, p. The following models were used to generate the design matrices.

Model 1: $Ω^{* (1)} = (ω_{i, j}^{* (1)})$ , where $ω_{i, i}^{* (1)} = 1, ω_{i, i + 1}^{* (1)} = ω_{i + 1, i}^{* (1)} = 0.6, ω_{i, i + 2}^{* (1)} = ω_{i + 2, i}^{* (1)} = 0.3$ and $ω_{i, j}^{* (1)} = 0$ otherwise. Ω⁽¹⁾ = D^1/2Ω*⁽¹⁾D^1/2.
Model 2: $Ω^{* (2)} = (ω_{i, j}^{* (2)})$ , where $ω_{i, j}^{* (2)} = ω_{j, i}^{* (2)} = 0.5$ for i = 10(k − 1) + 1 and 10(k − 1) + 2 ≤ j ≤ 10(k − 1) + 10, 1 ≤ k ≤ p/10. $ω_{i, j}^{* (2)} = 0$ otherwise. Ω⁽²⁾ = D^1/2(Ω*⁽²⁾ + δI)/(1 + δ)D^1/2 with δ = |λ_min(Ω*⁽²⁾)| + 0.05.
Model 3: $Ω^{* (3)} = (ω_{i, j}^{* (3)})$ , where $ω_{i, i}^{* (3)} = 1, ω_{i, j}^{* (3)} = 0.8 \times Bernoulli (1, 0.05)$ for i < j and $ω_{j, i}^{* (3)} = ω_{i, j}^{* (3)}$ . Ω⁽³⁾ = D^1/2(Ω*⁽³⁾ + δI)/(1 + δ)D^1/2 with δ = |λ_min(Ω*⁽³⁾)| + 0.05.

Global Test

For the global testing of H₀ : β₁ = β₂, the sample sizes were taken to be n = n₁ = n₂ = 100, while the dimension p varied over the values 100, 200, 400, and 1000. Under the global null hypothesis, we have β₁ = β₂ = β, and two scenarios of generating β were considered. For case 1, 10 nonzero locations {k₁, …, k₁₀} of β were randomly generated with magnitudes $β_{k_{i}, 1} = 2 i^{0.5} n_{1}^{- 0.15}$ , i = 1, …, 10. For case 2, s nonzero locations for β were randomly selected, with s = 5, 8, 10, and 15 for p =100, 200, 400 and 1000, respectively. The nonzero locations had magnitudes with any values between −10 and 10. The error terms ε_k,d were generated as normal random variables with mean 0 and variances having any values between 0.5 and 2.5. The nominal significance level for all the tests was set at α₁ = 0.05.

Table 1 shows that the sizes of the global test Ψ_α₁ are close to the nominal level for both cases under all matrix models. This reflects the fact that the null distribution of the test statistics M_n is well approximated by its limiting null distribution, as shown in Theorem 1. The empirical sizes are slightly below the nominal level in some cases for lower dimensions, as similarly observed in Xia et al. (2015), due to correlation among the variables. It is also shown in Table 1 that the proposed test is powerful in all settings, though β₁ and β₂ only differ in five or fewer locations with magnitudes of the order $\sqrt{log p / n}$ .

Table 1.

Empirical sizes and powers (%) for global testing with α₁ = 0.05, n₁ = n₂ = 100, and 1000 replications.

p	Case 1			Case 2

	Model 1	Model 2	Model 3	Model 1	Model 2	Model 3
	Size

100	4.1	3.2	2.9	4.4	2.9	2.8
400	4.8	3.8	3.7	4.0	4.1	3.5
1000	6.1	4.4	5.4	5.9	4.6	6.4

	Power

100	71.9	64.3	67.4	95.1	97.1	96.6
400	88.3	86.2	83.5	82.3	77.0	82.1
1000	95.1	92.6	97.9	47.3	42.0	48.1

Open in a new tab

To evaluate the power of the global test, we selected five locations, {k₁, …, k₅}, among the nonzero locations of β₁, with magnitudes β_{k_j,2} = β_{k_j,1} + u_j, j = 1, …, m, where u_j has magnitude randomly and uniformly from the set [−2β(log p/n)^1/2, −β(2 log p/n)^1/2] ∪ [β(2 log p/n)^1/2, 2β(log p/n)^1/2], with β = max_1≤i≤p |β_i,1|. The actual sizes and powers in percentage for each case under three matrix models, reported in Table 1, are estimated from 1000 replications. For each replication, the nonzero locations and magnitudes of the regression coefficients could vary.

Multiple Testing

For simultaneous testing of {H_0,i : β_i,1 − β_i,2 = 0, for 1 ≤ i ≤ p} with FDR control, we first generated β₁ according to the above two cases. For case 1, ten nonzero locations ${k_{1}^{'}, \dots, k_{10}^{'}}$ for β₂ were randomly generated and the locations could vary for these two vectors. The magnitudes were generated with values $β_{k_{i}^{'}, 2} = 4 i^{0.5} n_{2}^{- 0.15}$ , i = 1, …, 10. For case 2, s nonzero locations for β₂ were randomly selected, again with s = 5, 8, 10, and 15 for p =100, 200, 400 and 1000, respectively, also with magnitudes having any values between −10 and 10.

In Table 2, we present the empirical FDR and true discovery rate (power) of the proposed procedure (NEW) and the B-Y procedure at the FDR level of α₂ = 0.1, based on 100 replications, where the power is summarized based on

\frac{1}{100} \sum_{l = 1}^{100} \frac{\sum_{i \in ℋ_{1}} I (| W_{i, l} | \geq \hat{t})}{| ℋ_{1} |},

where W_i,l denotes standardized difference for the l^th replication and ℋ₁ denotes the nonzero locations of β₁ − β₂. The results suggest that across all configurations, the FDRs are well controlled under the nominal level α by both FDR control procedures. However, the B-Y procedure is extremely conservative in all scenarios. For the new FDR procedure, the empirical FDRs are also conservative, due to the correlations among the regression residuals under the nulls ℋ_0,i, and also due to the fact that we use |ℋ| to estimate |ℋ₀| because the latter is usually unknown. Furthermore, the total number of true signals is small in all cases due to the sparsity of the regression coefficients; for example, when the total number of true signals is ten, the FDP for each replication tends to be either 0 or some number close to 0.1, which will also cause the conservatism of the FDR estimation. In case 2, we can see that the empirical FDR gets closer to the nominal level as dimension increases, because the number of true signals increases when p grows. In summary, the new procedure has empirical FDR much closer to the nominal level than B-Y procedure in all cases. Table 2 also reflects that the FDR control procedure introduced in Section 4 is more powerful than the B-Y procedure for different scenarios.

Table 2.

Empirical FDRs and powers (%) for the new FDR procedure and B-Y procedure with α₂ = 0.1, n₁ = n₂ = 100, and 100 replications.

p	Method	Case 1			Case 2

		Model 1	Model 2	Model 3	Model 1	Model 2	Model 3
		Size

100	NEW	5.9	5.8	6.8	3.8	4.5	3.6
100	B-Y	0.3	1.0	0.7	0.1	0.3	0.7

400	NEW	6.7	7.4	6.8	6.2	5.5	5.5
400	B-Y	0.4	0.6	0.4	0.2	0.7	0.5

1000	NEW	6.2	6.0	6.1	9.4	9.4	9.8
1000	B-Y	0.6	1.0	0.4	1.5	1.6	1.4

		Power

100	NEW	95.3	94.7	94.7	93.3	92.1	90.4
100	B-Y	91.5	88.1	88.5	88.6	90.3	88.3

400	NEW	92.7	88.2	90.8	84.3	82.9	83.6
400	B-Y	86.1	82.2	84.3	81.5	78.7	81.3

1000	NEW	84.7	82.7	85.1	71.7	70.4	71.9
1000	B-Y	77.7	75.0	77.6	66.2	64.5	66.1

Open in a new tab

5.3 Simulation by Mimicking Data

We now consider a simulation setting mimicking the data considered in Section 6, where we have p = 119, n₁ = 46 and n₂ = 417. We investigated both cases of the construction of the regression coefficients as considered in Section 5.2, with ten nonzero locations, under all three matrix models, with covariates as a combination of continuous and discrete random variables. The nominal level was set at α₃ = 0.1, and the empirical FDR’s and powers for both FDR procedures, as reported in Table 3, were evaluated based on 100 replications. As in Section 5.2, the empirical FDRs are well controlled under the data setting by the new FDR procedure, while the B-Y procedure is again very conservative. For case 1, the empirical FDR’s of the new procedure are slightly larger than the nominal level, due to the fact that n₁ is much smaller than n₂ in this setting, and thus β₁ and β₂ have magnitudes much closer to each other based on their construction. The performance of the new method for case 2 is less conservative than in Section 5.2 due to the fact we have ten nonzero locations for the regression coefficients when the dimension is 119 in the data setting. Table 3 also indicates that the new procedure is more powerful than the B-Y procedure under the data setting in all scenarios.

Table 3.

Empirical FDRs and powers (%) for the new FDR procedure and B-Y procedure under the data setting with α₃ = 0.1, p = 119, n₁ = 46, n₂ = 417, and 100 replications.

p	Method	Case 1			Case 2

		Model 1	Model 2	Model 3	Model 1	Model 2	Model 3
		Size

119	NEW	9.4	11.2	11.0	8.7	8.9	8.8
119	B-Y	2.2	3.0	2.9	1.7	1.4	1.6

		Power

119	NEW	83.6	81.7	83.9	79.6	78.2	80.3
119	B-Y	76.2	72.1	74.8	73.7	72.6	74.6

Open in a new tab

6 Data Analysis

We illustrate our proposed methods using the Framingham Offspring Study (Kannel et al. (1979)) of coronary artery disease (CAD). Over the past three decades, various risk prediction models for CAD have been developed (Wilson et al. (1998); Ridker et al. (2007)). Unlike those for many other diseases, the risk models such as the Framingham Risk Score have been incorporated into clinical practice guidelines (Lloyd-Jones et al. (2004); D’Agostino Sr et al. (2008)). However, these models, largely based on traditional clinical risk factors, have recognized limitations in their clinical utilities. It is thus important to explore avenues beyond the routine clinical measures to improve prediction. One potential approach is to fully understand the roles of intermediate phenotypes, such as the C- reactive protein (CRP) and genomic markers. In recent years, many genome-wide association studies (GWAS) have been conducted to identify CAD-related single-nucleotide polymorphism (SNP) mutations. The newly identified SNPs, while significantly associated with CAD risk or the intermediate phenotypes of CAD, explain very little of the genetic risk for the trait (Humphries et al. (2008); Paynter et al. (2009)). This coincides with the growing awareness that the failure to identify genetic scores that significantly improve risk prediction for complex traits may be in part due to failure to account for the interplay of genes and environment. It is thus of substantial interests to study environment and its interaction with a genetic predisposition in causing human diseases.

Here, we use data from Framingham Offspring Study to examine how the interaction between smoking and genetic risk factors affect the inflammation marker CRP, since the inflammation system plays a vital role in the atherosclerotic process (Ross (1999)). We focus on the 463 female participants with complete information on CRP, 116 SNP’s previously reported as associated with CAD intermediate phenotypes, two leading principal components that adjust for population stratification, as well as age and smoking status at exam seven. Smoking is known to roughly double life-time risk of CAD and is thought to increase cardiovascular risk via a few different mechanisms. We examine the interaction between smoking and the genetic markers, as well as other risk factors based on the proposed method. We fit linear regression models for smokers and for non-smokers and the variables with significantly different coefficients between smokers and non-smokers are deemed as having an interactive effect.

The effects of top eight SNPs including rs11585329, rs17583120, rs17132534, rs11214606, rs17529477, rs10891552, rs4293, and rs4351, on CRP are considered as significantly modified by smoking. Interestingly, the smoking and rs11585329 interaction has been reported as important contributor to the risk of colorectal cancer whereas inflammation is a hallmark of cancer (Liu et al. (2013)). SNP rs17132534 belongs to the UCP2 gene whose main function is the control of mitochondria-derived reactive oxygen species. A variant in the UCP2 has been previously shown to interact with smoking to influence plasma markers of oxidative stress and hence likely to be associated with prospective CHD risk (Stephens et al. (2008)). SNPs rs10891552, rs17529477, and rs11214606 all belong to the DRD2 gene, which is linked to addictive behaviors, including alcoholism and smoking. Smoking was found to modify the effects of polymorphism in DRD2 gene on gastric cancer risk (Ikeda et al. (2008)). SNPs rs4293 and rs 4351 belong to the ACE gene, linked with hypertension and CAD among other disorders. Interactions between smoking and polymorphism in the ACE gene have been reported for blood pressure and coronary atherosclerosis (Hibi et al. (1997); Sayed-Tabatabaei et al. (2004); Schut et al. (2004)).

7 Extension to Non-Binary Environmental Variable

Motivated by applications in genomics, we have proposed hypothesis testing procedures for detecting the interactions between environment and genomic markers when the environmental variable is binary, such as smoking status, as illustrated in Section 6. Our testing approach can be extended to detect the interactions when the environmental variable is discrete and finite, but non-binary. Specifically, suppose the environmental variable takes K possible values. Interaction detection can then be formulated based on comparing K high-dimensional regression models

Y_{d} = μ_{d} + X_{d} β_{d} + ε_{d}, for d = 1, \dots, K .

One wishes to develop a global test for

H_{0} : β_{1} = β_{2} = \dots β_{K} versus H_{1} : β_{l} \neq β_{k} for some 1 \leq l < k \leq K,

(21)

as well as develop a procedure for simultaneously testing the hypotheses

H_{0, i} : β_{i, 1} = β_{i, 2} = \dots = β_{i, K} versus H_{1, i} : β_{i, l} \neq β_{i, k} for some 1 \leq l < k \leq K, i = 1, \dots, p,

(22)

with FDR and FDP control.

The test statistics for each model can be formulated similarly as in Section 2.2. For d = 1, …, K, we let

T_{i, d} = {\hat{r}}_{i, d} / {\hat{σ}}_{η_{i, d}}^{2}

and estimate θ_i,d by

{\hat{θ}}_{i, d} = ({\hat{σ}}_{ε_{d}}^{2} / {\hat{σ}}_{η_{i, d}}^{2} + {\hat{β}}_{i, d}^{2}) / n_{d} .

Then the pairwise standardized statistics can be defined by

W_{i}^{(l, k)} = \frac{T_{i, l} - T_{i, k}}{{({\hat{θ}}_{i, l} + {\hat{θ}}_{i, k})}^{1 / 2}}, 1 \leq l < k \leq K, i = 1, \dots, p .

Then if K is finite, we construct the sum of square type test statistic by

S_{i} = \sum_{1 \leq l < k \leq K} {(W_{i}^{(l, k)})}^{2} .

As in Cai and Xia (2014), it can be shown that the limiting null distribution of S_i is a mixture chi-square distribution. Based on this fact, we can further develop global and multiple testing procedures. When the environmental variable is binary, the test statistics S_i reduce to (11) in Section 2.2. On the other hand, if the environmental variable is continuous, the testing problem is significantly different, and out of the scope of the current paper. We leave it to future research.

8 Proofs

We prove the main results in this section. We begin by collecting technical lemmas that will be used in the proof of the main theorems.

8.1 Technical Lemmas

The first lemma is the classical Bonferroni inequality.

Lemma 1 (Bonferroni inequality)

Let $B = \cup_{t = 1}^{p} B_{t}$ . For any k < [p/2], we have

\sum_{t = 1}^{2 k} {(- 1)}^{t - 1} F_{t} \leq P (B) \leq \sum_{t = 1}^{2 k - 1} {(- 1)}^{t - 1} F_{t},

where F_t = Σ_{1≤i₁<⋯<i_t≤p} P(B_i₁ ∩ ⋯ ∩ B_{i_t}).

For d = 1, 2, let $U_{i, d} = n_{d}^{- 1} \sum_{k = 1}^{n_{d}} {ε_{k, d} η_{k, i, d} - E (ε_{k, d} η_{k, i, d})}$ and $Ũ_{i, d} = β_{i, d} + U_{i, d} / σ_{η_{i, d}}^{2}$ . The following lemma is essentially proved in Liu and Luo (2014).

Lemma 2

Suppose that Conditions (C1), (C3), (7) and (8) hold. Then

T_{i, d} = Ũ_{i, d} + ({\tilde{σ}}_{ε_{d}}^{2} / σ_{ε_{d}}^{2} + {\tilde{σ}}_{η_{i, d}}^{2} / σ_{η_{i, d}}^{2} - 2) β_{i, d} + o_{P} {{(n_{d} log p)}^{- 1 / 2}},

where ${\tilde{σ}}_{ε_{d}}^{2} = n_{d}^{- 1} \sum_{k = 1}^{n_{d}} {(ε_{k, d} - {\bar{ε}}_{k, d})}^{2}$ and ${\tilde{σ}}_{η_{i, d}}^{2} = n_{d}^{- 1} \sum_{k = 1}^{n_{d}} {(η_{k, i, d} - {\bar{η}}_{k, i, d})}^{2}$ with ${\bar{ε}}_{k, d} = n_{d}^{- 1} \sum_{k = 1}^{n_{d}} ε_{k, d}$ and ${\bar{η}}_{k, i, d} = n_{d}^{- 1} \sum_{k = 1}^{n_{d}} η_{k, i, d}$ . Consequently, uniformly in i = 1, …, p,

| T_{i, d} - Ũ_{i, d} | = O_{P} {β_{i, d} {(log p / n_{d})}^{1 / 2}} + o_{P} {{(n_{d} log p)}^{- 1 / 2}} .

Lemma 3

Let X_k ~ N(μ₁, Σ₁) for k = 1, …, n₁ and Y_k ~ N(μ₂, Σ₂) for k = 1, …, n₂.

Define

{\sum^{\sim}}_{1} = {({\tilde{σ}}_{i, j, 1})}_{p \times p} = \frac{1}{n_{1}} \sum_{k = 1}^{n_{1}} (X - μ_{1}) {(X - μ_{1})}^{⊤}, {\sum^{\sim}}_{2} = {({\tilde{σ}}_{i, j, 2})}_{p \times p} = \frac{1}{n_{2}} \sum_{k = 1}^{n_{2}} (Y - μ_{2}) {(Y - μ_{2})}^{⊤} .

Then, for some constant C > 0, σ̃_i,j,1 − σ̃_i,j,2 satisfies the large deviation bound

P [max_{(i, j) \in 𝒮} \frac{{({\tilde{σ}}_{i, j, 1} - {\tilde{σ}}_{i, j, 2} - σ_{i, j, 1} + σ_{i, j, 2})}^{2}}{Var {(X_{k, i} - μ_{1, i}) (X_{k, j} - μ_{1, j})} / n_{1} + Var {(Y_{k, i} - μ_{2, i}) (Y_{k, j} - μ_{2, j})} / n_{2}} \geq x^{2}] \leq C | 𝒮 | {1 - Φ (x)} + O (p^{- 1})

uniformly for 0 ≤ x ≤ (8 log p)^1/2 and any subset 𝑆 ⊆ {(i, j) : 1 ≤ i ≤ j ≤ p}.

The complete proof of this lemma can be found in the supplementary material of Xia et al. (2015).

8.2 Proof of Theorem 1

To prove Theorem 1, we first show that the terms in A_τ are negligible. Then we focus on the terms in ℋ\A_τ, where ℋ = {1, …, p}, and show that $P ({max}_{i \in ℋ \ A_{τ}} W_{i}^{2} - 2 log p + log log p \leq t) \to exp (- π^{- 1 / 2} exp (- t / 2))$ , where W_i is defined in (11).

Define

V_{i} = \frac{U_{i, 1} / σ_{η_{i, 1}}^{2} - U_{i, 2} / σ_{η_{i, 2}}^{2}}{{(θ_{i, 1} + θ_{i, 2})}^{1 / 2}},

where $θ_{i, d} = Var (Ũ_{i, d}) = Var (ε_{k, d} η_{k, i, d} / σ_{η_{i, d}}^{2}) / n_{d} = (σ_{ε_{d}}^{2} / σ_{η_{i, d}}^{2} + β_{i, d}^{2}) / n_{d}$ , for d = 1, 2. By Lemma 2 in Xia et al. (2015), under conditions (7) and (8), we have

| {\hat{σ}}_{ε_{d}}^{2} - σ_{ε_{d}}^{2} | = O_{P} (\sqrt{\frac{log p}{n_{d}}}), and max_{i} | {\hat{σ}}_{η_{i, d}}^{2} - σ_{η_{i, d}}^{2} | = O_{P} (\sqrt{\frac{log p}{n_{d}}}) .

(23)

Thus we have

max_{i} | {\hat{θ}}_{i, d} - θ_{i, d} | = o_{P} (1 / (n_{d} log p)) .

(24)

By Lemma 2, we have

W_{i} = V_{i} + b_{i} + o_{P} {{(log p)}^{- 1 / 2}},

where $b_{i} = {({\tilde{σ}}_{ε_{1}}^{2} / σ_{ε_{1}}^{2} + {\tilde{σ}}_{η_{i, 1}}^{2} / σ_{η_{i, 1}}^{2}) β_{i, 1} - ({\tilde{σ}}_{ε_{2}}^{2} / σ_{ε_{2}}^{2} + {\tilde{σ}}_{η_{i, 2}}^{2} / σ_{η_{i, 2}}^{2}) β_{i, 2}} / {({\hat{θ}}_{i, 1} + {\hat{θ}}_{i, 2})}^{1 / 2}$ . Note that for i ∈ ℋ\A_τ, β_i,d = o{(log p)⁻¹}. Thus we have max_{i∈ℋ\A_τ} |W_i − V_i| = o_P{(log p)^−1/2}. For i ∈ A_τ,

b_{i} \leq | \frac{{\tilde{σ}}_{ε_{1}}^{2} β_{i, 1} / σ_{ε_{1}}^{2} - {\tilde{σ}}_{ε_{2}}^{2} β_{i, 2} / σ_{ε_{2}}^{2}}{{Var (ε_{k, 1}^{2}) β_{i, 1}^{2} / (σ_{ε_{1}}^{4} n_{1}) + Var (ε_{k, 2}^{2}) β_{i, 2}^{2} / (σ_{ε_{2}}^{4} n_{2})}^{1 / 2}} | + | \frac{{\tilde{σ}}_{η_{i, 1}}^{2} β_{i, 1} / σ_{η_{i, 1}}^{2} - {\tilde{σ}}_{η_{i, 2}}^{2} β_{i, 2} / σ_{η_{i, 2}}^{2}}{{Var (η_{k, i, 1}^{2}) β_{i, 1}^{2} / (σ_{η_{i, 1}}^{4} n_{1}) + Var (η_{k, i, 2}^{2}) β_{i, 2}^{2} / (σ_{η_{i, 2}}^{4} n_{2})}^{1 / 2}} | + o_{P} {{(log p)}^{- 1 / 2}} .

Due to the fact that the indices of the random variables only show up in the second term here, by Lemma 3 and the condition that |A_τ| = O(p^r) with r < 1/4, we have

P (max_{i \in A_{τ}} W_{i}^{2} \geq 2 log p - log log p + t) \leq | A_{τ} | {P (V_{i}^{2} \geq 2 r log p) + P ({\tilde{b}}_{i}^{2} \geq 2 r log p)} + o (1) = o (1),

where ${\tilde{b}}_{i} = | \frac{{\tilde{σ}}_{η_{i, 1}}^{2} β_{i, 1} / σ_{η_{i, 1}}^{2} - {\tilde{σ}}_{η_{i, 2}}^{2} β_{i, 2} / σ_{η_{i, 2}}^{2}}{{Var (η_{k, i, 1}^{2}) β_{i, 1}^{2} / (σ_{η_{i, 1}}^{4} n_{1}) + Var (η_{k, i, 2}^{2}) β_{i, 2}^{2} / (σ_{η_{i, 2}}^{4} n_{2})}^{1 / 2}} |$ . Thus, it suffices to show that

P (max_{i \in ℋ \ A_{τ}} V_{i}^{2} - 2 log p + log log p \leq t) \to exp (- π^{- 1 / 2} exp (- t / 2)) .

Let q = |ℋ\A_τ| and let n₂/n₁ ≤ K₁ with K₁ ≥ 1. Define $Z_{k, i} = (n_{2} / n_{1}) {ε_{k, 1} η_{k, i, 1} - E (ε_{k, 1} η_{k, i, 1})} / σ_{η_{i, 1}}^{2}$ for 1 ≤ k ≤ n₁ and $Z_{k, i} = - {ε_{k, 2} η_{k, i, 2} - E (ε_{k, 2} η_{k, i, 2})} / σ_{η_{i, 2}}^{2}$ for n₁ + 1 ≤ k ≤ n₂. Thus we have

V_{i} = \frac{\sum_{k = 1}^{n_{1} + n_{2}} Z_{k, i}}{{(n_{2}^{2} θ_{k, 1} / n_{1} + n_{2} θ_{k, 2})}^{1 / 2}} .

Without loss of generality, we assume $σ_{ε_{d}}^{2} = σ_{η_{i, d}}^{2} = 1$ . Define

{\hat{V}}_{i} = \frac{\sum_{k = 1}^{n_{1} + n_{2}} Ẑ_{k, i}}{{(n_{2}^{2} θ_{k, 1} / n_{1} + n_{2} θ_{k, 2})}^{1 / 2}},

where Ẑ_k,i = Z_k,iI(|Z_k,i| ≤ τ_n) − E{Z_k,iI(|Z_k,i| ≤ τ_n)}, and τ_n = (4K₁/K) log(p + n). Note that ${max}_{i \in ℋ \ A_{τ}} V_{i}^{2} = {max}_{1 \leq i \leq q} V_{i}^{2}$ , and that

max_{1 \leq i \leq q} n^{- 1 / 2} \sum_{k = 1}^{n_{1} + n_{2}} E [| Z_{k, i} | I {| Z_{k, i} | \geq (4 K_{1} / K) log (p + n)}] \leq C n^{1 / 2} max_{1 \leq k \leq n_{1} + n_{2}} max_{1 \leq i \leq q} E [| Z_{k, i} | I {| Z_{k, i} | \geq (4 K_{1} / K) log (p + n)}] \leq C n^{1 / 2} {(p + n)}^{- 2} max_{1 \leq k \leq n_{1} + n_{2}} max_{1 \leq i \leq q} E [| Z_{k, i} | exp {(K / 2) | Z_{k, i} |}] \leq C n^{1 / 2} {(p + n)}^{- 2} .

Hence, P{max_1≤i≤q |V_i − V̂_i| ≥ (log p)⁻¹} ≤ P(max_1≤i≤q max_{1≤k≤n₁+n₂} |Z_k,i| ≥ τ_n) = O(p⁻¹). By the fact that $| {max}_{1 \leq i \leq q} V_{i}^{2} - {max}_{1 \leq i \leq q} {\hat{V}}_{i}^{2} | \leq 2 {max}_{1 \leq i \leq q} | {\hat{V}}_{i} | {max}_{1 \leq i \leq q} | V_{i} - {\hat{V}}_{i} | + {max}_{1 \leq i \leq q} {| V_{i} - {\hat{V}}_{i} |}^{2}$ , it suffices to prove that for any t ∈ ℝ, as n, p → ∞,

P (max_{1 \leq i \leq q} {\hat{V}}_{i}^{2} - 2 log p + log log p \leq t) \to exp (- π^{- 1 / 2} exp (- t / 2)) .

(25)

By Lemma 1, for any integer l with 0 < l < q/2,

\sum_{d = 1}^{2 l} {(- 1)}^{d - 1} \sum_{1 \leq i_{1} < \dots < i_{d} \leq q} P (\cap_{j = 1}^{d} F_{i_{j}}) \leq P (max_{1 \leq i \leq q} {\hat{V}}_{i}^{2} \geq y_{p}) \leq \sum_{d = 1}^{2 l - 1} {(- 1)}^{d - 1} \sum_{1 \leq i_{1} < \dots < i_{d} \leq q} P (\cap_{j = 1}^{d} F_{i_{j}}),

(26)

where y_p = 2 log p − log log p + t and $F_{i_{j}} = ({\hat{V}}_{i_{j}}^{2} \geq y_{p})$ . Let Z̃_k,i = Ẑ_k,i/(n₂θ_i,1/n₁ + θ_i,2)^1/2 for i = 1, …, q and W_k = (Z̃_k,i₁, …, Z̃_{k,i_d}), for 1 ≤ k ≤ n₁ + n₂. Define |a|_min = min_1≤i≤d |a_i| for any vector a ∈ R^d. Then we have

P (\cap_{j = 1}^{d} F_{i_{j}}) = P ({| n_{2}^{- \frac{1}{2}} \sum_{k = 1}^{n_{1} + n_{2}} W_{k} |}_{min} \geq y_{p}^{\frac{1}{2}}) .

Then it follows from Theorem 1 in Zaïtsev (1987) that

P ({| n_{2}^{- 1 / 2} \sum_{k = 1}^{n_{1} + n_{2}} W_{k} |}_{min} \geq y_{p}^{1 / 2}) \leq P {{| N_{d} |}_{min} \geq y_{p}^{1 / 2} - ε_{n} {(log p)}^{- 1 / 2}} + c_{1} d^{\frac{5}{2}} exp {- \frac{n^{1 / 2} ε_{n}}{c_{2} d^{3} τ_{n} {(log p)}^{1 / 2}}},

(27)

where c₁ > 0 and c₂ > 0 are constants, ε_n → 0 which will be specified later, and N_d = (N_m₁, …, N_{m_d}) is a normal random vector with E(N_d) = 0 and Cov(N_d) = n₁/n₂ Cov(W₁) + Cov(W_n₁+1). Here d is a fixed integer that does not depend on n, p. Because log p = o(n^1/5), we can let ε_n → 0 sufficiently slowly that, for any large M > 0,

c_{1} d^{5 / 2} exp {- \frac{n^{1 / 2} ε_{n}}{c_{2} d^{3} τ_{n} {(log p)}^{1 / 2}}} = O (p^{- M}) .

(28)

Combining (26), (27), and (28) we have

P (max_{1 \leq i \leq q} {\hat{V}}_{i}^{2} \geq y_{p}) \leq \sum_{d = 1}^{2 l - 1} {(- 1)}^{d - 1} \sum_{1 \leq i_{1} < \dots < i_{d} \leq q} P {{| N_{d} |}_{min} \geq y_{p}^{1 / 2} - ε_{n} {(log p)}^{- 1 / 2}} + o (1) .

(29)

Similarly, using Theorem 1 in Zaïtsev (1987) again, we can get

P (max_{1 \leq i \leq q} {\hat{V}}_{i}^{2} \geq y_{p}) \geq \sum_{d = 1}^{2 l} {(- 1)}^{d - 1} \sum_{1 \leq i_{1} < \dots < i_{d} \leq q} P {{| N_{d} |}_{min} \geq y_{p}^{1 / 2} + ε_{n} {(log p)}^{- 1 / 2}} - o (1) .

(30)

The following lemma is shown in the supplementary material of Cai et al. (2013) with q ≍ p and y_p = 2 log p − log log p + t.

Lemma 4

For any fixed integer d ≥ 1 and real number t ∈ ℝ,

\sum_{1 \leq i_{1} < \dots < i_{d} \leq q} P {{| N_{d} |}_{min} \geq y_{p}^{1 / 2} \pm ε_{n} {(log p)}^{- 1 / 2}} = \frac{1}{d!} {{(π)}^{- 1 / 2} exp (- t / 2)}^{d} {1 + o (1)} .

(31)

It then follows from Lemma 4, (29), and (30) that

\underset{n, p \to \infty}{lim sup} P (max_{1 \leq i \leq q} {\hat{V}}_{i}^{2} \geq y_{p}) \leq \sum_{d = 1}^{2 l} {(- 1)}^{d - 1} \frac{1}{d!} {{(π)}^{- 1 / 2} exp (- t / 2)}^{d}

\underset{n, p \to \infty}{lim inf} P (max_{1 \leq i \leq q} {\hat{V}}_{i}^{2} \geq y_{p}) \geq \sum_{d = 1}^{2 l - 1} {(- 1)}^{d - 1} \frac{1}{d!} {{(π)}^{- 1 / 2} exp (- t / 2)}^{d}

for any positive integer l. By letting l → ∞, we obtain (25) and Theorem 1 is proved.

8.3 Proof of Theorem 2

Let $M_{n}^{1} = {max}_{1 \leq i \leq j \leq p} {T_{i, 1} - T_{i, 2} - (β_{i, 1} - β_{i, 2})}^{2} / ({\hat{θ}}_{i, 1} + {\hat{θ}}_{i, 2})$ . It follows from the proof of Theorem 1 that $P (M_{n}^{1} \leq 2 log p - 2^{- 1} log log p) \to 1$ , as n, p → ∞. By (23), (24), and the inequalities

max_{1 \leq i \leq p} \frac{{(β_{i, 1} - β_{i, 2})}^{2}}{({\hat{θ}}_{i, 1} + {\hat{θ}}_{i, 2})} \leq 2 M_{n}^{1} + 2 M_{n},

max_{1 \leq i \leq p} \frac{| β_{i, 1} - β_{i, 2} |}{{(θ_{i, 1} + θ_{i, 2})}^{1 / 2}} \geq 2 \sqrt{2} {(log p)}^{1 / 2},

we have P(M_n ≥ q_α + 2 log p − log log p) → 1 as n, p → ∞.

8.4 Proof of Theorem 3

To prove the lower bound, we first construct a worst case scenario to test between β₁ and β₂. We apply the arguments in Baraud (2002) to prove the result.

Without loss of generality, we assume $σ_{ε_{d}}^{2} = 1$ , σ_i,i,d = 1, σ_i,j,d = 0, i ≠ j for d = 1, 2, and n₁ = n₂. Let m̂ be a random entry uniformly drawn from ℋ = {1, …, p}. We construct a class of β₁, 𝒩 = {β^(m̂), m̂ ∈ ℋ}, such that, β_m̂,1 = ρ and β_i,1 = 0 for i ≠ m̂, with ρ = c(log p/n)^1/2, where c < 1/2 is a constant. Let β₂ = 0 and β₁ be uniformly distributed on 𝒩. Let μ_ρ be the distribution on β₁ − β₂. Note that μ_ρ is a probability measure on ${δ \in 𝒮_{1} : {| δ |}_{2}^{2} = ρ^{2}}$ , where 𝑆₁ is a class of p-dimensional vectors with one nonzero entry. Then the likelihood ratio between samples {Y_k,1, X_k,·,1} and {Y_k,2, X_k,·,2} can be calculated as

L_{μ_{ρ}} = E_{\hat{m}} [\prod_{k = 1}^{n} \frac{1}{{| \sum^{(\hat{m})} |}^{1 / 2}} exp {- \frac{1}{2} Z_{k}^{T} (Ω^{(\hat{m})} - I) Z_{k}}],

where Σ^(m̂) = Ω^(m̂)−1 is the covariance matrix of {Y_k,1, X_k,·,1} and {Z₁, …, Z_n} are i.i.d samples generated from N(0, I). Because $Var (Y_{k, 1}) = σ_{\hat{m}, \hat{m}, 1} β_{\hat{m}, 1}^{2} + 1$ , Var(Y_k,2) = 1 and Cov(Y_k,d, X_k,i,d) = β_i,dσ_i,i,d. It can be easily calculated that |Σ^(m̂)| = 1 and $Ω^{(\hat{m})} = (ω_{i, j}^{(\hat{m})})$ with $ω_{1, 1}^{(\hat{m})} = 1, ω_{1, \hat{m} + 1}^{(\hat{m})} = ω_{\hat{m} + 1, 1}^{(\hat{m})} = - ρ, ω_{\hat{m} + 1, \hat{m} + 1}^{(\hat{m})} = 1 + ρ^{2}$ , and $ω_{i, j}^{(\hat{m})} = 0$ otherwise. Hence

L_{μ_{ρ}}^{2} = p^{- 2} \sum_{m, m' \in ℋ} E [\prod_{k = 1}^{n} exp {- \frac{1}{2} Z_{k}^{T} (Ω^{(m)} + Ω^{(m')} - 2 I) Z_{k}}] .

With Ω^(m) + Ω^(m′) − 2I = (a_i,j), it is easy to see that, when m ≠ m′, a_i,i = ρ² and a_1,i = −ρ for i = m + 1 or m′ + 1, a_j,i = a_i,j and a_i,j = 0 otherwise; when m = m′, a_i,i = 2ρ² and a_1,i = −2ρ for i = m + 1, a_j,i = a_i,j and a_i,j = 0 otherwise. Thus we have

E (L_{μ_{ρ}}^{2}) = {[E (exp {ρ (x_{1} x_{2} + x_{2} x_{3}) - ρ^{2} (x_{2}^{2} + x_{3}^{2}) / 2}]}^{n} + p^{- 1} {[E (exp {2 ρ x_{1} x_{2} - ρ^{2} x_{2}^{2}}]}^{n},

where x₁, x₂, x₃ are independent standard normal random variables. Because E(exp{ρ(x₁x₂ + x₂x₃)) = 1 + ρ², $E (exp {- ρ^{2} x_{2}^{2} / 2} = {(1 + ρ^{2})}^{- 1 / 2}$ and E(exp{2ρx₁x₂) = 1 + 2ρ², we have

E (L_{μ_{ρ}}^{2}) = 1 + p^{2 c - 1} + o (1) = 1 + o (1) .

Theorem 3 is thus proved by Baraud (2002).

8.5 Proof of Theorem 4

We first show that t̂, as defined in Section 4.1, is attained in the interval [0, (2 log p)^1/2]. We then show that A_τ is negligible and we focus on the set ℋ\A_τ. We then show the FDP result by dividing the null set into small subsets and controlling the variance of R₀(t) for each subset, and the FDR result will thus also be proved.

Under the condition of Theorem 4, we have

\sum_{1 \leq i \leq p} I {| W_{i} | \geq {(2 log p)}^{1 / 2}} \geq {1 / (π^{1 / 2} α) + δ} {(log p)}^{1 / 2},

with probability going to one. Hence, with probability tending to one, we have

\frac{2 p}{\sum_{1 \leq i \leq p} I {| W_{i} | \geq {(2 log p)}^{1 / 2}}} \leq 2 p {1 / (π^{1 / 2} α) + δ}^{- 1} {(log p)}^{- 1 / 2} .

Let t_p = (2 log p − 2 log log p)^1/2. Because $1 - Φ (t_{p}) \sim 1 / {{(2 π)}^{1 / 2} t_{p}} exp (- t_{p}^{2} / 2)$ , we have P(1 ≤ t̂ ≤ t_p) → 1 according to the definition of t̂ in Section 4.1. For 0 ≤ t̂ ≤ t_p,

\frac{2 p {1 - Φ (\hat{t})}}{max {\sum_{1 \leq i \leq p} I {| W_{i} | \geq \hat{t}}, 1}} = α .

Thus, to prove Theorem 4, it suffices to prove

| \frac{\sum_{i \in ℋ_{0}} I {| W_{i} | \geq t} - p_{0} G (t)}{p G (t)} | \to 0,

in probability, uniformly for 0 ≤ t ≤ t_p, where G(t) = 2(1 − Φ(t)) and p₀ = |ℋ₀|. We will show that it suffices to show

| \frac{\sum_{i \in ℋ_{0} \ A_{τ}} I {| V_{i} | \geq t} - p_{0} G (t)}{pG (t)} | \to 0,

(32)

in probability. We now consider two cases.

If t = {2 log p + o(log p)}^1/2, the proof of Theorem 1 yields that $P ({max}_{i \in A_{τ}} W_{i}^{2} \geq t^{2}) = o (1)$ . Thus, it suffices to prove
$| \frac{\sum_{i \in ℋ_{0} \ A_{τ}} I {| W_{i} | \geq t} - p_{0} G (t)}{pG (t)} | \to 0,$
in probability. We show in Theorem 1 that max_{i∈ℋ₀\A_τ} |W_i − V_i| = o_P{(log p)^−1/2}. Thus it suffices to show (32).
If t ≤ (C log p)^1/2 for some C < 2, we have
$| \frac{\sum_{i \in A_{τ} \cap ℋ_{0}} I {| W_{i} | \geq t}}{pG (t)} | \leq \frac{| A_{τ} \cap ℋ_{0} |}{O (p^{1 - C / 2})} \to 0$
in probability. Thus, it is again enough to show (32).

Let 0 ≤ t₀ < t₁ < ⋯ < t_b = t_p such that t_ι − t_ι−1 = υ_p for 1 ≤ ι ≤ b − 1 and t_b − t_b−1 ≤ υ_p, where $υ_{p} = 1 / \sqrt{log p ({log}_{4} p)}$ . Thus we have b ~ t_p/υ_p. For any t such that t_ι−1 ≤ t ≤ t_ι, by the fact that G(t + o((log p)^−1/2))/G(t) = 1 + o(1) uniformly in 0 ≤ t ≤ c(log p)^1/2 for any constant c, we have

\frac{\sum_{i \in ℋ_{0} \ A_{τ}} I (| V_{i} | \geq t_{ι})}{p_{0} G (t_{ι})} \frac{G (t_{ι})}{G (t_{ι - 1})} \leq \frac{\sum_{i \in ℋ_{0} \ A_{τ}} I (| V_{i} | \geq t)}{p_{0} G (t)} \leq \frac{\sum_{i \in ℋ_{0} \ A_{τ}} I (| V_{i} | \geq t_{ι - 1})}{p_{0} G (t_{ι - 1})} \frac{G (t_{ι - 1})}{G (t_{ι})} .

Thus it suffices to prove

max_{0 \leq ι \leq b} | \frac{\sum_{i \in ℋ_{0} \ A_{τ}} {I (| V_{i} | \geq t_{ι}) - G (t_{ι})}}{pG (t_{ι})} | \to 0,

in probability. Define ℋ̃₀ = ℋ₀ \ A_τ. Note that

P [max_{0 \leq ι \leq b} | \frac{\sum_{i \in {\tilde{ℋ}}_{0}} {I (| V_{i} | \geq t_{ι}) - G (t_{ι})}}{p_{0} G (t_{ι})} | \geq ε] \leq \sum_{ι = 1}^{b} P [| \frac{\sum_{i \in {\tilde{ℋ}}_{0}} {I (| V_{i} | \geq t_{ι}) - G (t_{ι})}}{p_{0} G (t_{ι})} | \geq ε] \leq \frac{1}{υ_{p}} \int_{0}^{t_{p}} P {| \frac{\sum_{i \in {\tilde{ℋ}}_{0}} I (| V_{i} | \geq t)}{p_{0} G (t)} - 1 | \geq ε} dt + \sum_{ι = b - 1}^{b} P [| \frac{\sum_{i \in {\tilde{ℋ}}_{0}} {I (| V_{i} | \geq t_{ι}) - G (t_{ι})}}{p_{0} G (t_{ι})} | \geq ε] .

Thus, it suffices to show, for any ε > 0,

\int_{0}^{t_{p}} P {| \frac{\sum_{i \in {\tilde{ℋ}}_{0}} {I (| V_{i} | \geq t) - P (I (| V_{i} | \geq t)}}{p_{0} G (t)} | \geq ε} dt = o (υ_{p}) .

(33)

Note that

E {| \frac{\sum_{i \in {\tilde{ℋ}}_{0}} {I (| V_{i} | \geq t) - P (I (| V_{i} | \geq t)}}{p_{0} G (t)} |}^{2} = \frac{\sum_{i, j \in {\tilde{ℋ}}_{0}} {P (| V_{i} | \geq t, | V_{j} | \geq t) - P (| V_{i} | \geq t) P (| V_{j} | \geq t)}}{p_{0}^{2} G^{2} (t)} .

We divides the indices i, j ∈ ℋ̃₀ into the subsets: ℋ̃₀₁ = {i, j ∈ ℋ̃₀, i = j}, ℋ̃₀₂ = {i, j ∈ ℋ̃₀, i ∈ Γ_j(γ), or j ∈ Γ_i(γ)} and ℋ̃₀₃ = ℋ̃₀ \ (ℋ̃₀₁ ∪ ℋ̃₀₂). Then we have

\frac{\sum_{i, j \in {\tilde{ℋ}}_{01}} {P (| V_{i} | \geq t, | V_{j} | \geq t) - P (| V_{i} | \geq t) P (| V_{j} | \geq t)}}{p_{0}^{2} G^{2} (t)} \leq \frac{C}{p_{0} G (t)} .

(34)

We now show the equation (12). Note that $Cov (ε_{k, d} η_{k, i, d}, ε_{k, d} η_{k, j, d}) = E (ε_{k, d}^{2} η_{k, i, d} η_{k, j, d}) - E (ε_{k, d} η_{k, i, d}) E (ε_{k, d} η_{k, j, d})$ . Because $Cov (ε_{k, d}, η_{k, i, d}) = - σ_{η_{i, d}}^{2} β_{i, d}$ , we have $E (ε_{k, d} η_{k, i, d}) E (ε_{k, d} η_{k, j, d}) = σ_{η_{i, d}}^{2} σ_{η_{j, d}}^{2} β_{i, d} β_{j, d}$ . Note that

E (ε_{k, d}^{2} η_{k, i, d} η_{k, j, d}) = E {ε_{k, d}^{2} (η_{k, i, d} + ε_{k, d} γ_{i, 1, d}) (η_{k, j, d} + ε_{k, d} γ_{j, 1, d})} - E {ε_{k, d}^{2} (η_{k, i, d} + ε_{k, d} γ_{i, 1, d}) ε_{k, d} γ_{j, 1, d}} - E (ε_{k, d}^{3} γ_{i, 1, d} η_{k, j, d}) .

By definition, we have ε_k,d independent with η_k,i,d + ε_k,dγ_i,1,d. Thus, we have

E (ε_{k, d}^{2} η_{k, i, d} η_{k, j, d}) = σ_{ε_{d}}^{2} E {(η_{k, i, d} + ε_{k, d} γ_{i, 1, d}) (η_{k, j, d} + ε_{k, d} γ_{j, 1, d})} - E (ε_{k, d}^{3} γ_{i, 1, d} η_{k, j, d}) .

Note that

E (ε_{k, d}^{3} γ_{i, 1, d} η_{k, j, d}) = E {ε_{k, d}^{3} γ_{i, 1, d} (η_{k, j, d} + ε_{k, d} γ_{j, 1, d})} - E (ε_{k, d}^{4} γ_{i, 1, d} γ_{j, 1, d}) = - 3 γ_{i, 1, d} γ_{j, 1, d} σ_{ε_{d}}^{4},

and that

E {(η_{k, i, d} + ε_{k, d} γ_{i, 1, d}) (η_{k, j, d} + ε_{k, d} γ_{j, 1, d})} = Cov (η_{k, i, d}, η_{k, j, d}) + γ_{i, 1, d} Cov (ε_{k, d}, η_{k, j, d}) + γ_{j, 1, d} Cov (ε_{k, d}, η_{k, i, d}) + γ_{i, 1, d} γ_{j, 1, d} σ_{ε_{d}}^{2} .

We have $Cov (ε_{k, d} η_{k, i, d}, ε_{k, d} η_{k, j, d}) = (ω_{i, j} σ_{ε_{d}}^{2} + 2 β_{i, d} β_{j, d}) σ_{η_{i, d}}^{2} σ_{η_{j, d}}^{2}$ . Thus

{\tilde{ξ}}_{i, j, d} = Corr (ε_{k, d} η_{k, i, d}, ε_{k, d} η_{k, j, d}) = \frac{(ω_{i, j, d} σ_{ε_{d}}^{2} + 2 β_{i, d} β_{j, d})}{{(ω_{i, i, d} σ_{ε_{d}}^{2} + 2 β_{i, d}^{2}) (ω_{j, j, d} σ_{ε_{d}}^{2} + 2 β_{j, d}^{2})}^{1 / 2}} .

Note that, for i ∈ ℋ̃₀, we have β_i,d = O((log p)^−2−τ) and so | Corr(V_i, V_j)| ≤ ξ < 1, where ξ = max{ξ₁, ξ₂} + ε with ξ_d defined in (C2) and ε < 1 − max{ξ₁, ξ₂}, for i, j ∈ ℋ̃₀₂. Hence

\frac{\sum_{i, j \in {\tilde{ℋ}}_{02}} {P (| V_{i} | \geq t, | V_{j} | \geq t) - P (| V_{i} | \geq t) P (| V_{j} | \geq t)}}{p_{0}^{2} G^{2} (t)} \leq C \frac{p^{1 + ν} t^{- 2} exp {- t^{2} / (1 + ξ)}}{p^{2} G (t)} \leq \frac{C}{p^{1 - ν} {G (t)}^{2 ξ / (1 + ξ)}} .

(35)

It remains to consider the subset ℋ̃₀₃, in which V_i and V_j are weakly correlated. It is easy to check that max_{i,j∈ ℋ̃₀₃} P(|V_i| ≥ t, |V_j | ≥ t) = (1 + O{(log p)^−1−γ})G²(t). Hence,

\frac{\sum_{i, j \in {\tilde{ℋ}}_{03}} {P (| V_{i} | \geq t, | V_{j} | \geq t) - P (| V_{i} | \geq t) P (| V_{j} | \geq t)}}{p_{0}^{2} G^{2} (t)} = O {{(log p)}^{- 1 - γ}} .

(36)

Equation (33) and the FDP result then follow by combining (34), (35), and (36), and the FDR result is also proved.

Acknowledgments

The research of Yin Xia was supported in part by “The Recruitment Program of Global Experts” Youth Project from China, the startup fund from Fudan University and NSF Grant DMS-1612906.

The research of Tianxi Cai was supported in part by NIH Grants R01 GM079330, P50 MH106933, and U54 HG007963.

The research of Tony Cai was supported in part by NSF Grants DMS-1208982 and DMS-1403708, and NIH Grant R01 CA127334.

References

Anderson TW. An Introduction To Multivariate Statistical Analysis. 3. Wiley-Intersceince; New York: 2003. [Google Scholar]
Baraud Y. Non-asymptotic minimax rates of testing in signal detection. Bernoulli. 2002;8(5):577–606. [Google Scholar]
Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Annals of statistics. 2001:1165–1188. [Google Scholar]
Cai T, Liu W, Xia Y. Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. J. Am. Statist. Assoc. 2013;108(501):265–277. [Google Scholar]
Cai TT, Xia Y. High-dimensional sparse manova. Journal of Multivariate Analysis. 2014;131:174–196. [Google Scholar]
D’Agostino R, Sr, Vasan R, Pencina M, Wolf P, Cobain M, Massaro J, Kannel W. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation. 2008;117(6):743. doi: 10.1161/CIRCULATIONAHA.107.699579. [DOI] [PubMed] [Google Scholar]
Hibi K, Ishigami T, Kimura K, Nakao M, Iwamoto T, Tamura K, Nemoto T, Shimizu T, Mochida Y, Ochiai H, et al. Angiotensin-converting enzyme gene polymorphism adds risk for the severity of coronary atherosclerosis in smokers. Hypertension. 1997;30(3):574–579. doi: 10.1161/01.hyp.30.3.574. [DOI] [PubMed] [Google Scholar]
Humphries S, Yiannakouris N, Talmud P. Cardiovascular disease risk prediction using genetic information (gene scores): is it really informative? Current Opinion in Lipidology. 2008;19(2):128. doi: 10.1097/MOL.0b013e3282f5283e. [DOI] [PubMed] [Google Scholar]
Hunter DJ. Gene–environment interactions in human diseases. Nature Reviews Genetics. 2005;6(4):287–298. doi: 10.1038/nrg1578. [DOI] [PubMed] [Google Scholar]
Ikeda S, Sasazuki S, Natsukawa S, Shaura K, Koizumi Y, Kasuga Y, Ohnami S, Sakamoto H, Yoshida T, Iwasaki M, et al. Screening of 214 single nucleotide polymorphisms in 44 candidate cancer susceptibility genes: a case–control study on gastric and colorectal cancers in the japanese population. The American journal of gastroenterology. 2008;103(6):1476–1487. doi: 10.1111/j.1572-0241.2008.01810.x. [DOI] [PubMed] [Google Scholar]
Javanmard A, Montanari A. Hypothesis testing in high-dimensional regression under the gaussian random design model: Asymptotic theory 2013 [Google Scholar]
Javanmard A, Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research. 2014;15(1):2869–2909. [Google Scholar]
Kannel W, Feinleib M, McNamara P, Garrison R, Castelli W. An investigation of coronary heart disease in families The Framingham Offspring Study. American Journal of Epidemiology. 1979;110(3):281–290. doi: 10.1093/oxfordjournals.aje.a112813. [DOI] [PubMed] [Google Scholar]
Liu L, Zhong R, Wei S, Xiang H, Chen J, Xie D, Yin J, Zou L, Sun J, Chen W, et al. The leptin gene family and colorectal cancer: interaction with smoking behavior and family history of cancer. PloS one. 2013;8(4):e60777. doi: 10.1371/journal.pone.0060777. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu W, Luo S. Hypothesis testing for high-dimensional regression models. Technical report 2014 [Google Scholar]
Lloyd-Jones D, Wilson P, Larson M, Beiser A, Leip E, D’Agostino R, Levy D. Framingham risk score and prediction of lifetime risk for coronary heart disease* 1. The American Journal of Cardiology. 2004;94(1):20–24. doi: 10.1016/j.amjcard.2004.03.023. [DOI] [PubMed] [Google Scholar]
Matsouaka RA, Li J, Cai T. Evaluating marker-guided treatment selection strategies. Biometrics. 2014;70(3):489–499. doi: 10.1111/biom.12179. [DOI] [PMC free article] [PubMed] [Google Scholar]
McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Reviews Genetics. 2008;9(5):356–369. doi: 10.1038/nrg2344. [DOI] [PubMed] [Google Scholar]
Paynter N, Chasman D, Buring J, Shiffman D, Cook N, Ridker P. Cardiovascular disease risk prediction with and without knowledge of genetic variation at chromosome 9p21. 3. Annals of internal medicine. 2009;150(2):65. doi: 10.7326/0003-4819-150-2-200901200-00003. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pepe MS. The statistical evaluation of medical tests for classification and prediction. Oxford University Press; 2003. [Google Scholar]
Ridker P, Buring J, Rifai N, Cook N. Development and validation of improved algorithms for the assessment of global cardiovascular risk in women: the Reynolds Risk Score. Journal of American Medical Association. 2007;297(6):611. doi: 10.1001/jama.297.6.611. [DOI] [PubMed] [Google Scholar]
Ross R. Atherosclerosis is an inflammatory disease. American Heart Journal. 1999;138(5):S419–S420. doi: 10.1016/s0002-8703(99)70266-8. [DOI] [PubMed] [Google Scholar]
Sayed-Tabatabaei F, Schut A, Hofman A, Bertoli-Avella A, Vergeer J, Witteman J, van Duijn C. A study of gene–environment interaction on the gene for angiotensin converting enzyme: a combined functional and population based approach. Journal of medical genetics. 2004;41(2):99–103. doi: 10.1136/jmg.2003.013441. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schut AF, Sayed-Tabatabaei FA, Witteman JC, Avella AM, Vergeer JM, Pols HA, Hofman A, Deinum J, van Duijn CM. Smoking-dependent effects of the angiotensin-converting enzyme gene insertion/deletion polymorphism on blood pressure. Journal of hypertension. 2004;22(2):313–319. doi: 10.1097/00004872-200402000-00015. [DOI] [PubMed] [Google Scholar]
Stephens JW, Bain SC, Humphries SE. Gene–environment interaction and oxidative stress in cardiovascular disease. Atherosclerosis. 2008;200(2):229–238. doi: 10.1016/j.atherosclerosis.2008.04.003. [DOI] [PubMed] [Google Scholar]
Van de Geer S, Bühlmann P, Ritov Y, Dezeure R, et al. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics. 2014;42(3):1166–1202. [Google Scholar]
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al. The sequence of the human genome. Science. 2001;291(5507):1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]
Wilson P, D’Agostino R, Levy D, Belanger A, Silbershatz H, Kannel W. Prediction of coronary heart disease using risk factor categories. Circulation. 1998;97(18):1837. doi: 10.1161/01.cir.97.18.1837. [DOI] [PubMed] [Google Scholar]
Xia Y, Cai T, Cai TT. Testing differential network with applications to detecting gene by gene interactions. Biometrika. 2015;102:247–266. doi: 10.1093/biomet/asu074. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zaïtsev AY. On the gaussian approximation of convolutions under multidimensional analogues of sn bernstein’s inequality conditions. Probab. Theory Rel. 1987;74(4):535–566. [Google Scholar]
Zhang C-H, Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B. 2014;76(1):217–242. [Google Scholar]

[R1] Anderson TW. An Introduction To Multivariate Statistical Analysis. 3. Wiley-Intersceince; New York: 2003. [Google Scholar]

[R2] Baraud Y. Non-asymptotic minimax rates of testing in signal detection. Bernoulli. 2002;8(5):577–606. [Google Scholar]

[R3] Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Annals of statistics. 2001:1165–1188. [Google Scholar]

[R4] Cai T, Liu W, Xia Y. Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. J. Am. Statist. Assoc. 2013;108(501):265–277. [Google Scholar]

[R5] Cai TT, Xia Y. High-dimensional sparse manova. Journal of Multivariate Analysis. 2014;131:174–196. [Google Scholar]

[R6] D’Agostino R, Sr, Vasan R, Pencina M, Wolf P, Cobain M, Massaro J, Kannel W. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation. 2008;117(6):743. doi: 10.1161/CIRCULATIONAHA.107.699579. [DOI] [PubMed] [Google Scholar]

[R7] Hibi K, Ishigami T, Kimura K, Nakao M, Iwamoto T, Tamura K, Nemoto T, Shimizu T, Mochida Y, Ochiai H, et al. Angiotensin-converting enzyme gene polymorphism adds risk for the severity of coronary atherosclerosis in smokers. Hypertension. 1997;30(3):574–579. doi: 10.1161/01.hyp.30.3.574. [DOI] [PubMed] [Google Scholar]

[R8] Humphries S, Yiannakouris N, Talmud P. Cardiovascular disease risk prediction using genetic information (gene scores): is it really informative? Current Opinion in Lipidology. 2008;19(2):128. doi: 10.1097/MOL.0b013e3282f5283e. [DOI] [PubMed] [Google Scholar]

[R9] Hunter DJ. Gene–environment interactions in human diseases. Nature Reviews Genetics. 2005;6(4):287–298. doi: 10.1038/nrg1578. [DOI] [PubMed] [Google Scholar]

[R10] Ikeda S, Sasazuki S, Natsukawa S, Shaura K, Koizumi Y, Kasuga Y, Ohnami S, Sakamoto H, Yoshida T, Iwasaki M, et al. Screening of 214 single nucleotide polymorphisms in 44 candidate cancer susceptibility genes: a case–control study on gastric and colorectal cancers in the japanese population. The American journal of gastroenterology. 2008;103(6):1476–1487. doi: 10.1111/j.1572-0241.2008.01810.x. [DOI] [PubMed] [Google Scholar]

[R11] Javanmard A, Montanari A. Hypothesis testing in high-dimensional regression under the gaussian random design model: Asymptotic theory 2013 [Google Scholar]

[R12] Javanmard A, Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research. 2014;15(1):2869–2909. [Google Scholar]

[R13] Kannel W, Feinleib M, McNamara P, Garrison R, Castelli W. An investigation of coronary heart disease in families The Framingham Offspring Study. American Journal of Epidemiology. 1979;110(3):281–290. doi: 10.1093/oxfordjournals.aje.a112813. [DOI] [PubMed] [Google Scholar]

[R14] Liu L, Zhong R, Wei S, Xiang H, Chen J, Xie D, Yin J, Zou L, Sun J, Chen W, et al. The leptin gene family and colorectal cancer: interaction with smoking behavior and family history of cancer. PloS one. 2013;8(4):e60777. doi: 10.1371/journal.pone.0060777. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Liu W, Luo S. Hypothesis testing for high-dimensional regression models. Technical report 2014 [Google Scholar]

[R16] Lloyd-Jones D, Wilson P, Larson M, Beiser A, Leip E, D’Agostino R, Levy D. Framingham risk score and prediction of lifetime risk for coronary heart disease* 1. The American Journal of Cardiology. 2004;94(1):20–24. doi: 10.1016/j.amjcard.2004.03.023. [DOI] [PubMed] [Google Scholar]

[R17] Matsouaka RA, Li J, Cai T. Evaluating marker-guided treatment selection strategies. Biometrics. 2014;70(3):489–499. doi: 10.1111/biom.12179. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Reviews Genetics. 2008;9(5):356–369. doi: 10.1038/nrg2344. [DOI] [PubMed] [Google Scholar]

[R19] Paynter N, Chasman D, Buring J, Shiffman D, Cook N, Ridker P. Cardiovascular disease risk prediction with and without knowledge of genetic variation at chromosome 9p21. 3. Annals of internal medicine. 2009;150(2):65. doi: 10.7326/0003-4819-150-2-200901200-00003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Pepe MS. The statistical evaluation of medical tests for classification and prediction. Oxford University Press; 2003. [Google Scholar]

[R21] Ridker P, Buring J, Rifai N, Cook N. Development and validation of improved algorithms for the assessment of global cardiovascular risk in women: the Reynolds Risk Score. Journal of American Medical Association. 2007;297(6):611. doi: 10.1001/jama.297.6.611. [DOI] [PubMed] [Google Scholar]

[R22] Ross R. Atherosclerosis is an inflammatory disease. American Heart Journal. 1999;138(5):S419–S420. doi: 10.1016/s0002-8703(99)70266-8. [DOI] [PubMed] [Google Scholar]

[R23] Sayed-Tabatabaei F, Schut A, Hofman A, Bertoli-Avella A, Vergeer J, Witteman J, van Duijn C. A study of gene–environment interaction on the gene for angiotensin converting enzyme: a combined functional and population based approach. Journal of medical genetics. 2004;41(2):99–103. doi: 10.1136/jmg.2003.013441. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Schut AF, Sayed-Tabatabaei FA, Witteman JC, Avella AM, Vergeer JM, Pols HA, Hofman A, Deinum J, van Duijn CM. Smoking-dependent effects of the angiotensin-converting enzyme gene insertion/deletion polymorphism on blood pressure. Journal of hypertension. 2004;22(2):313–319. doi: 10.1097/00004872-200402000-00015. [DOI] [PubMed] [Google Scholar]

[R25] Stephens JW, Bain SC, Humphries SE. Gene–environment interaction and oxidative stress in cardiovascular disease. Atherosclerosis. 2008;200(2):229–238. doi: 10.1016/j.atherosclerosis.2008.04.003. [DOI] [PubMed] [Google Scholar]

[R26] Van de Geer S, Bühlmann P, Ritov Y, Dezeure R, et al. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics. 2014;42(3):1166–1202. [Google Scholar]

[R27] Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al. The sequence of the human genome. Science. 2001;291(5507):1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]

[R28] Wilson P, D’Agostino R, Levy D, Belanger A, Silbershatz H, Kannel W. Prediction of coronary heart disease using risk factor categories. Circulation. 1998;97(18):1837. doi: 10.1161/01.cir.97.18.1837. [DOI] [PubMed] [Google Scholar]

[R29] Xia Y, Cai T, Cai TT. Testing differential network with applications to detecting gene by gene interactions. Biometrika. 2015;102:247–266. doi: 10.1093/biomet/asu074. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Zaïtsev AY. On the gaussian approximation of convolutions under multidimensional analogues of sn bernstein’s inequality conditions. Probab. Theory Rel. 1987;74(4):535–566. [Google Scholar]

[R31] Zhang C-H, Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B. 2014;76(1):217–242. [Google Scholar]

PERMALINK

Two-Sample Tests for High-Dimensional Linear Regression with an Application to Detecting Interactions

Yin Xia

Tianxi Cai

T Tony Cai

Abstract

1 Introduction

2 Methodology

2.1 Notation and Definitions

2.2 Test Statistics

Remark 1

Remark 2

2.3 Discussion

3 Global Test

3.1 Asymptotic Null Distribution

Theorem 1

Remark 3

Remark 4

3.2 Asymptotic Power

Theorem 2

Theorem 3

4 Multiple Testing with False Discovery Rate Control

4.1 Multiple Testing Procedure

4.2 Theoretical Properties

Theorem 4

5 Simulation Study

5.1 Implementation Details

5.2 Simulation Under Different Matrix Models

Global Test

Table 1.

Multiple Testing

Table 2.

5.3 Simulation by Mimicking Data

Table 3.

6 Data Analysis

7 Extension to Non-Binary Environmental Variable

8 Proofs

8.1 Technical Lemmas

Lemma 1 (Bonferroni inequality)

Lemma 2

Lemma 3

8.2 Proof of Theorem 1

Lemma 4

8.3 Proof of Theorem 2

8.4 Proof of Theorem 3

8.5 Proof of Theorem 4

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases