Abstract
Motivated by applications in genomics, we consider in this paper global and multiple testing for the comparisons of two high-dimensional linear regression models. A procedure for testing the equality of the two regression vectors globally is proposed and shown to be particularly powerful against sparse alternatives. We then introduce a multiple testing procedure for identifying unequal coordinates while controlling the false discovery rate and false discovery proportion. Theoretical justifications are provided to guarantee the validity of the proposed tests and optimality results are established under sparsity assumptions on the regression coefficients. The proposed testing procedures are easy to implement. Numerical properties of the procedures are investigated through simulation and data analysis. The results show that the proposed tests maintain the desired error rates under the null and have good power under the alternative at moderate sample sizes. The procedures are applied to the Framingham Offspring study to investigate the interactions between smoking and cardiovascular related genetic mutations important for an inflammation marker.
Keywords: False discovery proportion, false discovery rate, high-dimensional linear regression, hypothesis testing, multiple comparisons, sparsity, two-sample tests
1 Introduction
As we enter a new era of data science, called by some the “information century”, research in several novel genomics and epigenomics fields are well underway. Large-scale genomewide scans, such as genome-wide association studies, have become widely available tools for identifying common genetic variants that contribute to complex diseases and treatment responses (McCarthy et al. (2008); Venter et al. (2001)). However, there is growing evidence that genetic variants alone explain only a small proportion of variations in complex disease phenotypes. Most complex diseases are a result of interplay between genes and environment (Hunter (2005)). It is thus of substantial interest to rigorously study the effects of environment and its interaction with genetic predispositions on disease phenotypes.
When the environmental factor is a binary variable such as smoking status or gender, such interaction problems can be addressed through the two-sample high-dimensional regression framework. Specifically, interaction detection can be formulated based on comparing two high-dimensional regression models
(1) |
and identifing the nonzero components of β1 − β2, where βd = (β1,d, …, βp,d)T ∈ ℝp, μd = (μ1,d, …, μnd,d)T, , Yd = (Y1,d, …, Ynd,d)T, and εd = (ε1,d, …, εnd,d)T, with {εk,d} being independent and identically distributed (i.i.d) random variables with mean zero and variance and independent of Xk,·,d, k = 1, …, nd. Two-sample interaction detection problems arise in many other biomedical settings. For example, when the two samples represent diseased and non-diseased group and Y represents a diagnostic test, and the non-zero components of β1 − β2 represent the covariates that affect the diagnostic accuracy of Y (Pepe (2003)). When the two samples represent two treatment groups, the proposed testing procedures have important applications in personalized medicine. The non-zero components of β1 − β2 correspond to markers useful for individualized treatment selection since the rule that optimize the treatment selection for an individual patient with genomic markers X can be formed based on (β1 − β2)TX (Matsouaka et al. (2014)). However, the high dimensionality of the genomic data presents substantial statistical challenges in efficiently identifying gene-environment interactions and markers useful for personalized treatment selection.
There is a paucity of literature focusing on multiple testing of the regression coefficients in the high-dimensional two-sample setting while controlling the false discovery rate (FDR) and false discovery proportion (FDP). For example, Zhang and Zhang (2014), Van de Geer et al. (2014), and Javanmard and Montanari (2013, 2014) considered confidence intervals and tests for a given coordinate of a high-dimensional linear regression vector. Procedures that are based on the “de-biased” Lasso estimators were proposed. The focus was solely on inference for a given coordinate and simultaneous testing of all coordinates was not considered. Recently, Liu and Luo (2014) investigated the one-sample version of the multiple testing problem, testing simultaneously
with the control of FDR. They constructed the test statistics based on bias-corrected sample covariances of the residuals and inverse regression, as explained in detail in Section 2.2. The one-sample setting is simpler than the two-sample multiple testing problem considered in the present paper. For example, their proposed test statistics have desirable theoretical properties due to the facts that (i) they are asymptotically normally distributed under , and (ii) the correlation between two test statistics is equal to the partial correlation between two covariates, which is fully determined by the precision matrix. However, those properties no longer hold when we extend the hypothesis testing problem to two samples as described in (3).
In this paper, we are interested in developing efficient procedures for testing β1 − β2. The first goal is to develop a global test for
(2) |
that is powerful against sparse alternatives. We then develop a procedure for simultaneously testing the hypotheses
(3) |
with FDR and FDP control. The test statistics are constructed using the covariances between the residuals of the fitted regression models and the inverse regression models. Although the techniques build on the inverse regression method developed in Liu and Luo (2014) for the one-sample case, the two-sample case poses significant additional difficulties in both methodology development and technical analyses. We point out here two such major challenges and more detailed discussion is given in Section 2.3.
The construction of test statistics is much more involved than the one-sample case. This is mainly due to the fact that the difference of regression coefficients can no longer be reduced to the difference of residual covariances as in the one-sample setting. Furthermore, corrections of the test statistics are essential in the two-sample case to establish the asymptotic normality.
The technical analyses of the two-sample case are much more challenging. This is because the one-sample case can be easily reduced to a weakly correlated testing problem provided that the precision matrix of the covariates is sparse or nearly sparse, while the two-sample case cannot as the correlation structure is much more complicated.
The properties of the proposed testing procedures are investigated theoretically as well as numerically through simulation and data analysis. Theoretical justifications are provided to ensure the validity of the proposed tests and optimality results are established under sparsity assumptions on the regression coefficients. A simulation study is carried out to demonstrate that the proposed tests maintain the desired error rates under the null and have good power under the alternative at moderate sample sizes. The simulation results also show that the new multiple testing procedure outperforms the well known Benjamini-Yekutieli procedure (Benjamini and Yekutieli (2001)). In addition, the proposed testing procedures are illustrated by an application to the Framingham Offspring Study (Kannel et al., 1979) to study how smoking and its interaction with a genetic predisposition affect an inflammation marker which plays an important role in the risk of developing cardiovascular disease.
The rest of the paper is organized as follows. In Section 2, we introduce the construction of the new test statistics and discuss the technical differences and theoretical challenges of the two-sample testing problems. Section 3 develops a maximum-type statistic Mn and the corresponding test for the global hypothesis H0 : β1 = β2 through the inverse regression framework. We establish in this section the asymptotic null distribution of Mn and show the optimality results under sparse alternatives. Large-scale multiple testing with FDR and FDP control is presented in Section 4. Section 5 investigates the numerical performance of the proposed procedures by simulations. In Section 6, we apply the proposed procedures to the Framingham Offspring Study. The proofs of the main results are given in Section 8.
2 Methodology
2.1 Notation and Definitions
We first introduce the notation and definitions that will be used throughout the paper. For a vector βd = (β1,d, …, βp,d)T ∈ ℝp, define the ℓq norm by for 1 ≤ q ≤ ∞. For subscripts, we use the convention that i stands for the ith entry of a vector and (i, j) for the entry in the ith row and jth column of a matrix, k represents the kth sample and d is the group indicator. Let be the nd × p data matrix, and Yd = (Y1,d, …, Ynd,d)T be the nd × 1 data matrix, for d = 1, 2. Throughout, suppose that we have i.i.d random samples {Yk,d, Xk,·,d, 1 ≤ k ≤ nd} with Xk,·,d = (Xk,1,d, …, Xk,p,d) being a random vector with covariance matrix Σd for d = 1, 2. Define .
For any vector μd ∈ ℝp, let μ−i,d denote the (p − 1)-dimensional vector formed by removing the ith entry from μd. For a symmetric matrix Ad, let λmax(Ad) and λmin(Ad) denote the largest and smallest eigenvalues of Ad, respectively. For any n × p matrix Ad, Ai,−j,d denotes the ith row of Ad with its jth entry removed and A−i,j,d denotes the jth column of Ad with its ith entry removed. A−i,−j,d denotes the (n − 1) × (p − 1) submatrix of Ad with its ith row and jth column removed. Let A·, −j,d denote the n × (p − 1) submatrix of Ad with the jth column removed, Ai,·,d denote the ith row of Ad, A·,j,d denote the jth column of Ad and . Let , and . Let . For a matrix Ω = (ωi,j)p×p, the matrix 1-norm is the maximum absolute column sum, , the matrix elementwise infinity norm is defined to be ‖Ω‖∞ = max1≤i,j≤p|ωi,j| and the elementwise ℓ1 norm is . For a set ℋ, let |ℋ| be the cardinality of ℋ. For two sequences of real numbers {an} and {bn}, write an = O(bn) if there exists a constant C such that |an| ≤ C|bn| holds for all n, write an = o(bn) if limn→∞ an/bn = 0, and write an ≍ bn if there are positive constants c and C such that c ≤ an/bn ≤ C for all n.
2.2 Test Statistics
To form the test statistics, we consider the inverse regression models obtained by regressing Xk,i,d on (Yk,d, Xk,−i,d), as introduced in Liu and Luo (2014)
where for d = 1, 2, ηk,i,d has mean zero and variance and is uncorrelated with (Yk,d, Xk,−i,d), and γi,d = (γi,1,d, …, γi,p,d)T satisfies
(4) |
where , as provided in Liu and Luo (2014).
Remark 1
Equation (4) can be obtained directly as follows. Denote the covariance matrix of Z = (Xk,i,d, Yk,d, Xk,−i,d) by Σ = Cov(Z). Section 2.5 of Anderson (2003) shows that γi,d can be obtained by , where Σ22 = Cov(Z1) with Z1 = (Yk,d, Xk,−i,d) and Σ21 = Cov(Z1, Xk,i,d) is the covariance between Z1 and Xk,i,d. Then (4) follows from the regression model Yd = μd + Xdβd + εd and the fact that Xd and εd are uncorrelated with each other.
Because ri,d = Cov(εk,d, ηk,i,d) can be expressed as , the null hypotheses in global testing problem (2) and entry-wise testing problem (3) would be, respectively, equivalent to
(5) |
and
(6) |
and we base the tests on the estimates of { , i = 1, …, p; d = 1, 2}.
Define the residuals
where β̂d = (β̂1,d, …, β̂p,d) and γ̂i,d = (γ̂i,1,d, …, γ̂i,p,d) are the respective estimators of βd and γi,d satisfy
(7) |
for some an1 and an2 such that
(8) |
Estimators β̂d and γ̂i,d that satisfy (7) and (8) can be obtained easily via standard methods such as the lasso and Danzig selector, see, for example, Xia et al. (2015) and Liu and Luo (2014).
Based on the residuals ε̂k,d and η̂k,i,d, a natural estimator of ri,d is the sample covariance between the residuals,
Because r̃i,d tends to be biased, we define a bias corrected estimator for ri,d as
(9) |
where and are the sample variances satisfying
which can be obtained by Lemma 2 in Xia et al. (2015) under conditions (7) and (8). By Lemma 2, the bias of r̂i,d is then of order max{βi,d(log p/nd)1/2, (nd log p)−1/2}.
Remark 2
The most straightforward way to estimate ri,d is to use the sample covariance between the error terms, . However, the error terms are unknown, and we can use the the sample covariance between the residuals r̃i,d instead. The bias of r̃i,d exceeds the desired rate (nd log p)−1/2, and thus we calculate the difference of r̃i,d and , which up to order (nd log p)−1/2, is equal to . Hence, we define as in (9).
For i = 1, …, p and d = 1, 2, a natural estimator of can then be defined by
(10) |
Subsequently, we may test the hypotheses (2) and (3) using the estimators 𝒯 = {Ti,1 − Ti,2 : i = 1, …, p}. However, since Ti,1 − Ti,2 in 𝒯 are heteroscedastic with possibly a wide range of variability, we instead consider a standardized version of Ti,1 − Ti,2. Specifically, let
It can be shown in Lemma 2 that, uniformly in i = 1, …, p,
Noting that , we estimate θi,d by
and define the standardized statistics
(11) |
We base the tests for (2) and (3) on {Wi, i = 1, …, p}, which will be studied in detail in Sections 3 and 4.
2.3 Discussion
We discuss here the substantial differences between the two-sample and one-sample cases and the necessity for significant adjustments and corrections in the two-sample setting.
The proposed tests are based on estimators of . Here we estimate ri,d = Cov(εk,d, ηk,i,d) through constructing a bias-corrected sample covariance between the residuals, r̂i,d, as defined in (9). That is, we need to get an estimate of the difference between the naive estimate r̂i,d and an unbiased estimate of ri,d, which is .
Liu and Luo (2014) considered the one-sample case of the multiple testing problem (3) so is equivalent to ri = 0 under the null hypothesis, and ri is easier to estimate. The procedure in Liu and Luo (2014) is based on the estimation of ri instead of . In the two-sample case, is not equivalent to ri,1 = ri,2. Thus, it is necessary to construct testing procedures based directly on estimators of .
Furthermore, in the one-sample case, the asymptotic normality of Ti can be established because βi,1 = 0 under the null, which is shown in Lemma 2. Thus the theoretical properties of the individual test statistics are easier to obtain. In the two-sample case, βi,1 and βi,2 are not necessary equal to 0 under the null, and corrections are thus essential in order to show Wi is close to a normal random variable; the technical details are much more complicated.
More importantly, in the one-sample case, under the null hypothesis βi,1 = 0, and thus Corr(εkηk,i, εkηk,j) = ωi,j / (ωi,iωj,j), which is fully determined by the precision matrix of the covariates and thus simplifies the calculations. In the two-sample version, βi,1 = βi,2 under the null hypothesis and they are not necessary equal to zero. The calculation of Corr(εk,dηk,i,d, εk,dηk,j,d), which determines the correlation between Wi and Wj, is much more involved, and it can be shown in the proof of Theorem 4 that
(12) |
The technical analysis for establishing the theoretical results in Sections 3 and 4 is thus much more challenging.
3 Global Test
In this section, we wish to test the global hypothesis
We propose a procedure based on the standardized statistics {Wi, i = 1, …, p}
(13) |
It is shown in Section 3.1 that, under certain regularity conditions, Mn − 2 log p + log log p converges to a Gumbel distribution under the null, and the asymptotic α-level test can thus be defined as
(14) |
where qα is the 1 − α quantile of the Gumbel distribution with the cumulative distribution function exp(−π−1/2e−t/2),
We reject the null hypothesis H0 whenever Ψα = 1.
3.1 Asymptotic Null Distribution
We first introduce some regularity conditions, under which, Mn − 2 log p+log log p converges weakly to a Gumbel random variable with distribution function exp(−π−1/2e−t/2).
-
(C1)
log p = o(n1/5), n1 ≍ n2, and for some constants C0,C1,C2 > 0, , and |βd|∞ ≤ C2 for d = 1, 2. There exists some τ > 0 such that |Aτ| = O(pr) with r < 1/4, where Aτ = {i : |βi,d| ≥ (log p)−2−τ, 1 ≤ i ≤ p, for d = 1 or 2}.
-
(C2)
Let Dd be the diagonal of Ωd and let , for d = 1, 2. max1≤i<j≤p |ξi,j,d| ≤ ξd < 1 for some constant 0 < ξd < 1.
-
(C3)
There exists some constant K > 0 such that and are finite.
Condition (C1) on the eigenvalues is commonly used in the high-dimensional setting and implies that most of the variables are not highly correlated with each other. Condition (C2) is also mild. For example, if max1≤i≤j≤p |ξi,j,d| = 1, then Ωd is singular. (C3) is s sub-Gaussian tail condition, and it can be weakened to a polynomial tail condition if p < nc for some constant c > 0.
Theorem 1
Suppose (C1), (C2), (C3), (7), and (8) hold. Then under H0, for any t ∈ ℝ,
(15) |
where Mn is defined in (13). Under H0, the convergence in (15) is uniform for all {Yk,d, Xk,·,d : k = 1, 2, …, nd} satisfying (C1), (C2), (C3), (7), and (8).
Remark 3
The analysis can be extended to test H0 : βG,1 = βG,2 versus H1 : βG,1 ≠ βG,2 for a given index set G. We can construct the test statistic as , and obtain a similar Gumbel limiting null distribution by replacing p with |G|, as n1, n2, |G| → ∞. The condition (C1) will be slightly different, with Aτ being replaced by AG,τ = {i : |βi,d| ≥ (log p)−2−τ, i ∈ G, for d = 1 or 2}.
Remark 4
Condition (C1) is slightly stronger than the conditions in Liu and Luo (2014) as we need |Aτ| = O(pr) with r < 1/4. This is due to the major difference between the one-sample and two-sample cases that the global null H0 : β = 0 is a simple null in the one-sample case and the null H0 : β1 = β2 is composite in the two-sample case. In the one-sample case, Ti is a nearly unbiased estimate of βi because βi = 0 under the global null. However, in the two-sample case, as stated in Lemma 2, additional correction terms involving βi,d are needed in order to make Ti,d nearly unbiased because βi,1 and βi,2 are not necessary equal to 0 under the null. Thus, slightly stronger conditions on Aτ are needed.
3.2 Asymptotic Power
We now analyze the asymptotic power of the test Ψα given in (14). The test is shown to be particularly powerful against a large class of sparse alternatives and the power is minimax rate optimal. We first define a class of regression coefficients:
(16) |
We show that the null hypothesis H0 can be rejected by the test Ψα with overwhelming probability, if .
Theorem 2
Let the test Ψα be given in (14). Suppose (C1), (C3), (7) and (8) hold. Then
Theorem 2 shows that the null parameter set in which β1 = β2 is asymptotically distinguishable from by the test Ψα.
We further show that the lower bound in (16) is rate optimal. Let 𝒯α be the set of all α-level tests, P(Tα = 1) ≤ α under H0 for all Tα ∈ 𝒯α. If c in (16) is sufficiently small, then any α level test is unable to reject the null hypothesis correctly uniformly over (β1, β2) ∈ 𝒰(c) with probability tending to one.
Theorem 3
Suppose that log p = o(n). Let α, β > 0 and α + β < 1. Then there exists a constant c0 > 0 such that for all sufficiently large n and p,
Theorem 3 shows that the order (log p)1/2 in the lower bound of max1≤i≤p{|βi,1 − βi,2|/(θi,1 + θi,2)1/2} in (16) cannot be further improved.
4 Multiple Testing with False Discovery Rate Control
4.1 Multiple Testing Procedure
If the global null hypothesis is rejected, it is then of interest to identify the subset of variables in X that interact with the group indicator. This can be achieved by simultaneously testing on the entries of β1 − β2 with FDR and FDP control,
(17) |
The standardized differences of Ti,1−Ti,2 are defined by the test statistics Wi = (Ti,1 − Ti,2)/(θ̂i,1 + θ̂i,2)1/2 as in (11). Let t be the threshold such that H0,i is rejected if |Wi| ≥ t. Let ℋ0 = {i : βi,1 = βi,2, 1 ≤ i ≤ p} be the set of true nulls. Let R0(t) = Σi∈ℋ0 I(|Wi| ≥ t) and R(t) = Σ1≤i≤p I(|Wi| ≥ t), respectively, denote the total number of false positives and the total number of rejections. The FDP and FDR are defined as
Ideally, we select the threshold level as
However, ℋ0 is unknown, and we estimate Σi∈ℋ0 I{|Wi| ≥ t} by 2p{1 − Φ(t)} due to the sparsity of β1 − β2, where Φ(t) is the standard normal cumulative distribution function. This leads to the following multiple testing procedure.
Calculate the test statistics Wi = (Ti,1 − Ti,2)/(θ̂i,1 + θ̂i,2)1/2 as in (11).
-
For a given 0 ≤ α ≤ 1, calculate
If t̂ does not exists, set t̂ = (2 log p)1/2.
For 1 ≤ i ≤ p, reject H0,i if and only if |Wi| ≥ t̂.
4.2 Theoretical Properties
We now investigate the theoretical properties of this multiple testing procedure. For any 1 ≤ i ≤ p, define
where ξi,j,d is defined in Condition (C2). Under regularity conditions, this procedure controls the FDP and FDR at the pre-specified level α, asymptotically.
Theorem 4
Let
Suppose for some ρ > 0 and some δ > 0, |𝒮ρ| ≥ [1/(π1/2α) + δ](log p)1/2. Suppose that |Aτ ∩ ℋ0| = o(pν) for any ν > 0, where Aτ is given in Condition (C1). Assume that p0 = |ℋ0| ≥ cp for some c > 0, and (7) and (8) hold. If there exists some γ > 0 such that max1≤i≤p |Γi(γ)| = o(pν) for any ν > 0, then under (C1) – (C3) with p ≤ cnr for some c > 0 and r > 0, we have
in probability, as (n, p) → ∞.
The condition on |𝒮ρ| is mild, because among p hypotheses in total, it only requires a few number of entries with the standardized difference exceeding (log p)1/2+ρ/n1/2 for some constant ρ > 0. The technical condition |Aτ ∩ ℋ0| = o(pν) for any ν > 0 is to ensure that most of the regression residuals are not highly correlated with each other under the null hypotheses H0,i : βi,1 = βi,2.
5 Simulation Study
We consider the numerical performance, including the sizes and powers of both the global and the multiple testing procedures, through simulation studies. We investigated the performance of both procedures under two sets of simulations. For the first, we generated the data by considering two constructions of regression coefficients under three matrix models, with covariates being a combination of continuous and discrete random variables. For the second set, we studied the numerical performance of the proposed multiple testing procedure in a setting that is similar to the data application described in Section 6. We compared the proposed multiple testing procedure with Benjamini-Yekutieli (B-Y) procedure, as considered in Benjamini and Yekutieli (2001), and show that the B-Y procedure is much more conservative and has lower power in all cases.
5.1 Implementation Details
The proposed testing procedures required the estimation of the regression coefficients βd and γi,d, for i = 1, …, p and d = 1, 2. One may use the Lasso to estimate these parameters, as follows.
(18) |
and
(19) |
where DX = diag(Σ̂), Di,d = diag(σ̂Yd, Σ̂−i,−i), and , in which σ̂Yd is the sample variance of Yd and Σ̂ = (σ̂i,j) is the sample covariance matrix of Xd. In the global testing of H0 : β1 = β2, we chose the tuning parameter κ = 2.
For multiple testing of H0,i : βi,1 = βi,2, we selected the tuning parameters λn and λi,n in (18) and (19) adaptively by the data with the principle of making Σi∈ℋ0 I{|Wi| ≥ t} and 2{1 − Φ(t)}|ℋ0| as close as possible. That is, a good choice of the tuning parameters should minimize the error
where c > 0 and is the statistic of the corresponding tuning parameter. Step 2 below is a discretization of the above integral. The algorithm is summarized as follows.
Let and for b = 1, …, 40. For each b, calculate and , i = 1, …, p, d = 1, 2. Based on the estimation of regression coefficients, construct the corresponding statistics for each b.
- Choose b̂ as the minimizer of
The tuning parameters λn and λi,n are then chosen to be
(20) |
5.2 Simulation Under Different Matrix Models
We first generated the design matrices Xk,·,d, for k = 1, …, nd and d = 1, 2, with some of the covariates being continuous and the others being discrete. For simplicity, we generated Xk,·,d from the same distribution for d = 1, 2. As a first step, for three different matrix models, we obtained i.i.d samples Xk,·,d ~ N(0,Σ(m)), for k = 1, …, nd, with m = 1, 2 and 3. We then replaced l covariates of Xk,·,d by one of three discrete values 0, 1 or 2, with probability 1/3 each, where l is a random integer between ⌊p/2⌋ and p. We first introduce the matrix models Σ(m) used in the simulations. Let D = (Di,j) be a diagonal matrix with Di,i = Unif(1, 3) for i = 1, …, p. The following models were used to generate the design matrices.
Model 1: , where and otherwise. Ω(1) = D1/2Ω*(1)D1/2.
Model 2: , where for i = 10(k − 1) + 1 and 10(k − 1) + 2 ≤ j ≤ 10(k − 1) + 10, 1 ≤ k ≤ p/10. otherwise. Ω(2) = D1/2(Ω*(2) + δI)/(1 + δ)D1/2 with δ = |λmin(Ω*(2))| + 0.05.
Model 3: , where for i < j and . Ω(3) = D1/2(Ω*(3) + δI)/(1 + δ)D1/2 with δ = |λmin(Ω*(3))| + 0.05.
Global Test
For the global testing of H0 : β1 = β2, the sample sizes were taken to be n = n1 = n2 = 100, while the dimension p varied over the values 100, 200, 400, and 1000. Under the global null hypothesis, we have β1 = β2 = β, and two scenarios of generating β were considered. For case 1, 10 nonzero locations {k1, …, k10} of β were randomly generated with magnitudes , i = 1, …, 10. For case 2, s nonzero locations for β were randomly selected, with s = 5, 8, 10, and 15 for p =100, 200, 400 and 1000, respectively. The nonzero locations had magnitudes with any values between −10 and 10. The error terms εk,d were generated as normal random variables with mean 0 and variances having any values between 0.5 and 2.5. The nominal significance level for all the tests was set at α1 = 0.05.
Table 1 shows that the sizes of the global test Ψα1 are close to the nominal level for both cases under all matrix models. This reflects the fact that the null distribution of the test statistics Mn is well approximated by its limiting null distribution, as shown in Theorem 1. The empirical sizes are slightly below the nominal level in some cases for lower dimensions, as similarly observed in Xia et al. (2015), due to correlation among the variables. It is also shown in Table 1 that the proposed test is powerful in all settings, though β1 and β2 only differ in five or fewer locations with magnitudes of the order .
Table 1.
p | Case 1 | Case 2 | ||||
---|---|---|---|---|---|---|
| ||||||
Model 1 | Model 2 | Model 3 | Model 1 | Model 2 | Model 3 | |
Size | ||||||
| ||||||
100 | 4.1 | 3.2 | 2.9 | 4.4 | 2.9 | 2.8 |
400 | 4.8 | 3.8 | 3.7 | 4.0 | 4.1 | 3.5 |
1000 | 6.1 | 4.4 | 5.4 | 5.9 | 4.6 | 6.4 |
| ||||||
Power | ||||||
| ||||||
100 | 71.9 | 64.3 | 67.4 | 95.1 | 97.1 | 96.6 |
400 | 88.3 | 86.2 | 83.5 | 82.3 | 77.0 | 82.1 |
1000 | 95.1 | 92.6 | 97.9 | 47.3 | 42.0 | 48.1 |
To evaluate the power of the global test, we selected five locations, {k1, …, k5}, among the nonzero locations of β1, with magnitudes βkj,2 = βkj,1 + uj, j = 1, …, m, where uj has magnitude randomly and uniformly from the set [−2β(log p/n)1/2, −β(2 log p/n)1/2] ∪ [β(2 log p/n)1/2, 2β(log p/n)1/2], with β = max1≤i≤p |βi,1|. The actual sizes and powers in percentage for each case under three matrix models, reported in Table 1, are estimated from 1000 replications. For each replication, the nonzero locations and magnitudes of the regression coefficients could vary.
Multiple Testing
For simultaneous testing of {H0,i : βi,1 − βi,2 = 0, for 1 ≤ i ≤ p} with FDR control, we first generated β1 according to the above two cases. For case 1, ten nonzero locations for β2 were randomly generated and the locations could vary for these two vectors. The magnitudes were generated with values , i = 1, …, 10. For case 2, s nonzero locations for β2 were randomly selected, again with s = 5, 8, 10, and 15 for p =100, 200, 400 and 1000, respectively, also with magnitudes having any values between −10 and 10.
In Table 2, we present the empirical FDR and true discovery rate (power) of the proposed procedure (NEW) and the B-Y procedure at the FDR level of α2 = 0.1, based on 100 replications, where the power is summarized based on
where Wi,l denotes standardized difference for the lth replication and ℋ1 denotes the nonzero locations of β1 − β2. The results suggest that across all configurations, the FDRs are well controlled under the nominal level α by both FDR control procedures. However, the B-Y procedure is extremely conservative in all scenarios. For the new FDR procedure, the empirical FDRs are also conservative, due to the correlations among the regression residuals under the nulls ℋ0,i, and also due to the fact that we use |ℋ| to estimate |ℋ0| because the latter is usually unknown. Furthermore, the total number of true signals is small in all cases due to the sparsity of the regression coefficients; for example, when the total number of true signals is ten, the FDP for each replication tends to be either 0 or some number close to 0.1, which will also cause the conservatism of the FDR estimation. In case 2, we can see that the empirical FDR gets closer to the nominal level as dimension increases, because the number of true signals increases when p grows. In summary, the new procedure has empirical FDR much closer to the nominal level than B-Y procedure in all cases. Table 2 also reflects that the FDR control procedure introduced in Section 4 is more powerful than the B-Y procedure for different scenarios.
Table 2.
p | Method | Case 1 | Case 2 | ||||
---|---|---|---|---|---|---|---|
| |||||||
Model 1 | Model 2 | Model 3 | Model 1 | Model 2 | Model 3 | ||
Size | |||||||
| |||||||
100 | NEW | 5.9 | 5.8 | 6.8 | 3.8 | 4.5 | 3.6 |
B-Y | 0.3 | 1.0 | 0.7 | 0.1 | 0.3 | 0.7 | |
| |||||||
400 | NEW | 6.7 | 7.4 | 6.8 | 6.2 | 5.5 | 5.5 |
B-Y | 0.4 | 0.6 | 0.4 | 0.2 | 0.7 | 0.5 | |
| |||||||
1000 | NEW | 6.2 | 6.0 | 6.1 | 9.4 | 9.4 | 9.8 |
B-Y | 0.6 | 1.0 | 0.4 | 1.5 | 1.6 | 1.4 | |
| |||||||
Power | |||||||
| |||||||
100 | NEW | 95.3 | 94.7 | 94.7 | 93.3 | 92.1 | 90.4 |
B-Y | 91.5 | 88.1 | 88.5 | 88.6 | 90.3 | 88.3 | |
| |||||||
400 | NEW | 92.7 | 88.2 | 90.8 | 84.3 | 82.9 | 83.6 |
B-Y | 86.1 | 82.2 | 84.3 | 81.5 | 78.7 | 81.3 | |
| |||||||
1000 | NEW | 84.7 | 82.7 | 85.1 | 71.7 | 70.4 | 71.9 |
B-Y | 77.7 | 75.0 | 77.6 | 66.2 | 64.5 | 66.1 |
5.3 Simulation by Mimicking Data
We now consider a simulation setting mimicking the data considered in Section 6, where we have p = 119, n1 = 46 and n2 = 417. We investigated both cases of the construction of the regression coefficients as considered in Section 5.2, with ten nonzero locations, under all three matrix models, with covariates as a combination of continuous and discrete random variables. The nominal level was set at α3 = 0.1, and the empirical FDR’s and powers for both FDR procedures, as reported in Table 3, were evaluated based on 100 replications. As in Section 5.2, the empirical FDRs are well controlled under the data setting by the new FDR procedure, while the B-Y procedure is again very conservative. For case 1, the empirical FDR’s of the new procedure are slightly larger than the nominal level, due to the fact that n1 is much smaller than n2 in this setting, and thus β1 and β2 have magnitudes much closer to each other based on their construction. The performance of the new method for case 2 is less conservative than in Section 5.2 due to the fact we have ten nonzero locations for the regression coefficients when the dimension is 119 in the data setting. Table 3 also indicates that the new procedure is more powerful than the B-Y procedure under the data setting in all scenarios.
Table 3.
p | Method | Case 1 | Case 2 | ||||
---|---|---|---|---|---|---|---|
| |||||||
Model 1 | Model 2 | Model 3 | Model 1 | Model 2 | Model 3 | ||
Size | |||||||
| |||||||
119 | NEW | 9.4 | 11.2 | 11.0 | 8.7 | 8.9 | 8.8 |
B-Y | 2.2 | 3.0 | 2.9 | 1.7 | 1.4 | 1.6 | |
| |||||||
Power | |||||||
| |||||||
119 | NEW | 83.6 | 81.7 | 83.9 | 79.6 | 78.2 | 80.3 |
B-Y | 76.2 | 72.1 | 74.8 | 73.7 | 72.6 | 74.6 |
6 Data Analysis
We illustrate our proposed methods using the Framingham Offspring Study (Kannel et al. (1979)) of coronary artery disease (CAD). Over the past three decades, various risk prediction models for CAD have been developed (Wilson et al. (1998); Ridker et al. (2007)). Unlike those for many other diseases, the risk models such as the Framingham Risk Score have been incorporated into clinical practice guidelines (Lloyd-Jones et al. (2004); D’Agostino Sr et al. (2008)). However, these models, largely based on traditional clinical risk factors, have recognized limitations in their clinical utilities. It is thus important to explore avenues beyond the routine clinical measures to improve prediction. One potential approach is to fully understand the roles of intermediate phenotypes, such as the C- reactive protein (CRP) and genomic markers. In recent years, many genome-wide association studies (GWAS) have been conducted to identify CAD-related single-nucleotide polymorphism (SNP) mutations. The newly identified SNPs, while significantly associated with CAD risk or the intermediate phenotypes of CAD, explain very little of the genetic risk for the trait (Humphries et al. (2008); Paynter et al. (2009)). This coincides with the growing awareness that the failure to identify genetic scores that significantly improve risk prediction for complex traits may be in part due to failure to account for the interplay of genes and environment. It is thus of substantial interests to study environment and its interaction with a genetic predisposition in causing human diseases.
Here, we use data from Framingham Offspring Study to examine how the interaction between smoking and genetic risk factors affect the inflammation marker CRP, since the inflammation system plays a vital role in the atherosclerotic process (Ross (1999)). We focus on the 463 female participants with complete information on CRP, 116 SNP’s previously reported as associated with CAD intermediate phenotypes, two leading principal components that adjust for population stratification, as well as age and smoking status at exam seven. Smoking is known to roughly double life-time risk of CAD and is thought to increase cardiovascular risk via a few different mechanisms. We examine the interaction between smoking and the genetic markers, as well as other risk factors based on the proposed method. We fit linear regression models for smokers and for non-smokers and the variables with significantly different coefficients between smokers and non-smokers are deemed as having an interactive effect.
The effects of top eight SNPs including rs11585329, rs17583120, rs17132534, rs11214606, rs17529477, rs10891552, rs4293, and rs4351, on CRP are considered as significantly modified by smoking. Interestingly, the smoking and rs11585329 interaction has been reported as important contributor to the risk of colorectal cancer whereas inflammation is a hallmark of cancer (Liu et al. (2013)). SNP rs17132534 belongs to the UCP2 gene whose main function is the control of mitochondria-derived reactive oxygen species. A variant in the UCP2 has been previously shown to interact with smoking to influence plasma markers of oxidative stress and hence likely to be associated with prospective CHD risk (Stephens et al. (2008)). SNPs rs10891552, rs17529477, and rs11214606 all belong to the DRD2 gene, which is linked to addictive behaviors, including alcoholism and smoking. Smoking was found to modify the effects of polymorphism in DRD2 gene on gastric cancer risk (Ikeda et al. (2008)). SNPs rs4293 and rs 4351 belong to the ACE gene, linked with hypertension and CAD among other disorders. Interactions between smoking and polymorphism in the ACE gene have been reported for blood pressure and coronary atherosclerosis (Hibi et al. (1997); Sayed-Tabatabaei et al. (2004); Schut et al. (2004)).
7 Extension to Non-Binary Environmental Variable
Motivated by applications in genomics, we have proposed hypothesis testing procedures for detecting the interactions between environment and genomic markers when the environmental variable is binary, such as smoking status, as illustrated in Section 6. Our testing approach can be extended to detect the interactions when the environmental variable is discrete and finite, but non-binary. Specifically, suppose the environmental variable takes K possible values. Interaction detection can then be formulated based on comparing K high-dimensional regression models
One wishes to develop a global test for
(21) |
as well as develop a procedure for simultaneously testing the hypotheses
(22) |
with FDR and FDP control.
The test statistics for each model can be formulated similarly as in Section 2.2. For d = 1, …, K, we let
and estimate θi,d by
Then the pairwise standardized statistics can be defined by
Then if K is finite, we construct the sum of square type test statistic by
As in Cai and Xia (2014), it can be shown that the limiting null distribution of Si is a mixture chi-square distribution. Based on this fact, we can further develop global and multiple testing procedures. When the environmental variable is binary, the test statistics Si reduce to (11) in Section 2.2. On the other hand, if the environmental variable is continuous, the testing problem is significantly different, and out of the scope of the current paper. We leave it to future research.
8 Proofs
We prove the main results in this section. We begin by collecting technical lemmas that will be used in the proof of the main theorems.
8.1 Technical Lemmas
The first lemma is the classical Bonferroni inequality.
Lemma 1 (Bonferroni inequality)
Let . For any k < [p/2], we have
where Ft = Σ1≤i1<⋯<it≤p P(Bi1 ∩ ⋯ ∩ Bit).
For d = 1, 2, let and . The following lemma is essentially proved in Liu and Luo (2014).
Lemma 2
Suppose that Conditions (C1), (C3), (7) and (8) hold. Then
where and with and . Consequently, uniformly in i = 1, …, p,
Lemma 3
Let Xk ~ N(μ1, Σ1) for k = 1, …, n1 and Yk ~ N(μ2, Σ2) for k = 1, …, n2.
Define
Then, for some constant C > 0, σ̃i,j,1 − σ̃i,j,2 satisfies the large deviation bound
uniformly for 0 ≤ x ≤ (8 log p)1/2 and any subset 𝑆 ⊆ {(i, j) : 1 ≤ i ≤ j ≤ p}.
The complete proof of this lemma can be found in the supplementary material of Xia et al. (2015).
8.2 Proof of Theorem 1
To prove Theorem 1, we first show that the terms in Aτ are negligible. Then we focus on the terms in ℋ\Aτ, where ℋ = {1, …, p}, and show that , where Wi is defined in (11).
Define
where , for d = 1, 2. By Lemma 2 in Xia et al. (2015), under conditions (7) and (8), we have
(23) |
Thus we have
(24) |
By Lemma 2, we have
where . Note that for i ∈ ℋ\Aτ, βi,d = o{(log p)−1}. Thus we have maxi∈ℋ\Aτ |Wi − Vi| = oP{(log p)−1/2}. For i ∈ Aτ,
Due to the fact that the indices of the random variables only show up in the second term here, by Lemma 3 and the condition that |Aτ| = O(pr) with r < 1/4, we have
where . Thus, it suffices to show that
Let q = |ℋ\Aτ| and let n2/n1 ≤ K1 with K1 ≥ 1. Define for 1 ≤ k ≤ n1 and for n1 + 1 ≤ k ≤ n2. Thus we have
Without loss of generality, we assume . Define
where Ẑk,i = Zk,iI(|Zk,i| ≤ τn) − E{Zk,iI(|Zk,i| ≤ τn)}, and τn = (4K1/K) log(p + n). Note that , and that
Hence, P{max1≤i≤q |Vi − V̂i| ≥ (log p)−1} ≤ P(max1≤i≤q max1≤k≤n1+n2 |Zk,i| ≥ τn) = O(p−1). By the fact that , it suffices to prove that for any t ∈ ℝ, as n, p → ∞,
(25) |
By Lemma 1, for any integer l with 0 < l < q/2,
(26) |
where yp = 2 log p − log log p + t and . Let Z̃k,i = Ẑk,i/(n2θi,1/n1 + θi,2)1/2 for i = 1, …, q and Wk = (Z̃k,i1, …, Z̃k,id), for 1 ≤ k ≤ n1 + n2. Define |a|min = min1≤i≤d |ai| for any vector a ∈ Rd. Then we have
Then it follows from Theorem 1 in Zaïtsev (1987) that
(27) |
where c1 > 0 and c2 > 0 are constants, εn → 0 which will be specified later, and Nd = (Nm1, …, Nmd) is a normal random vector with E(Nd) = 0 and Cov(Nd) = n1/n2 Cov(W1) + Cov(Wn1+1). Here d is a fixed integer that does not depend on n, p. Because log p = o(n1/5), we can let εn → 0 sufficiently slowly that, for any large M > 0,
(28) |
Combining (26), (27), and (28) we have
(29) |
Similarly, using Theorem 1 in Zaïtsev (1987) again, we can get
(30) |
The following lemma is shown in the supplementary material of Cai et al. (2013) with q ≍ p and yp = 2 log p − log log p + t.
Lemma 4
For any fixed integer d ≥ 1 and real number t ∈ ℝ,
(31) |
It then follows from Lemma 4, (29), and (30) that
for any positive integer l. By letting l → ∞, we obtain (25) and Theorem 1 is proved.
8.3 Proof of Theorem 2
Let . It follows from the proof of Theorem 1 that , as n, p → ∞. By (23), (24), and the inequalities
we have P(Mn ≥ qα + 2 log p − log log p) → 1 as n, p → ∞.
8.4 Proof of Theorem 3
To prove the lower bound, we first construct a worst case scenario to test between β1 and β2. We apply the arguments in Baraud (2002) to prove the result.
Without loss of generality, we assume , σi,i,d = 1, σi,j,d = 0, i ≠ j for d = 1, 2, and n1 = n2. Let m̂ be a random entry uniformly drawn from ℋ = {1, …, p}. We construct a class of β1, 𝒩 = {β(m̂), m̂ ∈ ℋ}, such that, βm̂,1 = ρ and βi,1 = 0 for i ≠ m̂, with ρ = c(log p/n)1/2, where c < 1/2 is a constant. Let β2 = 0 and β1 be uniformly distributed on 𝒩. Let μρ be the distribution on β1 − β2. Note that μρ is a probability measure on , where 𝑆1 is a class of p-dimensional vectors with one nonzero entry. Then the likelihood ratio between samples {Yk,1, Xk,·,1} and {Yk,2, Xk,·,2} can be calculated as
where Σ(m̂) = Ω(m̂)−1 is the covariance matrix of {Yk,1, Xk,·,1} and {Z1, …, Zn} are i.i.d samples generated from N(0, I). Because , Var(Yk,2) = 1 and Cov(Yk,d, Xk,i,d) = βi,dσi,i,d. It can be easily calculated that |Σ(m̂)| = 1 and with , and otherwise. Hence
With Ω(m) + Ω(m′) − 2I = (ai,j), it is easy to see that, when m ≠ m′, ai,i = ρ2 and a1,i = −ρ for i = m + 1 or m′ + 1, aj,i = ai,j and ai,j = 0 otherwise; when m = m′, ai,i = 2ρ2 and a1,i = −2ρ for i = m + 1, aj,i = ai,j and ai,j = 0 otherwise. Thus we have
where x1, x2, x3 are independent standard normal random variables. Because E(exp{ρ(x1x2 + x2x3)) = 1 + ρ2, and E(exp{2ρx1x2) = 1 + 2ρ2, we have
Theorem 3 is thus proved by Baraud (2002).
8.5 Proof of Theorem 4
We first show that t̂, as defined in Section 4.1, is attained in the interval [0, (2 log p)1/2]. We then show that Aτ is negligible and we focus on the set ℋ\Aτ. We then show the FDP result by dividing the null set into small subsets and controlling the variance of R0(t) for each subset, and the FDR result will thus also be proved.
Under the condition of Theorem 4, we have
with probability going to one. Hence, with probability tending to one, we have
Let tp = (2 log p − 2 log log p)1/2. Because , we have P(1 ≤ t̂ ≤ tp) → 1 according to the definition of t̂ in Section 4.1. For 0 ≤ t̂ ≤ tp,
Thus, to prove Theorem 4, it suffices to prove
in probability, uniformly for 0 ≤ t ≤ tp, where G(t) = 2(1 − Φ(t)) and p0 = |ℋ0|. We will show that it suffices to show
(32) |
in probability. We now consider two cases.
- If t = {2 log p + o(log p)}1/2, the proof of Theorem 1 yields that . Thus, it suffices to prove
in probability. We show in Theorem 1 that maxi∈ℋ0\Aτ |Wi − Vi| = oP{(log p)−1/2}. Thus it suffices to show (32). - If t ≤ (C log p)1/2 for some C < 2, we have
in probability. Thus, it is again enough to show (32).
Let 0 ≤ t0 < t1 < ⋯ < tb = tp such that tι − tι−1 = υp for 1 ≤ ι ≤ b − 1 and tb − tb−1 ≤ υp, where . Thus we have b ~ tp/υp. For any t such that tι−1 ≤ t ≤ tι, by the fact that G(t + o((log p)−1/2))/G(t) = 1 + o(1) uniformly in 0 ≤ t ≤ c(log p)1/2 for any constant c, we have
Thus it suffices to prove
in probability. Define ℋ̃0 = ℋ0 \ Aτ. Note that
Thus, it suffices to show, for any ε > 0,
(33) |
Note that
We divides the indices i, j ∈ ℋ̃0 into the subsets: ℋ̃01 = {i, j ∈ ℋ̃0, i = j}, ℋ̃02 = {i, j ∈ ℋ̃0, i ∈ Γj(γ), or j ∈ Γi(γ)} and ℋ̃03 = ℋ̃0 \ (ℋ̃01 ∪ ℋ̃02). Then we have
(34) |
We now show the equation (12). Note that . Because , we have . Note that
By definition, we have εk,d independent with ηk,i,d + εk,dγi,1,d. Thus, we have
Note that
and that
We have . Thus
Note that, for i ∈ ℋ̃0, we have βi,d = O((log p)−2−τ) and so | Corr(Vi, Vj)| ≤ ξ < 1, where ξ = max{ξ1, ξ2} + ε with ξd defined in (C2) and ε < 1 − max{ξ1, ξ2}, for i, j ∈ ℋ̃02. Hence
(35) |
It remains to consider the subset ℋ̃03, in which Vi and Vj are weakly correlated. It is easy to check that maxi,j∈ ℋ̃03 P(|Vi| ≥ t, |Vj | ≥ t) = (1 + O{(log p)−1−γ})G2(t). Hence,
(36) |
Equation (33) and the FDP result then follow by combining (34), (35), and (36), and the FDR result is also proved.
Acknowledgments
The research of Yin Xia was supported in part by “The Recruitment Program of Global Experts” Youth Project from China, the startup fund from Fudan University and NSF Grant DMS-1612906.
The research of Tianxi Cai was supported in part by NIH Grants R01 GM079330, P50 MH106933, and U54 HG007963.
The research of Tony Cai was supported in part by NSF Grants DMS-1208982 and DMS-1403708, and NIH Grant R01 CA127334.
References
- Anderson TW. An Introduction To Multivariate Statistical Analysis. 3. Wiley-Intersceince; New York: 2003. [Google Scholar]
- Baraud Y. Non-asymptotic minimax rates of testing in signal detection. Bernoulli. 2002;8(5):577–606. [Google Scholar]
- Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Annals of statistics. 2001:1165–1188. [Google Scholar]
- Cai T, Liu W, Xia Y. Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. J. Am. Statist. Assoc. 2013;108(501):265–277. [Google Scholar]
- Cai TT, Xia Y. High-dimensional sparse manova. Journal of Multivariate Analysis. 2014;131:174–196. [Google Scholar]
- D’Agostino R, Sr, Vasan R, Pencina M, Wolf P, Cobain M, Massaro J, Kannel W. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation. 2008;117(6):743. doi: 10.1161/CIRCULATIONAHA.107.699579. [DOI] [PubMed] [Google Scholar]
- Hibi K, Ishigami T, Kimura K, Nakao M, Iwamoto T, Tamura K, Nemoto T, Shimizu T, Mochida Y, Ochiai H, et al. Angiotensin-converting enzyme gene polymorphism adds risk for the severity of coronary atherosclerosis in smokers. Hypertension. 1997;30(3):574–579. doi: 10.1161/01.hyp.30.3.574. [DOI] [PubMed] [Google Scholar]
- Humphries S, Yiannakouris N, Talmud P. Cardiovascular disease risk prediction using genetic information (gene scores): is it really informative? Current Opinion in Lipidology. 2008;19(2):128. doi: 10.1097/MOL.0b013e3282f5283e. [DOI] [PubMed] [Google Scholar]
- Hunter DJ. Gene–environment interactions in human diseases. Nature Reviews Genetics. 2005;6(4):287–298. doi: 10.1038/nrg1578. [DOI] [PubMed] [Google Scholar]
- Ikeda S, Sasazuki S, Natsukawa S, Shaura K, Koizumi Y, Kasuga Y, Ohnami S, Sakamoto H, Yoshida T, Iwasaki M, et al. Screening of 214 single nucleotide polymorphisms in 44 candidate cancer susceptibility genes: a case–control study on gastric and colorectal cancers in the japanese population. The American journal of gastroenterology. 2008;103(6):1476–1487. doi: 10.1111/j.1572-0241.2008.01810.x. [DOI] [PubMed] [Google Scholar]
- Javanmard A, Montanari A. Hypothesis testing in high-dimensional regression under the gaussian random design model: Asymptotic theory 2013 [Google Scholar]
- Javanmard A, Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research. 2014;15(1):2869–2909. [Google Scholar]
- Kannel W, Feinleib M, McNamara P, Garrison R, Castelli W. An investigation of coronary heart disease in families The Framingham Offspring Study. American Journal of Epidemiology. 1979;110(3):281–290. doi: 10.1093/oxfordjournals.aje.a112813. [DOI] [PubMed] [Google Scholar]
- Liu L, Zhong R, Wei S, Xiang H, Chen J, Xie D, Yin J, Zou L, Sun J, Chen W, et al. The leptin gene family and colorectal cancer: interaction with smoking behavior and family history of cancer. PloS one. 2013;8(4):e60777. doi: 10.1371/journal.pone.0060777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu W, Luo S. Hypothesis testing for high-dimensional regression models. Technical report 2014 [Google Scholar]
- Lloyd-Jones D, Wilson P, Larson M, Beiser A, Leip E, D’Agostino R, Levy D. Framingham risk score and prediction of lifetime risk for coronary heart disease* 1. The American Journal of Cardiology. 2004;94(1):20–24. doi: 10.1016/j.amjcard.2004.03.023. [DOI] [PubMed] [Google Scholar]
- Matsouaka RA, Li J, Cai T. Evaluating marker-guided treatment selection strategies. Biometrics. 2014;70(3):489–499. doi: 10.1111/biom.12179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Reviews Genetics. 2008;9(5):356–369. doi: 10.1038/nrg2344. [DOI] [PubMed] [Google Scholar]
- Paynter N, Chasman D, Buring J, Shiffman D, Cook N, Ridker P. Cardiovascular disease risk prediction with and without knowledge of genetic variation at chromosome 9p21. 3. Annals of internal medicine. 2009;150(2):65. doi: 10.7326/0003-4819-150-2-200901200-00003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pepe MS. The statistical evaluation of medical tests for classification and prediction. Oxford University Press; 2003. [Google Scholar]
- Ridker P, Buring J, Rifai N, Cook N. Development and validation of improved algorithms for the assessment of global cardiovascular risk in women: the Reynolds Risk Score. Journal of American Medical Association. 2007;297(6):611. doi: 10.1001/jama.297.6.611. [DOI] [PubMed] [Google Scholar]
- Ross R. Atherosclerosis is an inflammatory disease. American Heart Journal. 1999;138(5):S419–S420. doi: 10.1016/s0002-8703(99)70266-8. [DOI] [PubMed] [Google Scholar]
- Sayed-Tabatabaei F, Schut A, Hofman A, Bertoli-Avella A, Vergeer J, Witteman J, van Duijn C. A study of gene–environment interaction on the gene for angiotensin converting enzyme: a combined functional and population based approach. Journal of medical genetics. 2004;41(2):99–103. doi: 10.1136/jmg.2003.013441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schut AF, Sayed-Tabatabaei FA, Witteman JC, Avella AM, Vergeer JM, Pols HA, Hofman A, Deinum J, van Duijn CM. Smoking-dependent effects of the angiotensin-converting enzyme gene insertion/deletion polymorphism on blood pressure. Journal of hypertension. 2004;22(2):313–319. doi: 10.1097/00004872-200402000-00015. [DOI] [PubMed] [Google Scholar]
- Stephens JW, Bain SC, Humphries SE. Gene–environment interaction and oxidative stress in cardiovascular disease. Atherosclerosis. 2008;200(2):229–238. doi: 10.1016/j.atherosclerosis.2008.04.003. [DOI] [PubMed] [Google Scholar]
- Van de Geer S, Bühlmann P, Ritov Y, Dezeure R, et al. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics. 2014;42(3):1166–1202. [Google Scholar]
- Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al. The sequence of the human genome. Science. 2001;291(5507):1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]
- Wilson P, D’Agostino R, Levy D, Belanger A, Silbershatz H, Kannel W. Prediction of coronary heart disease using risk factor categories. Circulation. 1998;97(18):1837. doi: 10.1161/01.cir.97.18.1837. [DOI] [PubMed] [Google Scholar]
- Xia Y, Cai T, Cai TT. Testing differential network with applications to detecting gene by gene interactions. Biometrika. 2015;102:247–266. doi: 10.1093/biomet/asu074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zaïtsev AY. On the gaussian approximation of convolutions under multidimensional analogues of sn bernstein’s inequality conditions. Probab. Theory Rel. 1987;74(4):535–566. [Google Scholar]
- Zhang C-H, Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B. 2014;76(1):217–242. [Google Scholar]