Abstract
This paper is concerned with testing linear hypotheses in high-dimensional generalized linear models. To deal with linear hypotheses, we first propose constrained partial regularization method and study its statistical properties. We further introduce an algorithm for solving regularization problems with folded-concave penalty functions and linear constraints. To test linear hypotheses, we propose a partial penalized likelihood ratio test, a partial penalized score test and a partial penalized Wald test. We show that the limiting null distributions of these three test statistics are χ2 distribution with the same degrees of freedom, and under local alternatives, they asymptotically follow non-central χ2 distributions with the same degrees of freedom and noncentral parameter, provided the number of parameters involved in the test hypothesis grows to ∞ at a certain rate. Simulation studies are conducted to examine the finite sample performance of the proposed tests. Empirical analysis of a real data example is used to illustrate the proposed testing procedures.
Keywords: High-dimensional testing, Linear hypothesis, Likelihood ratio statistics, Score test, Wald test
1. Introduction.
During the last three decades, there are many works devoted to developing variable selection techniques for high dimensional regression models. Fan and Lv (2010) presents a selective overview on this topic. There are some recent works for hypothesis testing on Lasso estimate (Tibshirani, 1996) in high-dimensional linear models. Lockhart et al. (2014) proposed the covariance test which produces a sequence of p-values as the tuning parameter, λn, decreases, and features become non-zero in the Lasso. This approach does not give confidence intervals or p-values for an individual variable’s coefficient. Taylor et al. (2014) and Lee et al. (2016) extended the covariance testing framework to test hypotheses about individual features, after conditioning on a model selected by the Lasso. However, their framework permits inference only about features which have non-zero coefficients in a Lasso regression; this set of features likely varies across samples, making the interpretation difficult. Moreover, these work focused on high dimensional linear regression models, and it remains unknown whether their results can be extended to a more general setting.
This paper will focus on generalized linear models (GLM, McCullagh and Nelder, 1989). Let Y be the response, and X be its associate fixed-design covariate vector. The GLM assumes that the distribution of Y belongs to the exponential family. The exponential family with canonical link has the following probability density function
(1.1) |
where β0 is a p-dimensional vector of regression coefficients, and ϕ0 is some positive nuisance parameter. In this paper, we assume that b(·) is thrice continuously differentiable with b′′(·) > 0.
We study testing linear hypothesis in GLM, where is a subvector of β0, the true regression coefficients. The number of covariates p can be much larger than the sample size n, while the number of parameters in is assumed to be much smaller than n. Such type of hypotheses is of particular interests when the goal is to explore the group structure of β0. Moreover, it also includes a very important class of hypotheses by setting C to be the identity matrix and t = 0. In the literature, Fan and Peng (2004) proposed penalized likelihood ratio test for H0a : Cβ0,S = 0 in GLM, where β0,S is the vector consisting of all nonzero elements of β0 when p = o(n1/5) where n stands for the sample size. Wang and Cui (2013) extended Fan and Peng (2004)’s proposal and considered a penalized likelihood ratio statistic for testing , requiring p = o(n1/5). Ning and Liu (2017) proposed a decorrelated score test for under the setting of high dimensional penalized M-estimators with nonconvex penalties. Recently, Fang, Ning and Liu (2017) extended the proposal of Ning and Liu (2017) and developed a class of decorrelated Wald, score and partial likelihood ratio tests for Cox’s model with high dimensional survival data. Zhang and Cheng (2017) proposed a maximal type statistic based on the desparsified Lasso estimator (van de Geer et al., 2014) and a bootstrap-assisted testing procedure for , allowing the cardinality of to be an arbitrary subset of [1,...,p]. In this paper, we aim to develop theory of Wald test, score test and likelihood ratio test for in GLM under ultrahigh dimensional setting (i.e., p grows exponentially with n).
It is well known that the Wald, score and likelihood ratio tests are equivalent in the fixed p case. However, it can be challenging to generalize these statistics to the setting with ultrahigh dimensionality. To better understand this point, we take the Wald statistic for illustration. Consider the null hypothesis . Analogous to the classical Wald statistic, in the high dimensional setting, one might consider the statistic for some penalized regression estimator and its variance estimator . The choice of the estimators is essential here: some penalized regression estimator such as the Lasso, or the Dantzig estimator (Candes and Tao, 2007) cannot be used due to their large biases when p ≫ n. The non-concave penalized estimator does not have this bias issue, but the minimal signal conditions imposed in Fan and Peng (2004) and Fan and Lv (2011) implies that the associated Wald statistic does not have any power for local alternatives of the type for some sequence hn such that ∥hn∥2 ≪ λn where ∥ · ∥2 is the Euclidean norm. Moreover, to implement the score and the likelihood ratio statistics, we need to estimate the regression parameter under the null, which involves penalized likelihood under linear constraints. This is a very challenging task and has rarely been studied: (a) the associated estimation and variable selection property is not standard from a theoretical perspective, and (b) there is a lack of constrained optimization algorithms that can produce sparse estimators from a computational perspective.
We briefly summarize our contributions as follows. First, we consider a more general form of hypothesis. In contrast, existing literature mainly focuses on testing . Besides, we also allow the number of linear constraints to diverge with n. Our tests are therefore applicable to a wider range of real applications for testing a growing set of linear hypotheses. Second, we propose a partial penalized Wald, a partial penalized score and a partial penalized likelihood-ratio statistic based on the class of folded-concave penalty functions, and show their equivalence in the high dimensional setting. We derive the asymptotic distributions of our test statistics under the null hypothesis and the local alternatives. Third, we systematically study the partial penalized estimator with linear constraints. We derive its rate of convergence and limiting distribution. These results are significant in their own rights. The unconstrained and constrained estimators share similar forms, but the constrained estimator is more efficient due to the additional information contained in the constraints under the null hypothesis. Fourth, we introduce an algorithm for solving regularization problems with folded-concave penalty functions and equality constraints, based on the alternating direction method of multipliers (ADMM, cf. Boyd et al., 2011).
The rest of the paper is organized as follows. We study the statistical properties of the constrained partial penalized estimator with folded concave penalty functions in Section 2. We formally define our partial penalized Wald, score and likelihood-ratio statistics, establish their limiting distributions, and show their equivalence in Section 3. Detailed implementations of our testing procedures are given in Section 3.3, where we introduce our algorithm for solving the constrained partial penalized regression problems. Simulation studies are presented in Section 4. The proof of Theorem 3.1 is presented in Section 5. Other proofs and addition numerical results are presented in the supplementary material (Shi et al., 2018).
2. Constrained partial penalized regression.
2.1. Model setup.
Suppose that {Xi,Yi}, i = 1,··· ,n is a sample from model (1.1). Denote by Y = (Y1,...,Yn) the n-dimensional response vector and X = (X1,···,Xn)T is the n×p design matrix. We assume the covariates Xi are fixed design. Let Xj denote the jth column of X. To simplify the presentation, for any r×q matrix Φ and any set J ⊂ [1,2,...,q], we denote by ΦJ the submatrix of Φ formed by columns in J. Similarly, for any q-dimensional vector ϕ, ϕJ stands for the subvector of ϕ formed by elements in J. We further denote ΦJ1,J2 as the submatrix of Φ formed by rows in J1 and columns in J2 for any J1 ⊆ [1,...,r] and J2 ⊆ [1,...,q]. Let |J| be the number of elements in J. Define Jc = [1,...,q] − J to be the complement of J.
In this paper, we assume log p = O(na) for some 0 < a < 1 and focus on the following testing problem:
(2.1) |
for a given , an matrix C and an r-dimensional vector t. We assume that the matrix C is of full row rank. This implies there are no redundant or contradictory constraints in (2.1). Let , we have r ≤ m.
Define the partial penalized likelihood function
for some penalty function pλ(·) with a tuning parameter λ. Further define
(2.2) |
(2.3) |
Note that in (2.2) and (2.3), we do not add penalties on parameters involved in the constraints. This enables to avoid imposing minimal signal condition on elements of . Thus, the corresponding likelihood ratio test, Wald test and score test have power at local alternatives.
We present a lemma to characterize the constrained local maximizer in the supplementary material (see Lemma S.1). In Section 3, we show that these partial penalized estimators help us to obtain valid statistical inference about the null hypothesis.
2.2. Partial penalized regression with linear constraint.
In this section, we study the statistical properties of and by restricting pλ to the class of folded concave penalty functions. Popular penalty functions such as SCAD (Fan and Li, 2001) and MCP (Zhang, 2010) belong to this class. Let ρ(t0,λ) = pλ(t0)/λ for λ > 0. We assume that ρ(t0,λ) is increasing and concave in t0 ∈ [0,∞), and has a continuous derivative ρ′(t0,λ) with ρ′(0+,λ) > 0. In addition, assume ρ′(t0,λ) is increasing in λ ∈ (0,∞) and ρ′(0+,λ) is independent of λ. For any vector v = (v1,...,vq)T, define
where sgn(·) denotes the sign function. We further define the local concavity of the penalty function ρ at v with ∥v∥0 = q as
We assume that the true regression coefficient β0 is sparse and satisfies for some sequence of vectors hn → 0. When hn = 0, the null holds. Otherwise, the alternative holds. Let and s = |S|. Let dn be the half minimum signal of β0,s, i.e, dn = minj∈S |βj|/2. Define . We impose the following conditions.
(A1) Assume that
for some constant c > 0, where for any vector v = (v1,...,vq)T, diag(v) denotes a diagonal matrix with the j-th diagonal elements being vj, |v| = (|v1|,...,|vq|)T, and ∥B∥2,∞ = supv:∥v∥2=1 ∥Bv∥ for any matrix B with q rows.
(A2) Assume that where , for j = 0,a.
(A3) Assume that there exist some constants M and v0 such that
(A4) Assume that , and λmax ((CCT)−1) = O(1).
In Section S4.1 of the supplementary material, we show that Condition (A1) holds with probability tending to 1 if the covariate vectors X1,...,Xn are uniformly bounded or realizations from a sub-Gaussian distribution. The first condition in (A2) is a minimum signal assumption imposed on nonzero elements in only. This is due to partial penalization, which enables us to evaluate the uncertainty of the estimation for small signals. Such conditions are not assumed in van de Geer et al. (2014) and Ning and Liu (2017) for testing . However, we note that these authors impose some additional assumptions on the design matrix. For example, the validity of the decorrelated score statistic depends on the sparsity of w*. For testing univariate parameters, this requires the degree of a particular node in the graph to be relatively small when the covariate follows a Gaussian graphical model (see Remark 6 in Ning and Liu, 2017). In Section S4.3 of the supplementary material, we show Condition (A3) holds for linear, logistic, and Poisson regression models.
Theorem 2.1. Suppose that Conditions (A1)-(A4) hold, and , then the following holds: (i) With probability tending to 1, and defined in (2.2) and (2.3) must satisfy . (ii) and . If further s + m =o(n1/3), then we have
where I is the identity matrix, Kn is the (m + s) × (m + s) matrix
and Pn is the (m × s) × (m × s) projection matrix
where Or×s is an r × s zero matrix.
Remark 2.1. Since , Theorem 2.1(ii) implies that each element in and is nonzero. This together with result (i) shows the sign consistency of and .
Remark 2.2. Theorem 2.1 implies that the constrained estimator converges at a rate of . In contrast, the unconstrained estimator converges at a rate of . This suggests that when hn is relatively small, the constrained estimator converges faster than the unconstrained defined in (2.3), when s+m−r ≪ s+m. This result is expected with the following intuition: the more information about β0 we have, the more accurate the estimator will be.
Remark 2.3. Under certain regularity conditions, Theorem 2.1 implies that
where ξ0and V0 are limits of and , respectively. Similarly, we can show
where . Note that aT V0a ≤ aT Vaa for any a ∈ ℝs+m. Under the null, we have ξ0 = 0, which suggests that is more efficient than in terms of a smaller asymptotic variance. Under the alternative, is asymptotically biased. This can be interpreted as a bias-variance trade-off between and .
3. Partial penalized Wald, score and likelihood ratio statistics.
3.1. Test statistics.
We begin by introducing our partial penalized like-lihood ratio statistic,
(3.1) |
where Ln(β) = ∑i{YiβT Xi − b(βT Xi)}/n, and are defined in (2.2) and (2.3) respectively, and is some consistent estimator for ϕ0 in (1.1). For Gaussian linear models, ϕ0 corresponds to the error variance. For logistic or Poisson regression models, ϕ0 = 1.
The partial penalized Wald statistic is based on . Define and denote Ωmm as the first m rows and columns of Ωn. It follows from Theorem 2.1 that its asymptotic variance is equal to CΩmmCT. Let . Then, with probability tending to 1, we have . Define
and as its submatrix formed by its first m rows and columns. The partial penalized Wald statistic is defined by
(3.2) |
Analogous to the classical score statistic, we define our partial penalized score statistic as
(3.3) |
where , and
3.2. Limiting distributions of the test statistics.
For a given significance level α, we reject the null hypothesis when for T = TL,TW or TS where is the upper α-quantile of a central χ2 distribution with r degrees of freedom and r is the number of constraints. Assume r is fixed. When is consistent to ϕ0, it follows from Theorem 2.1 that TL, TW and TS converge asymptotically to a (non-central) χ2 distribution with r degrees of freedom. However, when r diverges with n, there is no such theoretical guarantee. This is because the concept of weak convergence is not well defined in such settings. To resolve this issue, we observe that when the following holds,
where χ2(r,γn) is a chi square random variable with r degrees of freedom and noncentrality parameter γn which is allowed to vary with n, our testing procedure is still valid using χ2 approximation.
Theorem 3.1. Assume Conditions (A1)-(A4) hold, s + m = o(n1/3), and . Further assume the following holds:
(3.4) |
Then, we have
(3.5) |
for T = TW ,TS or TL, where
REMARK 3.1. By (3.5), it is immediate to see that
for any T1,T2 ∈ {TW ,TS,TL}. This establish the equivalence between the partial penalized Wald, score and likelihood-ratio statistics. Condition (3.4) is the key to guarantee χ2 approximation in (3.5). When r = O(1), this condition is equivalent to
which corresponds to the Lyaponuv condition that ensures the asymptotic normality of and . When r diverges, (3.4) guarantees that the following Lyaponuv type bound goes to 0,
where Z represents an r-dimensional multivariate normal with identity covariance matrix, and the supremum is taken over all convex subsets in ℝm. The scaling factor r1/4 accounts for the dependence of the above Lyaponuv type estimate on the dimension and it remains unknown whether the factor r1/4 can be improved (see related discussions in Bentkus, 2004).
REMARK 3.2. Theorem 3.1 implies that our testing procedures are consistent. When the null holds, we have hn = 0 and hence γn = 0. This together with equation (2.1) suggests that our tests have correct size under the null. Under the alternative, we have hn ≠ 0 and hence γn ≠ 0. Since χ2(r,0) is stochastically smaller than χ2(r,γn), (3.5) implies that our tests have non-negligible powers under Ha. We summarize these results in the following corollary.
COROLLARY 3.1. Assume Conditions (A1)-(A3) and (3.4) hold, s + m = o(n1/3), λmax((CCT )⎺1) = O(1), and . Then, under the null hypothesis, for any 0 < α < 1, we have
for T = TW ,TL and TS, where is the critical value of χ2-distribution with r degrees of freedom at level α. Under the alternative for some hn satisfying , we have for any 0 < α < 1, and T = TW ,TS and TL,
where
REMARK 3.3. Corollary 3.1 shows that the asymptotic power functions of the proposed test statistics are
(3.6) |
It follows from Theorem 2 in Ghosh (1973) that the asymptotic power function decreases as r increases for a given γn. This is the same as that for traditional likelihood ratio test, score test and Wald’s test. However, hn is an r-dimensional vector in our setting. Thus, one may easily construct an example in which γn grows as r increases. As a result, the asymptotic power function may not be monotone increasing function of r.
In Section S3 of Shi et al. (2018), we study in depth that how the penalty on individual coefficient affects the power, and find that the tests are most advantageous if each unpenalized variable is either an important variable (i.e., in ) or a variable in .
REMARK 3.4. Notice that the null hypothesis reduces to if we set C to be the identity matrix and t = 0. The Wald test based on the desparsified Lasso estimator (van de Geer et al., 2014) and the decorrelated score test (Ning and Liu, 2017) can also be applied to testing such hypothesis. Based on (3.6), we show that these two tests achieve less power than the proposed partial penalized tests in Section S1 of Shi et al. (2018). This is due to the increased variances of the de-sparsified Lasso estimator and the decorrelated score statistic after the debiasing procedure.
3.3. Some implementation issues.
3.3.1. Constrained partial penalized regression.
To construct our test statistics, we need to compute the partial penalized estimators and . Our algorithm is based upon the alternating direction method of multipliers (ADMM), which is a variant of standard augmented Lagrangian method. Below, we present our algorithm for estimating . The unconstrained estimator can be similarly computed. For a fixed regularization parameter λ, define
The above optimization problem is equivalent to
(3.7) |
The augmented Lagrangian for (3.7) is
for a given ρ > 0. Applying dual ascent method yields the following algorithm:
for the (k + 1)th iteration.
Since Ln is twice differentiable, βk+1 can be obtained by the Newton-Raphson algorithm. θk+1 may have a closed form for some popular penalties such as Lasso, SCAD or MCP penalty. In our implementation, we use the SCAD penalty,
and set a = 3.7, 𝜌=1,
To obtain , we compute for a series of log-spaced values in [−λmin, λmax] for some λmin < λmax. Then we choose by minimizing the following information criterion:
where cn = max{logn, log(log(n))log(p)}. Using similar arguments in Schwarz (1978) and Fan and Tang (2013), we can show such information criterion is consistent in both fixed p and ultrahigh dimension setting.
3.3.2. Estimation of the nuisance parameter.
It can be shown that ϕ0 = 1 for logistic or Poisson regression models. In linear regression models, we have . In our implementation, we estimate ϕ0 by
where is defined in (2.3).
In Section S2 of the supplementary material (Shi et al., 2018), we show , under the conditions in Theorem 2.1, which implies selection consistency. Alternatively, one can estimate ϕ0 using refitted cross-validation (Fan, Guo and Hao, 2012) or scaled lasso (Sun and Zhang, 2013).
4. Numerical Examples.
In this section, we examine the finite sample performance of the proposed tests. Simulation results for linear regression and logistic regression are presented in the main text. In the supplementary material (Shi et al., 2018), we present simulation results for Poisson log-linear model and illustrate the proposed methodology by a real data example.
4.1. Linear regression.
Simulated data with sample size n = 100 were generated from
where ε ∼ N(0,1) and X ∼ N(0p,Σ), and h(1) and h(2) are some constants. The true value where 0q denotes a zero vector of length q.
4.1.1. Testing linear hypothesis.
We focus on testing the following three pairs of hypotheses:
These hypotheses test linear structures between two regression coefficients. When testing , we set h(2) = 0, and hence holds if and only if h(1) = 0. Similarly when testing and , we set h(1) = 0, and hence the hull hypotheses hold if and only if h(2) = 0.
We consider two different dimensions, p = 50 and p = 200, and two different covariance matrices Σ, corresponding to Σ = I and . This yields a total of 4 settings. For each hypothesis and each setting, we further consider four scenarios, by setting h(j) = 0,0.1,0.2,0.4. Therefore, the null holds under the first scenario and the alternative holds under the rest three. Table 1 summarizes the rejection probabilities for and under the settings where Σ = {0.5|i–j|}. Rejection probabilities of the proposed tests under the settings where Σ = I are given in Table S1 in the supplementary material. The rejection probabilities are evaluated via 600 simulation replications.
Table 1.
Rejection probabilities (%) of the partial penalized Wald, score and likelihood ratio statistics with standard errors in parenthesis (%), under the setting where .
p = 50 | p = 200 | |||||
---|---|---|---|---|---|---|
TL | TW | TS | TL | TW | TS | |
h(1) | ||||||
0 | 4.33(0.83) | 4.33(0.83) | 4.67(0.86) | 5.67(0.94) | 5.67(0.94) | 5.67(0.94) |
0.1 | 13.17(1.38) | 13.50(1.40) | 13.50(1.40) | 11.67(1.31) | 11.67(1.31) | 11.67(1.31) |
0.2 | 39.83(2.00) | 40.17(2.00) | 40.00(2.00) | 39.67(2.00) | 39.67(2.00) | 39.67(2.00) |
0.4 | 92.33(1.09) | 93.17(1.03) | 93.17(1.03) | 92.67(1.06) | 92.67(1.06) | 92.67(1.06) |
h(2) | ||||||
0 | 5.17(0.90) | 5.17(0.90) | 5.67(0.94) | 5.33(0.92) | 5.33(0.92) | 5.33(0.92) |
0.1 | 11.00(1.28) | 11.00(1.28) | 11.33(1.29) | 12.50(1.35) | 12.50(1.35) | 12.50(1.35) |
0.2 | 30.67(1.88) | 30.67(1.88) | 31.00(1.89) | 33.67(1.93) | 33.67(1.93) | 33.67(1.93) |
0.4 | 85.17(1.45) | 85.00(1.46) | 85.00(1.46) | 87.83(1.33) | 87.83(1.33) | 87.83(1.33) |
h(2) | ||||||
0 | 6.50 (1.01) | 6.33(0.99) | 6.50(1.01) | 5.67(0.94) | 5.67(0.94) | 5.67(0.94) |
0.1 | 11.83 (1.32) | 11.67(1.31) | 11.67(1.31) | 11.00(1.28) | 11.00(1.28) | 11.00(1.28) |
0.2 | 31.67 (1.90) | 31.50(1.90) | 31.67(1.90) | 33.17(1.92) | 33.17(1.92) | 33.17(1.92) |
0.4 | 84.33 (1.48) | 84.17(1.49) | 84.50(1.48) | 86.00(1.42) | 86.17(1.41) | 86.17(1.41) |
Based on the results, it can be seen that under these null hypotheses, Type I error rates of the three tests are well controlled and close to the nominal level for all four settings. Under the alternative hypotheses, the powers of these three test statistics increase as h(1) or h(2) increases, showing the consistency of our testing procedure. Moreover, the empirical rejection rates between these three test statistics are very close across all different scenarios and settings. For example, the rejection rates are exactly the same for testing and when p = 200 in Table 1, although we observed that the values of these three statistics in our simulation are slightly different. This is consistent with our theoretical findings that these statistics are asymptotically equivalent even in high dimensional settings. Figures S1, S2 and S3 in the supplementary material depicts the kernel density estimates of three test statistics under and with different combinations of p and the covariance matrices respectively. It can be seen that these three test statistics converge to their limiting distributions under the null hypotheses.
4.1.2. Testing univariate parameter.
Consider testing the following two pairs of hypotheses:
We set h(2) = 0 when testing , and set h(1) = 0 when testing . Therefore, is equivalent to h(1) = 0 and is equivalent to h(2) = 0. We use the same 4 settings described in Section 4.1.1. For each setting, we set h(1) = 0.1,0.2,0.4 under and h(2) = 0.1,0.2,0.4 under . Comparison is made among the following test statistics:
The proposed likelihood ratio (TL), Wald (TW ) and score (TS) statistic.
The Wald test statistic based on the de-sparisfied Lasso estimator .
The decorrelated score statistic. .
The test statistic is computed via the R package hdi (Dezeure et al., 2015). We calculate according to Section 4.1 in Ning and Liu (2017). More specifically, the initial estimator is computed by a penalized linear regression with SCAD penalty function, and is computed by a penalized linear regression with l1 penalty function (see Equation (4.4) in Ning and Liu, 2017). These penalized regressions are implemented via the R package ncvreg (Breheny and Huang, 2011). The tuning parameters are selected via 10-folded cross-validation. The rejection probabilities of these test statistics under the settings where Σ = {0.5|i–j|} are reported in Table 2. In the supplementary material, we report the rejection probabilities of these test statistics under the settings where Σ = I in Table S2. Results are averaged over 600 simulation replications.
Table 2.
Rejection probabilities (%) of the partial penalized Wald, score and likelihood ratio statistics, the Wald test statistic based on the de-sparsified Lasso estimator and the decorrelated score statistic under the settings where , with standard errors in parenthesis (%).
TL | TW | TS | |||
---|---|---|---|---|---|
h(1) | and p = 50 | ||||
0 | 5.17(0.90) | 5.33(0.92) | 5.50(0.93) | 12.67(1.36) | 7.00(1.04) |
0.1 | 15.67(1.48) | 16.00(1.50) | 16.00(1.50) | 6.00(0.97) | 14.67(1.44) |
0.2 | 41.00(2.01) | 41.33(2.01) | 41.50(2.01) | 14.83(1.45) | 38.83(1.99) |
0.4 | 92.50(1.08) | 93.00(1.04) | 93.00(1.04) | 67.67(1.91) | 88.67(1.29) |
and p = 200 | |||||
0 | 4.83(0.88) | 4.83(0.88) | 4.83(0.88) | 21.83(1.69) | 5.50(0.93) |
0.1 | 11.00(1.28) | 11.00(1.28) | 11.00(1.28) | 5.83(0.96) | 10.83(1.27) |
0.2 | 40.50(2.00) | 40.50(2.00) | 40.50(2.00) | 6.17(0.98) | 37.83(1.98) |
0.4 | 91.50(1.14) | 91.50(1.14) | 91.50(1.14) | 49.33(2.04) | 88.00(1.33) |
h(2) | and p = 50 | ||||
0 | 6.33(0.99) | 6.00(0.97) | 6.50(1.00) | 5.33(0.92) | 3.00(0.70) |
0.1 | 13.67(1.40) | 13.50(1.40) | 14.00(1.42) | 5.33(0.92) | 9.17(1.18) |
0.2 | 40.17(2.00) | 40.33(2.00) | 40.50(2.00) | 15.67(1.48) | 28.50(1.84) |
0.4 | 90.83(1.18) | 91.33(1.15) | 91.67(1.13) | 69.17(1.89) | 83.33(1.52) |
and p = 200 | |||||
0 | 5.67(0.94) | 5.67(0.94) | 5.67(0.94) | 6.50(1.01) | 2.67(0.66) |
0.1 | 13.67(1.40) | 13.67(1.40) | 13.67(1.40) | 3.67(0.77) | 8.17(1.12) |
0.2 | 39.17(1.99) | 39.17(1.99) | 39.17(1.99) | 9.67(1.21) | 24.67(1.76) |
0.4 | 91.50(1.14) | 91.50(1.14) | 91.50(1.14) | 51.33(2.04) | 80.50(1.62) |
From Table 2, it can be seen that failed to test under the settings where Σ = {0.5|i–j|}. Under the null hypotheses, the Type I error rates of are greater than 12%. Under the alternative hypotheses, the proposed test statistics and the decorrelated score test are more powerful than in almost all cases. Besides, we note that TL, TW, TS and perform comparable under the settings where Σ = I. When Σ = {0.5|i–j|} however, the proposed test statistics achieve greater power than . This is in line with our theoretical findings (see Section S1 of the supplementary material for details).
4.1.3. Effects on m.
In Section 4.1.1, we consider linear hypotheses involving two parameters only. As suggested by one of the referee, we further examine our test statistics under settings where more regression parameters are involved in the hypotheses. More specifically, we consider the following three pairs of hypotheses:
The numbers of parameters involved in and are equal to 4, 8 and 12, respectively. We consider the same 4 settings described in Section 4.1.1. For each setting, we set h(1) = 0,0.2,0.4,0.8 and h(2) = 0. Hence, the null hypotheses hold when h(1) = 0 and the alternatives hold when h(1) > 0. We report the rejection probabilities over 600 replications in Table 3, under the settings where Σ = {0.5|i–j|}. Rejection probabilities under the settings where Σ = I are reported in Table S3 in the supplementary material.
Table 3.
Rejection probabilities (%) of the partial penalized Wald, score and likelihood ratio statistics with standard errors in parenthesis (%), under the settings where .
p = 50 | p = 200 | |||||
---|---|---|---|---|---|---|
TL | TW | TS | TL | TW | TS | |
h(1) | ||||||
0 | 4.83(0.88) | 4.50(0.85) | 4.67(0.86) | 4.83(0.88) | 4.83(0.88) | 4.83(0.88) |
0.2 | 28.17(1.84) | 28.17(1.84) | 28.50(1.84) | 28.50(1.84) | 28.50(1.84) | 28.50(1.84) |
0.4 | 80.33(l.62) | 80.17(1.63) | 80.33(1.62) | 79.83(1.64) | 79.83(1.64) | 79.83(1.64) |
0.8 | 99.83(0.17) | 100.00(0.00) | 100.00(0.00) | 100.00(0.00) | 100.00(0.00) | 100.00(0.00) |
h(1) | ||||||
0 | 4.50(0.85) | 4.50(0.85) | 4.50(0.85) | 5.00(0.89) | 5.00(0.89) | 5.00(0.89) |
0.2 | 18.17(1.57) | 18.33(1.58) | 18.33(1.58) | 18.33(1.58) | 18.33(1.58) | 18.33(1.58) |
0.4 | 53.83(2.04) | 54.17(2.03) | 54.00(2.03) | 57.33(2.02) | 57.33(2.02) | 57.33(2.02) |
0.8 | 98.50(0.50) | 99.00(0.41) | 99.00(0.41) | 98.50(0.50) | 98.50(0.50) | 98.50(0.50) |
h(1) | ||||||
0 | 5.17 (0.90) | 5.00(0.89) | 5.17(0.90) | 5.67(0.94) | 5.67(0.94) | 5.67(0.94) |
0.2 | 14.33 (1.43) | 14.33(1.43) | 14.33(1.43) | 13.67(1.40) | 13.67(1.40) | 13.67(1.40) |
0.4 | 42.00 (2.01) | 42.17(2.02) | 42.17(2.02) | 41.67(2.01) | 41.67(2.01) | 41.67(2.01) |
0.8 | 92.83 (1.05) | 92.83(1.05) | 92.83(1.05) | 93.00(1.04) | 93.00(1.04) | 93.00(1.04) |
The Type I error rates of the three test statistics are close to the nominal level under the null hypotheses. The powers of the test statistics increase as h(1) increases, under the alternative hypotheses. Moreover, we note that the powers decrease as m increases. This is in line with Corollary 3.1 which states that the asymptotic power function of our test statistics is a function of r and γn. Recall that . Consider the following sequence of null hypotheses indexed by m ≥ 2: Cmβ0 = 0 where Cm = (1,···,1,0p−m). Let . Under the given settings, we have Ωmm = (ωij) is a banded matrix with ωij = 0 for |i − j| ≥ 2, wij = –1/(1-ρ2) for |i − j| = 1, ω11 = ωmm = 1/{ρ(1 − ρ2)}, and ωjj = (1 + ρ2)/{ρ(1 − ρ2)} for j ≠ 1 and m, where ρ is the auto-correlation between X1 and X2. It is immediate to see γn,m decreases as m increases.
4.2. Logistic regression.
In this example, we generate data with sample size n = 300 from the logistic regression model
where logit(p) = log{p/(1−p)}, the logit link function, and X ∼ N(0p,Σ).
4.2.1. Testing linear hypothesis.
We consider the same linear hypotheses as those in Section 4.1.1:
Similarly, we set h(2) = 0 when testing , and set h(1) = 0 when testing . Therefore, is equivalent to h(1) = 0 and is equivalent to h(2) = 0. We use the same 4 settings described in Section 4.1.1. For each of the four settings, we set h(j) = 0.2,0.4,0.8 under . The rejection probabilities for and over 600 replications are given in Table S4 in the supplementary material. We also plot the kernel density estimates of three test statistics under and in Figures S4, S5 and S6 in the supplementary material. The findings are very similar to those in the previous examples.
4.2.2. Testing univariate parameter.
To compare the proposed partial penalized Wald (TW ), score (TS) and likelihood ratio (TL) test statistics with the Wald test based on the de-sparsified Lasso estimator and the decorrelated score test , we consider testing the following hypotheses:
Similar to Section 4.1.2, we set h(2) = 0 when testing , and set h(1) = 0 when testing . We set h(1) = 0 under , h(1) = 0.2, 0.4, 0.8 under and set h(2) = 0 under , h(2) = 0.2, 0.4, 0.8 under . We consider the same 4 settings described in Section 4.1.1. The test statistic is computed via the R package hdi and is obtained according to Section 4.2 of Ning and Liu (2017). We compute the initial estimator in by fitting a penalized logistic regression with SCAD penalty function, and calculate by fitting a penalized linear regression with l1 penalty function. These penalized regressions are implemented via the R package ncvreg. We report the rejection probabilities of TW, TS,TL, and in Table S5 in the supplementary article, based on 600 simulation replications.
Based on the results, it can be seen that the Type I error rates of and are significantly larger than the nominal level in almost all cases for testing . On the other hand, the Type I error rates of the proposed test statistics are close to the nominal level under . Besides, under , the powers of the proposed test statistics are greater than or equal to and in all cases.
4.2.3. Effects on m.
As in Section 4.1.3, we further examine the proposed test statistics by allowing more regression coefficients to appear in the linear hypotheses. Similarly, we consider the following three pairs of hypotheses:
We set h(2) = 0, and set h(1) = 0 under the null hypotheses, h(1) = 0.4,0.8,1.6 under the alternative hypotheses. The same 4 settings described in Section 4.1.1 are used. The rejection probabilities of the proposed test statistics are reported in Table S6 in the supplementary article. Results are averaged over 600 replications. Findings are very similar to those in Section 4.1.3.
5. Technical proofs.
This section consists of the proof of Theorem 3.1. To establish Theorem 3.1, we need the following lemma. The proof of this lemma is given in Section 5.1. For any symmetric and positive definite matrix A ∈ ℝq×q, it follows from the spectral theorem that A = UT ΛU for some orthogonal matrix U and diagonal matrix Λ = diag(λ1,...,λq). Since the diagonal elements in Λ are positive, we use Λ1/2 and Λ−1/2 to denote the diagonal matrices and , respectively. In addition, we define A1/2 = UT Λ1/2U and A‒1/2 = UT Λ⎺1/2U.
LEMMA 5.1. Under the conditions in Theorem 3.1, we have
(5.1) |
(5.2) |
(5.3) |
(5.4) |
(5.5) |
(5.6) |
(5.7) |
where Ψ = CΩmm CT and
We break the proof into four steps. In the first three steps, we show TW /r, TS/r and TL/r are equivalent to T0/r, respectively, where
and
In the final step, we show the χ2 approximation (3.5) holds for TW ,TS and TL.
Step 1: We first show that TW /r is equivalent to T0/r. It follows from Theorem 2.1 that
for some vector Ra that satisfies
(5.8) |
Therefore, we have
(5.9) |
where J0 = [1,...,m]. Since , it follows from (5.9) that
and hence
(5.10) |
By (5.8) and (5.5) in Lemma 5.1, we have
This together with (5.10) gives
(5.11) |
Note that
By Markov’s inequality, we have
(5.12) |
Besides, it follows from (5.4) in Lemma 5.1 and Condition (A4) that
(5.13) |
This together with (5.11) and (5.12) implies that
(5.14) |
Combining this together with (5.6) in Lemma 5.1 gives
The last term is op(r) under the condition s+m = o(n1/3). By the definition of TW, we have shown that
(5.15) |
where
Under the conditions in Theorem 3.1, we have . Since ϕ0 > 0, we have
(5.16) |
which together with (5.15) entails that TW = TW,0 + op(r).
It follows from (5.10)-(5.13) and the condition s + m = o(n1/3) that
(5.17) |
where
By (5.16), we obtain TW,0 = TW,1 + op(r) and hence TW = TW,1 + op(r). In the following, we show TW,1 = T0 + op(r).
Observe that
(5.18) |
It follows from (5.12), (5.13), (5.16) and the condition that right-hand side (RHS) of (5.18) is of the order op(r). This proves TW,1 = T0 + op(r).
Step 2: We show that TS/r is equivalent to T0/r. Based on the proof of Theorem 2.1 in Section S5.1 of the supplementary article, we have
(5.19) |
and
(5.20) |
Combining (5.1) with (5.20) gives
which together with (5.19) implies that
By (5.3), we have
(5.21) |
It follows from (5.5) and (5.13) that
This together with (5.3) yields
(5.22) |
Notice that
It follows from Markov’s equality that
Combining this with (5.21) and (5.22) yields
(5.23) |
This together with (5.7) and the condition s + m = o(n1/3) gives that
When , we have
Since , we obtain , where
This together with (5.16) implies that |TS − TS,0| = op(r). Using similar arguments in (5.17) and (5.18), we can show that TS,0/r is equivalent to TS,1/r, where TS,1 is defined as
Recall that
we have
This proves the equivalence between TS/r and T0/r.
Step 3: By Theorem 2.1, we have
Notice that
It follows that
(5.24) |
Similar to (5.23), we can show that
(5.25) |
Under the event , using third-order Taylor expansion, we obtain that
where n∥R∥∞ is upper bounded by
for some β* lying on the line segment between and . By Theorem 2.1, we have and with probability tending to 1. By Condition (A1), we obtain
This together with (5.25) yields that
The last term is since r ≤ s + m and s + m = o(n1/3).
Similarly, we can show
As a result, we have
(5.26) |
Recall that is the maximizer of Theorem 2.1, we have with probability tending to 1 that . Under the condition we have
This together with (5.25) yields
By (5.26), we obtain that
In view of (5.24), using similar arguments in (5.17), we can show that
As a result, we have
By (5.16), this shows
Under the condition , we can show . As a result, we have TL = T0 + op(r).
Step 4: We first show the χ2 approximation (3.5) holds for T = T0. Recall that
By the definition of ωn, we have
With some calculation, we can show that
(5.27) |
It follows from Condition (A3) that
This implies maxi = 1,...,n
Hence, with some calculations, we have
where O(1) denotes some positive constant, the first inequality follows from Cauchy-Schwarz inequality, the last inequality follows from the fact that
and the last equality is due to Condition (3.4).
This together with (5.27) and an application of Lemma S.3 in the supplementary material gives that
(5.28) |
where Z ∈ ℝr stands for a mean zero Gaussian random vector with identity covariance matrix, and the supremum is taken over all convex sets ∈ ℝr.
Consider the following class of sets:
indexed by x ∈ ℝ. It follows from (5.28) that
Note that is equivalent to T0 ≤ x, and Pr(Z ∈ ) = Pr(χ2(r,γn) ≤ x) where . This implies
(5.29) |
Consider any statistic T* = T0 + op(r). For any x and ε > 0, it follows from (5.29) that
(5.30) |
Besides, by Lemma S.4, we have
(5.31) |
Combining (5.30) with (5.31), we obtain that
(5.32) |
In the first three steps, we have shown T0 = TS + op(1) = TW + op(1) = TL + op(1). This together with (5.32) implies that the χ2 approximation holds for our partial penalized Wald, score and likelihood ratio statistics. The proof is hence completed.
5.1. Proof of Lemma 5.1.
Assertion (5.1) is directly implies by Condition (A1). This means the square root of the maximum eigenvalue of Kn is O(1). By definition, this proves (5.2). Under Condition (A1), we have . Using the same arguments, we have . Hence, (5.3) is proven. We now show (5.4) holds. It follows from the condition λmax ((CCT )⎺1) = O(1) in Condition (A4) that liminfn λmin(CCT )⎺1 > 0, and hence
This implies that for sufficiently large n, we have
(5.33) |
By (5.1), we have liminfn λmin(Ωn) > 0, or equivalently,
Hence, we have
where J0 = [1,...,m]. Note that this implies
Therefore, we obtain
(5.34) |
Combining this together with (5.33) yields
By definition, this suggests
or equivalently,
This gives (5.4).
Using Cauchy-Schwarz inequality, we have
Observe that
(5.35) |
Besides, by (5.34), we have
which together with (5.35) implies that I1I2 = O(1). This proves (5.5).
We now show (5.6) holds. Assume for now, we have
(5.36) |
where
Note that
Under Condition (A1), we have lim infn λmin(Kn) > 0. Under the condition max(s,m) = o(n1/2), this together with (5.36) implies
(5.37) |
with probability tending to 1. Hence, we have
(5.38) |
By Lemma S.2, this gives
and hence,
where J0 = [1,...,m]. Using Lemma S.2 again, we obtain
(5.39) |
By definition, we have . According to Theorem 2.1, we have that with probability tending to 1, where . When , we have and . Therefore, by (5.39), we have
Using Cauchy-Schwarz inequality, we obtain
(5.40) |
by (5.34). Let Ψ = CΩmmCT, we obtain
(5.41) |
by(5.40) and that
Similar to (5.37), by (5.41), we can show that
(5.42) |
Combining (5.41) together with (5.42), we obtain
This proves (5.6).
Similar to (5.38), we can show
By (5.2), we obtain
This proves (5.7).
It remains to show (5.36). Since Kn and are symmetric, by Lemma S.5, it suffices to show
By definition, this requires to show
For any vector a ∈ ℝq, we have . Hence, it suffices to show
(5.43) |
Using Taylor’s theorem, we have
(5.44) |
By Theorem 2.1, we have . Hence, we have
By Condition (A1),
By Cauchy-Schwarz inequality, we have
This proves (5.43).
Supplementary Material
Acknowledgements.
The authors wish to thank the Associate Editor and anonymous referees for their constructive comments, which lead to significant improvement of this work.
Supported by NSF grant DMS 1555244, NCI grant P01 CA142538
Supported by NSF grant DMS 1512422, NIH grants P50 DA039838 and P50, DA036107, and T32 LM012415, and NNSFC grants 11690014 and 11690015
Footnotes
SUPPLEMENTARY MATERIAL
Supplement to “Partial penalization for high dimensional testing with linear constraints”: (doi: COMPLETED BY THE TYPESETTER; .pdf). This supplemental material includes power comparions with existing test statistics, additional numerical studies on Poisson regression and a real data application, discussions of Condition (A1)-(A4), some technical lemmas and the proof of Theorem 2.1.
Contributor Information
Chengchun Shi, Email: cshi4@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, USA.
Rui Song, Email: rsong@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, USA.
Zhao Chen, Email: zuc4@psu.edu, Department of Statistics, and The Methodology Center, the Pennsylvania State University, University Park, PA 16802-2111, USA.
Runze Li, Email: rzli@psu.edu, Department of Statistics, and The Methodology Center, the Pennsylvania State University, University Park, PA 16802-2111, USA.
References.
- Bentkus V (2004). A Lyapunov type bound in Rd. Teor. Veroyatn. Primen 49 400–410. [Google Scholar]
- Boyd S, Parikh N, Chu E, Peleato B and Eckstein J (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3 1–122. [Google Scholar]
- Breheny P and Huang J (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat 5 232–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Candes E and Tao T (2007). The Dantzig selector: statistical estimation when p is much larger than n. Ann. Statist 35 2313–2351. [Google Scholar]
- Dezeure R, BÜhlmann P, Meier L and Meinshausen N (2015). High-dimensional inference: confidence intervals, p-values and R-software hdi. Statist. Sci 30 533–558. [Google Scholar]
- Fan J, Guo S and Hao N (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B. Stat. Methodol 74 37–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc 96 1348–1360. [Google Scholar]
- Fan J and Lv J (2010). A selective overview of variable selection in high dimensional feature space. Statist. Sinica 20 101–148. [PMC free article] [PubMed] [Google Scholar]
- Fan J and Lv J (2011). Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inform. Theory 57 5467–5484. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J and Peng H (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist 32 928–961. [Google Scholar]
- Fan Y and Tang CY (2013). Tuning parameter selection in high dimensional penalized likelihood. J. R. Stat. Soc. Ser. B. Stat. Methodol 75 531–552. [Google Scholar]
- Fang EX, Ning Y and Liu H (2017). Testing and confidence intervals for high dimensional proportional hazards models. J. Roy. Statist. Soc. Ser. B 79 1415–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ghosh BK (1973). Some monotonicity theorems for χ2, F and t distributions with applications. J. Roy. Statist. Soc. Ser. B 35 480–492. [Google Scholar]
- Lee JD, Sun DL, Sun Y and Taylor JE (2016). Exact post-selection inference, with application to the lasso. Ann. Statist 44 907–927. [Google Scholar]
- Lockhart R, Taylor J, Tibshirani RJ and Tibshirani R (2014). A significance test for the lasso. Ann. Statist 42 413–468. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McCullagh and Nelder (1989). Generalized Linear Models. Chapman and Hall. [Google Scholar]
- Ning Y and Liu H (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist 45 158–195. [Google Scholar]
- Schwarz G (1978). Estimating the dimension of a model. Ann. Statist 6 461–464. [Google Scholar]
- Shi C, Song R, Chen Z and Li R (2018). Supplement to “Partial penalization for high dimensional testing with linear constraints”.
- Sun T and Zhang C-H (2013). Sparse matrix inversion with scaled lasso. J. Mach. Learn. Res 14 3385–3418. [Google Scholar]
- Taylor J, Lockhart R, Tibshirani RJ and Tibshirani R (2014). Post-selection adaptive inference for least angle regression and the lasso. arXiv preprint.
- Tibshirani R (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. [Google Scholar]
- van de Geer S, BÜhlmann P, Ritov Y and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist 42 1166–1202. [Google Scholar]
- Wang S and Cui H (2013). Partial Penalized Likelihood Ratio Test under Sparse Case. arXiv preprint arXiv:1312.3723.
- Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist 38 894–942. [Google Scholar]
- Zhang X and Cheng G (2017). Simultaneous inference for high-dimensional linear models. J. Amer. Statist. Assoc 112 757–768. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.