Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Oct 1.
Published in final edited form as: Ann Stat. 2019 Aug 3;47(5):2671–2703. doi: 10.1214/18-AOS1761

LINEAR HYPOTHESIS TESTING FOR HIGH DIMENSIONAL GENERALIZED LINEAR MODELS

Chengchun Shi 1, Rui Song 2, Zhao Chen 3, Runze Li 4
PMCID: PMC6750760  NIHMSID: NIHMS996283  PMID: 31534282

Abstract

This paper is concerned with testing linear hypotheses in high-dimensional generalized linear models. To deal with linear hypotheses, we first propose constrained partial regularization method and study its statistical properties. We further introduce an algorithm for solving regularization problems with folded-concave penalty functions and linear constraints. To test linear hypotheses, we propose a partial penalized likelihood ratio test, a partial penalized score test and a partial penalized Wald test. We show that the limiting null distributions of these three test statistics are χ2 distribution with the same degrees of freedom, and under local alternatives, they asymptotically follow non-central χ2 distributions with the same degrees of freedom and noncentral parameter, provided the number of parameters involved in the test hypothesis grows to ∞ at a certain rate. Simulation studies are conducted to examine the finite sample performance of the proposed tests. Empirical analysis of a real data example is used to illustrate the proposed testing procedures.

Keywords: High-dimensional testing, Linear hypothesis, Likelihood ratio statistics, Score test, Wald test

1. Introduction.

During the last three decades, there are many works devoted to developing variable selection techniques for high dimensional regression models. Fan and Lv (2010) presents a selective overview on this topic. There are some recent works for hypothesis testing on Lasso estimate (Tibshirani, 1996) in high-dimensional linear models. Lockhart et al. (2014) proposed the covariance test which produces a sequence of p-values as the tuning parameter, λn, decreases, and features become non-zero in the Lasso. This approach does not give confidence intervals or p-values for an individual variable’s coefficient. Taylor et al. (2014) and Lee et al. (2016) extended the covariance testing framework to test hypotheses about individual features, after conditioning on a model selected by the Lasso. However, their framework permits inference only about features which have non-zero coefficients in a Lasso regression; this set of features likely varies across samples, making the interpretation difficult. Moreover, these work focused on high dimensional linear regression models, and it remains unknown whether their results can be extended to a more general setting.

This paper will focus on generalized linear models (GLM, McCullagh and Nelder, 1989). Let Y be the response, and X be its associate fixed-design covariate vector. The GLM assumes that the distribution of Y belongs to the exponential family. The exponential family with canonical link has the following probability density function

exp(Yβ0TXb(β0TX)ϕ0)c(Y), (1.1)

where β0 is a p-dimensional vector of regression coefficients, and ϕ0 is some positive nuisance parameter. In this paper, we assume that b(·) is thrice continuously differentiable with b′′(·) > 0.

We study testing linear hypothesis H0:Cβ0,M=t in GLM, where β0,M is a subvector of β0, the true regression coefficients. The number of covariates p can be much larger than the sample size n, while the number of parameters in β0,M is assumed to be much smaller than n. Such type of hypotheses is of particular interests when the goal is to explore the group structure of β0. Moreover, it also includes a very important class of hypotheses β0,M=0 by setting C to be the identity matrix and t = 0. In the literature, Fan and Peng (2004) proposed penalized likelihood ratio test for H0a : Cβ0,S = 0 in GLM, where β0,S is the vector consisting of all nonzero elements of β0 when p = o(n1/5) where n stands for the sample size. Wang and Cui (2013) extended Fan and Peng (2004)’s proposal and considered a penalized likelihood ratio statistic for testing H0b:β0,M=0, requiring p = o(n1/5). Ning and Liu (2017) proposed a decorrelated score test for H0c:β0,M=0 under the setting of high dimensional penalized M-estimators with nonconvex penalties. Recently, Fang, Ning and Liu (2017) extended the proposal of Ning and Liu (2017) and developed a class of decorrelated Wald, score and partial likelihood ratio tests for Cox’s model with high dimensional survival data. Zhang and Cheng (2017) proposed a maximal type statistic based on the desparsified Lasso estimator (van de Geer et al., 2014) and a bootstrap-assisted testing procedure for H0d:β0,M=0, allowing the cardinality of M to be an arbitrary subset of [1,...,p]. In this paper, we aim to develop theory of Wald test, score test and likelihood ratio test for H0:Cβ0,M=t in GLM under ultrahigh dimensional setting (i.e., p grows exponentially with n).

It is well known that the Wald, score and likelihood ratio tests are equivalent in the fixed p case. However, it can be challenging to generalize these statistics to the setting with ultrahigh dimensionality. To better understand this point, we take the Wald statistic for illustration. Consider the null hypothesis H0:β0,M=0. Analogous to the classical Wald statistic, in the high dimensional setting, one might consider the statistic β^MT{cov^(β^M)}1β^M for some penalized regression estimator β^ and its variance estimator cov^(β^). The choice of the estimators is essential here: some penalized regression estimator such as the Lasso, or the Dantzig estimator (Candes and Tao, 2007) cannot be used due to their large biases when pn. The non-concave penalized estimator does not have this bias issue, but the minimal signal conditions imposed in Fan and Peng (2004) and Fan and Lv (2011) implies that the associated Wald statistic does not have any power for local alternatives of the type Ha:β0,M=hn for some sequence hn such that ∥hn∥2 ≪ λn where ∥ · ∥2 is the Euclidean norm. Moreover, to implement the score and the likelihood ratio statistics, we need to estimate the regression parameter under the null, which involves penalized likelihood under linear constraints. This is a very challenging task and has rarely been studied: (a) the associated estimation and variable selection property is not standard from a theoretical perspective, and (b) there is a lack of constrained optimization algorithms that can produce sparse estimators from a computational perspective.

We briefly summarize our contributions as follows. First, we consider a more general form of hypothesis. In contrast, existing literature mainly focuses on testing β0,M=0. Besides, we also allow the number of linear constraints to diverge with n. Our tests are therefore applicable to a wider range of real applications for testing a growing set of linear hypotheses. Second, we propose a partial penalized Wald, a partial penalized score and a partial penalized likelihood-ratio statistic based on the class of folded-concave penalty functions, and show their equivalence in the high dimensional setting. We derive the asymptotic distributions of our test statistics under the null hypothesis and the local alternatives. Third, we systematically study the partial penalized estimator with linear constraints. We derive its rate of convergence and limiting distribution. These results are significant in their own rights. The unconstrained and constrained estimators share similar forms, but the constrained estimator is more efficient due to the additional information contained in the constraints under the null hypothesis. Fourth, we introduce an algorithm for solving regularization problems with folded-concave penalty functions and equality constraints, based on the alternating direction method of multipliers (ADMM, cf. Boyd et al., 2011).

The rest of the paper is organized as follows. We study the statistical properties of the constrained partial penalized estimator with folded concave penalty functions in Section 2. We formally define our partial penalized Wald, score and likelihood-ratio statistics, establish their limiting distributions, and show their equivalence in Section 3. Detailed implementations of our testing procedures are given in Section 3.3, where we introduce our algorithm for solving the constrained partial penalized regression problems. Simulation studies are presented in Section 4. The proof of Theorem 3.1 is presented in Section 5. Other proofs and addition numerical results are presented in the supplementary material (Shi et al., 2018).

2. Constrained partial penalized regression.

2.1. Model setup.

Suppose that {Xi,Yi}, i = 1,··· ,n is a sample from model (1.1). Denote by Y = (Y1,...,Yn) the n-dimensional response vector and X = (X1,···,Xn)T is the n×p design matrix. We assume the covariates Xi are fixed design. Let Xj denote the jth column of X. To simplify the presentation, for any r×q matrix Φ and any set J ⊂ [1,2,...,q], we denote by ΦJ the submatrix of Φ formed by columns in J. Similarly, for any q-dimensional vector ϕ, ϕJ stands for the subvector of ϕ formed by elements in J. We further denote ΦJ1,J2 as the submatrix of Φ formed by rows in J1 and columns in J2 for any J1 ⊆ [1,...,r] and J2 ⊆ [1,...,q]. Let |J| be the number of elements in J. Define Jc = [1,...,q] − J to be the complement of J.

In this paper, we assume log p = O(na) for some 0 < a < 1 and focus on the following testing problem:

H0:Cβ0,M=t, (2.1)

for a given M[1,,p], an r×|M| matrix C and an r-dimensional vector t. We assume that the matrix C is of full row rank. This implies there are no redundant or contradictory constraints in (2.1). Let m=|M|, we have rm.

Define the partial penalized likelihood function

Qn(β,λ)=1ni=1n{YiβTXib(βTXi)}jMpλ(|βj|).

for some penalty function pλ(·) with a tuning parameter λ. Further define

β^0=argmaxβQn(β,λn,0)subjecttoCβM=t, (2.2)
β^a=argmaxβQn(β,λn,a). (2.3)

Note that in (2.2) and (2.3), we do not add penalties on parameters involved in the constraints. This enables to avoid imposing minimal signal condition on elements of β0,M. Thus, the corresponding likelihood ratio test, Wald test and score test have power at local alternatives.

We present a lemma to characterize the constrained local maximizer β^0 in the supplementary material (see Lemma S.1). In Section 3, we show that these partial penalized estimators help us to obtain valid statistical inference about the null hypothesis.

2.2. Partial penalized regression with linear constraint.

In this section, we study the statistical properties of β^0 and β^a by restricting pλ to the class of folded concave penalty functions. Popular penalty functions such as SCAD (Fan and Li, 2001) and MCP (Zhang, 2010) belong to this class. Let ρ(t0) = pλ(t0) for λ > 0. We assume that ρ(t0) is increasing and concave in t0 ∈ [0,∞), and has a continuous derivative ρ′(t0) with ρ′(0+) > 0. In addition, assume ρ′(t0) is increasing in λ ∈ (0,∞) and ρ′(0+) is independent of λ. For any vector v = (v1,...,vq)T, define

ρ¯(v,λ)={sgn(v1)ρ(|v1|,λ),,sgn(vq)ρ(|vq|,λ)}T,μ(v)={b(v1),,b(vq)},Σ(v)=diag{b(v1),,b(vq)},

where sgn(·) denotes the sign function. We further define the local concavity of the penalty function ρ at v with ∥v∥0 = q as

κ(ρ,v,λ)=limϵ0+max1jqsupt1<t2(|vj|ϵ,|vj|+ϵ)ρ(t2,λ)ρ(t1,λ)t2t1.

We assume that the true regression coefficient β0 is sparse and satisfies Cβ0,Mt=hn for some sequence of vectors hn0. When hn = 0, the null holds. Otherwise, the alternative holds. Let S={jMc:β0,j0} and s = |S|. Let dn be the half minimum signal of β0,s, i.e, dn = minj∈S |βj|/2. Define N0={βp:βSMβ0,SM2(s+m)log(n)/n,β(SM)c=0}. We impose the following conditions.

(A1) Assume that

max1jpXj=O(n/log(p)),max1jpXj2=O(n),infβN0λmin(XSMTΣ(Xβ)XSM)cn,λmax(XSMTΣ(Xβ0)XSM)=O(n),X(SM)cTΣ(XTβ0)XSM2,=O(n),max1jpsupβN0λmax(XSMTdiag{|Xj||b(Xβ)|}XSM)=O(n),

for some constant c > 0, where for any vector v = (v1,...,vq)T, diag(v) denotes a diagonal matrix with the j-th diagonal elements being vj, |v| = (|v1|,...,|vq|)T, and ∥B∥2,∞ = supv:v∥2=1Bv∥ for any matrix B with q rows.

(A2) Assume that dnλn,jmax{(s+m)/n,(logp)/n},pλn,j(dn)=o((s+m)1/2n1/2),λn,jκ0,j=o(1) where κ0,j=maxβN0κ(ρ,β,λn,j), for j = 0,a.

(A3) Assume that there exist some constants M and v0 such that

max1inE{exp(|Yiμ(β0TXi)|M)1|Yiμ(β0TXi)|M}M2v02.

(A4) Assume that hn2=O(min(s+mr,r)/n), and λmax ((CCT)−1) = O(1).

In Section S4.1 of the supplementary material, we show that Condition (A1) holds with probability tending to 1 if the covariate vectors X1,...,Xn are uniformly bounded or realizations from a sub-Gaussian distribution. The first condition in (A2) is a minimum signal assumption imposed on nonzero elements in Mc only. This is due to partial penalization, which enables us to evaluate the uncertainty of the estimation for small signals. Such conditions are not assumed in van de Geer et al. (2014) and Ning and Liu (2017) for testing H0:β0,M=0. However, we note that these authors impose some additional assumptions on the design matrix. For example, the validity of the decorrelated score statistic depends on the sparsity of w*. For testing univariate parameters, this requires the degree of a particular node in the graph to be relatively small when the covariate follows a Gaussian graphical model (see Remark 6 in Ning and Liu, 2017). In Section S4.3 of the supplementary material, we show Condition (A3) holds for linear, logistic, and Poisson regression models.

Theorem 2.1. Suppose that Conditions (A1)-(A4) hold, and s+m=o(n), then the following holds: (i) With probability tending to 1, β^0 and β^a defined in (2.2) and (2.3) must satisfy β^0,(SM)c=β^a,(SM)c=0. (ii) β^a,SMβa,SM2=Op((s+m)/n) and β^0,SMβ0,SM2=Op((s+mr)/n). If further s + m =o(n1/3), then we have

n(β^a,Mβ0,Mβ^a,Sβ0,S)=1nKn1(XMTXST){Yμ(Xβ0)}+op(1),n(β^0,Mβ0,Mβ^0,Sβ0,S)=1nKn1/2(IPn)Kn1/2(XMTXST){Yμ(Xβ0)}nKn1/2PnKn1/2(CT(CCT)1hn0)+op(1),

where I is the identity matrix, Kn is the (m + s) × (m + s) matrix

Kn=1n(XMTΣ(Xβ0)XMXMTΣ(Xβ0)XSXSTΣ(Xβ0)XMXSTΣ(Xβ0)XS),

and Pn is the (m × s) × (m × s) projection matrix

Pn=Kn1/2(CTOr×sT){(COr×s)Kn1(CTOr×sT)}1(COr×s)Kn1/2,

where Or×s is an r × s zero matrix.

Remark 2.1. Since dn(s+m)/n, Theorem 2.1(ii) implies that each element in β^0,S and β^a,S is nonzero. This together with result (i) shows the sign consistency of β^0,Mc and β^a,Mc.

Remark 2.2. Theorem 2.1 implies that the constrained estimator β^0 converges at a rate of Op(s+mr/n). In contrast, the unconstrained estimator converges at a rate of Op(s+m/n). This suggests that when hn is relatively small, the constrained estimator β^0 converges faster than the unconstrained β^a defined in (2.3), when s+mrs+m. This result is expected with the following intuition: the more information about β0 we have, the more accurate the estimator will be.

Remark 2.3. Under certain regularity conditions, Theorem 2.1 implies that

n{(β^0,Mβ0,M)T,(β^0,Sβ0,S)T}N(ξ0,V0),

where ξ0and V0 are limits of nKn1/2PnKn1/2(hnT,0T)T and Kn1/2(IPn)Kn1/2, respectively. Similarly, we can show

n{(β^a,Mβ0,M)T,(β^a,Sβ0,S)T}N(0,Va),

where Va=limnKn1. Note that aT V0aaT Vaa for any a ∈ ℝs+m. Under the null, we have ξ0 = 0, which suggests that β^0 is more efficient than β^a in terms of a smaller asymptotic variance. Under the alternative, β^0,M is asymptotically biased. This can be interpreted as a bias-variance trade-off between β^0 and β^a.

3. Partial penalized Wald, score and likelihood ratio statistics.

3.1. Test statistics.

We begin by introducing our partial penalized like-lihood ratio statistic,

TL=2n{Ln(β^a)Ln(β^0)}/ϕ^, (3.1)

where Ln(β) = ∑i{YiβT XibT Xi)}/n, β^0 and β^a are defined in (2.2) and (2.3) respectively, and ϕ^ is some consistent estimator for ϕ0 in (1.1). For Gaussian linear models, ϕ0 corresponds to the error variance. For logistic or Poisson regression models, ϕ0 = 1.

The partial penalized Wald statistic is based on n(Cβ^a,Mt). Define Ωn=Kn1 and denote mm as the first m rows and columns of n. It follows from Theorem 2.1 that its asymptotic variance is equal to CmmCT. Let S^a={jMc:β^a,j0}. Then, with probability tending to 1, we have S^a=S. Define

Ω^a=n(XMTΣ(Xβ^a)XMXMTΣ(Xβ^a)XS^aXS^aTΣ(Xβ^a)XMXS^aTΣ(Xβ^a)XS^a)1,

and Ω^a,mm as its submatrix formed by its first m rows and columns. The partial penalized Wald statistic is defined by

TW=(Cβ^a,Mt)T(CΩ^a,mmCT)1(Cβ^a,Mt)/ϕ^. (3.2)

Analogous to the classical score statistic, we define our partial penalized score statistic as

TS={Yμ(Xβ^0)}T(XMXS^0)Ω^0(XMXS^0)T{Yμ(Xβ^0)}/ϕ^, (3.3)

where S^0={jMc:β^0,j0}, and

Ω^0=n(XMTΣ(Xβ^0)XMXMTΣ(Xβ^0)XS^0XS^0TΣ(Xβ^0)XMXS^0TΣ(Xβ^0)XS^0)1.

3.2. Limiting distributions of the test statistics.

For a given significance level α, we reject the null hypothesis when T>χα2(r) for T = TL,TW or TS where χα2(r) is the upper α-quantile of a central χ2 distribution with r degrees of freedom and r is the number of constraints. Assume r is fixed. When ϕ^ is consistent to ϕ0, it follows from Theorem 2.1 that TL, TW and TS converge asymptotically to a (non-central) χ2 distribution with r degrees of freedom. However, when r diverges with n, there is no such theoretical guarantee. This is because the concept of weak convergence is not well defined in such settings. To resolve this issue, we observe that when the following holds,

supx|Pr(Tx)Pr(χ2(r,γn)x)|0,

where χ2(r,γn) is a chi square random variable with r degrees of freedom and noncentrality parameter γn which is allowed to vary with n, our testing procedure is still valid using χ2 approximation.

Theorem 3.1. Assume Conditions (A1)-(A4) hold, s + m = o(n1/3), and |ϕ^ϕ0|=op(1). Further assume the following holds:

r1/4n3/2i=1n{(Xi,MS)TKn1Xi,MS}3/20. (3.4)

Then, we have

supx|Pr(Tx)Pr(χ2(r,γn)x)|0, (3.5)

for T = TW ,TS or TL, where γn=nhnT(CΩmmCT)1hn/ϕ0.

REMARK 3.1. By (3.5), it is immediate to see that

supx|Pr(T1x)Pr(T2x)|0,

for any T1,T2 ∈ {TW ,TS,TL}. This establish the equivalence between the partial penalized Wald, score and likelihood-ratio statistics. Condition (3.4) is the key to guarantee χ2 approximation in (3.5). When r = O(1), this condition is equivalent to

1n3/2i=1n{(Xi,MS)TKn1Xi,MS}3/20,

which corresponds to the Lyaponuv condition that ensures the asymptotic normality of β^0,MS and β^a,MS. When r diverges, (3.4) guarantees that the following Lyaponuv type bound goes to 0,

supC|Pr((CΩmmCT)1/2(Cβ^a,Mt)/ϕ0C)Pr(ZC)|0,

where Z represents an r-dimensional multivariate normal with identity covariance matrix, and the supremum is taken over all convex subsets C in ℝm. The scaling factor r1/4 accounts for the dependence of the above Lyaponuv type estimate on the dimension and it remains unknown whether the factor r1/4 can be improved (see related discussions in Bentkus, 2004).

REMARK 3.2. Theorem 3.1 implies that our testing procedures are consistent. When the null holds, we have hn = 0 and hence γn = 0. This together with equation (2.1) suggests that our tests have correct size under the null. Under the alternative, we have hn0 and hence γn ≠ 0. Since χ2(r,0) is stochastically smaller than χ2(r,γn), (3.5) implies that our tests have non-negligible powers under Ha. We summarize these results in the following corollary.

COROLLARY 3.1. Assume Conditions (A1)-(A3) and (3.4) hold, s + m = o(n1/3), λmax((CCT )⎺1) = O(1), and |ϕ^ϕ0|=op(1). Then, under the null hypothesis, for any 0 < α < 1, we have

limnPr(T>χα2(r))=α,

for T = TW ,TL and TS, where χα2(r) is the critical value of χ2-distribution with r degrees of freedom at level α. Under the alternative Cβ0,Mt=hn for some hn satisfying hn=O(min(s+mr,r)/n), we have for any 0 < α < 1, and T = TW ,TS and TL,

limn|Pr(T>χα2(r))Pr(χ2(r,γn)>χα2(r))|=0,

where γn=nhnT(CΩmmCT)1hn/ϕ0.

REMARK 3.3. Corollary 3.1 shows that the asymptotic power functions of the proposed test statistics are

Pr(χ2(r,γn)>χα2(r)). (3.6)

It follows from Theorem 2 in Ghosh (1973) that the asymptotic power function decreases as r increases for a given γn. This is the same as that for traditional likelihood ratio test, score test and Wald’s test. However, hn is an r-dimensional vector in our setting. Thus, one may easily construct an example in which γn grows as r increases. As a result, the asymptotic power function may not be monotone increasing function of r.

In Section S3 of Shi et al. (2018), we study in depth that how the penalty on individual coefficient affects the power, and find that the tests are most advantageous if each unpenalized variable is either an important variable (i.e., in S) or a variable in M.

REMARK 3.4. Notice that the null hypothesis reduces to β0,M=0 if we set C to be the identity matrix and t = 0. The Wald test based on the desparsified Lasso estimator (van de Geer et al., 2014) and the decorrelated score test (Ning and Liu, 2017) can also be applied to testing such hypothesis. Based on (3.6), we show that these two tests achieve less power than the proposed partial penalized tests in Section S1 of Shi et al. (2018). This is due to the increased variances of the de-sparsified Lasso estimator and the decorrelated score statistic after the debiasing procedure.

3.3. Some implementation issues.

3.3.1. Constrained partial penalized regression.

To construct our test statistics, we need to compute the partial penalized estimators β^0 and β^a. Our algorithm is based upon the alternating direction method of multipliers (ADMM), which is a variant of standard augmented Lagrangian method. Below, we present our algorithm for estimating β^0. The unconstrained estimator β^a can be similarly computed. For a fixed regularization parameter λ, define

β^0λ=argminβ(Ln(β)+jMcpλ(|βj|)),subjecttoCβM=t.

The above optimization problem is equivalent to

(β^0λ,θ^0λ)=argminβpθpm(Ln(β)+j=1pmpλ(|θj|)),subjecttoCβM=t,βMc=θ. (3.7)

The augmented Lagrangian for (3.7) is

Lρ(β,θ,v)=Ln(β)+j=1pmpλ(|θj|)+vT(CβMtβMcθ)+ρ2CβMt22+ρ2βMcθ22,

for a given ρ > 0. Applying dual ascent method yields the following algorithm:

βk+1=argminβ{(vk)T(CβMtβMcθk)+ρ2CβMtβMcθk22Ln(β)},θk+1=argminθ{j=1pmpλ(|θj|)+ρ2βMck+1θ22+(vk)T(CβMk+1tβMck+1θ)},vk+1=vk+ρ(CβMk+1tβMck+1θk+1),

for the (k + 1)th iteration.

Since Ln is twice differentiable, βk+1 can be obtained by the Newton-Raphson algorithm. θk+1 may have a closed form for some popular penalties such as Lasso, SCAD or MCP penalty. In our implementation, we use the SCAD penalty,

pλ(|βj|)=λ0|βj|{I(tλ)+(aλt)+a1I(t>λ)}dt,

and set a = 3.7, 𝜌=1,

To obtain β^0, we compute β^0λ for a series of log-spaced values in [−λmin, λmax] for some λmin < λmax. Then we choose β^0=β^0λ^ by minimizing the following information criterion:

λ^=argminλ(nLn(λ)+cnβ^λ0),

where cn = max{logn, log(log(n))log(p)}. Using similar arguments in Schwarz (1978) and Fan and Tang (2013), we can show such information criterion is consistent in both fixed p and ultrahigh dimension setting.

3.3.2. Estimation of the nuisance parameter.

It can be shown that ϕ0 = 1 for logistic or Poisson regression models. In linear regression models, we have ϕ0=E(Yiβ0TXi)2. In our implementation, we estimate ϕ0 by

ϕ^=1n|S^a|mi=1n(Yiβ^aTXi)2,

where β^a is defined in (2.3).

In Section S2 of the supplementary material (Shi et al., 2018), we show ϕ^=ϕ0+Op(n1/2), under the conditions in Theorem 2.1, which implies selection consistency. Alternatively, one can estimate ϕ0 using refitted cross-validation (Fan, Guo and Hao, 2012) or scaled lasso (Sun and Zhang, 2013).

4. Numerical Examples.

In this section, we examine the finite sample performance of the proposed tests. Simulation results for linear regression and logistic regression are presented in the main text. In the supplementary material (Shi et al., 2018), we present simulation results for Poisson log-linear model and illustrate the proposed methodology by a real data example.

4.1. Linear regression.

Simulated data with sample size n = 100 were generated from

Y=2X1(2+h(1))X2+h(2)X3+ε

where εN(0,1) and XN(0p,Σ), and h(1) and h(2) are some constants. The true value β0=(2,2h(1),h(2),0p3T)T where 0q denotes a zero vector of length q.

4.1.1. Testing linear hypothesis.

We focus on testing the following three pairs of hypotheses:

H0(1):β0,1+β0,2=0,v.sHa(1):β0,1+β0,20.H0(2):β0,2+β0,3=2,v.sHa(2):β0,2+β0,32.H0(3):β0,3+β0,4=0,v.sHa(3):β0,3+β0,40.

These hypotheses test linear structures between two regression coefficients. When testing H0(1), we set h(2) = 0, and hence H0(1) holds if and only if h(1) = 0. Similarly when testing H0(2) and H0(3), we set h(1) = 0, and hence the hull hypotheses hold if and only if h(2) = 0.

We consider two different dimensions, p = 50 and p = 200, and two different covariance matrices Σ, corresponding to Σ = I and Σ={0.5|ij|}i,j=1,,p. This yields a total of 4 settings. For each hypothesis and each setting, we further consider four scenarios, by setting h(j) = 0,0.1,0.2,0.4. Therefore, the null holds under the first scenario and the alternative holds under the rest three. Table 1 summarizes the rejection probabilities for H0(1),H0(2) and H0(3) under the settings where Σ = {0.5|i–j|}. Rejection probabilities of the proposed tests under the settings where Σ = I are given in Table S1 in the supplementary material. The rejection probabilities are evaluated via 600 simulation replications.

Table 1.

Rejection probabilities (%) of the partial penalized Wald, score and likelihood ratio statistics with standard errors in parenthesis (%), under the setting where Σ={0.5|ij|}i,j=1,,p.

p = 50 p = 200

TL TW TS TL TW TS
h(1) H0(1)

0 4.33(0.83) 4.33(0.83) 4.67(0.86) 5.67(0.94) 5.67(0.94) 5.67(0.94)
0.1 13.17(1.38) 13.50(1.40) 13.50(1.40) 11.67(1.31) 11.67(1.31) 11.67(1.31)
0.2 39.83(2.00) 40.17(2.00) 40.00(2.00) 39.67(2.00) 39.67(2.00) 39.67(2.00)
0.4 92.33(1.09) 93.17(1.03) 93.17(1.03) 92.67(1.06) 92.67(1.06) 92.67(1.06)

h(2) H0(2)

0 5.17(0.90) 5.17(0.90) 5.67(0.94) 5.33(0.92) 5.33(0.92) 5.33(0.92)
0.1 11.00(1.28) 11.00(1.28) 11.33(1.29) 12.50(1.35) 12.50(1.35) 12.50(1.35)
0.2 30.67(1.88) 30.67(1.88) 31.00(1.89) 33.67(1.93) 33.67(1.93) 33.67(1.93)
0.4 85.17(1.45) 85.00(1.46) 85.00(1.46) 87.83(1.33) 87.83(1.33) 87.83(1.33)

h(2) H0(3)

0 6.50 (1.01) 6.33(0.99) 6.50(1.01) 5.67(0.94) 5.67(0.94) 5.67(0.94)
0.1 11.83 (1.32) 11.67(1.31) 11.67(1.31) 11.00(1.28) 11.00(1.28) 11.00(1.28)
0.2 31.67 (1.90) 31.50(1.90) 31.67(1.90) 33.17(1.92) 33.17(1.92) 33.17(1.92)
0.4 84.33 (1.48) 84.17(1.49) 84.50(1.48) 86.00(1.42) 86.17(1.41) 86.17(1.41)

Based on the results, it can be seen that under these null hypotheses, Type I error rates of the three tests are well controlled and close to the nominal level for all four settings. Under the alternative hypotheses, the powers of these three test statistics increase as h(1) or h(2) increases, showing the consistency of our testing procedure. Moreover, the empirical rejection rates between these three test statistics are very close across all different scenarios and settings. For example, the rejection rates are exactly the same for testing H0(1) and H0(2) when p = 200 in Table 1, although we observed that the values of these three statistics in our simulation are slightly different. This is consistent with our theoretical findings that these statistics are asymptotically equivalent even in high dimensional settings. Figures S1, S2 and S3 in the supplementary material depicts the kernel density estimates of three test statistics under H0(1) and H0(2) with different combinations of p and the covariance matrices respectively. It can be seen that these three test statistics converge to their limiting distributions under the null hypotheses.

4.1.2. Testing univariate parameter.

Consider testing the following two pairs of hypotheses:

H0(4):β0,2=2,v.sHa(1):β0,22.H0(5):β0,3=0,v.sHa(2):β0,30.

We set h(2) = 0 when testing H0(4), and set h(1) = 0 when testing H0(5). Therefore, H0(4) is equivalent to h(1) = 0 and H0(5) is equivalent to h(2) = 0. We use the same 4 settings described in Section 4.1.1. For each setting, we set h(1) = 0.1,0.2,0.4 under Ha(4) and h(2) = 0.1,0.2,0.4 under Ha(5). Comparison is made among the following test statistics:

  1. The proposed likelihood ratio (TL), Wald (TW ) and score (TS) statistic.

  2. The Wald test statistic based on the de-sparisfied Lasso estimator (TWD).

  3. The decorrelated score statistic. (TSD).

The test statistic TWD is computed via the R package hdi (Dezeure et al., 2015). We calculate TSD according to Section 4.1 in Ning and Liu (2017). More specifically, the initial estimator β^ is computed by a penalized linear regression with SCAD penalty function, and ω^ is computed by a penalized linear regression with l1 penalty function (see Equation (4.4) in Ning and Liu, 2017). These penalized regressions are implemented via the R package ncvreg (Breheny and Huang, 2011). The tuning parameters are selected via 10-folded cross-validation. The rejection probabilities of these test statistics under the settings where Σ = {0.5|i–j|} are reported in Table 2. In the supplementary material, we report the rejection probabilities of these test statistics under the settings where Σ = I in Table S2. Results are averaged over 600 simulation replications.

Table 2.

Rejection probabilities (%) of the partial penalized Wald, score and likelihood ratio statistics, the Wald test statistic based on the de-sparsified Lasso estimator and the decorrelated score statistic under the settings where Σ={0.5|ij|}i,j=1,,p, with standard errors in parenthesis (%).

TL TW TS TWD TSD
h(1) H0(4) and p = 50

0 5.17(0.90) 5.33(0.92) 5.50(0.93) 12.67(1.36) 7.00(1.04)
0.1 15.67(1.48) 16.00(1.50) 16.00(1.50) 6.00(0.97) 14.67(1.44)
0.2 41.00(2.01) 41.33(2.01) 41.50(2.01) 14.83(1.45) 38.83(1.99)
0.4 92.50(1.08) 93.00(1.04) 93.00(1.04) 67.67(1.91) 88.67(1.29)

H0(4) and p = 200

0 4.83(0.88) 4.83(0.88) 4.83(0.88) 21.83(1.69) 5.50(0.93)
0.1 11.00(1.28) 11.00(1.28) 11.00(1.28) 5.83(0.96) 10.83(1.27)
0.2 40.50(2.00) 40.50(2.00) 40.50(2.00) 6.17(0.98) 37.83(1.98)
0.4 91.50(1.14) 91.50(1.14) 91.50(1.14) 49.33(2.04) 88.00(1.33)

h(2) H0(5) and p = 50

0 6.33(0.99) 6.00(0.97) 6.50(1.00) 5.33(0.92) 3.00(0.70)
0.1 13.67(1.40) 13.50(1.40) 14.00(1.42) 5.33(0.92) 9.17(1.18)
0.2 40.17(2.00) 40.33(2.00) 40.50(2.00) 15.67(1.48) 28.50(1.84)
0.4 90.83(1.18) 91.33(1.15) 91.67(1.13) 69.17(1.89) 83.33(1.52)

H0(5) and p = 200

0 5.67(0.94) 5.67(0.94) 5.67(0.94) 6.50(1.01) 2.67(0.66)
0.1 13.67(1.40) 13.67(1.40) 13.67(1.40) 3.67(0.77) 8.17(1.12)
0.2 39.17(1.99) 39.17(1.99) 39.17(1.99) 9.67(1.21) 24.67(1.76)
0.4 91.50(1.14) 91.50(1.14) 91.50(1.14) 51.33(2.04) 80.50(1.62)

From Table 2, it can be seen that TWD failed to test H0(4) under the settings where Σ = {0.5|i–j|}. Under the null hypotheses, the Type I error rates of TWD are greater than 12%. Under the alternative hypotheses, the proposed test statistics and the decorrelated score test are more powerful than TWD in almost all cases. Besides, we note that TL, TW, TS and TWD perform comparable under the settings where Σ = I. When Σ = {0.5|i–j|} however, the proposed test statistics achieve greater power than TSD. This is in line with our theoretical findings (see Section S1 of the supplementary material for details).

4.1.3. Effects on m.

In Section 4.1.1, we consider linear hypotheses involving two parameters only. As suggested by one of the referee, we further examine our test statistics under settings where more regression parameters are involved in the hypotheses. More specifically, we consider the following three pairs of hypotheses:

H0(6):j=14β0,j=0,v.sHa(6):j=14β0,j0.H0(7):j=18β0,j=0,v.sHa(7):j=18β0,j0.H0(8):j=112β0,j=0,v.sHa(8):j=112β0,j0.

The numbers of parameters involved in H0(6),H0(7) and H0(8) are equal to 4, 8 and 12, respectively. We consider the same 4 settings described in Section 4.1.1. For each setting, we set h(1) = 0,0.2,0.4,0.8 and h(2) = 0. Hence, the null hypotheses hold when h(1) = 0 and the alternatives hold when h(1) > 0. We report the rejection probabilities over 600 replications in Table 3, under the settings where Σ = {0.5|i–j|}. Rejection probabilities under the settings where Σ = I are reported in Table S3 in the supplementary material.

Table 3.

Rejection probabilities (%) of the partial penalized Wald, score and likelihood ratio statistics with standard errors in parenthesis (%), under the settings where Σ={0.5|ij|}i,j=1,,p.

p = 50 p = 200

TL TW TS TL TW TS
h(1) H0(6)

0 4.83(0.88) 4.50(0.85) 4.67(0.86) 4.83(0.88) 4.83(0.88) 4.83(0.88)
0.2 28.17(1.84) 28.17(1.84) 28.50(1.84) 28.50(1.84) 28.50(1.84) 28.50(1.84)
0.4 80.33(l.62) 80.17(1.63) 80.33(1.62) 79.83(1.64) 79.83(1.64) 79.83(1.64)
0.8 99.83(0.17) 100.00(0.00) 100.00(0.00) 100.00(0.00) 100.00(0.00) 100.00(0.00)

h(1) H0(7)

0 4.50(0.85) 4.50(0.85) 4.50(0.85) 5.00(0.89) 5.00(0.89) 5.00(0.89)
0.2 18.17(1.57) 18.33(1.58) 18.33(1.58) 18.33(1.58) 18.33(1.58) 18.33(1.58)
0.4 53.83(2.04) 54.17(2.03) 54.00(2.03) 57.33(2.02) 57.33(2.02) 57.33(2.02)
0.8 98.50(0.50) 99.00(0.41) 99.00(0.41) 98.50(0.50) 98.50(0.50) 98.50(0.50)

h(1) H0(8)

0 5.17 (0.90) 5.00(0.89) 5.17(0.90) 5.67(0.94) 5.67(0.94) 5.67(0.94)
0.2 14.33 (1.43) 14.33(1.43) 14.33(1.43) 13.67(1.40) 13.67(1.40) 13.67(1.40)
0.4 42.00 (2.01) 42.17(2.02) 42.17(2.02) 41.67(2.01) 41.67(2.01) 41.67(2.01)
0.8 92.83 (1.05) 92.83(1.05) 92.83(1.05) 93.00(1.04) 93.00(1.04) 93.00(1.04)

The Type I error rates of the three test statistics are close to the nominal level under the null hypotheses. The powers of the test statistics increase as h(1) increases, under the alternative hypotheses. Moreover, we note that the powers decrease as m increases. This is in line with Corollary 3.1 which states that the asymptotic power function of our test statistics is a function of r and γn. Recall that γn=nhnT(CΩmmCT)1hn/ϕ0. Consider the following sequence of null hypotheses indexed by m ≥ 2: Cmβ0 = 0 where Cm = (1,···,1,0pm). Let γn,m=nhnT(CΩmmCmT)1hn/ϕ0. Under the given settings, we have mm = (ωij) is a banded matrix with ωij = 0 for |ij| ≥ 2, wij = –1/(1-ρ2) for |ij| = 1, ω11 = ωmm = 1/{ρ(1 − ρ2)}, and ωjj = (1 + ρ2)/{ρ(1 − ρ2)} for j ≠ 1 and m, where ρ is the auto-correlation between X1 and X2. It is immediate to see γn,m decreases as m increases.

4.2. Logistic regression.

In this example, we generate data with sample size n = 300 from the logistic regression model

logit{Pr(Y=1|X)}=2X1(2+h(1))X2+h(2)X3,

where logit(p) = log{p/(1−p)}, the logit link function, and X ∼ N(0p,Σ).

4.2.1. Testing linear hypothesis.

We consider the same linear hypotheses as those in Section 4.1.1:

H0(1):β0,1+β0,2=0,v.sHa(1):β0,1+β0,20.H0(2):β0,2+β0,3=2,v.sHa(2):β0,2+β0,32.H0(3):β0,3+β0,4=0,v.sHa(3):β0,3+β0,40.

Similarly, we set h(2) = 0 when testing H0(1), and set h(1) = 0 when testing H0(2). Therefore, H0(1) is equivalent to h(1) = 0 and H0(2) is equivalent to h(2) = 0. We use the same 4 settings described in Section 4.1.1. For each of the four settings, we set h(j) = 0.2,0.4,0.8 under Ha(j). The rejection probabilities for H0(1) and H0(2) over 600 replications are given in Table S4 in the supplementary material. We also plot the kernel density estimates of three test statistics under H0(1) and H0(2) in Figures S4, S5 and S6 in the supplementary material. The findings are very similar to those in the previous examples.

4.2.2. Testing univariate parameter.

To compare the proposed partial penalized Wald (TW ), score (TS) and likelihood ratio (TL) test statistics with the Wald test based on the de-sparsified Lasso estimator (TWD) and the decorrelated score test (TSD), we consider testing the following hypotheses:

H0(4):β0,2=2,v.sHa(4):β0,22.H0(5):β0,3=0,v.sHa(5):β0,30.

Similar to Section 4.1.2, we set h(2) = 0 when testing H0(4), and set h(1) = 0 when testing H0(5). We set h(1) = 0 under H0(4), h(1) = 0.2, 0.4, 0.8 under Ha(4) and set h(2) = 0 under H0(5), h(2) = 0.2, 0.4, 0.8 under Ha(5). We consider the same 4 settings described in Section 4.1.1. The test statistic TWD is computed via the R package hdi and TSD is obtained according to Section 4.2 of Ning and Liu (2017). We compute the initial estimator β^ in TSD by fitting a penalized logistic regression with SCAD penalty function, and calculate ω^ by fitting a penalized linear regression with l1 penalty function. These penalized regressions are implemented via the R package ncvreg. We report the rejection probabilities of TW, TS,TL, TWD and TSD in Table S5 in the supplementary article, based on 600 simulation replications.

Based on the results, it can be seen that the Type I error rates of TWD and TSD are significantly larger than the nominal level in almost all cases for testing H0(4). On the other hand, the Type I error rates of the proposed test statistics are close to the nominal level under H0(4). Besides, under Ha(5), the powers of the proposed test statistics are greater than or equal to TWD and TSD in all cases.

4.2.3. Effects on m.

As in Section 4.1.3, we further examine the proposed test statistics by allowing more regression coefficients to appear in the linear hypotheses. Similarly, we consider the following three pairs of hypotheses:

H0(6):j=14β0,j=0,v.sHa(6):j=14β0,j0.H0(7):j=18β0,j=0,v.sHa(7):j=18β0,j0.H0(8):j=112β0,j=0,v.sHa(8):j=112β0,j0.

We set h(2) = 0, and set h(1) = 0 under the null hypotheses, h(1) = 0.4,0.8,1.6 under the alternative hypotheses. The same 4 settings described in Section 4.1.1 are used. The rejection probabilities of the proposed test statistics are reported in Table S6 in the supplementary article. Results are averaged over 600 replications. Findings are very similar to those in Section 4.1.3.

5. Technical proofs.

This section consists of the proof of Theorem 3.1. To establish Theorem 3.1, we need the following lemma. The proof of this lemma is given in Section 5.1. For any symmetric and positive definite matrix A ∈ ℝq×q, it follows from the spectral theorem that A = UT ΛU for some orthogonal matrix U and diagonal matrix Λ = diag(λ1,...,λq). Since the diagonal elements in Λ are positive, we use Λ1/2 and Λ1/2 to denote the diagonal matrices diag(λ11/2,,λq1/2) and diag(λ11/2,,λq1/2), respectively. In addition, we define A1/2 = UT Λ1/2U and A‒1/2 = UT Λ1/2U.

LEMMA 5.1. Under the conditions in Theorem 3.1, we have

λmax(Kn)=O(1), (5.1)
λmax(Kn1/2)=O(1), (5.2)
λmax(Kn1/2)=O(1), (5.3)
λmax((CΩmmCT)1)=O(1), (5.4)
Ψ1/2C2=O(1), (5.5)
Ψ1/2(CΩ^a,mmCT)1Ψ1/2I2=Op(s+mn), (5.6)
IKn1/2K^n,01Kn1/22=Op(s+mn), (5.7)

where Ψ = Cmm CT and

K^n,0=1n(XMTΣ(Xβ^0)XMXMTΣ(Xβ^0)XSXSTΣ(Xβ^0)XMXSTΣ(Xβ^0)XS).

We break the proof into four steps. In the first three steps, we show TW /r, TS/r and TL/r are equivalent to T0/r, respectively, where

T0=1ϕ0(ωn+nhn)T(CΩmmCT)1(ωn+nhn),

and

ωn=1n(CTOs×rT)TKn1(XMTXST){Yμ(Xβ0)}.

In the final step, we show the χ2 approximation (3.5) holds for TW ,TS and TL.

Step 1: We first show that TW /r is equivalent to T0/r. It follows from Theorem 2.1 that

n(β^a,Mβ0,Mβ^a,Sβ0,S)=1nKn1(XMTXST){Yμ(Xβ0)}+Ra,

for some vector Ra that satisfies

Ra2=op(1). (5.8)

Therefore, we have

nC(β^a,Mβ0,M)=ωn+CRa,J0, (5.9)

where J0 = [1,...,m]. Since Cβ0,M=t+hn, it follows from (5.9) that

n(Cβ^a,Mt)=ωn+CRa,J0+nhn,

and hence

nΨ1/2(Cβ^a,Mt)=Ψ1/2(ωn+CRa,J0+nhn). (5.10)

By (5.8) and (5.5) in Lemma 5.1, we have

Ψ1/2CRa,J02Ψ1/2CRa,J02Ψ1/2CRa2=op(1).

This together with (5.10) gives

nΨ1/2(Cβ^a,Mt)=Ψ1/2(ωn+nhn)+op(1). (5.11)

Note that

EΨ1/2ωn22=tr(Ψ1/2EωnωnTΨ1/2)=ϕ0tr(Ψ1/2ΨΨ1/2)=rϕ0.

By Markov’s inequality, we have

Ψ1/2ωn2=Op(r). (5.12)

Besides, it follows from (5.4) in Lemma 5.1 and Condition (A4) that

nΨ1/2hn2=O(r). (5.13)

This together with (5.11) and (5.12) implies that

nΨ1/2(Cβ^a,Mt)2=Op(r). (5.14)

Combining this together with (5.6) in Lemma 5.1 gives

{nΨ1/2(Cβ^a,Mt)}T{Ψ1/2(CΩ^a,mmCT)1Ψ1/2I}{nΨ1/2(Cβ^a,Mt)}22{nΨ1/2(Cβ^a,Mt)}22{Ψ1/2(CΩ^a,mmCT)1Ψ1/2I}2=Op(r(s+m)n).

The last term is op(r) under the condition s+m = o(n1/3). By the definition of TW, we have shown that

ϕ^|TWTW,0|=op(r), (5.15)

where

TW,0=n(Cβ^a,Mt)TΨ1(Cβ^a,Mt)ϕ^.

Under the conditions in Theorem 3.1, we have ϕ^=ϕ0+op(1). Since ϕ0 > 0, we have

1/ϕ^=Op(1), (5.16)

which together with (5.15) entails that TW = TW,0 + op(r).

It follows from (5.10)-(5.13) and the condition s + m = o(n1/3) that

ϕ^TW,0=Ψ1/2ωn+nΨ1/2hn+op(1)22=Ψ1/2ωn+nΨ1/2hn22+op(1)+op(Ψ1/2(ωn+nhn))=Ψ1/2ωn+nΨ1/2hn22+op(1)+op(r)=Ψ1/2ωn+nΨ1/2hn22+op(r)=ϕ^TW,1+op(r), (5.17)

where

TW,1=Ψ1/2ωn+nΨ1/2hn22ϕ^.

By (5.16), we obtain TW,0 = TW,1 + op(r) and hence TW = TW,1 + op(r). In the following, we show TW,1 = T0 + op(r).

Observe that

|TW,1T0|=|ϕ0ϕ^|ϕ^ϕ0Ψ1/2ωn+nΨ1/2hn22. (5.18)

It follows from (5.12), (5.13), (5.16) and the condition |ϕ^ϕ0|=op(1) that right-hand side (RHS) of (5.18) is of the order op(r). This proves TW,1 = T0 + op(r).

Step 2: We show that TS/r is equivalent to T0/r. Based on the proof of Theorem 2.1 in Section S5.1 of the supplementary article, we have

1n(XMTXST){Yμ(Xβ^0)}=1n(XMTXST)T{Yμ(Xβ0)}1n(XMTXST)Σ(Xβ0)(XMTXST)T(β^0,Mβ0,Mβ^0,Sβ0,S)+op(1), (5.19)

and

n(β^0,Mβ0,Mβ^0,Sβ0,S)=1nKn1/2(IPn)Kn1/2(XMTXST){Yμ(Xβ0)}nKn1(CTOr×sT)Ψ1hn+op(1). (5.20)

Combining (5.1) with (5.20) gives

nKn(β^0,Mβ0,Mβ^0,Sβ0,S)=1nKn1/2(IPn)Kn1/2(XMTXST){Yμ(Xβ0)}n(CTOr×sT)Ψ1hn+op(1),

which together with (5.19) implies that

1n(XMTXST){Yμ(Xβ^0)}=1n(XMTXST)T{Yμ(Xβ0)}+op(1)1nKn1/2(IPn)Kn1/2(XMTXST){Yμ(Xβ0)}+n(CTOr×sT)Ψ1hn=1nKn1/2PnKn1/2(XMTXST){Yμ(Xβ0)}+n(CTOr×sT)Ψ1hn+op(1).

By (5.3), we have

1nKn1/2(XMTXST){Yμ(Xβ^0)}=1nPnKn1/2(XMTXST)T{Yμ(Xβ0)}+nKn1/2(CTOr×sT)Ψ1hn+op(1). (5.21)

It follows from (5.5) and (5.13) that

(CTOr×sT)Ψ1hn2CTΨ1/22Ψ1/2hn2=Op(r/n).

This together with (5.3) yields

nKn1/2(CTOr×sT)Ψ1hn2=Op(r). (5.22)

Notice that

E1nPnKn1/2(XMTXST){Yμ(Xβ0)}22=tr(Pn)=rank(Pn)=r.

It follows from Markov’s equality that

1nPnKn1/2(XMTXST){Yμ(Xβ0)}2=Op(r).

Combining this with (5.21) and (5.22) yields

1nKn1/2(XMTXST){Yμ(Xβ^0)}2=Op(r). (5.23)

This together with (5.7) and the condition s + m = o(n1/3) gives that

|1n{(XMTXST){Yμ(Xβ^0)}}T(Kn1K^n,01)(XMTXST)T{Yμ(Xβ^0)}|1nKn1/2(XMTXST){Yμ(Xβ^0)}22IKn1/2K^n,01Kn1/22=Op(r(s+m)n)=op(r).

When S^0=S, we have

ϕ^TS=1n{(XMTXST){Yμ(Xβ^0)}}TK^n,01(XMTXST){Yμ(Xβ^0)}.

Since Pr(S^0=S)1, we obtain ϕ^|TSTS,0|=op(r), where

TS,0=1nϕ^{(XMTXST){Yμ(Xβ^0)}}TKn1(XMTXST){Yμ(Xβ^0)}.

This together with (5.16) implies that |TSTS,0| = op(r). Using similar arguments in (5.17) and (5.18), we can show that TS,0/r is equivalent to TS,1/r, where TS,1 is defined as

1ϕ01nPnKn1/2(XMTXST){Yμ(Xβ0)}+nKn1/2(CTOr×sT)Ψ1hn22.

Recall that

Pn=Kn1/2(CTOr×sT)Ψ1(CTOr×sT)TKn1/2,

we have

TS,1=1ϕ0Kn1/2(CTOr×sT)Ψ1ωn+nKn1/2(CTOr×sT)Ψ1hn22=1ϕ0Ψ1/2ωn+nΨ1/2hn22=T0.

This proves the equivalence between TS/r and T0/r.

Step 3: By Theorem 2.1, we have

n(β^a,Mβ^0,Mβ^a,Sβ^0,S)=1nKn1/2PnKn1/2(XMTXST){Yμ(Xβ0)}+nKn1/2PnKn1/2(CT(CCT)1hn0)+op(1).

Notice that

Kn1/2PnKn1/2(CT(CCT)1hn0)=Kn1(CTOr×sT)Ψ1(CTOr×sT)T(CT(CCT)1hn0)=Kn1(CTOr×sT)Ψ1hn.

It follows that

n(β^a,Mβ^0,Mβ^a,Sβ^0,S)=1nKn1/2PnKn1/2(XMTXST){Yμ(Xβ0)}+nKn1(CTOr×sT)Ψ1hn+op(1). (5.24)

Similar to (5.23), we can show that

nβ^a,MSβ^0,MS22=Op(r). (5.25)

Under the event β^0,MS=β^a,MS=0, using third-order Taylor expansion, we obtain that

Ln(β^0)Ln(β^a)=1n(β^0,Mβ^a,Mβ^0,Sβ^a,S)T(XMTXST){Yμ(Xβ^a)}12n(β^0,Mβ^a,Mβ^0,Sβ^a,S)T(XMTXST)Σ(Xβ^a)(XMTXST)T(β^0,Mβ^a,Mβ^0,Sβ^a,S)+(β^0,Mβ^a,Mβ^0,Sβ^a,S)TR,

where nR is upper bounded by

maxjMS|(β^0,MSβ^a,MS)TXMSTdiag{|Xj||b(Xβ*)|}XMS(β^0,MSβ^a,MS)|β^0,MSβ^a,MUS22maxjMSλmax(XMSTdiag{|Xj||b(Xβ*)|}XMS),

for some β* lying on the line segment between β^a and β^0. By Theorem 2.1, we have β(MS)c*=0 and βMS*β0,MS2(s+m)logn/n with probability tending to 1. By Condition (A1), we obtain

R=Op(rn).

This together with (5.25) yields that

(β^a,Mβ^0,Mβ^a,Sβ^0,S)TR2β^a,MSβ^0,MS1R=op(rnrns+m).

The last term is op(r/n) since rs + m and s + m = o(n1/3).

Similarly, we can show

(β^a,Mβ^0,Mβ^a,Sβ^0,S)T(XMTXST)Σ(Xβ^a)(XMTXST)T(β^a,Mβ^0,Mβ^a,Sβ^0,S)(β^a,Mβ^0,Mβ^a,Sβ^0,S)T(XMTXST)Σ(Xβ0)(XMTXST)T(β^a,Mβ^0,Mβ^a,Sβ^0,S)2=op(r).

As a result, we have

n{Ln(β^0)Ln(β^a)}=(β^0,Mβ^a,Mβ^0,Sβ^a,S)T(XMTXST){Yμ(Xβ^a)}n2(β^0,Mβ^a,Mβ^0,Sβ^a,S)TKn(β^0,Mβ^a,Mβ^0,Sβ^a,S)T+op(r). (5.26)

Recall that β^a is the maximizer of nLn(β)njMpλn,a(|βj|) Theorem 2.1, we have with probability tending to 1 that minjS|β^a,j|dn. Under the condition pλn,a(dn)=o((s+m)1/2n1/2) we have

(XMTXST){Yμ(Xβ^a)}=n(0ρ¯(β^a,S,λn,a))=op(n1/2).

This together with (5.25) yields

(β^0,Mβ^a,Mβ^0,Sβ^a,S)T(XMTXST){Yμ(Xβ^a)}=op(r).

By (5.26), we obtain that

n{Ln(β^0)Ln(β^a)}=n2(β^0,Mβ^a,Mβ^0,Sβ^a,S)TKn(β^0,Mβ^a,Mβ^0,Sβ^a,S)+op(r).

In view of (5.24), using similar arguments in (5.17), we can show that

|n(β^0,Mβ^a,Mβ^0,Sβ^a,S)TKn(β^0,Mβ^a,Mβ^0,Sβ^a,S)1nPnKn1/2(XMTXST){Yμ(Xβ0)}+nKn1/2(CTOr×sT)Ψ1hn22|=op(r).

As a result, we have

1nPnKn1/2(XMTXST){Yμ(Xβ0)}+nKn1/2(CTOr×sT)Ψ1hn222n{Ln(β^a)Ln(β^0)}=op(r).

By (5.16), this shows

|TLϕ0ϕ^T0|=op(r).

Under the condition |ϕ^ϕ0|=op(1), we can show |T0(1ϕ0/ϕ^)|=op(r). As a result, we have TL = T0 + op(r).

Step 4: We first show the χ2 approximation (3.5) holds for T = T0. Recall that

T0=1ϕ01nΨ1/2ωn+nΨ1/2hn22.

By the definition of ωn, we have

1nϕ0Ψ1/2ωn=1nϕ0Ψ1/2(CTOr×sT)TKn1(XMTXST){Yμ(Xβ0)}=i=1n1nϕ0Ψ1/2(CTOr×sT)TKn1{Yiμ(β0TXi)}(Xi,MXi,S)=i=1nξi.

With some calculation, we can show that

incov(ξi)=Ir. (5.27)

It follows from Condition (A3) that

maxi=1,,nE(|Yiμ(β0TXi)|36M3M2)maxi=1,,nE{exp(|Yiμ(β0TXi)|M)1|Yiμ(β0TXi)|M}M2v02.

This implies maxi = 1,...,n E|Yiμ(β0TXi)|3=O(1).

Hence, with some calculations, we have

r1/4inEξi23=r1/4(nϕ0)3/2inEΨ1/2(CTOr×sT)TKn1Xi,MS{Yiμ(β0TXi)}23=r1/4(nϕ0)3/2inΨ1/2(CTOr×sT)TKn1Xi,MS23E|Yiμ(β0TXi)|3=O(1)r1/4(nϕ0)3/2inΨ1/2(CTOr×sT)TKn1Xi,MS23O(1)r1/4(nϕ0)3/2inΨ1/2(CTOr×sT)TKn1/223Kn1/2Xi,MS23O(1)r1/4(nϕ0)3/2in(Xi,MS)TKn1Xi,MS}3/2=o(1),

where O(1) denotes some positive constant, the first inequality follows from Cauchy-Schwarz inequality, the last inequality follows from the fact that

Ψ1/2(CTOr×sT)TKn1/2=λmax{Ψ1/2(CTOr×sT)TKn1(CTOr×sT)Ψ1/2}=1,

and the last equality is due to Condition (3.4).

This together with (5.27) and an application of Lemma S.3 in the supplementary material gives that

supC|Pr(1nϕ0Ψ1/2ω0C)Pr(ZC)|0, (5.28)

where Z ∈ ℝr stands for a mean zero Gaussian random vector with identity covariance matrix, and the supremum is taken over all convex sets C ∈ ℝr.

Consider the following class of sets:

Cx={zr:znϕ0Ψ1/2hn2x},

indexed by x ∈ ℝ. It follows from (5.28) that

supx|Pr(1nϕ0Ψ1/2ω0Cx)Pr(ZCx)|0.

Note that 1nϕ0Ψ1ω0Cx is equivalent to T0x, and Pr(ZCx) = Pr(χ2(r,γn) ≤ x) where γn=nhnTΨ1/2hn/ϕ0. This implies

supx|Pr(T0x)Pr(χ2(r,γn)x)|0. (5.29)

Consider any statistic T* = T0 + op(r). For any x and ε > 0, it follows from (5.29) that

Pr(χ2(r,γn)xrε)+o(1)Pr(T0xrε)+o(1)Pr(T*x)Pr(T0x+rε)+o(1)Pr(χ2(r,γn)x+rε)+o(1). (5.30)

Besides, by Lemma S.4, we have

limε0limsupn|Pr(χ2(r,γn)x+rε)Pr(χ2(r,γn)xrε)|0. (5.31)

Combining (5.30) with (5.31), we obtain that

supx|Pr(T*x)Pr(χ2(r,γn)x)|0. (5.32)

In the first three steps, we have shown T0 = TS + op(1) = TW + op(1) = TL + op(1). This together with (5.32) implies that the χ2 approximation holds for our partial penalized Wald, score and likelihood ratio statistics. The proof is hence completed.

5.1. Proof of Lemma 5.1.

Assertion (5.1) is directly implies by Condition (A1). This means the square root of the maximum eigenvalue of Kn is O(1). By definition, this proves (5.2). Under Condition (A1), we have λmax(Kn1)=O(1). Using the same arguments, we have λmax(Kn1/2)=O(1). Hence, (5.3) is proven. We now show (5.4) holds. It follows from the condition λmax ((CCT )⎺1) = O(1) in Condition (A4) that liminfn λmin(CCT )⎺1 > 0, and hence

a0liminfninfar:a2=1CTa22=liminfninfar:a2=1aTCCTa>0.

This implies that for sufficiently large n, we have

CTa2>a0/2a2,a0. (5.33)

By (5.1), we have liminfn λmin(n) > 0, or equivalently,

infam+s:a2=1liminfnaTΩna>0.

Hence, we have

infam+s:a2=1,aJ0c=0liminfnaTΩna>0,

where J0 = [1,...,m]. Note that this implies

infam+s:a2=1liminfnaTΩna>0.

Therefore, we obtain

liminfnλmin(Ωmm)>0. (5.34)

Combining this together with (5.33) yields

infar:a2=1liminfnaTCΩmmCTainfam:a2=a0/2liminfnaTΩmma>0.

By definition, this suggests

liminfnλmin(CΩmmCT)>0,

or equivalently,

λmax((CΩmmCT)1)=O(1).

This gives (5.4).

Using Cauchy-Schwarz inequality, we have

(CΩmmCT)1/2C2(CΩmmCT)1/2CΩmm1/22I2Ωmm1/22.I2

Observe that

I12=λmax((CΩmmCT)1/2CΩmmCT(CΩmmCT)1/2)=1. (5.35)

Besides, by (5.34), we have

I22=λmax((Ωmm)1)=O(1),

which together with (5.35) implies that I1I2 = O(1). This proves (5.5).

We now show (5.6) holds. Assume for now, we have

KnK^n,a2=Op(s+mn), (5.36)

where

K^n,a=1n(XMTΣ(Xβ^a)XMXMTΣ(Xβ^a)XSXSTΣ(Xβ^a)XMXSTΣ(Xβ^a)XS).

Note that

liminfnλmin(K^n,a)liminfninfam+sa2=1aTKnalimsupnsupam+sa2=1|aT(K^n,aKn)a|liminfnλmin(Kn,a)limsupnKnK^n,a2.

Under Condition (A1), we have lim infn λmin(Kn) > 0. Under the condition max(s,m) = o(n1/2), this together with (5.36) implies

liminfnλmin(K^n,a)>0, (5.37)

with probability tending to 1. Hence, we have

Kn1K^n,a12=Kn1(KnK^n,a)K^n,a12λmax(Kn1)KnK^n,a2λmax(K^n1)=Op(s+mn). (5.38)

By Lemma S.2, this gives

supam+s:a2=1|aT(Kn1K^n,a1)a|=Op(s+mn),

and hence,

supam+s:a2=1,aJ0c=0|aT(Kn1K^n,a1)a|=Op(s+mn),

where J0 = [1,...,m]. Using Lemma S.2 again, we obtain

(Kn1)J0,J0(K^n,a1)J0,J02=Op(s+mn). (5.39)

By definition, we have Ωmm=(Kn1)J0,J0. According to Theorem 2.1, we have that with probability tending to 1, S^a=S where S^a={jMc:β^a,j0}. When S^a=S, we have K^n,a1=Ω^a and K^n,a1)J0,J0=Ω^a,mm. Therefore, by (5.39), we have

ΩmmΩ^a,mm2=Op(s+mn).

Using Cauchy-Schwarz inequality, we obtain

Ωmm1/2(ΩmmΩ^a,mm)Ωmm1/22Ωmm1/222ΩmmΩ^a,mm2=Op(s+mn), (5.40)

by (5.34). Let Ψ = CmmCT, we obtain

Ψ1/2C(ΩmmΩ^a,mm)CTΨ1/22Ψ1/2CΩmm1/2Ωmm1/2(ΩmmΩ^a,mm)Ωmm1/2Ωmm1/2CTΨ1/22Ψ1/2CΩmm1/222Ωmm1/2(ΩmmΩ^a,mm)Ωmm1/22=Op(s+mn), (5.41)

by(5.40) and that

Ψ1/2CΩmm1/222=λmax(Ψ1/2ΨΨ1/2)=O(1).

Similar to (5.37), by (5.41), we can show that

liminfnλmin(Ψ1/2CΩ^a,mmCTΨ1/2)>0. (5.42)

Combining (5.41) together with (5.42), we obtain

(Ψ1/2CΩ^a,mmCTΨ1/2)1Im2(Ψ1/2CΩ^a,mmCTΨ1/2)12Ψ1/2C(ΩmmΩ^a,mm)CTΨ1/22=Op(s+mn).

This proves (5.6).

Similar to (5.38), we can show

Kn1K^n,012=Op(s+mn).

By (5.2), we obtain

IKn1/2K^n,01Kn1/22Kn1/22Kn1K^n,012Kn1/22=Op(s+mn).

This proves (5.7).

It remains to show (5.36). Since Kn and K^n,a are symmetric, by Lemma S.5, it suffices to show

KnK^n,a=Op(s+mn).

By definition, this requires to show

maxjSM(Xj)T{Σ(Xβ^a)Σ(Xβ0)}XMS1=Op(n(s+m)),

For any vector a ∈ ℝq, we have a1qa2. Hence, it suffices to show

maxjSM(Xj)T{Σ(Xβ^a)Σ(Xβ0)}XMUS2=Op(n(s+m)). (5.43)

Using Taylor’s theorem, we have

(Xj)T{Σ(Xβ^a)Σ(Xβ0)}XMS01(β^aβ0)TXdiag{Xjb(X{tβ^a+(1t)β0})}XMSdt. (5.44)

By Theorem 2.1, we have Pr(β^aN0)1. Hence, we have

Pr(t[0,1]{tβ^a+(1t)β0N0})1.

By Condition (A1),

supt[0,1]λmax{XMSdiag(Xjb(X{tβ^a+(1t)β0}))XMUS}=Op(n).

By Cauchy-Schwarz inequality, we have

(Xj)T{Σ(Xβ^a)Σ(Xβ0)}XMS2supt[0,1](β^aβ0)TXdiag{Xjb(X{tβ^a+(1t)β0})}XMS2β^aβ02supt[0,1]λmax{XMSdiag(Xjb(X{tβ^a+(1t)β0}))XMS}=Op(n(s+m)).

This proves (5.43).

Supplementary Material

supp

Acknowledgements.

The authors wish to thank the Associate Editor and anonymous referees for their constructive comments, which lead to significant improvement of this work.

Supported by NSF grant DMS 1555244, NCI grant P01 CA142538

Supported by NSF grant DMS 1512422, NIH grants P50 DA039838 and P50, DA036107, and T32 LM012415, and NNSFC grants 11690014 and 11690015

Footnotes

SUPPLEMENTARY MATERIAL

Supplement to “Partial penalization for high dimensional testing with linear constraints”: (doi: COMPLETED BY THE TYPESETTER; .pdf). This supplemental material includes power comparions with existing test statistics, additional numerical studies on Poisson regression and a real data application, discussions of Condition (A1)-(A4), some technical lemmas and the proof of Theorem 2.1.

Contributor Information

Chengchun Shi, Email: cshi4@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, USA.

Rui Song, Email: rsong@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, USA.

Zhao Chen, Email: zuc4@psu.edu, Department of Statistics, and The Methodology Center, the Pennsylvania State University, University Park, PA 16802-2111, USA.

Runze Li, Email: rzli@psu.edu, Department of Statistics, and The Methodology Center, the Pennsylvania State University, University Park, PA 16802-2111, USA.

References.

  1. Bentkus V (2004). A Lyapunov type bound in Rd. Teor. Veroyatn. Primen 49 400–410. [Google Scholar]
  2. Boyd S, Parikh N, Chu E, Peleato B and Eckstein J (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3 1–122. [Google Scholar]
  3. Breheny P and Huang J (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat 5 232–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Candes E and Tao T (2007). The Dantzig selector: statistical estimation when p is much larger than n. Ann. Statist 35 2313–2351. [Google Scholar]
  5. Dezeure R, BÜhlmann P, Meier L and Meinshausen N (2015). High-dimensional inference: confidence intervals, p-values and R-software hdi. Statist. Sci 30 533–558. [Google Scholar]
  6. Fan J, Guo S and Hao N (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B. Stat. Methodol 74 37–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc 96 1348–1360. [Google Scholar]
  8. Fan J and Lv J (2010). A selective overview of variable selection in high dimensional feature space. Statist. Sinica 20 101–148. [PMC free article] [PubMed] [Google Scholar]
  9. Fan J and Lv J (2011). Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inform. Theory 57 5467–5484. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Fan J and Peng H (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist 32 928–961. [Google Scholar]
  11. Fan Y and Tang CY (2013). Tuning parameter selection in high dimensional penalized likelihood. J. R. Stat. Soc. Ser. B. Stat. Methodol 75 531–552. [Google Scholar]
  12. Fang EX, Ning Y and Liu H (2017). Testing and confidence intervals for high dimensional proportional hazards models. J. Roy. Statist. Soc. Ser. B 79 1415–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Ghosh BK (1973). Some monotonicity theorems for χ2, F and t distributions with applications. J. Roy. Statist. Soc. Ser. B 35 480–492. [Google Scholar]
  14. Lee JD, Sun DL, Sun Y and Taylor JE (2016). Exact post-selection inference, with application to the lasso. Ann. Statist 44 907–927. [Google Scholar]
  15. Lockhart R, Taylor J, Tibshirani RJ and Tibshirani R (2014). A significance test for the lasso. Ann. Statist 42 413–468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. McCullagh and Nelder (1989). Generalized Linear Models. Chapman and Hall. [Google Scholar]
  17. Ning Y and Liu H (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist 45 158–195. [Google Scholar]
  18. Schwarz G (1978). Estimating the dimension of a model. Ann. Statist 6 461–464. [Google Scholar]
  19. Shi C, Song R, Chen Z and Li R (2018). Supplement to “Partial penalization for high dimensional testing with linear constraints”.
  20. Sun T and Zhang C-H (2013). Sparse matrix inversion with scaled lasso. J. Mach. Learn. Res 14 3385–3418. [Google Scholar]
  21. Taylor J, Lockhart R, Tibshirani RJ and Tibshirani R (2014). Post-selection adaptive inference for least angle regression and the lasso. arXiv preprint.
  22. Tibshirani R (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. [Google Scholar]
  23. van de Geer S, BÜhlmann P, Ritov Y and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist 42 1166–1202. [Google Scholar]
  24. Wang S and Cui H (2013). Partial Penalized Likelihood Ratio Test under Sparse Case. arXiv preprint arXiv:1312.3723.
  25. Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist 38 894–942. [Google Scholar]
  26. Zhang X and Cheng G (2017). Simultaneous inference for high-dimensional linear models. J. Amer. Statist. Assoc 112 757–768. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp

RESOURCES