LINEAR HYPOTHESIS TESTING FOR HIGH DIMENSIONAL GENERALIZED LINEAR MODELS

Chengchun Shi; Rui Song; Zhao Chen; Runze Li

doi:10.1214/18-AOS1761

. Author manuscript; available in PMC: 2020 Oct 1.

Published in final edited form as: Ann Stat. 2019 Aug 3;47(5):2671–2703. doi: 10.1214/18-AOS1761

LINEAR HYPOTHESIS TESTING FOR HIGH DIMENSIONAL GENERALIZED LINEAR MODELS

Chengchun Shi ¹, Rui Song ², Zhao Chen ³, Runze Li ⁴

PMCID: PMC6750760 NIHMSID: NIHMS996283 PMID: 31534282

Abstract

This paper is concerned with testing linear hypotheses in high-dimensional generalized linear models. To deal with linear hypotheses, we first propose constrained partial regularization method and study its statistical properties. We further introduce an algorithm for solving regularization problems with folded-concave penalty functions and linear constraints. To test linear hypotheses, we propose a partial penalized likelihood ratio test, a partial penalized score test and a partial penalized Wald test. We show that the limiting null distributions of these three test statistics are χ² distribution with the same degrees of freedom, and under local alternatives, they asymptotically follow non-central χ² distributions with the same degrees of freedom and noncentral parameter, provided the number of parameters involved in the test hypothesis grows to ∞ at a certain rate. Simulation studies are conducted to examine the finite sample performance of the proposed tests. Empirical analysis of a real data example is used to illustrate the proposed testing procedures.

Keywords: High-dimensional testing, Linear hypothesis, Likelihood ratio statistics, Score test, Wald test

1. Introduction.

During the last three decades, there are many works devoted to developing variable selection techniques for high dimensional regression models. Fan and Lv (2010) presents a selective overview on this topic. There are some recent works for hypothesis testing on Lasso estimate (Tibshirani, 1996) in high-dimensional linear models. Lockhart et al. (2014) proposed the covariance test which produces a sequence of p-values as the tuning parameter, λ_n, decreases, and features become non-zero in the Lasso. This approach does not give confidence intervals or p-values for an individual variable’s coefficient. Taylor et al. (2014) and Lee et al. (2016) extended the covariance testing framework to test hypotheses about individual features, after conditioning on a model selected by the Lasso. However, their framework permits inference only about features which have non-zero coefficients in a Lasso regression; this set of features likely varies across samples, making the interpretation difficult. Moreover, these work focused on high dimensional linear regression models, and it remains unknown whether their results can be extended to a more general setting.

This paper will focus on generalized linear models (GLM, McCullagh and Nelder, 1989). Let Y be the response, and X be its associate fixed-design covariate vector. The GLM assumes that the distribution of Y belongs to the exponential family. The exponential family with canonical link has the following probability density function

\exp (\frac{Y β_{0}^{T} X - b (β_{0}^{T} X)}{ϕ_{0}}) c (Y),

(1.1)

where β₀ is a p-dimensional vector of regression coefficients, and ϕ₀ is some positive nuisance parameter. In this paper, we assume that b(·) is thrice continuously differentiable with b′′(·) > 0.

We study testing linear hypothesis $H_{0} : C β_{0, M} = t$ in GLM, where $β_{0, M}$ is a subvector of β₀, the true regression coefficients. The number of covariates p can be much larger than the sample size n, while the number of parameters in $β_{0, M}$ is assumed to be much smaller than n. Such type of hypotheses is of particular interests when the goal is to explore the group structure of β₀. Moreover, it also includes a very important class of hypotheses $β_{0, M} = 0$ by setting C to be the identity matrix and t = 0. In the literature, Fan and Peng (2004) proposed penalized likelihood ratio test for H_0a : Cβ_0,S = 0 in GLM, where β_0,S is the vector consisting of all nonzero elements of β₀ when p = o(n^1/5) where n stands for the sample size. Wang and Cui (2013) extended Fan and Peng (2004)’s proposal and considered a penalized likelihood ratio statistic for testing $H_{0 b} : β_{0, M} = 0$ , requiring p = o(n^1/5). Ning and Liu (2017) proposed a decorrelated score test for $H_{0 c} : β_{0, M} = 0$ under the setting of high dimensional penalized M-estimators with nonconvex penalties. Recently, Fang, Ning and Liu (2017) extended the proposal of Ning and Liu (2017) and developed a class of decorrelated Wald, score and partial likelihood ratio tests for Cox’s model with high dimensional survival data. Zhang and Cheng (2017) proposed a maximal type statistic based on the desparsified Lasso estimator (van de Geer et al., 2014) and a bootstrap-assisted testing procedure for $H_{0 d} : β_{0, M} = 0$ , allowing the cardinality of $M$ to be an arbitrary subset of [1,...,p]. In this paper, we aim to develop theory of Wald test, score test and likelihood ratio test for $H_{0} : C β_{0, M} = t$ in GLM under ultrahigh dimensional setting (i.e., p grows exponentially with n).

It is well known that the Wald, score and likelihood ratio tests are equivalent in the fixed p case. However, it can be challenging to generalize these statistics to the setting with ultrahigh dimensionality. To better understand this point, we take the Wald statistic for illustration. Consider the null hypothesis $H_{0} : β_{0, M} = 0$ . Analogous to the classical Wald statistic, in the high dimensional setting, one might consider the statistic ${\hat{β}}_{M}^{T} {\hat{cov} ({\hat{β}}_{M})}^{- 1} {\hat{β}}_{M}$ for some penalized regression estimator $\hat{β}$ and its variance estimator $\hat{cov} (\hat{β})$ . The choice of the estimators is essential here: some penalized regression estimator such as the Lasso, or the Dantzig estimator (Candes and Tao, 2007) cannot be used due to their large biases when p ≫ n. The non-concave penalized estimator does not have this bias issue, but the minimal signal conditions imposed in Fan and Peng (2004) and Fan and Lv (2011) implies that the associated Wald statistic does not have any power for local alternatives of the type $H_{a} : β_{0, M} = h_{n}$ for some sequence h_n such that ∥h_n∥2 ≪ λ_n where ∥ · ∥2 is the Euclidean norm. Moreover, to implement the score and the likelihood ratio statistics, we need to estimate the regression parameter under the null, which involves penalized likelihood under linear constraints. This is a very challenging task and has rarely been studied: (a) the associated estimation and variable selection property is not standard from a theoretical perspective, and (b) there is a lack of constrained optimization algorithms that can produce sparse estimators from a computational perspective.

We briefly summarize our contributions as follows. First, we consider a more general form of hypothesis. In contrast, existing literature mainly focuses on testing $β_{0, M} = 0$ . Besides, we also allow the number of linear constraints to diverge with n. Our tests are therefore applicable to a wider range of real applications for testing a growing set of linear hypotheses. Second, we propose a partial penalized Wald, a partial penalized score and a partial penalized likelihood-ratio statistic based on the class of folded-concave penalty functions, and show their equivalence in the high dimensional setting. We derive the asymptotic distributions of our test statistics under the null hypothesis and the local alternatives. Third, we systematically study the partial penalized estimator with linear constraints. We derive its rate of convergence and limiting distribution. These results are significant in their own rights. The unconstrained and constrained estimators share similar forms, but the constrained estimator is more efficient due to the additional information contained in the constraints under the null hypothesis. Fourth, we introduce an algorithm for solving regularization problems with folded-concave penalty functions and equality constraints, based on the alternating direction method of multipliers (ADMM, cf. Boyd et al., 2011).

The rest of the paper is organized as follows. We study the statistical properties of the constrained partial penalized estimator with folded concave penalty functions in Section 2. We formally define our partial penalized Wald, score and likelihood-ratio statistics, establish their limiting distributions, and show their equivalence in Section 3. Detailed implementations of our testing procedures are given in Section 3.3, where we introduce our algorithm for solving the constrained partial penalized regression problems. Simulation studies are presented in Section 4. The proof of Theorem 3.1 is presented in Section 5. Other proofs and addition numerical results are presented in the supplementary material (Shi et al., 2018).

2. Constrained partial penalized regression.

2.1. Model setup.

Suppose that {X_i,Y_i}, i = 1,··· ,n is a sample from model (1.1). Denote by Y = (Y₁,...,Y_n) the n-dimensional response vector and X = (X₁,···,X_n)^T is the n×p design matrix. We assume the covariates X_i are fixed design. Let X^j denote the jth column of X. To simplify the presentation, for any r×q matrix Φ and any set J ⊂ [1,2,...,q], we denote by Φ_J the submatrix of Φ formed by columns in J. Similarly, for any q-dimensional vector ϕ, ϕ_J stands for the subvector of ϕ formed by elements in J. We further denote Φ_J1,J2 as the submatrix of Φ formed by rows in J₁ and columns in J₂ for any J₁ ⊆ [1,...,r] and J₂ ⊆ [1,...,q]. Let |J| be the number of elements in J. Define J^c = [1,...,q] − J to be the complement of J.

In this paper, we assume log p = O(n^a) for some 0 < a < 1 and focus on the following testing problem:

H_{0} : C β_{0, M} = t,

(2.1)

for a given $M \subseteq [1, \dots, p]$ , an $r \times | M |$ matrix C and an r-dimensional vector t. We assume that the matrix C is of full row rank. This implies there are no redundant or contradictory constraints in (2.1). Let $m = | M |$ , we have r ≤ m.

Define the partial penalized likelihood function

Q_{n} (β, λ) = \frac{1}{n} \sum_{i = 1}^{n} {Y_{i} β^{T} X_{i} - b (β^{T} X_{i})} - \sum_{j \notin M} p_{λ} (| β_{j} |) .

for some penalty function p_λ(·) with a tuning parameter λ. Further define

{\hat{β}}_{0} = \arg \max_{_{β}} Q_{n} (β, λ_{n, 0}) subject to C β_{M} = t,

(2.2)

{\hat{β}}_{a} = \arg \max_{_{β}} Q_{n} (β, λ_{n, a}) .

(2.3)

Note that in (2.2) and (2.3), we do not add penalties on parameters involved in the constraints. This enables to avoid imposing minimal signal condition on elements of $β_{0, M}$ . Thus, the corresponding likelihood ratio test, Wald test and score test have power at local alternatives.

We present a lemma to characterize the constrained local maximizer ${\hat{β}}_{0}$ in the supplementary material (see Lemma S.1). In Section 3, we show that these partial penalized estimators help us to obtain valid statistical inference about the null hypothesis.

2.2. Partial penalized regression with linear constraint.

In this section, we study the statistical properties of ${\hat{β}}_{0}$ and ${\hat{β}}_{a}$ by restricting p_λ to the class of folded concave penalty functions. Popular penalty functions such as SCAD (Fan and Li, 2001) and MCP (Zhang, 2010) belong to this class. Let ρ(t₀,λ) = p_λ(t₀)/λ for λ > 0. We assume that ρ(t₀,λ) is increasing and concave in t₀ ∈ [0,∞), and has a continuous derivative ρ′(t₀,λ) with ρ′(0+,λ) > 0. In addition, assume ρ′(t₀,λ) is increasing in λ ∈ (0,∞) and ρ′(0+,λ) is independent of λ. For any vector v = (v₁,...,v_q)^T, define

\bar{ρ} (v, λ) = {sgn (v_{1}) ρ^{'} (| v_{1} |, λ), \dots, sgn (v_{q}) ρ^{'} (| v_{q} |, λ)}^{T}, μ (v) = {b^{'} (v_{1}), \dots, b^{'} (v_{q})}, Σ (v) = diag {b^{''} (v_{1}), \dots, b^{''} (v_{q})},

where sgn(·) denotes the sign function. We further define the local concavity of the penalty function ρ at v with ∥v∥0 = q as

κ (ρ, v, λ) = \lim_{ϵ \to 0^{+}} \max_{1 \leq j \leq q} \sup_{t_{1} < t_{2} \in (| v_{j} | - ϵ, | v_{j} | + ϵ)} - \frac{ρ^{'} (t_{2}, λ) - ρ^{'} (t_{1}, λ)}{t_{2} - t_{1}} .

We assume that the true regression coefficient β₀ is sparse and satisfies $C β_{0, M} - t = h_{n}$ for some sequence of vectors h_n → 0. When h_n = 0, the null holds. Otherwise, the alternative holds. Let $S = {j \in M^{c} : β_{0, j} \neq 0}$ and s = |S|. Let d_n be the half minimum signal of β₀,s, i.e, d_n = min_j∈S |β_j|/2. Define $N_{0} = {β \in ℝ^{p} : {‖ β_{S \cup M} - β_{0, S \cup M} ‖}_{2} \leq \sqrt{(s + m) \log (n) / n}, β_{{(S \cup M)}^{c}} = 0}$ . We impose the following conditions.

(A1) Assume that

\max_{_{1 \leq j \leq p}} {‖ X^{j} ‖}_{\infty} = O (\sqrt{n / \log (p)}), \max_{_{1 \leq j \leq p}} {‖ X^{j} ‖}_{2} = O (\sqrt{n}), \inf_{β \in N_{0}} λ_{\min} (X_{S \cup M}^{T} Σ (X β) X_{S \cup M}) \geq c n, λ_{\max} (X_{S \cup M}^{T} Σ (X β_{0}) X_{S \cup M}) = O (n), {‖ X_{{(S \cup M)}^{c}}^{T} Σ (X^{T} β_{0}) X_{S \cup M} ‖}_{2, \infty} = O (n), \max_{_{1 \leq j \leq p}} \sup_{β \in N_{0}} λ_{\max} (X_{S \cup M}^{T} diag {| X^{j} | \circ | b^{'''} (X β) |} X_{S \cup M}) = O (n),

for some constant c > 0, where for any vector v = (v₁,...,v_q)^T, diag(v) denotes a diagonal matrix with the j-th diagonal elements being v_j, |v| = (|v₁|,...,|v_q|)^T, and ∥B∥2_,∞ = sup_v:∥v∥2₌₁ ∥Bv∥ for any matrix B with q rows.

(A2) Assume that $d_{n} ≫ λ_{n, j} ≫ \max {\sqrt{(s + m) / n}, \sqrt{(\log p) / n}}, p_{λ_{n, j}}^{'} (d_{n}) = o ({(s + m)}^{- 1 / 2} n^{- 1 / 2}), λ_{n, j} κ_{0, j} = o (1)$ where $κ_{0, j} = \max_{β \in N_{0}} κ (ρ, β, λ_{n, j})$ , for j = 0,a.

(A3) Assume that there exist some constants M and v₀ such that

\max_{_{1 \leq i \leq n}} E {\exp (\frac{| Y_{i} - μ (β_{0}^{T} X_{i}) |}{M}) - 1 - \frac{| Y_{i} - μ (β_{0}^{T} X_{i}) |}{M}} M^{2} \leq \frac{v_{0}}{2} .

(A4) Assume that ${‖ h_{n} ‖}_{2} = O (\sqrt{\min (s + m - r, r) / n})$ , and λ_max ((CC^T)⁻¹) = O(1).

In Section S4.1 of the supplementary material, we show that Condition (A1) holds with probability tending to 1 if the covariate vectors X₁,...,X_n are uniformly bounded or realizations from a sub-Gaussian distribution. The first condition in (A2) is a minimum signal assumption imposed on nonzero elements in $M^{c}$ only. This is due to partial penalization, which enables us to evaluate the uncertainty of the estimation for small signals. Such conditions are not assumed in van de Geer et al. (2014) and Ning and Liu (2017) for testing $H_{0} : β_{0, M} = 0$ . However, we note that these authors impose some additional assumptions on the design matrix. For example, the validity of the decorrelated score statistic depends on the sparsity of w^*. For testing univariate parameters, this requires the degree of a particular node in the graph to be relatively small when the covariate follows a Gaussian graphical model (see Remark 6 in Ning and Liu, 2017). In Section S4.3 of the supplementary material, we show Condition (A3) holds for linear, logistic, and Poisson regression models.

Theorem 2.1. Suppose that Conditions (A1)-(A4) hold, and $s + m = o (\sqrt{n})$ , then the following holds: (i) With probability tending to 1, ${\hat{β}}_{0}$ and ${\hat{β}}_{a}$ defined in (2.2) and (2.3) must satisfy ${\hat{β}}_{0, {(S \cup M)}^{c}} = {\hat{β}}_{a, {(S \cup M)}^{c}} = 0$ . (ii) ${‖ {\hat{β}}_{a, S \cup M} - β_{a, S \cup M} ‖}_{2} = O_{p} (\sqrt{(s + m) / n})$ and ${‖ {\hat{β}}_{0, S \cup M} - β_{0, S \cup M} ‖}_{2} = O_{p} (\sqrt{(s + m - r) / n})$ . If further s + m =o(n^1/3), then we have

\sqrt{n} (\begin{matrix} {\hat{β}}_{a, M} - β_{0, M} \\ {\hat{β}}_{a, S} - β_{0, S} \end{matrix}) = \frac{1}{\sqrt{n}} K_{n}^{- 1} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} + o_{p} (1), \sqrt{n} (\begin{matrix} {\hat{β}}_{0, M} - β_{0, M} \\ {\hat{β}}_{0, S} - β_{0, S} \end{matrix}) = \frac{1}{\sqrt{n}} K_{n}^{- 1 / 2} (I - P_{n}) K_{n}^{- 1 / 2} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} - \sqrt{n} K_{n}^{- 1 / 2} P_{n} K_{n}^{- 1 / 2} (\begin{matrix} C^{T} {(C C^{T})}^{- 1} h_{n} \\ 0 \end{matrix}) + o_{p} (1),

where I is the identity matrix, K_n is the (m + s) × (m + s) matrix

K_{n} = \frac{1}{n} (\begin{matrix} X_{M}^{T} Σ (X β_{0}) X_{M} & X_{M}^{T} Σ (X β_{0}) X_{S} \\ X_{S}^{T} Σ (X β_{0}) X_{M} & X_{S}^{T} Σ (X β_{0}) X_{S} \end{matrix}),

and P_n is the (m × s) × (m × s) projection matrix

P_{n} = K_{n}^{- 1 / 2} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) {(C O_{r \times s}) K_{n}^{- 1} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix})}^{- 1} (C O_{r \times s}) K_{n}^{- 1 / 2},

where O_r×s is an r × s zero matrix.

Remark 2.1. Since $d_{n} ≫ \sqrt{(s + m) / n}$ , Theorem 2.1(ii) implies that each element in ${\hat{β}}_{0, S}$ and ${\hat{β}}_{a, S}$ is nonzero. This together with result (i) shows the sign consistency of ${\hat{β}}_{0, M^{c}}$ and ${\hat{β}}_{a, M^{c}}$ .

Remark 2.2. Theorem 2.1 implies that the constrained estimator ${\hat{β}}_{0}$ converges at a rate of $O_{p} (\sqrt{s + m - r} / \sqrt{n})$ . In contrast, the unconstrained estimator converges at a rate of $O_{p} (\sqrt{s + m} / \sqrt{n})$ . This suggests that when h_n is relatively small, the constrained estimator ${\hat{β}}_{0}$ converges faster than the unconstrained ${\hat{β}}_{a}$ defined in (2.3), when s+m−r ≪ s+m. This result is expected with the following intuition: the more information about β₀ we have, the more accurate the estimator will be.

Remark 2.3. Under certain regularity conditions, Theorem 2.1 implies that

\sqrt{n} {{({\hat{β}}_{0, M} - β_{0, M})}^{T}, {({\hat{β}}_{0, S} - β_{0, S})}^{T}} \to N (- ξ_{0}, V_{0}),

where ξ₀and V0 are limits of $\sqrt{n} K_{n}^{- 1 / 2} P_{n} K_{n}^{- 1 / 2} {(h_{n}^{T}, 0^{T})}^{T}$ and $K_{n}^{- 1 / 2} (I - P_{n}) K_{n}^{- 1 / 2}$ , respectively. Similarly, we can show

\sqrt{n} {{({\hat{β}}_{a, M} - β_{0, M})}^{T}, {({\hat{β}}_{a, S} - β_{0, S})}^{T}} \to N (0, V_{a}),

where $V_{a} = \lim_{n} K_{n}^{- 1}$ . Note that a^T V₀a ≤ a^T V_aa for any a ∈ ℝ^s+m. Under the null, we have ξ₀ = 0, which suggests that ${\hat{β}}_{0}$ is more efficient than ${\hat{β}}_{a}$ in terms of a smaller asymptotic variance. Under the alternative, ${\hat{β}}_{0, M}$ is asymptotically biased. This can be interpreted as a bias-variance trade-off between ${\hat{β}}_{0}$ and ${\hat{β}}_{a}$ .

3. Partial penalized Wald, score and likelihood ratio statistics.

3.1. Test statistics.

We begin by introducing our partial penalized like-lihood ratio statistic,

T_{L} = 2 n {L_{n} ({\hat{β}}_{a}) - L_{n} ({\hat{β}}_{0})} / \hat{ϕ},

(3.1)

where L_n(β) = ∑_i{Y_iβ^T X_i − b(β^T X_i)}/n, ${\hat{β}}_{0}$ and ${\hat{β}}_{a}$ are defined in (2.2) and (2.3) respectively, and $\hat{ϕ}$ is some consistent estimator for ϕ₀ in (1.1). For Gaussian linear models, ϕ₀ corresponds to the error variance. For logistic or Poisson regression models, ϕ₀ = 1.

The partial penalized Wald statistic is based on $\sqrt{n} (C {\hat{β}}_{a, M} - t)$ . Define $Ω_{n} = K_{n}^{- 1}$ and denote Ω_mm as the first m rows and columns of Ω_n. It follows from Theorem 2.1 that its asymptotic variance is equal to CΩ_mmC^T. Let ${\hat{S}}_{a} = {j \in M^{c} : {\hat{β}}_{a, j} \neq 0}$ . Then, with probability tending to 1, we have ${\hat{S}}_{a} = S$ . Define

{\hat{Ω}}_{a} = n {(\begin{matrix} X_{M}^{T} Σ (X {\hat{β}}_{a}) X_{M} & X_{M}^{T} Σ (X {\hat{β}}_{a}) X_{{\hat{S}}_{a}} \\ X_{{\hat{S}}_{a}}^{T} Σ (X {\hat{β}}_{a}) X_{M} & X_{{\hat{S}}_{a}}^{T} Σ (X {\hat{β}}_{a}) X_{{\hat{S}}_{a}} \end{matrix})}^{- 1},

and ${\hat{Ω}}_{a, m m}$ as its submatrix formed by its first m rows and columns. The partial penalized Wald statistic is defined by

T_{W} = {(C {\hat{β}}_{a, M} - t)}^{T} {(C {\hat{Ω}}_{a, m m} C^{T})}^{- 1} (C {\hat{β}}_{a, M} - t) / \hat{ϕ} .

(3.2)

Analogous to the classical score statistic, we define our partial penalized score statistic as

T_{S} = {Y - μ (X {\hat{β}}_{0})}^{T} (\begin{matrix} X_{M} \\ X_{{\hat{S}}_{0}} \end{matrix}) {\hat{Ω}}_{0} {(\begin{matrix} X_{M} \\ X_{{\hat{S}}_{0}} \end{matrix})}^{T} {Y - μ (X {\hat{β}}_{0})} / \hat{ϕ},

(3.3)

where ${\hat{S}}_{0} = {j \in M^{c} : {\hat{β}}_{0, j} \neq 0}$ , and

{\hat{Ω}}_{0} = n {(\begin{matrix} X_{M}^{T} Σ (X {\hat{β}}_{0}) X_{M} & X_{M}^{T} Σ (X {\hat{β}}_{0}) X_{{\hat{S}}_{0}} \\ X_{{\hat{S}}_{0}}^{T} Σ (X {\hat{β}}_{0}) X_{M} & X_{{\hat{S}}_{0}}^{T} Σ (X {\hat{β}}_{0}) X_{{\hat{S}}_{0}} \end{matrix})}^{- 1} .

3.2. Limiting distributions of the test statistics.

For a given significance level α, we reject the null hypothesis when $T > χ_{α}^{2} (r)$ for T = T_L,T_W or T_S where $χ_{α}^{2} (r)$ is the upper α-quantile of a central χ² distribution with r degrees of freedom and r is the number of constraints. Assume r is fixed. When $\hat{ϕ}$ is consistent to ϕ₀, it follows from Theorem 2.1 that T_L, T_W and T_S converge asymptotically to a (non-central) χ² distribution with r degrees of freedom. However, when r diverges with n, there is no such theoretical guarantee. This is because the concept of weak convergence is not well defined in such settings. To resolve this issue, we observe that when the following holds,

\sup_{x} | \Pr (T \leq x) - \Pr (χ^{2} (r, γ_{n}) \leq x) | \to 0,

where χ²(r,γ_n) is a chi square random variable with r degrees of freedom and noncentrality parameter γ_n which is allowed to vary with n, our testing procedure is still valid using χ² approximation.

Theorem 3.1. Assume Conditions (A1)-(A4) hold, s + m = o(n^1/3), and $| \hat{ϕ} - ϕ_{0} | = o_{p} (1)$ . Further assume the following holds:

\frac{r^{1 / 4}}{n^{3 / 2}} \sum_{i = 1}^{n} {{(X_{i, M \cup S})}^{T} K_{n}^{- 1} X_{i, M \cup S}}^{3 / 2} \to 0.

(3.4)

Then, we have

\sup_{x} | P r (T \leq x) - P r (χ^{2} (r, γ_{n}) \leq x) | \to 0,

(3.5)

for T = T_W ,T_S or T_L, where $γ_{n} = n h_{n}^{T} {(C Ω_{m m} C^{T})}^{- 1} h_{n} / ϕ_{0} .$

REMARK 3.1. By (3.5), it is immediate to see that

\sup_{x} | \Pr (T_{1} \leq x) - \Pr (T_{2} \leq x) | \to 0,

for any T₁,T₂ ∈ {T_W ,T_S,T_L}. This establish the equivalence between the partial penalized Wald, score and likelihood-ratio statistics. Condition (3.4) is the key to guarantee χ² approximation in (3.5). When r = O(1), this condition is equivalent to

\frac{1}{n^{3 / 2}} \sum_{i = 1}^{n} {{(X_{i, M \cup S})}^{T} K_{n}^{- 1} X_{i, M \cup S}}^{3 / 2} \to 0,

which corresponds to the Lyaponuv condition that ensures the asymptotic normality of ${\hat{β}}_{0, M \cup S}$ and ${\hat{β}}_{a, M \cup S}$ . When r diverges, (3.4) guarantees that the following Lyaponuv type bound goes to 0,

\sup_{C} | \Pr ({(C Ω_{m m} C^{T})}^{- 1 / 2} (C {\hat{β}}_{a, M} - t) / {\sqrt{ϕ}}_{0} \in C) - \Pr (Z \in C) | \to 0,

where Z represents an r-dimensional multivariate normal with identity covariance matrix, and the supremum is taken over all convex subsets $C$ in ℝ^m. The scaling factor r^1/4 accounts for the dependence of the above Lyaponuv type estimate on the dimension and it remains unknown whether the factor r^1/4 can be improved (see related discussions in Bentkus, 2004).

REMARK 3.2. Theorem 3.1 implies that our testing procedures are consistent. When the null holds, we have h_n = 0 and hence γ_n = 0. This together with equation (2.1) suggests that our tests have correct size under the null. Under the alternative, we have h_n ≠ 0 and hence γ_n ≠ 0. Since χ²(r,0) is stochastically smaller than χ²(r,γ_n), (3.5) implies that our tests have non-negligible powers under H_a. We summarize these results in the following corollary.

COROLLARY 3.1. Assume Conditions (A1)-(A3) and (3.4) hold, s + m = o(n^1/3), λ_max((CC^T )^⎺1) = O(1), and $| \hat{ϕ} - ϕ_{0} | = o_{p} (1)$ . Then, under the null hypothesis, for any 0 < α < 1, we have

\lim_{n} P r (T > χ_{α}^{2} (r)) = α,

for T = T_W ,T_L and T_S, where $χ_{α}^{2} (r)$ is the critical value of χ²-distribution with r degrees of freedom at level α. Under the alternative $C β_{0, M} - t = h_{n}$ for some h_n satisfying $h_{n} = O (\sqrt{\min (s + m - r, r) / n})$ , we have for any 0 < α < 1, and T = T_W ,T_S and T_L,

\lim_{n} | P r (T > χ_{α}^{2} (r)) - P r (χ^{2} (r, γ_{n}) > χ_{α}^{2} (r)) | = 0,

where $γ_{n} = n h_{n}^{T} {(C Ω_{m m} C^{T})}^{- 1} h_{n} / ϕ_{0} .$

REMARK 3.3. Corollary 3.1 shows that the asymptotic power functions of the proposed test statistics are

\Pr (χ^{2} (r, γ_{n}) > χ_{α}^{2} (r)) .

(3.6)

It follows from Theorem 2 in Ghosh (1973) that the asymptotic power function decreases as r increases for a given γ_n. This is the same as that for traditional likelihood ratio test, score test and Wald’s test. However, h_n is an r-dimensional vector in our setting. Thus, one may easily construct an example in which γ_n grows as r increases. As a result, the asymptotic power function may not be monotone increasing function of r.

In Section S3 of Shi et al. (2018), we study in depth that how the penalty on individual coefficient affects the power, and find that the tests are most advantageous if each unpenalized variable is either an important variable (i.e., in $S$ ) or a variable in $M$ .

REMARK 3.4. Notice that the null hypothesis reduces to $β_{0, M} = 0$ if we set C to be the identity matrix and t = 0. The Wald test based on the desparsified Lasso estimator (van de Geer et al., 2014) and the decorrelated score test (Ning and Liu, 2017) can also be applied to testing such hypothesis. Based on (3.6), we show that these two tests achieve less power than the proposed partial penalized tests in Section S1 of Shi et al. (2018). This is due to the increased variances of the de-sparsified Lasso estimator and the decorrelated score statistic after the debiasing procedure.

3.3. Some implementation issues.

3.3.1. Constrained partial penalized regression.

To construct our test statistics, we need to compute the partial penalized estimators ${\hat{β}}_{0}$ and ${\hat{β}}_{a}$ . Our algorithm is based upon the alternating direction method of multipliers (ADMM), which is a variant of standard augmented Lagrangian method. Below, we present our algorithm for estimating ${\hat{β}}_{0}$ . The unconstrained estimator ${\hat{β}}_{a}$ can be similarly computed. For a fixed regularization parameter λ, define

{\hat{β}}_{0}^{λ} = \arg \min_{β} (- L_{n} (β) + \sum_{j \in M^{c}} p_{λ} (| β_{j} |)), subject to C β_{M} = t .

The above optimization problem is equivalent to

({\hat{β}}_{0}^{λ}, {\hat{θ}}_{0}^{λ}) = \arg \underset{θ \in ℝ^{p - m}}{\min_{β \in ℝ^{p}}} (- L_{n} (β) + \sum_{j = 1}^{p - m} p_{λ} (| θ_{j} |)), subject to C β_{M} = t, β_{M^{c}} = θ .

(3.7)

The augmented Lagrangian for (3.7) is

L_{ρ} (β, θ, v) = - L_{n} (β) + \sum_{j = 1}^{p - m} p_{λ} (| θ_{j} |) + v^{T} (\begin{matrix} C β_{M} - t \\ β_{M^{c}} - θ \end{matrix}) + \frac{ρ}{2} {‖ C β_{M} - t ‖}_{2}^{2} + \frac{ρ}{2} {‖ β_{M^{c}} - θ ‖}_{2}^{2},

for a given ρ > 0. Applying dual ascent method yields the following algorithm:

β^{k + 1} = \arg \min_{_{β}} {{(v^{k})}^{T} (\begin{matrix} C β_{M} - t \\ β_{M^{c}} - θ^{k} \end{matrix}) + \frac{ρ}{2} {‖ \begin{matrix} C β_{M} - t \\ β_{M^{c}} - θ^{k} \end{matrix} ‖}_{2}^{2} - L_{n} (β)}, θ^{k + 1} = \arg \min_{_{_{θ}}} {\sum_{j = 1}^{p - m} p_{λ} (| θ_{j} |) + \frac{ρ}{2} {‖ β_{M^{c}}^{k + 1} - θ ‖}_{2}^{2} + {(v^{k})}^{T} (\begin{matrix} C β_{M}^{k + 1} - t \\ β_{M^{c}}^{k + 1} - θ \end{matrix})}, v^{k + 1} = v^{k} + ρ (\begin{matrix} C β_{M}^{k + 1} - t \\ β_{M^{c}}^{k + 1} - θ^{k + 1} \end{matrix}),

for the (k + 1)th iteration.

Since L_n is twice differentiable, β^k+1 can be obtained by the Newton-Raphson algorithm. θ^k+1 may have a closed form for some popular penalties such as Lasso, SCAD or MCP penalty. In our implementation, we use the SCAD penalty,

p_{λ} (| β_{j} |) = λ \int_{0}^{| β_{j} |} {I (t \leq λ) + \frac{{(a λ - t)}_{+}}{a - 1} I (t > λ)} d t,

and set a = 3.7, 𝜌=1,

To obtain ${\hat{β}}_{0}$ , we compute ${\hat{β}}_{0}^{λ}$ for a series of log-spaced values in [−λ_min, λ_max] for some λ_min < λ_max. Then we choose ${\hat{β}}_{0} = {\hat{β}}_{0}^{\hat{λ}}$ by minimizing the following information criterion:

\hat{λ} = \arg \min_{_{λ}} (- n L_{n} (λ) + c_{n} {‖ {\hat{β}}^{λ} ‖}_{0}),

where c_n = max{logn, log(log(n))log(p)}. Using similar arguments in Schwarz (1978) and Fan and Tang (2013), we can show such information criterion is consistent in both fixed p and ultrahigh dimension setting.

3.3.2. Estimation of the nuisance parameter.

It can be shown that ϕ₀ = 1 for logistic or Poisson regression models. In linear regression models, we have $ϕ_{0} = E {(Y_{i} - β_{0}^{T} X_{i})}^{2}$ . In our implementation, we estimate ϕ₀ by

\hat{ϕ} = \frac{1}{n - | {\hat{S}}_{a} | - m} \sum_{i = 1}^{n} {(Y_{i} - {\hat{β}}_{a}^{T} X_{i})}^{2},

where ${\hat{β}}_{a}$ is defined in (2.3).

In Section S2 of the supplementary material (Shi et al., 2018), we show $\hat{ϕ} = ϕ_{0} + O_{p} (n^{- 1 / 2})$ , under the conditions in Theorem 2.1, which implies selection consistency. Alternatively, one can estimate ϕ₀ using refitted cross-validation (Fan, Guo and Hao, 2012) or scaled lasso (Sun and Zhang, 2013).

4. Numerical Examples.

In this section, we examine the finite sample performance of the proposed tests. Simulation results for linear regression and logistic regression are presented in the main text. In the supplementary material (Shi et al., 2018), we present simulation results for Poisson log-linear model and illustrate the proposed methodology by a real data example.

4.1. Linear regression.

Simulated data with sample size n = 100 were generated from

Y = 2 X_{1} - (2 + h^{(1)}) X_{2} + h^{(2)} X_{3} + ε

where ε ∼ N(0,1) and X ∼ N(0_p,Σ), and h⁽¹⁾ and h⁽²⁾ are some constants. The true value $β_{0} = {(2, - 2 - h^{(1)}, h^{(2)}, 0_{p - 3}^{T})}^{T}$ where 0_q denotes a zero vector of length q.

4.1.1. Testing linear hypothesis.

We focus on testing the following three pairs of hypotheses:

H_{0}^{(1)} : β_{0, 1} + β_{0, 2} = 0, v . s H_{a}^{(1)} : β_{0, 1} + β_{0, 2} \neq 0. H_{0}^{(2)} : β_{0, 2} + β_{0, 3} = - 2, v . s H_{a}^{(2)} : β_{0, 2} + β_{0, 3} \neq - 2. H_{0}^{(3)} : β_{0, 3} + β_{0, 4} = 0, v . s H_{a}^{(3)} : β_{0, 3} + β_{0, 4} \neq 0.

These hypotheses test linear structures between two regression coefficients. When testing $H_{0}^{(1)}$ , we set h⁽²⁾ = 0, and hence $H_{0}^{(1)}$ holds if and only if h⁽¹⁾ = 0. Similarly when testing $H_{0}^{(2)}$ and $H_{0}^{(3)}$ , we set h⁽¹⁾ = 0, and hence the hull hypotheses hold if and only if h⁽²⁾ = 0.

We consider two different dimensions, p = 50 and p = 200, and two different covariance matrices Σ, corresponding to Σ = I and $Σ = {{0.5}^{| i - j |}}_{i, j = 1, \dots, p}$ . This yields a total of 4 settings. For each hypothesis and each setting, we further consider four scenarios, by setting h^(j) = 0,0.1,0.2,0.4. Therefore, the null holds under the first scenario and the alternative holds under the rest three. Table 1 summarizes the rejection probabilities for $H_{0}^{(1)}, H_{0}^{(2)}$ and $H_{0}^{(3)}$ under the settings where Σ = {0.5^|i–j|}. Rejection probabilities of the proposed tests under the settings where Σ = I are given in Table S1 in the supplementary material. The rejection probabilities are evaluated via 600 simulation replications.

Table 1.

Rejection probabilities (%) of the partial penalized Wald, score and likelihood ratio statistics with standard errors in parenthesis (%), under the setting where $Σ = {{0.5}^{| i - j |}}_{i, j = 1, \dots, p}$ .

	p = 50			p = 200

	T_L	T_W	T_S	T_L	T_W	T_S
h⁽¹⁾	$H_{0}^{(1)}$

0	4.33(0.83)	4.33(0.83)	4.67(0.86)	5.67(0.94)	5.67(0.94)	5.67(0.94)
0.1	13.17(1.38)	13.50(1.40)	13.50(1.40)	11.67(1.31)	11.67(1.31)	11.67(1.31)
0.2	39.83(2.00)	40.17(2.00)	40.00(2.00)	39.67(2.00)	39.67(2.00)	39.67(2.00)
0.4	92.33(1.09)	93.17(1.03)	93.17(1.03)	92.67(1.06)	92.67(1.06)	92.67(1.06)

h⁽²⁾	$H_{0}^{(2)}$

0	5.17(0.90)	5.17(0.90)	5.67(0.94)	5.33(0.92)	5.33(0.92)	5.33(0.92)
0.1	11.00(1.28)	11.00(1.28)	11.33(1.29)	12.50(1.35)	12.50(1.35)	12.50(1.35)
0.2	30.67(1.88)	30.67(1.88)	31.00(1.89)	33.67(1.93)	33.67(1.93)	33.67(1.93)
0.4	85.17(1.45)	85.00(1.46)	85.00(1.46)	87.83(1.33)	87.83(1.33)	87.83(1.33)

h⁽²⁾	$H_{0}^{(3)}$

0	6.50 (1.01)	6.33(0.99)	6.50(1.01)	5.67(0.94)	5.67(0.94)	5.67(0.94)
0.1	11.83 (1.32)	11.67(1.31)	11.67(1.31)	11.00(1.28)	11.00(1.28)	11.00(1.28)
0.2	31.67 (1.90)	31.50(1.90)	31.67(1.90)	33.17(1.92)	33.17(1.92)	33.17(1.92)
0.4	84.33 (1.48)	84.17(1.49)	84.50(1.48)	86.00(1.42)	86.17(1.41)	86.17(1.41)

Open in a new tab

Based on the results, it can be seen that under these null hypotheses, Type I error rates of the three tests are well controlled and close to the nominal level for all four settings. Under the alternative hypotheses, the powers of these three test statistics increase as h⁽¹⁾ or h⁽²⁾ increases, showing the consistency of our testing procedure. Moreover, the empirical rejection rates between these three test statistics are very close across all different scenarios and settings. For example, the rejection rates are exactly the same for testing $H_{0}^{(1)}$ and $H_{0}^{(2)}$ when p = 200 in Table 1, although we observed that the values of these three statistics in our simulation are slightly different. This is consistent with our theoretical findings that these statistics are asymptotically equivalent even in high dimensional settings. Figures S1, S2 and S3 in the supplementary material depicts the kernel density estimates of three test statistics under $H_{0}^{(1)}$ and $H_{0}^{(2)}$ with different combinations of p and the covariance matrices respectively. It can be seen that these three test statistics converge to their limiting distributions under the null hypotheses.

4.1.2. Testing univariate parameter.

Consider testing the following two pairs of hypotheses:

H_{0}^{(4)} : β_{0, 2} = - 2, v . s H_{a}^{(1)} : β_{0, 2} \neq - 2. H_{0}^{(5)} : β_{0, 3} = 0, v . s H_{a}^{(2)} : β_{0, 3} \neq 0.

We set h⁽²⁾ = 0 when testing $H_{0}^{(4)}$ , and set h⁽¹⁾ = 0 when testing $H_{0}^{(5)}$ . Therefore, $H_{0}^{(4)}$ is equivalent to h⁽¹⁾ = 0 and $H_{0}^{(5)}$ is equivalent to h⁽²⁾ = 0. We use the same 4 settings described in Section 4.1.1. For each setting, we set h⁽¹⁾ = 0.1,0.2,0.4 under $H_{a}^{(4)}$ and h⁽²⁾ = 0.1,0.2,0.4 under $H_{a}^{(5)}$ . Comparison is made among the following test statistics:

The proposed likelihood ratio (T_L), Wald (T_W ) and score (T_S) statistic.
The Wald test statistic based on the de-sparisfied Lasso estimator $(T_{W}^{D})$ .
The decorrelated score statistic. $(T_{S}^{D})$ .

The test statistic $T_{W}^{D}$ is computed via the R package hdi (Dezeure et al., 2015). We calculate $T_{S}^{D}$ according to Section 4.1 in Ning and Liu (2017). More specifically, the initial estimator $\hat{β}$ is computed by a penalized linear regression with SCAD penalty function, and $\hat{ω}$ is computed by a penalized linear regression with l₁ penalty function (see Equation (4.4) in Ning and Liu, 2017). These penalized regressions are implemented via the R package ncvreg (Breheny and Huang, 2011). The tuning parameters are selected via 10-folded cross-validation. The rejection probabilities of these test statistics under the settings where Σ = {0.5^|i–j|} are reported in Table 2. In the supplementary material, we report the rejection probabilities of these test statistics under the settings where Σ = I in Table S2. Results are averaged over 600 simulation replications.

Table 2.

Rejection probabilities (%) of the partial penalized Wald, score and likelihood ratio statistics, the Wald test statistic based on the de-sparsified Lasso estimator and the decorrelated score statistic under the settings where $Σ = {{0.5}^{| i - j |}}_{i, j = 1, \dots, p}$ , with standard errors in parenthesis (%).

	T_L	T_W	T_S	$T_{W}^{D}$	$T_{S}^{D}$
h⁽¹⁾	$H_{0}^{(4)}$ and p = 50

0	5.17(0.90)	5.33(0.92)	5.50(0.93)	12.67(1.36)	7.00(1.04)
0.1	15.67(1.48)	16.00(1.50)	16.00(1.50)	6.00(0.97)	14.67(1.44)
0.2	41.00(2.01)	41.33(2.01)	41.50(2.01)	14.83(1.45)	38.83(1.99)
0.4	92.50(1.08)	93.00(1.04)	93.00(1.04)	67.67(1.91)	88.67(1.29)

	$H_{0}^{(4)}$ and p = 200

0	4.83(0.88)	4.83(0.88)	4.83(0.88)	21.83(1.69)	5.50(0.93)
0.1	11.00(1.28)	11.00(1.28)	11.00(1.28)	5.83(0.96)	10.83(1.27)
0.2	40.50(2.00)	40.50(2.00)	40.50(2.00)	6.17(0.98)	37.83(1.98)
0.4	91.50(1.14)	91.50(1.14)	91.50(1.14)	49.33(2.04)	88.00(1.33)

h⁽²⁾	$H_{0}^{(5)}$ and p = 50

0	6.33(0.99)	6.00(0.97)	6.50(1.00)	5.33(0.92)	3.00(0.70)
0.1	13.67(1.40)	13.50(1.40)	14.00(1.42)	5.33(0.92)	9.17(1.18)
0.2	40.17(2.00)	40.33(2.00)	40.50(2.00)	15.67(1.48)	28.50(1.84)
0.4	90.83(1.18)	91.33(1.15)	91.67(1.13)	69.17(1.89)	83.33(1.52)

	$H_{0}^{(5)}$ and p = 200

0	5.67(0.94)	5.67(0.94)	5.67(0.94)	6.50(1.01)	2.67(0.66)
0.1	13.67(1.40)	13.67(1.40)	13.67(1.40)	3.67(0.77)	8.17(1.12)
0.2	39.17(1.99)	39.17(1.99)	39.17(1.99)	9.67(1.21)	24.67(1.76)
0.4	91.50(1.14)	91.50(1.14)	91.50(1.14)	51.33(2.04)	80.50(1.62)

Open in a new tab

From Table 2, it can be seen that $T_{W}^{D}$ failed to test $H_{0}^{(4)}$ under the settings where Σ = {0.5^|i–j|}. Under the null hypotheses, the Type I error rates of $T_{W}^{D}$ are greater than 12%. Under the alternative hypotheses, the proposed test statistics and the decorrelated score test are more powerful than $T_{W}^{D}$ in almost all cases. Besides, we note that T_L, T_W, T_S and $T_{W}^{D}$ perform comparable under the settings where Σ = I. When Σ = {0.5^|i–j|} however, the proposed test statistics achieve greater power than $T_{S}^{D}$ . This is in line with our theoretical findings (see Section S1 of the supplementary material for details).

4.1.3. Effects on m.

In Section 4.1.1, we consider linear hypotheses involving two parameters only. As suggested by one of the referee, we further examine our test statistics under settings where more regression parameters are involved in the hypotheses. More specifically, we consider the following three pairs of hypotheses:

H_{0}^{(6)} : \sum_{j = 1}^{4} β_{0, j} = 0, v . s H_{a}^{(6)} : \sum_{j = 1}^{4} β_{0, j} \neq 0. H_{0}^{(7)} : \sum_{j = 1}^{8} β_{0, j} = 0, v . s H_{a}^{(7)} : \sum_{j = 1}^{8} β_{0, j} \neq 0. H_{0}^{(8)} : \sum_{j = 1}^{12} β_{0, j} = 0, v . s H_{a}^{(8)} : \sum_{j = 1}^{12} β_{0, j} \neq 0.

The numbers of parameters involved in $H_{0}^{(6)}, H_{0}^{(7)}$ and $H_{0}^{(8)}$ are equal to 4, 8 and 12, respectively. We consider the same 4 settings described in Section 4.1.1. For each setting, we set h⁽¹⁾ = 0,0.2,0.4,0.8 and h⁽²⁾ = 0. Hence, the null hypotheses hold when h⁽¹⁾ = 0 and the alternatives hold when h⁽¹⁾ > 0. We report the rejection probabilities over 600 replications in Table 3, under the settings where Σ = {0.5^|i–j|}. Rejection probabilities under the settings where Σ = I are reported in Table S3 in the supplementary material.

Table 3.

Rejection probabilities (%) of the partial penalized Wald, score and likelihood ratio statistics with standard errors in parenthesis (%), under the settings where $Σ = {{0.5}^{| i - j |}}_{i, j = 1, \dots, p}$ .

	p = 50			p = 200

	T_L	T_W	T_S	T_L	T_W	T_S
h⁽¹⁾	$H_{0}^{(6)}$

0	4.83(0.88)	4.50(0.85)	4.67(0.86)	4.83(0.88)	4.83(0.88)	4.83(0.88)
0.2	28.17(1.84)	28.17(1.84)	28.50(1.84)	28.50(1.84)	28.50(1.84)	28.50(1.84)
0.4	80.33(l.62)	80.17(1.63)	80.33(1.62)	79.83(1.64)	79.83(1.64)	79.83(1.64)
0.8	99.83(0.17)	100.00(0.00)	100.00(0.00)	100.00(0.00)	100.00(0.00)	100.00(0.00)

h⁽¹⁾	$H_{0}^{(7)}$

0	4.50(0.85)	4.50(0.85)	4.50(0.85)	5.00(0.89)	5.00(0.89)	5.00(0.89)
0.2	18.17(1.57)	18.33(1.58)	18.33(1.58)	18.33(1.58)	18.33(1.58)	18.33(1.58)
0.4	53.83(2.04)	54.17(2.03)	54.00(2.03)	57.33(2.02)	57.33(2.02)	57.33(2.02)
0.8	98.50(0.50)	99.00(0.41)	99.00(0.41)	98.50(0.50)	98.50(0.50)	98.50(0.50)

h⁽¹⁾	$H_{0}^{(8)}$

0	5.17 (0.90)	5.00(0.89)	5.17(0.90)	5.67(0.94)	5.67(0.94)	5.67(0.94)
0.2	14.33 (1.43)	14.33(1.43)	14.33(1.43)	13.67(1.40)	13.67(1.40)	13.67(1.40)
0.4	42.00 (2.01)	42.17(2.02)	42.17(2.02)	41.67(2.01)	41.67(2.01)	41.67(2.01)
0.8	92.83 (1.05)	92.83(1.05)	92.83(1.05)	93.00(1.04)	93.00(1.04)	93.00(1.04)

Open in a new tab

The Type I error rates of the three test statistics are close to the nominal level under the null hypotheses. The powers of the test statistics increase as h⁽¹⁾ increases, under the alternative hypotheses. Moreover, we note that the powers decrease as m increases. This is in line with Corollary 3.1 which states that the asymptotic power function of our test statistics is a function of r and γ_n. Recall that $γ_{n} = n h_{n}^{T} {(C Ω_{m m} C^{T})}^{- 1} h_{n} / ϕ_{0}$ . Consider the following sequence of null hypotheses indexed by m ≥ 2: C_mβ₀ = 0 where C_m = (1,···,1,0_p−m). Let $γ_{n, m} = n h_{n}^{T} {(C Ω_{m m} C_{m}^{T})}^{- 1} h_{n} / ϕ_{0}$ . Under the given settings, we have Ω_mm = (ω_ij) is a banded matrix with ω_ij = 0 for |i − j| ≥ 2, w_ij = –1/(1-ρ²) for |i − j| = 1, ω₁₁ = ω_mm = 1/{ρ(1 − ρ²)}, and ω_jj = (1 + ρ²)/{ρ(1 − ρ²)} for j ≠ 1 and m, where ρ is the auto-correlation between X₁ and X₂. It is immediate to see γ_n,m decreases as m increases.

4.2. Logistic regression.

In this example, we generate data with sample size n = 300 from the logistic regression model

logit {\Pr (Y = 1 | X)} = 2 X_{1} - (2 + h^{(1)}) X_{2} + h^{(2)} X_{3},

where logit(p) = log{p/(1−p)}, the logit link function, and X ∼ N(0_p,Σ).

4.2.1. Testing linear hypothesis.

We consider the same linear hypotheses as those in Section 4.1.1:

H_{0}^{(1)} : β_{0, 1} + β_{0, 2} = 0, v . s H_{a}^{(1)} : β_{0, 1} + β_{0, 2} \neq 0. H_{0}^{(2)} : β_{0, 2} + β_{0, 3} = - 2, v . s H_{a}^{(2)} : β_{0, 2} + β_{0, 3} \neq - 2. H_{0}^{(3)} : β_{0, 3} + β_{0, 4} = 0, v . s H_{a}^{(3)} : β_{0, 3} + β_{0, 4} \neq 0.

Similarly, we set h⁽²⁾ = 0 when testing $H_{0}^{(1)}$ , and set h⁽¹⁾ = 0 when testing $H_{0}^{(2)}$ . Therefore, $H_{0}^{(1)}$ is equivalent to h⁽¹⁾ = 0 and $H_{0}^{(2)}$ is equivalent to h⁽²⁾ = 0. We use the same 4 settings described in Section 4.1.1. For each of the four settings, we set h^(j) = 0.2,0.4,0.8 under $H_{a}^{(j)}$ . The rejection probabilities for $H_{0}^{(1)}$ and $H_{0}^{(2)}$ over 600 replications are given in Table S4 in the supplementary material. We also plot the kernel density estimates of three test statistics under $H_{0}^{(1)}$ and $H_{0}^{(2)}$ in Figures S4, S5 and S6 in the supplementary material. The findings are very similar to those in the previous examples.

4.2.2. Testing univariate parameter.

To compare the proposed partial penalized Wald (T_W ), score (T_S) and likelihood ratio (T_L) test statistics with the Wald test based on the de-sparsified Lasso estimator $(T_{W}^{D})$ and the decorrelated score test $(T_{S}^{D})$ , we consider testing the following hypotheses:

H_{0}^{(4)} : β_{0, 2} = - 2, v . s H_{a}^{(4)} : β_{0, 2} \neq - 2. H_{0}^{(5)} : β_{0, 3} = 0, v . s H_{a}^{(5)} : β_{0, 3} \neq 0.

Similar to Section 4.1.2, we set h⁽²⁾ = 0 when testing $H_{0}^{(4)}$ , and set h⁽¹⁾ = 0 when testing $H_{0}^{(5)}$ . We set h⁽¹⁾ = 0 under $H_{0}^{(4)}$ , h⁽¹⁾ = 0.2, 0.4, 0.8 under $H_{a}^{(4)}$ and set h⁽²⁾ = 0 under $H_{0}^{(5)}$ , h⁽²⁾ = 0.2, 0.4, 0.8 under $H_{a}^{(5)}$ . We consider the same 4 settings described in Section 4.1.1. The test statistic $T_{W}^{D}$ is computed via the R package hdi and $T_{S}^{D}$ is obtained according to Section 4.2 of Ning and Liu (2017). We compute the initial estimator $\hat{β}$ in $T_{S}^{D}$ by fitting a penalized logistic regression with SCAD penalty function, and calculate $\hat{ω}$ by fitting a penalized linear regression with l₁ penalty function. These penalized regressions are implemented via the R package ncvreg. We report the rejection probabilities of T_W, T_S,T_L, $T_{W}^{D}$ and $T_{S}^{D}$ in Table S5 in the supplementary article, based on 600 simulation replications.

Based on the results, it can be seen that the Type I error rates of $T_{W}^{D}$ and $T_{S}^{D}$ are significantly larger than the nominal level in almost all cases for testing $H_{0}^{(4)}$ . On the other hand, the Type I error rates of the proposed test statistics are close to the nominal level under $H_{0}^{(4)}$ . Besides, under $H_{a}^{(5)}$ , the powers of the proposed test statistics are greater than or equal to $T_{W}^{D}$ and $T_{S}^{D}$ in all cases.

4.2.3. Effects on m.

As in Section 4.1.3, we further examine the proposed test statistics by allowing more regression coefficients to appear in the linear hypotheses. Similarly, we consider the following three pairs of hypotheses:

H_{0}^{(6)} : \sum_{j = 1}^{4} β_{0, j} = 0, v . s H_{a}^{(6)} : \sum_{j = 1}^{4} β_{0, j} \neq 0. H_{0}^{(7)} : \sum_{j = 1}^{8} β_{0, j} = 0, v . s H_{a}^{(7)} : \sum_{j = 1}^{8} β_{0, j} \neq 0. H_{0}^{(8)} : \sum_{j = 1}^{12} β_{0, j} = 0, v . s H_{a}^{(8)} : \sum_{j = 1}^{12} β_{0, j} \neq 0.

We set h⁽²⁾ = 0, and set h⁽¹⁾ = 0 under the null hypotheses, h⁽¹⁾ = 0.4,0.8,1.6 under the alternative hypotheses. The same 4 settings described in Section 4.1.1 are used. The rejection probabilities of the proposed test statistics are reported in Table S6 in the supplementary article. Results are averaged over 600 replications. Findings are very similar to those in Section 4.1.3.

5. Technical proofs.

This section consists of the proof of Theorem 3.1. To establish Theorem 3.1, we need the following lemma. The proof of this lemma is given in Section 5.1. For any symmetric and positive definite matrix A ∈ ℝ^q×^q, it follows from the spectral theorem that A = U^T ΛU for some orthogonal matrix U and diagonal matrix Λ = diag(λ₁,...,λ_q). Since the diagonal elements in Λ are positive, we use Λ^1/2 and Λ−^1/2 to denote the diagonal matrices $diag (λ_{1}^{1 / 2}, \dots, λ_{q}^{1 / 2})$ and $diag (λ_{1}^{- 1 / 2}, \dots, λ_{q}^{- 1 / 2})$ , respectively. In addition, we define A^1/2 = U^T Λ^1/2U and A^‒1/2 = U^T Λ⎺^1/2U.

LEMMA 5.1. Under the conditions in Theorem 3.1, we have

λ_{\max} (K_{n}) = O (1),

(5.1)

λ_{\max} (K_{n}^{1 / 2}) = O (1),

(5.2)

λ_{\max} (K_{n}^{- 1 / 2}) = O (1),

(5.3)

λ_{\max} ({(C Ω_{m m} C^{T})}^{- 1}) = O (1),

(5.4)

{‖ Ψ^{- 1 / 2} C ‖}_{2} = O (1),

(5.5)

{‖ Ψ^{1 / 2} {(C {\hat{Ω}}_{a, m m} C^{T})}^{- 1} Ψ^{1 / 2} - I ‖}_{2} = O_{p} (\frac{s + m}{\sqrt{n}}),

(5.6)

{‖ I - K_{n}^{1 / 2} {\hat{K}}_{n, 0}^{- 1} K_{n}^{1 / 2} ‖}_{2} = O_{p} (\frac{s + m}{\sqrt{n}}),

(5.7)

where Ψ = CΩ_mm C^T and

{\hat{K}}_{n, 0} = \frac{1}{n} (\begin{matrix} X_{M}^{T} Σ (X {\hat{β}}_{0}) X_{M} & X_{M}^{T} Σ (X {\hat{β}}_{0}) X_{S} \\ X_{S}^{T} Σ (X {\hat{β}}_{0}) X_{M} & X_{S}^{T} Σ (X {\hat{β}}_{0}) X_{S} \end{matrix}) .

We break the proof into four steps. In the first three steps, we show T_W /r, T_S/r and T_L/r are equivalent to T₀/r, respectively, where

T_{0} = \frac{1}{ϕ_{0}} {(ω_{n} + \sqrt{n} h_{n})}^{T} {(C Ω_{m m} C^{T})}^{- 1} (ω_{n} + \sqrt{n} h_{n}),

and

ω_{n} = \frac{1}{\sqrt{n}} {(\begin{matrix} C^{T} \\ O_{s \times r}^{T} \end{matrix})}^{T} K_{n}^{- 1} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} .

In the final step, we show the χ² approximation (3.5) holds for T_W ,T_S and T_L.

Step 1: We first show that T_W /r is equivalent to T₀/r. It follows from Theorem 2.1 that

\sqrt{n} (\begin{matrix} {\hat{β}}_{a, M} - β_{0, M} \\ {\hat{β}}_{a, S} - β_{0, S} \end{matrix}) = \frac{1}{\sqrt{n}} K_{n}^{- 1} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} + R_{a},

for some vector R_a that satisfies

{‖ R_{a} ‖}_{2} = o_{p} (1) .

(5.8)

Therefore, we have

\sqrt{n} C ({\hat{β}}_{a, M} - β_{0, M}) = ω_{n} + C R_{a, J_{0}},

(5.9)

where J₀ = [1,...,m]. Since $C β_{0, M} = t + h_{n}$ , it follows from (5.9) that

\sqrt{n} (C {\hat{β}}_{a, M} - t) = ω_{n} + C R_{a, J_{0}} + \sqrt{n} h_{n},

and hence

\sqrt{n} Ψ^{- 1 / 2} (C {\hat{β}}_{a, M} - t) = Ψ^{- 1 / 2} (ω_{n} + C R_{a, J_{0}} + \sqrt{n} h_{n}) .

(5.10)

By (5.8) and (5.5) in Lemma 5.1, we have

{‖ Ψ^{- 1 / 2} C R_{a, J_{0}} ‖}_{2} \leq ‖ Ψ^{- 1 / 2} C ‖ {‖ R_{a, J_{0}} ‖}_{2} \leq ‖ Ψ^{- 1 / 2} C ‖ {‖ R_{a} ‖}_{2} = o_{p} (1) .

This together with (5.10) gives

\sqrt{n} Ψ^{- 1 / 2} (C {\hat{β}}_{a, M} - t) = Ψ^{- 1 / 2} (ω_{n} + \sqrt{n} h_{n}) + o_{p} (1) .

(5.11)

Note that

E {‖ Ψ^{- 1 / 2} ω_{n} ‖}_{2}^{2} = tr (Ψ^{- 1 / 2} E ω_{n} ω_{n}^{T} Ψ^{- 1 / 2}) = ϕ_{0} tr (Ψ^{- 1 / 2} Ψ Ψ^{- 1 / 2}) = r ϕ_{0} .

By Markov’s inequality, we have

{‖ Ψ^{- 1 / 2} ω_{n} ‖}_{2} = O_{p} (\sqrt{r}) .

(5.12)

Besides, it follows from (5.4) in Lemma 5.1 and Condition (A4) that

{‖ \sqrt{n} Ψ^{- 1 / 2} h_{n} ‖}_{2} = O (\sqrt{r}) .

(5.13)

This together with (5.11) and (5.12) implies that

{‖ \sqrt{n} Ψ^{- 1 / 2} (C {\hat{β}}_{a, M} - t) ‖}_{2} = O_{p} (\sqrt{r}) .

(5.14)

Combining this together with (5.6) in Lemma 5.1 gives

{‖ {\sqrt{n} Ψ^{- 1 / 2} (C {\hat{β}}_{a, M} - t)}^{T} {Ψ^{1 / 2} {(C {\hat{Ω}}_{a, m m} C^{T})}^{- 1} Ψ^{1 / 2} - I} {\sqrt{n} Ψ^{- 1 / 2} (C {\hat{β}}_{a, M} - t)} ‖}_{2}^{2} \leq {‖ {\sqrt{n} Ψ^{- 1 / 2} (C {\hat{β}}_{a, M} - t)} ‖}_{2}^{2} {‖ {Ψ^{1 / 2} {(C {\hat{Ω}}_{a, m m} C^{T})}^{- 1} Ψ^{1 / 2} - I} ‖}_{2} = O_{p} (\frac{r (s + m)}{\sqrt{n}}) .

The last term is o_p(r) under the condition s+m = o(n^1/3). By the definition of T_W, we have shown that

\hat{ϕ} | T_{W} - T_{W, 0} | = o_{p} (r),

(5.15)

where

T_{W, 0} = \frac{n {(C {\hat{β}}_{a, M} - t)}^{T} Ψ^{- 1} (C {\hat{β}}_{a, M} - t)}{\hat{ϕ}} .

Under the conditions in Theorem 3.1, we have $\hat{ϕ} = ϕ_{0} + o_{p} (1)$ . Since ϕ₀ > 0, we have

1 / \hat{ϕ} = O_{p} (1),

(5.16)

which together with (5.15) entails that T_W = T_W,0 + o_p(r).

It follows from (5.10)-(5.13) and the condition s + m = o(n^1/3) that

\hat{ϕ} T_{W, 0} = {‖ Ψ^{- 1 / 2} ω_{n} + \sqrt{n} Ψ^{- 1 / 2} h_{n} + o_{p} (1) ‖}_{2}^{2} = {‖ Ψ^{- 1 / 2} ω_{n} + \sqrt{n} Ψ^{- 1 / 2} h_{n} ‖}_{2}^{2} + o_{p} (1) + o_{p} (Ψ^{- 1 / 2} (ω_{n} + \sqrt{n} h_{n})) = {‖ Ψ^{- 1 / 2} ω_{n} + \sqrt{n} Ψ^{- 1 / 2} h_{n} ‖}_{2}^{2} + o_{p} (1) + o_{p} (r) = {‖ Ψ^{- 1 / 2} ω_{n} + \sqrt{n} Ψ^{- 1 / 2} h_{n} ‖}_{2}^{2} + o_{p} (r) = \hat{ϕ} T_{W, 1} + o_{p} (r),

(5.17)

where

T_{W, 1} = \frac{{‖ Ψ^{- 1 / 2} ω_{n} + \sqrt{n} Ψ^{- 1 / 2} h_{n} ‖}_{2}^{2}}{\hat{ϕ}} .

By (5.16), we obtain T_W,0 = T_W,1 + o_p(r) and hence T_W = T_W,1 + o_p(r). In the following, we show T_W,1 = T₀ + o_p(r).

Observe that

| T_{W, 1} - T_{0} | = \frac{| ϕ_{0} - \hat{ϕ} |}{\hat{ϕ} ϕ_{0}} {‖ Ψ^{- 1 / 2} ω_{n} + \sqrt{n} Ψ^{- 1 / 2} h_{n} ‖}_{2}^{2} .

(5.18)

It follows from (5.12), (5.13), (5.16) and the condition $| \hat{ϕ} - ϕ_{0} | = o_{p} (1)$ that right-hand side (RHS) of (5.18) is of the order o_p(r). This proves T_W,1 = T₀ + o_p(r).

Step 2: We show that T_S/r is equivalent to T₀/r. Based on the proof of Theorem 2.1 in Section S5.1 of the supplementary article, we have

\frac{1}{\sqrt{n}} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X {\hat{β}}_{0})} = \frac{1}{\sqrt{n}} {(\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix})}^{T} {Y - μ (X β_{0})} - \frac{1}{\sqrt{n}} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) Σ (X β_{0}) {(\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix})}^{T} (\begin{matrix} {\hat{β}}_{0, M} - β_{0, M} \\ {\hat{β}}_{0, S} - β_{0, S} \end{matrix}) + o_{p} (1),

(5.19)

and

\sqrt{n} (\begin{matrix} {\hat{β}}_{0, M} - β_{0, M} \\ {\hat{β}}_{0, S} - β_{0, S} \end{matrix}) = \frac{1}{\sqrt{n}} K_{n}^{- 1 / 2} (I - P_{n}) K_{n}^{- 1 / 2} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} - \sqrt{n} K_{n}^{- 1} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} h_{n} + o_{p} (1) .

(5.20)

Combining (5.1) with (5.20) gives

\sqrt{n} K_{n} (\begin{matrix} {\hat{β}}_{0, M} - β_{0, M} \\ {\hat{β}}_{0, S} - β_{0, S} \end{matrix}) = \frac{1}{\sqrt{n}} K_{n}^{1 / 2} (I - P_{n}) K_{n}^{- 1 / 2} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} - \sqrt{n} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} h_{n} + o_{p} (1),

which together with (5.19) implies that

\frac{1}{\sqrt{n}} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X {\hat{β}}_{0})} = \frac{1}{\sqrt{n}} {(\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix})}^{T} {Y - μ (X β_{0})} + o_{p} (1) - \frac{1}{\sqrt{n}} K_{n}^{1 / 2} (I - P_{n}) K_{n}^{- 1 / 2} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} + \sqrt{n} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} h_{n} = \frac{1}{\sqrt{n}} K_{n}^{1 / 2} P_{n} K_{n}^{- 1 / 2} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} + \sqrt{n} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} h_{n} + o_{p} (1) .

By (5.3), we have

\frac{1}{\sqrt{n}} K_{n}^{- 1 / 2} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X {\hat{β}}_{0})} = \frac{1}{\sqrt{n}} P_{n} K_{n}^{- 1 / 2} {(\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix})}^{T} {Y - μ (X β_{0})} + \sqrt{n} K_{n}^{- 1 / 2} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} h_{n} + o_{p} (1) .

(5.21)

It follows from (5.5) and (5.13) that

{‖ (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} h_{n} ‖}_{2} \leq {‖ C^{T} Ψ^{- 1 / 2} ‖}_{2} {‖ Ψ^{- 1 / 2} h_{n} ‖}_{2} = O_{p} (\sqrt{r / n}) .

This together with (5.3) yields

\sqrt{n} {‖ K_{n}^{- 1 / 2} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} h_{n} ‖}_{2} = O_{p} (\sqrt{r}) .

(5.22)

Notice that

E {‖ \frac{1}{\sqrt{n}} P_{n} K_{n}^{- 1 / 2} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} ‖}_{2}^{2} = tr (P_{n}) = rank (P_{n}) = r .

It follows from Markov’s equality that

{‖ \frac{1}{\sqrt{n}} P_{n} K_{n}^{- 1 / 2} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} ‖}_{2} = O_{p} (\sqrt{r}) .

Combining this with (5.21) and (5.22) yields

{‖ \frac{1}{\sqrt{n}} K_{n}^{- 1 / 2} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X {\hat{β}}_{0})} ‖}_{2} = O_{p} (\sqrt{r}) .

(5.23)

This together with (5.7) and the condition s + m = o(n^1/3) gives that

| \frac{1}{n} {(\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X {\hat{β}}_{0})}}^{T} (K_{n}^{- 1} - {\hat{K}}_{n, 0}^{- 1}) {(\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix})}^{T} {Y - μ (X {\hat{β}}_{0})} | \leq \frac{1}{n} {‖ K_{n}^{- 1 / 2} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X {\hat{β}}_{0})} ‖}_{2}^{2} {‖ I - K_{n}^{1 / 2} {\hat{K}}_{n, 0}^{- 1} K_{n}^{1 / 2} ‖}_{2} = O_{p} (\frac{r (s + m)}{\sqrt{n}}) = o_{p} (r) .

When ${\hat{S}}_{0} = S$ , we have

\hat{ϕ} T_{S} = \frac{1}{n} {(\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X {\hat{β}}_{0})}}^{T} {\hat{K}}_{n, 0}^{- 1} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X {\hat{β}}_{0})} .

Since $\Pr ({\hat{S}}_{0} = S) \to 1$ , we obtain $\hat{ϕ} | T_{S} - T_{S, 0} | = o_{p} (r)$ , where

T_{S, 0} = \frac{1}{n \hat{ϕ}} {(\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X {\hat{β}}_{0})}}^{T} K_{n}^{- 1} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X {\hat{β}}_{0})} .

This together with (5.16) implies that |T_S − T_S,0| = o_p(r). Using similar arguments in (5.17) and (5.18), we can show that T_S,0/r is equivalent to T_S,1/r, where T_S,1 is defined as

\frac{1}{ϕ_{0}} {‖ \frac{1}{\sqrt{n}} P_{n} K_{n}^{- 1 / 2} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} + \sqrt{n} K_{n}^{- 1 / 2} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} h_{n} ‖}_{2}^{2} .

Recall that

P_{n} = K_{n}^{- 1 / 2} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} {(\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix})}^{T} K_{n}^{- 1 / 2},

we have

T_{S, 1} = \frac{1}{ϕ_{0}} {‖ K_{n}^{- 1 / 2} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} ω_{n} + \sqrt{n} K_{n}^{- 1 / 2} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} h_{n} ‖}_{2}^{2} = \frac{1}{ϕ_{0}} {‖ Ψ^{- 1 / 2} ω_{n} + \sqrt{n} Ψ^{- 1 / 2} h_{n} ‖}_{2}^{2} = T_{0} .

This proves the equivalence between T_S/r and T₀/r.

Step 3: By Theorem 2.1, we have

\sqrt{n} (\begin{matrix} {\hat{β}}_{a, M} - {\hat{β}}_{0, M} \\ {\hat{β}}_{a, S} - {\hat{β}}_{0, S} \end{matrix}) = \frac{1}{\sqrt{n}} K_{n}^{- 1 / 2} P_{n} K_{n}^{- 1 / 2} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} + \sqrt{n} K_{n}^{- 1 / 2} P_{n} K_{n}^{- 1 / 2} (\begin{matrix} C^{T} {(C C^{T})}^{- 1} h_{n} \\ 0 \end{matrix}) + o_{p} (1) .

Notice that

K_{n}^{- 1 / 2} P_{n} K_{n}^{- 1 / 2} (\begin{matrix} C^{T} {(C C^{T})}^{- 1} h_{n} \\ 0 \end{matrix}) = K_{n}^{- 1} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} {(\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix})}^{T} (\begin{matrix} C^{T} {(C C^{T})}^{- 1} h_{n} \\ 0 \end{matrix}) = K_{n}^{- 1} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} h_{n} .

It follows that

\sqrt{n} (\begin{matrix} {\hat{β}}_{a, M} - {\hat{β}}_{0, M} \\ {\hat{β}}_{a, S} - {\hat{β}}_{0, S} \end{matrix}) = \frac{1}{\sqrt{n}} K_{n}^{- 1 / 2} P_{n} K_{n}^{- 1 / 2} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} + \sqrt{n} K_{n}^{- 1} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} h_{n} + o_{p} (1) .

(5.24)

Similar to (5.23), we can show that

n {‖ {\hat{β}}_{a, M \cup S} - {\hat{β}}_{0, M \cup S} ‖}_{2}^{2} = O_{p} (r) .

(5.25)

Under the event ${\hat{β}}_{0, M \cup S} = {\hat{β}}_{a, M \cup S} = 0$ , using third-order Taylor expansion, we obtain that

L_{n} ({\hat{β}}_{0}) - L_{n} ({\hat{β}}_{a}) = \frac{1}{n} {(\begin{matrix} {\hat{β}}_{0, M} - {\hat{β}}_{a, M} \\ {\hat{β}}_{0, S} - {\hat{β}}_{a, S} \end{matrix})}^{T} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X {\hat{β}}_{a})} - \frac{1}{2 n} {(\begin{matrix} {\hat{β}}_{0, M} - {\hat{β}}_{a, M} \\ {\hat{β}}_{0, S} - {\hat{β}}_{a, S} \end{matrix})}^{T} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) Σ (X {\hat{β}}_{a}) {(\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix})}^{T} (\begin{matrix} {\hat{β}}_{0, M} - {\hat{β}}_{a, M} \\ {\hat{β}}_{0, S} - {\hat{β}}_{a, S} \end{matrix}) + {(\begin{matrix} {\hat{β}}_{0, M} - {\hat{β}}_{a, M} \\ {\hat{β}}_{0, S} - {\hat{β}}_{a, S} \end{matrix})}^{T} R,

where n∥R∥_∞ is upper bounded by

\max_{_{j \in M \cup S}} | {({\hat{β}}_{0, M \cup S} - {\hat{β}}_{a, M \cup S})}^{T} X_{M \cup S}^{T} diag {| X^{j} | \circ | b^{'''} (X β^{*}) |} X_{M \cup S} ({\hat{β}}_{0, M \cup S} - {\hat{β}}_{a, M \cup S}) | \leq {‖ {\hat{β}}_{0, M \cup S} - {\hat{β}}_{a, M U S} ‖}_{2}^{2} \max_{_{j \in M \cup S}} λ_{\max} (X_{M \cup S}^{T} diag {| X^{j} | \circ | b^{'''} (X β^{*}) |} X_{M \cup S}),

for some β* lying on the line segment between ${\hat{β}}_{a}$ and ${\hat{β}}_{0}$ . By Theorem 2.1, we have $β_{{(M \cup S)}^{c}}^{*} = 0$ and ${‖ β_{M \cup S}^{*} - β_{0, M \cup S} ‖}_{2} \leq \sqrt{(s + m) \log n / n}$ with probability tending to 1. By Condition (A1), we obtain

‖ R ‖_{\infty} = O_{p} (\frac{r}{n}) .

This together with (5.25) yields that

{‖ {(\begin{matrix} {\hat{β}}_{a, M} - {\hat{β}}_{0, M} \\ {\hat{β}}_{a, S} - {\hat{β}}_{0, S} \end{matrix})}^{T} R ‖}_{2} \leq {‖ {\hat{β}}_{a, M \cup S} - {\hat{β}}_{0, M \cup S} ‖}_{1} ‖ R ‖_{\infty} = o_{p} (\frac{r}{n} \frac{\sqrt{r}}{\sqrt{n}} \sqrt{s + m}) .

The last term is $o_{p} (\sqrt{r} / n)$ since r ≤ s + m and s + m = o(n^1/3).

Similarly, we can show

‖ {(\begin{matrix} {\hat{β}}_{a, M} - {\hat{β}}_{0, M} \\ {\hat{β}}_{a, S} - {\hat{β}}_{0, S} \end{matrix})}^{T} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) Σ (X {\hat{β}}_{a}) {(\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix})}^{T} (\begin{matrix} {\hat{β}}_{a, M} - {\hat{β}}_{0, M} \\ {\hat{β}}_{a, S} - {\hat{β}}_{0, S} \end{matrix}) - {{(\begin{matrix} {\hat{β}}_{a, M} - {\hat{β}}_{0, M} \\ {\hat{β}}_{a, S} - {\hat{β}}_{0, S} \end{matrix})}^{T} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) Σ (X β_{0}) {(\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix})}^{T} (\begin{matrix} {\hat{β}}_{a, M} - {\hat{β}}_{0, M} \\ {\hat{β}}_{a, S} - {\hat{β}}_{0, S} \end{matrix}) ‖}_{2} = o_{p} (\sqrt{r}) .

As a result, we have

n {L_{n} ({\hat{β}}_{0}) - L_{n} ({\hat{β}}_{a})} = {(\begin{matrix} {\hat{β}}_{0, M} - {\hat{β}}_{a, M} \\ {\hat{β}}_{0, S} - {\hat{β}}_{a, S} \end{matrix})}^{T} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X {\hat{β}}_{a})} - \frac{n}{2} {(\begin{matrix} {\hat{β}}_{0, M} - {\hat{β}}_{a, M} \\ {\hat{β}}_{0, S} - {\hat{β}}_{a, S} \end{matrix})}^{T} K_{n} {(\begin{matrix} {\hat{β}}_{0, M} - {\hat{β}}_{a, M} \\ {\hat{β}}_{0, S} - {\hat{β}}_{a, S} \end{matrix})}^{T} + o_{p} (\sqrt{r}) .

(5.26)

Recall that ${\hat{β}}_{a}$ is the maximizer of $n L_{n} (β) - n \sum_{j \notin M} p_{λ_{n, a}} (| β_{j} |)$ Theorem 2.1, we have with probability tending to 1 that $\min_{j \in S} | {\hat{β}}_{a, j} | \geq d_{n}$ . Under the condition $p_{λ_{n, a}}^{'} (d_{n}) = o ({(s + m)}^{- 1 / 2} n^{- 1 / 2})$ we have

(\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X {\hat{β}}_{a})} = n (\begin{matrix} 0 \\ \bar{ρ} ({\hat{β}}_{a, S}, λ_{n, a}) \end{matrix}) = o_{p} (n^{1 / 2}) .

This together with (5.25) yields

{(\begin{matrix} {\hat{β}}_{0, M} - {\hat{β}}_{a, M} \\ {\hat{β}}_{0, S} - {\hat{β}}_{a, S} \end{matrix})}^{T} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X {\hat{β}}_{a})} = o_{p} (\sqrt{r}) .

By (5.26), we obtain that

n {L_{n} ({\hat{β}}_{0}) - L_{n} ({\hat{β}}_{a})} = - \frac{n}{2} {(\begin{matrix} {\hat{β}}_{0, M} - {\hat{β}}_{a, M} \\ {\hat{β}}_{0, S} - {\hat{β}}_{a, S} \end{matrix})}^{T} K_{n} (\begin{matrix} {\hat{β}}_{0, M} - {\hat{β}}_{a, M} \\ {\hat{β}}_{0, S} - {\hat{β}}_{a, S} \end{matrix}) + o_{p} (\sqrt{r}) .

In view of (5.24), using similar arguments in (5.17), we can show that

| n {(\begin{matrix} {\hat{β}}_{0, M} - {\hat{β}}_{a, M} \\ {\hat{β}}_{0, S} - {\hat{β}}_{a, S} \end{matrix})}^{T} K_{n} (\begin{matrix} {\hat{β}}_{0, M} - {\hat{β}}_{a, M} \\ {\hat{β}}_{0, S} - {\hat{β}}_{a, S} \end{matrix}) - \frac{1}{n} {‖ P_{n} K_{n}^{- 1 / 2} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} + n K_{n}^{- 1 / 2} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} h_{n} ‖}_{2}^{2} | = o_{p} (\sqrt{r}) .

As a result, we have

\frac{1}{n} {‖ P_{n} K_{n}^{- 1 / 2} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} + n K_{n}^{- 1 / 2} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1} h_{n} ‖}_{2}^{2} - 2 n {L_{n} ({\hat{β}}_{a}) - L_{n} ({\hat{β}}_{0})} = o_{p} (\sqrt{r}) .

By (5.16), this shows

| T_{L} - \frac{ϕ_{0}}{\hat{ϕ}} T_{0} | = o_{p} (r) .

Under the condition $| \hat{ϕ} - ϕ_{0} | = o_{p} (1)$ , we can show $| T_{0} (1 - ϕ_{0} / \hat{ϕ}) | = o_{p} (r)$ . As a result, we have T_L = T₀ + o_p(r).

Step 4: We first show the χ² approximation (3.5) holds for T = T₀. Recall that

T_{0} = \frac{1}{ϕ_{0}} {‖ \frac{1}{\sqrt{n}} Ψ^{- 1 / 2} ω_{n} + \sqrt{n} Ψ^{- 1 / 2} h_{n} ‖}_{2}^{2} .

By the definition of ω_n, we have

\frac{1}{\sqrt{n ϕ_{0}}} Ψ^{- 1 / 2} ω_{n} = \frac{1}{\sqrt{n ϕ_{0}}} Ψ^{- 1 / 2} {(\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix})}^{T} K_{n}^{- 1} (\begin{matrix} X_{M}^{T} \\ X_{S}^{T} \end{matrix}) {Y - μ (X β_{0})} = \sum_{i = 1}^{n} \frac{1}{\sqrt{n ϕ_{0}}} Ψ^{- 1 / 2} {(\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix})}^{T} K_{n}^{- 1} {Y_{i} - μ (β_{0}^{T} X_{i})} (\begin{matrix} X_{i, M} \\ X_{i, S} \end{matrix}) = \sum_{i = 1}^{n} ξ_{i} .

With some calculation, we can show that

\sum_{i}^{n} cov (ξ_{i}) = I_{r} .

(5.27)

It follows from Condition (A3) that

\max_{_{i = 1, \dots, n}} E (\frac{{| Y_{i} - μ (β_{0}^{T} X_{i}) |}^{3}}{6 M^{3}} M^{2}) \leq \max_{_{i = 1, \dots, n}} E {\exp (\frac{| Y_{i} - μ (β_{0}^{T} X_{i}) |}{M}) - 1 - \frac{| Y_{i} - μ (β_{0}^{T} X_{i}) |}{M}} M^{2} \leq \frac{v_{0}}{2} .

This implies max_{i = 1,...,n} $E {| Y_{i} - μ (β_{0}^{T} X_{i}) |}^{3} = O (1) .$

Hence, with some calculations, we have

r^{1 / 4} \sum_{i}^{n} E {‖ ξ_{i} ‖}_{2}^{3} = \frac{r^{1 / 4}}{{(n ϕ_{0})}^{3 / 2}} \sum_{i}^{n} E {‖ Ψ^{- 1 / 2} {(\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix})}^{T} K_{n}^{- 1} X_{i, M \cup S} {Y_{i} - μ (β_{0}^{T} X_{i})} ‖}_{2}^{3} = \frac{r^{1 / 4}}{{(n ϕ_{0})}^{3 / 2}} \sum_{i}^{n} {‖ Ψ^{- 1 / 2} {(\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix})}^{T} K_{n}^{- 1} X_{i, M \cup S} ‖}_{2}^{3} E {| Y_{i} - μ (β_{0}^{T} X_{i}) |}^{3} = O (1) \frac{r^{1 / 4}}{{(n ϕ_{0})}^{3 / 2}} \sum_{i}^{n} {‖ Ψ^{- 1 / 2} {(\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix})}^{T} K_{n}^{- 1} X_{i, M \cup S} ‖}_{2}^{3} \leq O (1) \frac{r^{1 / 4}}{{(n ϕ_{0})}^{3 / 2}} \sum_{i}^{n} {‖ Ψ^{- 1 / 2} {(\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix})}^{T} K_{n}^{- 1 / 2} ‖}_{2}^{3} {‖ K_{n}^{- 1 / 2} X_{i, M \cup S} ‖}_{2}^{3} \leq O (1) \frac{r^{1 / 4}}{{(n ϕ_{0})}^{3 / 2}} \sum_{i}^{n} {(X_{i, M \cup S})}^{T} K_{n}^{- 1} X_{i, M \cup S}}^{3 / 2} = o (1),

where O(1) denotes some positive constant, the first inequality follows from Cauchy-Schwarz inequality, the last inequality follows from the fact that

‖ Ψ^{- 1 / 2} {(\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix})}^{T} K_{n}^{- 1 / 2} ‖ = λ_{\max} {Ψ^{- 1 / 2} {(\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix})}^{T} K_{n}^{- 1} (\begin{matrix} C^{T} \\ O_{r \times s}^{T} \end{matrix}) Ψ^{- 1 / 2}} = 1,

and the last equality is due to Condition (3.4).

This together with (5.27) and an application of Lemma S.3 in the supplementary material gives that

\sup_{C} | \Pr (\frac{1}{\sqrt{n ϕ_{0}}} Ψ^{- 1 / 2} ω_{0} \in C) - \Pr (Z \in C) | \to 0,

(5.28)

where Z ∈ ℝ^r stands for a mean zero Gaussian random vector with identity covariance matrix, and the supremum is taken over all convex sets $C$ ∈ ℝ^r.

Consider the following class of sets:

C_{x} = {z \in ℝ^{r} : {‖ z - \sqrt{\frac{n}{ϕ_{0}}} Ψ^{- 1 / 2} h_{n} ‖}_{2} \leq x},

indexed by x ∈ ℝ. It follows from (5.28) that

\sup_{x} | \Pr (\frac{1}{\sqrt{n ϕ_{0}}} Ψ^{- 1 / 2} ω_{0} \in C_{x}) - \Pr (Z \in C_{x}) | \to 0.

Note that $\frac{1}{\sqrt{n ϕ_{0}}} Ψ^{- 1} ω_{0} \in C_{x}$ is equivalent to T₀ ≤ x, and Pr(Z ∈ $C_{x}$ ) = Pr(χ²(r,γ_n) ≤ x) where $γ_{n} = n h_{n}^{T} Ψ^{- 1 / 2} h_{n} / ϕ_{0}$ . This implies

\sup_{x} | \Pr (T_{0} \leq x) - \Pr (χ^{2} (r, γ_{n}) \leq x) | \to 0.

(5.29)

Consider any statistic T^* = T₀ + o_p(r). For any x and ε > 0, it follows from (5.29) that

\Pr (χ^{2} (r, γ_{n}) \leq x - r ε) + o (1) \leq \Pr (T_{0} \leq x - r ε) + o (1) \leq \Pr (T^{*} \leq x) \leq \Pr (T_{0} \leq x + r ε) + o (1) \leq \Pr (χ^{2} (r, γ_{n}) \leq x + r ε) + o (1) .

(5.30)

Besides, by Lemma S.4, we have

\lim_{ε \to 0} \underset{n}{\lim \sup} | \Pr (χ^{2} (r, γ_{n}) \leq x + r ε) - \Pr (χ^{2} (r, γ_{n}) \leq x - r ε) | \to 0.

(5.31)

Combining (5.30) with (5.31), we obtain that

\sup_{x} | \Pr (T^{*} \leq x) - \Pr (χ^{2} (r, γ_{n}) \leq x) | \to 0.

(5.32)

In the first three steps, we have shown T₀ = T_S + o_p(1) = T_W + o_p(1) = T_L + o_p(1). This together with (5.32) implies that the χ² approximation holds for our partial penalized Wald, score and likelihood ratio statistics. The proof is hence completed.

5.1. Proof of Lemma 5.1.

Assertion (5.1) is directly implies by Condition (A1). This means the square root of the maximum eigenvalue of K_n is O(1). By definition, this proves (5.2). Under Condition (A1), we have $λ_{\max} (K_{n}^{- 1}) = O (1)$ . Using the same arguments, we have $λ_{\max} (K_{n}^{- 1 / 2}) = O (1)$ . Hence, (5.3) is proven. We now show (5.4) holds. It follows from the condition λ_max ((CC^T )⎺¹) = O(1) in Condition (A4) that liminf_n λ_min(CC^T )⎺¹ > 0, and hence

a_{0} ≜ \underset{n}{\lim \inf} \inf_{a \in ℝ^{r} : ‖ a ‖_{2} = 1} {‖ C^{T} a ‖}_{2}^{2} = \underset{n}{\lim \inf} \inf_{a \in ℝ^{r} : ‖ a ‖_{2} = 1} a^{T} C C^{T} a > 0.

This implies that for sufficiently large n, we have

{‖ C^{T} a ‖}_{2} > \sqrt{a_{0} / 2} ‖ a ‖_{2}, \forall a \neq 0.

(5.33)

By (5.1), we have liminf_n λ_min(Ω_n) > 0, or equivalently,

\inf_{a \in ℝ^{m + s} : ‖ a ‖_{2} = 1} \underset{n}{\lim \inf} a^{T} Ω_{n} a > 0.

Hence, we have

\inf_{a \in ℝ^{m + s} : ‖ a ‖_{2} = 1, a_{J_{0}}^{c} = 0} \underset{n}{\lim \inf} a^{T} Ω_{n} a > 0,

where J₀ = [1,...,m]. Note that this implies

\inf_{a \in ℝ^{m + s} : ‖ a ‖_{2} = 1} \underset{n}{\lim \inf} a^{T} Ω_{n} a > 0.

Therefore, we obtain

\underset{n}{\lim \inf} λ_{\min} (Ω_{m m}) > 0.

(5.34)

Combining this together with (5.33) yields

\inf_{a \in ℝ^{r} : ‖ a ‖_{2} = 1} \underset{n}{\lim \inf} a^{T} C Ω_{m m} C^{T} a \geq \inf_{a \in ℝ^{m} : ‖ a ‖_{2} = \sqrt{a_{0} / 2}} \underset{n}{\lim \inf} a^{T} Ω_{m m} a > 0.

By definition, this suggests

\underset{n}{\lim \inf} λ_{\min} (C Ω_{m m} C^{T}) > 0,

or equivalently,

λ_{\max} ({(C Ω_{m m} C^{T})}^{- 1}) = O (1) .

This gives (5.4).

Using Cauchy-Schwarz inequality, we have

{‖ {(C Ω_{m m} C^{T})}^{- 1 / 2} C ‖}_{2} \leq \underset{I_{2}}{\underset{︸}{{‖ {(C Ω_{m m} C^{T})}^{- 1 / 2} C Ω_{m m}^{1 / 2} ‖}_{2}}} \underset{I_{2}}{\underset{︸}{{‖ Ω_{m m}^{- 1 / 2} ‖}_{2} .}}

Observe that

I_{1}^{2} = λ_{\max} ({(C Ω_{m m} C^{T})}^{- 1 / 2} C Ω_{m m} C^{T} {(C Ω_{m m} C^{T})}^{- 1 / 2}) = 1.

(5.35)

Besides, by (5.34), we have

I_{2}^{2} = λ_{\max} ({(Ω_{m m})}^{- 1}) = O (1),

which together with (5.35) implies that I₁I₂ = O(1). This proves (5.5).

We now show (5.6) holds. Assume for now, we have

{‖ K_{n} - {\hat{K}}_{n, a} ‖}_{2} = O_{p} (\frac{s + m}{\sqrt{n}}),

(5.36)

where

{\hat{K}}_{n, a} = \frac{1}{n} (\begin{matrix} X_{M}^{T} Σ (X {\hat{β}}_{a}) X_{M} & X_{M}^{T} Σ (X {\hat{β}}_{a}) X_{S} \\ X_{S}^{T} Σ (X {\hat{β}}_{a}) X_{M} & X_{S}^{T} Σ (X {\hat{β}}_{a}) X_{S} \end{matrix}) .

Note that

\underset{n}{\lim \inf} λ_{\min} ({\hat{K}}_{n, a}) \geq \underset{n}{\lim \inf} \underset{‖ a ‖_{2} = 1}{\inf_{a \in ℝ^{m + s}}} a^{T} K_{n} a - \underset{n}{\lim \sup} \underset{‖ a ‖_{2} = 1}{\sup_{a \in ℝ^{m + s}}} | a^{T} ({\hat{K}}_{n, a} - K_{n}) a | \geq \underset{n}{\lim \inf} λ_{\min} (K_{n, a}) - \underset{n}{\lim \sup} {‖ K_{n} - {\hat{K}}_{n, a} ‖}_{2} .

Under Condition (A1), we have lim inf_n λ_min(K_n) > 0. Under the condition max(s,m) = o(n^1/2), this together with (5.36) implies

\underset{n}{\lim \inf} λ_{\min} ({\hat{K}}_{n, a}) > 0,

(5.37)

with probability tending to 1. Hence, we have

{‖ K_{n}^{- 1} - {\hat{K}}_{n, a}^{- 1} ‖}_{2} = {‖ K_{n}^{- 1} (K_{n} - {\hat{K}}_{n, a}) {\hat{K}}_{n, a}^{- 1} ‖}_{2} \leq λ_{\max} (K_{n}^{- 1}) {‖ K_{n} - {\hat{K}}_{n, a} ‖}_{2} λ_{\max} ({\hat{K}}_{n}^{- 1}) = O_{p} (\frac{s + m}{\sqrt{n}}) .

(5.38)

By Lemma S.2, this gives

\sup_{a \in ℝ^{m + s} : ‖ a ‖_{2} = 1} | a^{T} (K_{n}^{- 1} - {\hat{K}}_{n, a}^{- 1}) a | = O_{p} (\frac{s + m}{\sqrt{n}}),

and hence,

\sup_{a \in ℝ^{m + s} : ‖ a ‖_{2} = 1, a_{J_{0}}^{c} = 0} | a^{T} (K_{n}^{- 1} - {\hat{K}}_{n, a}^{- 1}) a | = O_{p} (\frac{s + m}{\sqrt{n}}),

where J₀ = [1,...,m]. Using Lemma S.2 again, we obtain

{‖ {(K_{n}^{- 1})}_{J_{0}, J_{0}} - {({\hat{K}}_{n, a}^{- 1})}_{J_{0}, J_{0}} ‖}_{2} = O_{p} (\frac{s + m}{\sqrt{n}}) .

(5.39)

By definition, we have $Ω_{m m} = {(K_{n}^{- 1})}_{J_{0}, J_{0}}$ . According to Theorem 2.1, we have that with probability tending to 1, ${\hat{S}}_{a} = S$ where ${\hat{S}}_{a} = {j \in M^{c} : {\hat{β}}_{a, j} \neq 0}$ . When ${\hat{S}}_{a} = S$ , we have ${\hat{K}}_{n, a}^{- 1} = {\hat{Ω}}_{a}$ and ${\hat{K}}_{n, a}^{- 1})_{J_{0}, J_{0}} = {\hat{Ω}}_{a, m m}$ . Therefore, by (5.39), we have

{‖ Ω_{m m} - {\hat{Ω}}_{a, m m} ‖}_{2} = O_{p} (\frac{s + m}{\sqrt{n}}) .

Using Cauchy-Schwarz inequality, we obtain

{‖ Ω_{m m}^{- 1 / 2} (Ω_{m m} - {\hat{Ω}}_{a, m m}) Ω_{m m}^{- 1 / 2} ‖}_{2} \leq {‖ Ω_{m m}^{- 1 / 2} ‖}_{2}^{2} {‖ Ω_{m m} - {\hat{Ω}}_{a, m m} ‖}_{2} = O_{p} (\frac{s + m}{\sqrt{n}}),

(5.40)

by (5.34). Let Ψ = CΩ_mmC^T, we obtain

{‖ Ψ^{- 1 / 2} C (Ω_{m m} - {\hat{Ω}}_{a, m m}) C^{T} Ψ^{- 1 / 2} ‖}_{2} \leq {‖ Ψ^{- 1 / 2} C Ω_{m m}^{1 / 2} Ω_{m m}^{- 1 / 2} (Ω_{m m} - {\hat{Ω}}_{a, m m}) Ω_{m m}^{- 1 / 2} Ω_{m m}^{1 / 2} C^{T} Ψ^{- 1 / 2} ‖}_{2} \leq {‖ Ψ^{- 1 / 2} C Ω_{m m}^{1 / 2} ‖}_{2}^{2} ‖ Ω_{m m}^{- 1 / 2} (Ω_{m m} - {\hat{Ω}}_{a, m m}) Ω_{m m}^{- 1 / 2} ‖ 2 = O_{p} (\frac{s + m}{\sqrt{n}}),

(5.41)

by(5.40) and that

{‖ Ψ^{- 1 / 2} C Ω_{m m}^{1 / 2} ‖}_{2}^{2} = λ_{\max} (Ψ^{- 1 / 2} Ψ Ψ^{- 1 / 2}) = O (1) .

Similar to (5.37), by (5.41), we can show that

\underset{n}{\lim \inf} λ_{\min} (Ψ^{- 1 / 2} C {\hat{Ω}}_{a, m m} C^{T} Ψ^{- 1 / 2}) > 0.

(5.42)

Combining (5.41) together with (5.42), we obtain

{‖ {(Ψ^{- 1 / 2} C {\hat{Ω}}_{a, m m} C^{T} Ψ^{- 1 / 2})}^{- 1} - I_{m} ‖}_{2} \leq {‖ {(Ψ^{- 1 / 2} C {\hat{Ω}}_{a, m m} C^{T} Ψ^{- 1 / 2})}^{- 1} ‖}_{2} {‖ Ψ^{- 1 / 2} C (Ω_{m m} - {\hat{Ω}}_{a, m m}) C^{T} Ψ^{- 1 / 2} ‖}_{2} = O_{p} (\frac{s + m}{\sqrt{n}}) .

This proves (5.6).

Similar to (5.38), we can show

{‖ K_{n}^{- 1} - {\hat{K}}_{n, 0}^{- 1} ‖}_{2} = O_{p} (\frac{s + m}{\sqrt{n}}) .

By (5.2), we obtain

{‖ I - K_{n}^{1 / 2} {\hat{K}}_{n, 0}^{- 1} K_{n}^{1 / 2} ‖}_{2} \leq {‖ K_{n}^{1 / 2} ‖}_{2} {‖ K_{n}^{- 1} - {\hat{K}}_{n, 0}^{- 1} ‖}_{2} {‖ K_{n}^{1 / 2} ‖}_{2} = O_{p} (\frac{s + m}{\sqrt{n}}) .

This proves (5.7).

It remains to show (5.36). Since K_n and ${\hat{K}}_{n, a}$ are symmetric, by Lemma S.5, it suffices to show

{‖ K_{n} - {\hat{K}}_{n, a} ‖}_{\infty} = O_{p} (\frac{s + m}{\sqrt{n}}) .

By definition, this requires to show

\max_{j \in S \cup M} {‖ {(X^{j})}^{T} {Σ (X {\hat{β}}_{a}) - Σ (X β_{0})} X_{M \cup S} ‖}_{1} = O_{p} (\sqrt{n} (s + m)),

For any vector a ∈ ℝ^q, we have $‖ a ‖_{1} \leq \sqrt{q} ‖ a ‖_{2}$ . Hence, it suffices to show

\max_{j \in S \cup M} {‖ {(X^{j})}^{T} {Σ (X {\hat{β}}_{a}) - Σ (X β_{0})} X_{M U S} ‖}_{2} = O_{p} (\sqrt{n (s + m)}) .

(5.43)

Using Taylor’s theorem, we have

{(X^{j})}^{T} {Σ (X {\hat{β}}_{a}) - Σ (X β_{0})} X_{M \cup S} \leq \int_{0}^{1} {({\hat{β}}_{a} - β_{0})}^{T} X diag {X^{j} \circ b^{'''} (X {t {\hat{β}}_{a} + (1 - t) β_{0}})} X_{M \cup S} d t .

(5.44)

By Theorem 2.1, we have $\Pr ({\hat{β}}_{a} \in N_{0}) \to 1$ . Hence, we have

\Pr (⋃_{t \in [0, 1]} {t {\hat{β}}_{a} + (1 - t) β_{0} \in N_{0}}) \to 1.

By Condition (A1),

\sup_{t \in [0, 1]} λ_{\max} {X_{M \cup S} diag (X^{j} \circ b^{'''} (X {t {\hat{β}}_{a} + (1 - t) β_{0}})) X_{M U S}} = O_{p} (n) .

By Cauchy-Schwarz inequality, we have

{‖ {(X^{j})}^{T} {Σ (X {\hat{β}}_{a}) - Σ (X β_{0})} X_{M \cup S} ‖}_{2} \leq \sup_{t \in [0, 1]} {‖ {({\hat{β}}_{a} - β_{0})}^{T} X diag {X^{j} \circ b^{'''} (X {t {\hat{β}}_{a} + (1 - t) β_{0}})} X_{M \cup S} ‖}_{2} \leq {‖ {\hat{β}}_{a} - β_{0} ‖}_{2} \sup_{t \in [0, 1]} λ_{\max} {X_{M \cup S} diag (X^{j} \circ b^{'''} (X {t {\hat{β}}_{a} + (1 - t) β_{0}})) X_{M \cup S}} = O_{p} (\sqrt{n (s + m)}) .

This proves (5.43).

Supplementary Material

supp

NIHMS996283-supplement-supp.pdf^{(537.5KB, pdf)}

Acknowledgements.

The authors wish to thank the Associate Editor and anonymous referees for their constructive comments, which lead to significant improvement of this work.

Supported by NSF grant DMS 1555244, NCI grant P01 CA142538

Supported by NSF grant DMS 1512422, NIH grants P50 DA039838 and P50, DA036107, and T32 LM012415, and NNSFC grants 11690014 and 11690015

Footnotes

SUPPLEMENTARY MATERIAL

Supplement to “Partial penalization for high dimensional testing with linear constraints”: (doi: COMPLETED BY THE TYPESETTER; .pdf). This supplemental material includes power comparions with existing test statistics, additional numerical studies on Poisson regression and a real data application, discussions of Condition (A1)-(A4), some technical lemmas and the proof of Theorem 2.1.

Contributor Information

Chengchun Shi, Email: cshi4@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, USA.

Rui Song, Email: rsong@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, USA.

Zhao Chen, Email: zuc4@psu.edu, Department of Statistics, and The Methodology Center, the Pennsylvania State University, University Park, PA 16802-2111, USA.

Runze Li, Email: rzli@psu.edu, Department of Statistics, and The Methodology Center, the Pennsylvania State University, University Park, PA 16802-2111, USA.

References.

Bentkus V (2004). A Lyapunov type bound in R^d. Teor. Veroyatn. Primen 49 400–410. [Google Scholar]
Boyd S, Parikh N, Chu E, Peleato B and Eckstein J (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3 1–122. [Google Scholar]
Breheny P and Huang J (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat 5 232–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
Candes E and Tao T (2007). The Dantzig selector: statistical estimation when p is much larger than n. Ann. Statist 35 2313–2351. [Google Scholar]
Dezeure R, BÜhlmann P, Meier L and Meinshausen N (2015). High-dimensional inference: confidence intervals, p-values and R-software hdi. Statist. Sci 30 533–558. [Google Scholar]
Fan J, Guo S and Hao N (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B. Stat. Methodol 74 37–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc 96 1348–1360. [Google Scholar]
Fan J and Lv J (2010). A selective overview of variable selection in high dimensional feature space. Statist. Sinica 20 101–148. [PMC free article] [PubMed] [Google Scholar]
Fan J and Lv J (2011). Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inform. Theory 57 5467–5484. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J and Peng H (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist 32 928–961. [Google Scholar]
Fan Y and Tang CY (2013). Tuning parameter selection in high dimensional penalized likelihood. J. R. Stat. Soc. Ser. B. Stat. Methodol 75 531–552. [Google Scholar]
Fang EX, Ning Y and Liu H (2017). Testing and confidence intervals for high dimensional proportional hazards models. J. Roy. Statist. Soc. Ser. B 79 1415–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ghosh BK (1973). Some monotonicity theorems for χ², F and t distributions with applications. J. Roy. Statist. Soc. Ser. B 35 480–492. [Google Scholar]
Lee JD, Sun DL, Sun Y and Taylor JE (2016). Exact post-selection inference, with application to the lasso. Ann. Statist 44 907–927. [Google Scholar]
Lockhart R, Taylor J, Tibshirani RJ and Tibshirani R (2014). A significance test for the lasso. Ann. Statist 42 413–468. [DOI] [PMC free article] [PubMed] [Google Scholar]
McCullagh and Nelder (1989). Generalized Linear Models. Chapman and Hall. [Google Scholar]
Ning Y and Liu H (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist 45 158–195. [Google Scholar]
Schwarz G (1978). Estimating the dimension of a model. Ann. Statist 6 461–464. [Google Scholar]
Shi C, Song R, Chen Z and Li R (2018). Supplement to “Partial penalization for high dimensional testing with linear constraints”.
Sun T and Zhang C-H (2013). Sparse matrix inversion with scaled lasso. J. Mach. Learn. Res 14 3385–3418. [Google Scholar]
Taylor J, Lockhart R, Tibshirani RJ and Tibshirani R (2014). Post-selection adaptive inference for least angle regression and the lasso. arXiv preprint.
Tibshirani R (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. [Google Scholar]
van de Geer S, BÜhlmann P, Ritov Y and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist 42 1166–1202. [Google Scholar]
Wang S and Cui H (2013). Partial Penalized Likelihood Ratio Test under Sparse Case. arXiv preprint arXiv:1312.3723.
Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist 38 894–942. [Google Scholar]
Zhang X and Cheng G (2017). Simultaneous inference for high-dimensional linear models. J. Amer. Statist. Assoc 112 757–768. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp

NIHMS996283-supplement-supp.pdf^{(537.5KB, pdf)}

[R1] Bentkus V (2004). A Lyapunov type bound in R^d. Teor. Veroyatn. Primen 49 400–410. [Google Scholar]

[R2] Boyd S, Parikh N, Chu E, Peleato B and Eckstein J (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3 1–122. [Google Scholar]

[R3] Breheny P and Huang J (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat 5 232–253. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Candes E and Tao T (2007). The Dantzig selector: statistical estimation when p is much larger than n. Ann. Statist 35 2313–2351. [Google Scholar]

[R5] Dezeure R, BÜhlmann P, Meier L and Meinshausen N (2015). High-dimensional inference: confidence intervals, p-values and R-software hdi. Statist. Sci 30 533–558. [Google Scholar]

[R6] Fan J, Guo S and Hao N (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B. Stat. Methodol 74 37–65. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc 96 1348–1360. [Google Scholar]

[R8] Fan J and Lv J (2010). A selective overview of variable selection in high dimensional feature space. Statist. Sinica 20 101–148. [PMC free article] [PubMed] [Google Scholar]

[R9] Fan J and Lv J (2011). Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inform. Theory 57 5467–5484. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Fan J and Peng H (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist 32 928–961. [Google Scholar]

[R11] Fan Y and Tang CY (2013). Tuning parameter selection in high dimensional penalized likelihood. J. R. Stat. Soc. Ser. B. Stat. Methodol 75 531–552. [Google Scholar]

[R12] Fang EX, Ning Y and Liu H (2017). Testing and confidence intervals for high dimensional proportional hazards models. J. Roy. Statist. Soc. Ser. B 79 1415–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Ghosh BK (1973). Some monotonicity theorems for χ², F and t distributions with applications. J. Roy. Statist. Soc. Ser. B 35 480–492. [Google Scholar]

[R14] Lee JD, Sun DL, Sun Y and Taylor JE (2016). Exact post-selection inference, with application to the lasso. Ann. Statist 44 907–927. [Google Scholar]

[R15] Lockhart R, Taylor J, Tibshirani RJ and Tibshirani R (2014). A significance test for the lasso. Ann. Statist 42 413–468. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] McCullagh and Nelder (1989). Generalized Linear Models. Chapman and Hall. [Google Scholar]

[R17] Ning Y and Liu H (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist 45 158–195. [Google Scholar]

[R18] Schwarz G (1978). Estimating the dimension of a model. Ann. Statist 6 461–464. [Google Scholar]

[R19] Shi C, Song R, Chen Z and Li R (2018). Supplement to “Partial penalization for high dimensional testing with linear constraints”.

[R20] Sun T and Zhang C-H (2013). Sparse matrix inversion with scaled lasso. J. Mach. Learn. Res 14 3385–3418. [Google Scholar]

[R21] Taylor J, Lockhart R, Tibshirani RJ and Tibshirani R (2014). Post-selection adaptive inference for least angle regression and the lasso. arXiv preprint.

[R22] Tibshirani R (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. [Google Scholar]

[R23] van de Geer S, BÜhlmann P, Ritov Y and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist 42 1166–1202. [Google Scholar]

[R24] Wang S and Cui H (2013). Partial Penalized Likelihood Ratio Test under Sparse Case. arXiv preprint arXiv:1312.3723.

[R25] Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist 38 894–942. [Google Scholar]

[R26] Zhang X and Cheng G (2017). Simultaneous inference for high-dimensional linear models. J. Amer. Statist. Assoc 112 757–768. [Google Scholar]

PERMALINK

LINEAR HYPOTHESIS TESTING FOR HIGH DIMENSIONAL GENERALIZED LINEAR MODELS

Chengchun Shi

Rui Song

Zhao Chen

Runze Li

Abstract

1. Introduction.

2. Constrained partial penalized regression.

2.1. Model setup.

2.2. Partial penalized regression with linear constraint.

3. Partial penalized Wald, score and likelihood ratio statistics.

3.1. Test statistics.

3.2. Limiting distributions of the test statistics.

3.3. Some implementation issues.

3.3.1. Constrained partial penalized regression.

3.3.2. Estimation of the nuisance parameter.

4. Numerical Examples.

4.1. Linear regression.

4.1.1. Testing linear hypothesis.

Table 1.

4.1.2. Testing univariate parameter.

Table 2.

4.1.3. Effects on m.

Table 3.

4.2. Logistic regression.

4.2.1. Testing linear hypothesis.

4.2.2. Testing univariate parameter.

4.2.3. Effects on m.

5. Technical proofs.

5.1. Proof of Lemma 5.1.

Supplementary Material

Acknowledgements.

Footnotes

Contributor Information

References.

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases